Mojo function
vectorize
vectorize[origins: origin.set, //, func: fn[Int](Int) capturing -> None, simd_width: Int, /, *, unroll_factor: Int = 1](size: Int)
Simplifies SIMD optimized loops by mapping a function across a range from 0 to size
, incrementing by simd_width
at each step. The remainder of size % simd_width
will run in separate iterations.
The below example demonstrates how you could improve the performance of a loop, by setting multiple values at the same time using SIMD registers on the machine:
from algorithm.functional import vectorize
# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit into the SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()
fn main():
var p = UnsafePointer[Int32].alloc(size)
# @parameter allows the closure to capture the `p` pointer
@parameter
fn closure[simd_width: Int](i: Int):
print("storing", simd_width, "els at pos", i)
p.store[width=simd_width](i, i)
vectorize[closure, simd_width](size)
print(p.load[width=size]())
from algorithm.functional import vectorize
# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit into the SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()
fn main():
var p = UnsafePointer[Int32].alloc(size)
# @parameter allows the closure to capture the `p` pointer
@parameter
fn closure[simd_width: Int](i: Int):
print("storing", simd_width, "els at pos", i)
p.store[width=simd_width](i, i)
vectorize[closure, simd_width](size)
print(p.load[width=size]())
On a machine with a SIMD register size of 128, this will set 4xInt32 values on each iteration. The remainder of 10 % 4 is 2, so those last two elements will be set in two separate iterations:
storing 4 els at pos 0
storing 4 els at pos 4
storing 1 els at pos 8
storing 1 els at pos 9
[0, 0, 0, 0, 4, 4, 4, 4, 8, 9]
storing 4 els at pos 0
storing 4 els at pos 4
storing 1 els at pos 8
storing 1 els at pos 9
[0, 0, 0, 0, 4, 4, 4, 4, 8, 9]
You can also unroll the loop to potentially improve performance at the cost of binary size:
vectorize[closure, width, unroll_factor=2](size)
vectorize[closure, width, unroll_factor=2](size)
In the generated assembly the function calls will be repeated, resulting in fewer arithmetic, comparison, and conditional jump operations. The assembly would look like this in psuedocode:
closure[4](0)
closure[4](4)
# Remainder loop won't unroll unless `size` is passed as a parameter
for i in range(8, 10):
closure[1](i)
closure[1](i)
closure[4](0)
closure[4](4)
# Remainder loop won't unroll unless `size` is passed as a parameter
for i in range(8, 10):
closure[1](i)
closure[1](i)
You can pass size
as a parameter if it's compile time known to reduce the
iterations for the remainder. This only occurs if the remainder is an
exponent of 2 (2, 4, 8, 16, ...). The remainder loop will still unroll for
performance improvements if not an exponent of 2.
Parameters:
- origins (
origin.set
): The capture origins. - func (
fn[Int](Int) capturing -> None
): The function that will be called in the loop body. - simd_width (
Int
): The SIMD vector width. - unroll_factor (
Int
): The unroll factor for the main loop (Default 1).
Args:
- size (
Int
): The upper limit for the loop.
vectorize[origins: origin.set, //, func: fn[Int](Int) capturing -> None, simd_width: Int, /, *, size: Int, unroll_factor: Int = $3 if is_nvidia_gpu() else 1]()
Simplifies SIMD optimized loops by mapping a function across a range from 0 to size
, incrementing by simd_width
at each step. The remainder of size % simd_width
will run in a single iteration if it's an exponent of 2.
The below example demonstrates how you could improve the performance of a loop, by setting multiple values at the same time using SIMD registers on the machine:
from algorithm.functional import vectorize
# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit in SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()
fn main():
var p = UnsafePointer[Int32].alloc(size)
# @parameter allows the closure to capture the `p` pointer
@parameter
fn closure[simd_width: Int](i: Int):
print("storing", simd_width, "els at pos", i)
p.store[width=simd_width](i, i)
vectorize[closure, simd_width, size=size]()
print(p.load[width=size]())
from algorithm.functional import vectorize
# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit in SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()
fn main():
var p = UnsafePointer[Int32].alloc(size)
# @parameter allows the closure to capture the `p` pointer
@parameter
fn closure[simd_width: Int](i: Int):
print("storing", simd_width, "els at pos", i)
p.store[width=simd_width](i, i)
vectorize[closure, simd_width, size=size]()
print(p.load[width=size]())
On a machine with a SIMD register size of 128, this will set 4xInt32 values on each iteration. The remainder of 10 % 4 is 2, so those last two elements will be set in a single iteration:
storing 4 els at pos 0
storing 4 els at pos 4
storing 2 els at pos 8
[0, 0, 0, 0, 4, 4, 4, 4, 8, 8]
storing 4 els at pos 0
storing 4 els at pos 4
storing 2 els at pos 8
[0, 0, 0, 0, 4, 4, 4, 4, 8, 8]
If the remainder is not an exponent of 2 (2, 4, 8, 16 ...) there will be a
seperate iteration for each element. However passing size
as a parameter
also allows the loop for the remaining elements to be unrolled.
You can also unroll the main loop to potentially improve performance at the cost of binary size:
vectorize[closure, width, size=size, unroll_factor=2]()
vectorize[closure, width, size=size, unroll_factor=2]()
In the generated assembly the function calls will be repeated, resulting in fewer arithmetic, comparison, and conditional jump operations. The assembly would look like this in psuedocode:
closure[4](0)
closure[4](4)
closure[2](8)
closure[4](0)
closure[4](4)
closure[2](8)
Parameters:
- origins (
origin.set
): The capture origins. - func (
fn[Int](Int) capturing -> None
): The function that will be called in the loop body. - simd_width (
Int
): The SIMD vector width. - size (
Int
): The upper limit for the loop. - unroll_factor (
Int
): The unroll factor for the main loop (Default 1).
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!