Mojo function

sum

sum[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]

Computes the sum of values across all lanes in a warp.

This is a convenience wrapper around lane_group_sum_and_broadcast that operates on the entire warp. It performs a parallel reduction using warp shuffle operations to find the global sum across all lanes in the warp.

Parameters:

val_type (DType): The data type of the SIMD elements (e.g. float32, int32).
simd_width (Int): The number of elements in the SIMD vector.

Args:

val (SIMD[val_type, simd_width]): The SIMD value to reduce. Each lane contributes its value to the sum.

Returns:

A SIMD value where all lanes contain the sum found across the entire warp. The sum is broadcast to all lanes.

sum[intermediate_type: DType, *, reduction_method: ReductionMethod, output_type: DType](x: SIMD[dtype, size]) -> SIMD[output_type, 1]

Performs a warp-level reduction to compute the sum of values across threads.

This function provides two reduction methods:

Warp shuffle: Uses warp shuffle operations to efficiently sum values across threads
Tensor core: Leverages tensor cores for high-performance reductions, with dtype casting

The tensor core method will cast the input to the specified intermediate dtype before reduction to ensure compatibility with tensor core operations. The warp shuffle method requires the output dtype to match the input dtype.

Constraints:

For warp shuffle reduction, output_type must match the input value dtype.
For tensor core reduction, input will be cast to intermediate_type.

Parameters:

intermediate_type (DType): The data type to cast to when using tensor core reduction.
reduction_method (ReductionMethod): WARP for warp shuffle or TENSOR_CORE for tensor core reduction.
output_type (DType): The desired output data type for the reduced value.

Args:

x (SIMD[dtype, size]): The SIMD value to reduce across the warp.

Returns:

A scalar containing the sum of the input values across all threads in the warp, cast to the specified output dtype.