Mojo function

multimem_ld_reduce

multimem_ld_reduce[dtype: DType, *, count: Int, reduction: ReduceOp, scope: Scope, consistency: Consistency, accum_type: DType = get_accum_type[::DType,::DType](), output_width: Int = 1](addr: UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(1)]) -> StaticTuple[SIMD[accum_type, output_width], count]

Performs a vectorized load-reduce operation using NVIDIA's multimem feature.

This function loads multiple values from global memory and performs a reduction operation across them in a single instruction. It utilizes NVIDIA's multimem feature available on SM90+ GPUs for improved performance.

Constraints:

Only supported on SM90+ GPUs.
Count must be 2 or 4.
Type must be float32, float16, or bfloat16.

Parameters:

dtype (DType): Data dtype for the operation (float32, float16, or bfloat16).
count (Int): Number of elements to load and reduce (2 or 4).
reduction (ReduceOp): Type of reduction operation to perform.
scope (Scope): Memory scope for the operation.
consistency (Consistency): Memory consistency model to use.
accum_type (DType): Data dtype used for accumulation. Defaults to a wider dtype than input (e.g. float32 for float16 inputs) to maintain precision during reduction.
output_width (Int): Width of each output SIMD vector (default 1).

Args:

addr (UnsafePointer): Pointer to global memory where data will be loaded from.

Returns:

StaticTuple: A StaticTuple containing 'count' SIMD vectors of width 'output_width' holding the results of the load-reduce operation.