Mojo function
multimem_ld_reduce
multimem_ld_reduce[dtype: DType, *, count: Int, reduction: ReduceOp, scope: Scope, consistency: Consistency, accum_type: DType = get_accum_type[dtype](), output_width: Int = 1](addr: UnsafePointer[Scalar[dtype], address_space=AddressSpace(1)]) -> StaticTuple[SIMD[dtype, output_width], count]
Performs a vectorized load-reduce operation using NVIDIA's multimem feature.
This function loads multiple values from global memory and performs a reduction operation across them in a single instruction. It utilizes NVIDIA's multimem feature available on SM90+ GPUs for improved performance.
Constraints:
- Only supported on SM90+ GPUs.
- Total bit width (count * output_width * size_of[dtype] * 8) must be 32, 64, or 128 bits.
- Type must be a floating point type.
- float64 requires count=1 (no .vec qualifier allowed).
Parameters:
- dtype (
DType
): Data dtype for the operation (must be a floating point type). - count (
Int
): Vector size for PTX (corresponds to .v2, .v4, .v8 qualifiers, or no .v for scalar). - reduction (
ReduceOp
): Type of reduction operation to perform. - scope (
Scope
): Memory scope for the operation. - consistency (
Consistency
): Memory consistency model to use. - accum_type (
DType
): Data dtype used for accumulation. Defaults to a wider dtype than input (e.g. float32 for float16 inputs) to maintain precision during reduction. - output_width (
Int
): Number of elements packed into a single output register (e.g. bf16x2).
Args:
- addr (
UnsafePointer
): Pointer to global memory where data will be loaded from.
Returns:
StaticTuple
: A StaticTuple containing 'count' SIMD vectors of width 'output_width'
holding the results of the load-reduce operation.
multimem_ld_reduce[dtype: DType, *, simd_width: Int, reduction: ReduceOp, scope: Scope, consistency: Consistency, accum_type: DType = get_accum_type[dtype]()](addr: UnsafePointer[Scalar[dtype], address_space=AddressSpace(1)]) -> SIMD[dtype, simd_width]
Simplified multimem_ld_reduce that automatically calculates optimal packing.
This wrapper automatically determines the optimal output_width and count parameters based on the requested simd_width and data type, using 32-bit word packing for efficiency.
Constraints:
- Only supported on SM90+ GPUs.
- simd_width must be 1, 2, 4, or 8.
- Total bit width (count * output_width * size_of[dtype] * 8) must be 32, 64, or 128 bits.
- Type must be a floating point type.
- float64 requires count=1 (no .vec qualifier allowed).
Parameters:
- dtype (
DType
): Data dtype for the operation (must be a floating point type). - simd_width (
Int
): Total number of elements to process. - reduction (
ReduceOp
): Type of reduction operation to perform. - scope (
Scope
): Memory scope for the operation. - consistency (
Consistency
): Memory consistency model to use. - accum_type (
DType
): Data dtype used for accumulation.
Returns:
SIMD
: A SIMD vector containing simd_width elements with the reduction results.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!