Mojo function

st_matrix

st_matrix[dtype: DType, //, simd_width: Int, *, transpose: Bool = False](ptr: UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)], d: SIMD[float32, simd_width])

Performs warp-synchronized copy from registers to shared memory.

This function stores data from registers to shared memory in a format that can be directly used by tensor core Matrix Multiply-Accumulate (MMA) instructions. It uses the NVIDIA stmatrix instruction to perform an efficient warp-synchronized store.

Note: The function performs a warp-synchronized operation - all threads in the warp must execute this instruction to avoid deadlock.

Constraints:

Must be used with shared memory pointers.
Number of registers must be 1, 2, or 4.
Data must be properly aligned for matrix operations.
All threads in warp must participate.
Only supported on NVIDIA GPUs with tensor core capabilities.

Parameters:

dtype (DType): Data type of elements to store.
simd_width (Int): Width of the SIMD vector.
transpose (Bool): If True, transposes the matrix during store.

Args:

ptr (UnsafePointer): Pointer to shared memory where data will be stored.
d (SIMD): SIMD vector containing the data to store.