Mojo function
st_matrix
st_matrix[dtype: DType, //, simd_width: Int, *, transpose: Bool = False](ptr: UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)], d: SIMD[float32, simd_width])
Performs warp-synchronized copy from registers to shared memory.
This function stores data from registers to shared memory in a format that can be directly used by tensor core Matrix Multiply-Accumulate (MMA) instructions. It uses the NVIDIA stmatrix instruction to perform an efficient warp-synchronized store.
Note: The function performs a warp-synchronized operation - all threads in the warp must execute this instruction to avoid deadlock.
Constraints:
- Must be used with shared memory pointers. - Number of registers must be 1, 2, or 4. - Data must be properly aligned for matrix operations. - All threads in warp must participate. - Only supported on NVIDIA GPUs with tensor core capabilities.
Parameters:
- dtype (
DType
): Data type of elements to store. - simd_width (
Int
): Width of the SIMD vector. - transpose (
Bool
): If True, transposes the matrix during store.
Args:
- ptr (
UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)]
): Pointer to shared memory where data will be stored. - d (
SIMD[float32, simd_width]
): SIMD vector containing the data to store.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!