Mojo struct
TMATensorTile
struct TMATensorTile[dtype: DType, layout: Layout, desc_layout: Layout = layout]
A hardware-accelerated tensor memory access (TMA) tile for efficient asynchronous data movement.
The TMATensorTile struct provides a high-performance interface for asynchronous data transfers between global memory and shared memory in GPU tensor operations. It encapsulates a TMA descriptor that defines the memory access pattern and provides methods for various asynchronous operations.
Performance:
- Hardware-accelerated memory transfers using TMA instructions
- Supports prefetching of descriptors for latency hiding
- Enforces 128-byte alignment requirements for optimal memory access
- Hardware-accelerated memory transfers using TMA instructions
- Supports prefetching of descriptors for latency hiding
- Enforces 128-byte alignment requirements for optimal memory access
Parameters
- dtype (
DType
): DType The data type of the tensor elements. - layout (
Layout
): Layout The layout of the tile in shared memory, typically specified as row_major. - desc_layout (
Layout
): Layout = layout The layout of the descriptor, which can be different from the shared memory layout to accommodate hardware requirements like WGMMA.
Fields
- descriptor (
TMADescriptor
): The TMA descriptor that defines the memory access pattern.
Implemented traits
AnyType
,
Copyable
,
ExplicitlyCopyable
,
Movable
,
UnknownDestructibility
Methods
__init__
@implicit
__init__(out self, descriptor: TMADescriptor)
Initializes a new TMATensorTile with the provided TMA descriptor.
Args:
- descriptor (
TMADescriptor
): The TMA descriptor that defines the memory access pattern.
__copyinit__
__copyinit__(out self, other: Self)
Copy initializes this TMATensorTile
from another instance.
Args:
- other (
Self
): The otherTMATensorTile
instance to copy from.
prefetch_descriptor
prefetch_descriptor(self)
Prefetches the TMA descriptor into cache to reduce latency.
This method helps hide memory access latency by prefetching the descriptor before it's needed for actual data transfers.
async_copy
async_copy(self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], mem_barrier: TMABarrier, coords: Tuple[UInt, UInt])
Schedules an asynchronous copy from global memory to shared memory at specified coordinates.
This method initiates a hardware-accelerated asynchronous transfer of data from global memory to the specified destination in shared memory. The transfer is tracked by the provided memory barrier.
Constraints:
- The destination tensor must be 128-byte aligned in shared memory. - The descriptor layout may be smaller than the shared memory tile shape to accommodate hardware requirements.
Args:
- dst (
LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The destination tensor in shared memory where data will be copied. Must be 128-byte aligned. - mem_barrier (
TMABarrier
): The memory barrier used to track and synchronize the asynchronous transfer. - coords (
Tuple[UInt, UInt]
): The 2D coordinates in the source tensor from which to copy data.
async_copy_3d
async_copy_3d(self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], mem_barrier: TMABarrier, coords: Tuple[UInt, UInt, UInt])
Schedules an asynchronous copy from global memory to shared memory at specified 3D coordinates.
This method initiates a hardware-accelerated asynchronous transfer of data from global memory to the specified destination in shared memory for 3D tensors. The transfer is tracked by the provided memory barrier.
Constraints:
- The destination tensor must be 128-byte aligned in shared memory. - The descriptor layout may be smaller than the shared memory tile shape to accommodate hardware requirements.
Args:
- dst (
LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The destination tensor in shared memory where data will be copied. Must be 128-byte aligned. - mem_barrier (
TMABarrier
): The memory barrier used to track and synchronize the asynchronous transfer. - coords (
Tuple[UInt, UInt, UInt]
): The 3D coordinates in the source tensor from which to copy data.
async_multicast_load
async_multicast_load(self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], mem_barrier: TMABarrier, coords: Tuple[UInt, UInt], multicast_mask: SIMD[uint16, 1])
Schedules an asynchronous multicast load from global memory to multiple shared memory locations.
This method initiates a hardware-accelerated asynchronous transfer of data from global memory to multiple destination locations in shared memory across different CTAs (Cooperative Thread Arrays) as specified by the multicast mask.
Constraints:
The destination tensor must be 128-byte aligned in shared memory.
Args:
- dst (
LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): LayoutTensor The destination tensor in shared memory where data will be copied. Must be 128-byte aligned. - mem_barrier (
TMABarrier
): TMABarrier The memory barrier used to track and synchronize the asynchronous transfer. - coords (
Tuple[UInt, UInt]
): Tuple[UInt, UInt] The 2D coordinates in the source tensor from which to copy data. - multicast_mask (
SIMD[uint16, 1]
): UInt16 A bit mask specifying which CTAs should receive the data.
async_store
async_store(self, src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], coords: Tuple[UInt, UInt])
Schedules an asynchronous store from shared memory to global memory.
This method initiates a hardware-accelerated asynchronous transfer of data from shared memory to global memory at the specified coordinates.
Constraints:
The source tensor must be 128-byte aligned in shared memory.
Args:
- src (
LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): LayoutTensor The source tensor in shared memory from which data will be copied. Must be 128-byte aligned. - coords (
Tuple[UInt, UInt]
): Tuple[UInt, UInt] The 2D coordinates in the destination tensor where data will be stored.
async_reduce
async_reduce[reduction_kind: ReduceOp](self, src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], coords: Tuple[UInt, UInt])
Schedules an asynchronous reduction operation from shared memory to global memory.
This method initiates a hardware-accelerated asynchronous reduction operation that combines data from shared memory with data in global memory using the specified reduction operation. The reduction is performed element-wise at the specified coordinates in the global tensor.
Constraints:
The source tensor must be 128-byte aligned in shared memory.
Parameters:
- reduction_kind (
ReduceOp
): The type of reduction operation to perform (e.g., ADD, MIN, MAX). This determines how values are combined during the reduction.
Args:
- src (
LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The source tensor in shared memory containing the data to be reduced. Must be 128-byte aligned. - coords (
Tuple[UInt, UInt]
): The 2D coordinates in the destination tensor where the reduction will be applied.
smem_tensormap_init
smem_tensormap_init(self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3)])
Initializes a TMA descriptor in shared memory from this tensor tile's descriptor.
This method copies the TMA descriptor from global memory to shared memory, allowing for faster access during kernel execution. The descriptor is copied in 16-byte chunks using asynchronous copy operations for efficiency.
Note:
- Only one thread should call this method to avoid race conditions
- The descriptor is copied in 8 chunks of 16 bytes each (total 128 bytes)
- Only one thread should call this method to avoid race conditions
- The descriptor is copied in 8 chunks of 16 bytes each (total 128 bytes)
Args:
- smem_tma_descriptor_ptr (
UnsafePointer[TMADescriptor, address_space=AddressSpace(3)]
): Pointer to the location in shared memory where the descriptor will be stored. Must be properly aligned.
replace_tensormap_global_address_in_gmem
replace_tensormap_global_address_in_gmem[dtype: DType, src_layout: Layout](self, new_src: LayoutTensor[dtype, src_layout, MutableAnyOrigin])
Replaces the global memory address in the TMA descriptor stored in global memory.
This method allows dynamically changing the source tensor for TMA operations without recreating the entire descriptor, which is useful for reusing descriptors with different data sources. The operation modifies the descriptor in global memory directly.
Note: A memory fence may be required after this operation to ensure visibility of the changes to other threads.
Parameters:
- dtype (
DType
): The data type of the new source tensor. - src_layout (
Layout
): The layout of the new source tensor.
Args:
- new_src (
LayoutTensor[dtype, src_layout, MutableAnyOrigin]
): The new source tensor whose address will replace the current one in the descriptor. Must have compatible layout with the original tensor.
tensormap_fence_acquire
tensormap_fence_acquire(self)
Establishes a memory fence for TMA operations with acquire semantics.
This method ensures proper ordering of memory operations by creating a barrier that prevents subsequent TMA operations from executing before prior operations have completed. It is particularly important when reading from a descriptor that might have been modified by other threads or processes.
The acquire semantics ensure that all memory operations after this fence will observe any modifications made to the descriptor before the fence.
Notes:
- The entire warp must call this function as the instruction is warp-aligned.
- Typically used in pairs with `tensormap_fence_release` for proper synchronization.
- The entire warp must call this function as the instruction is warp-aligned.
- Typically used in pairs with `tensormap_fence_release` for proper synchronization.
tensormap_fence_release
tensormap_fence_release(self)
Establishes a memory fence for TMA operations with release semantics.
This method ensures proper ordering of memory operations by creating a barrier that ensures all prior memory operations are visible before subsequent operations can proceed. It is particularly important when modifying a TMA descriptor in global memory that might be read by other threads or processes.
The release semantics ensure that all memory operations before this fence will be visible to any thread that observes operations after the fence.
Notes:
- Typically used after modifying a tensormap descriptor in global memory.
- Often paired with `tensormap_fence_acquire` for proper synchronization.
- Typically used after modifying a tensormap descriptor in global memory.
- Often paired with `tensormap_fence_acquire` for proper synchronization.
replace_tensormap_global_address_in_shared_mem
replace_tensormap_global_address_in_shared_mem[dtype: DType, src_layout: Layout](self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3)], new_src: LayoutTensor[dtype, src_layout, MutableAnyOrigin])
Replaces the global memory address in the TMA descriptor stored in shared memory.
This method allows dynamically changing the source tensor for TMA operations without recreating the entire descriptor, which is useful for reusing descriptors with different data sources. The operation modifies a descriptor that has been previously copied to shared memory.
Notes:
- Only one thread should call this method to avoid race conditions.
- A memory fence may be required after this operation to ensure visibility
of the changes to other threads.
- Typically used with descriptors previously initialized with `smem_tensormap_init`.
- Only one thread should call this method to avoid race conditions.
- A memory fence may be required after this operation to ensure visibility
of the changes to other threads.
- Typically used with descriptors previously initialized with `smem_tensormap_init`.
Parameters:
- dtype (
DType
): The data type of the new source tensor. - src_layout (
Layout
): The layout of the new source tensor.
Args:
- smem_tma_descriptor_ptr (
UnsafePointer[TMADescriptor, address_space=AddressSpace(3)]
): Pointer to the TMA descriptor in shared memory that will be modified. - new_src (
LayoutTensor[dtype, src_layout, MutableAnyOrigin]
): The new source tensor whose address will replace the current one in the descriptor. Must have compatible layout with the original tensor.
tensormap_cp_fence_release
tensormap_cp_fence_release(self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3)])
Establishes a memory fence for TMA operations with release semantics for shared memory descriptors.
This method ensures proper ordering of memory operations by creating a barrier that ensures all prior memory operations are visible before subsequent operations can proceed. It is specifically designed for synchronizing between global memory and shared memory TMA descriptors.
The release semantics ensure that all memory operations before this fence will be visible to any thread that observes operations after the fence.
Notes:
- The entire warp must call this function as the instruction is warp-aligned
- Typically used after modifying a tensormap descriptor in shared memory
- More specialized than the general `tensormap_fence_release` for cross-memory space synchronization
- The entire warp must call this function as the instruction is warp-aligned
- Typically used after modifying a tensormap descriptor in shared memory
- More specialized than the general `tensormap_fence_release` for cross-memory space synchronization
Args:
- smem_tma_descriptor_ptr (
UnsafePointer[TMADescriptor, address_space=AddressSpace(3)]
): Pointer to the TMA descriptor in shared memory that is being synchronized with the global memory descriptor.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!