Mojo struct

TMATensorTile

struct TMATensorTile[dtype: DType, layout: Layout, desc_layout: Layout = layout]

A hardware-accelerated tensor memory access (TMA) tile for efficient asynchronous data movement.

The TMATensorTile struct provides a high-performance interface for asynchronous data transfers between global memory and shared memory in GPU tensor operations. It encapsulates a TMA descriptor that defines the memory access pattern and provides methods for various asynchronous operations.

Performance:

- Hardware-accelerated memory transfers using TMA instructions
- Supports prefetching of descriptors for latency hiding
- Enforces 128-byte alignment requirements for optimal memory access
- Hardware-accelerated memory transfers using TMA instructions
- Supports prefetching of descriptors for latency hiding
- Enforces 128-byte alignment requirements for optimal memory access

Parameters

dtype (DType): DType The data type of the tensor elements.
layout (Layout): Layout The layout of the tile in shared memory, typically specified as row_major.
desc_layout (Layout): Layout = layout The layout of the descriptor, which can be different from the shared memory layout to accommodate hardware requirements like WGMMA.

Fields

descriptor (TMADescriptor): The TMA descriptor that defines the memory access pattern.

Implemented traits

AnyType, Copyable, ExplicitlyCopyable, Movable, UnknownDestructibility

Methods

`init`

@implicit __init__(out self, descriptor: TMADescriptor)

Initializes a new TMATensorTile with the provided TMA descriptor.

Args:

descriptor (TMADescriptor): The TMA descriptor that defines the memory access pattern.

`copyinit`

__copyinit__(out self, other: Self)

Copy initializes this TMATensorTile from another instance.

Args:

other (Self): The other TMATensorTile instance to copy from.

`prefetch_descriptor`

prefetch_descriptor(self)

Prefetches the TMA descriptor into cache to reduce latency.

This method helps hide memory access latency by prefetching the descriptor before it's needed for actual data transfers.

`async_copy`

async_copy(self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], mem_barrier: TMABarrier, coords: Tuple[UInt, UInt])

Schedules an asynchronous copy from global memory to shared memory at specified coordinates.

This method initiates a hardware-accelerated asynchronous transfer of data from global memory to the specified destination in shared memory. The transfer is tracked by the provided memory barrier.

Constraints:

The destination tensor must be 128-byte aligned in shared memory. - The descriptor layout may be smaller than the shared memory tile shape to accommodate hardware requirements.

Args:

dst (LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The destination tensor in shared memory where data will be copied. Must be 128-byte aligned.
mem_barrier (TMABarrier): The memory barrier used to track and synchronize the asynchronous transfer.
coords (Tuple[UInt, UInt]): The 2D coordinates in the source tensor from which to copy data.

`async_copy_3d`

async_copy_3d(self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], mem_barrier: TMABarrier, coords: Tuple[UInt, UInt, UInt])

Schedules an asynchronous copy from global memory to shared memory at specified 3D coordinates.

This method initiates a hardware-accelerated asynchronous transfer of data from global memory to the specified destination in shared memory for 3D tensors. The transfer is tracked by the provided memory barrier.

Constraints:

The destination tensor must be 128-byte aligned in shared memory. - The descriptor layout may be smaller than the shared memory tile shape to accommodate hardware requirements.

Args:

dst (LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The destination tensor in shared memory where data will be copied. Must be 128-byte aligned.
mem_barrier (TMABarrier): The memory barrier used to track and synchronize the asynchronous transfer.
coords (Tuple[UInt, UInt, UInt]): The 3D coordinates in the source tensor from which to copy data.

`async_multicast_load`

async_multicast_load(self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], mem_barrier: TMABarrier, coords: Tuple[UInt, UInt], multicast_mask: SIMD[uint16, 1])

Schedules an asynchronous multicast load from global memory to multiple shared memory locations.

This method initiates a hardware-accelerated asynchronous transfer of data from global memory to multiple destination locations in shared memory across different CTAs (Cooperative Thread Arrays) as specified by the multicast mask.

Constraints:

The destination tensor must be 128-byte aligned in shared memory.

Args:

dst (LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): LayoutTensor The destination tensor in shared memory where data will be copied. Must be 128-byte aligned.
mem_barrier (TMABarrier): TMABarrier The memory barrier used to track and synchronize the asynchronous transfer.
coords (Tuple[UInt, UInt]): Tuple[UInt, UInt] The 2D coordinates in the source tensor from which to copy data.
multicast_mask (SIMD[uint16, 1]): UInt16 A bit mask specifying which CTAs should receive the data.

`async_store`

async_store(self, src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], coords: Tuple[UInt, UInt])

Schedules an asynchronous store from shared memory to global memory.

This method initiates a hardware-accelerated asynchronous transfer of data from shared memory to global memory at the specified coordinates.

Constraints:

The source tensor must be 128-byte aligned in shared memory.

Args:

src (LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): LayoutTensor The source tensor in shared memory from which data will be copied. Must be 128-byte aligned.
coords (Tuple[UInt, UInt]): Tuple[UInt, UInt] The 2D coordinates in the destination tensor where data will be stored.

`async_reduce`

async_reduce[reduction_kind: ReduceOp](self, src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], coords: Tuple[UInt, UInt])

Schedules an asynchronous reduction operation from shared memory to global memory.

This method initiates a hardware-accelerated asynchronous reduction operation that combines data from shared memory with data in global memory using the specified reduction operation. The reduction is performed element-wise at the specified coordinates in the global tensor.

Constraints:

The source tensor must be 128-byte aligned in shared memory.

Parameters:

reduction_kind (ReduceOp): The type of reduction operation to perform (e.g., ADD, MIN, MAX). This determines how values are combined during the reduction.

Args:

src (LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The source tensor in shared memory containing the data to be reduced. Must be 128-byte aligned.
coords (Tuple[UInt, UInt]): The 2D coordinates in the destination tensor where the reduction will be applied.

`smem_tensormap_init`

smem_tensormap_init(self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3)])

Initializes a TMA descriptor in shared memory from this tensor tile's descriptor.

This method copies the TMA descriptor from global memory to shared memory, allowing for faster access during kernel execution. The descriptor is copied in 16-byte chunks using asynchronous copy operations for efficiency.

Note:

- Only one thread should call this method to avoid race conditions
- The descriptor is copied in 8 chunks of 16 bytes each (total 128 bytes)
- Only one thread should call this method to avoid race conditions
- The descriptor is copied in 8 chunks of 16 bytes each (total 128 bytes)

Args:

smem_tma_descriptor_ptr (UnsafePointer[TMADescriptor, address_space=AddressSpace(3)]): Pointer to the location in shared memory where the descriptor will be stored. Must be properly aligned.

`replace_tensormap_global_address_in_gmem`

replace_tensormap_global_address_in_gmem[dtype: DType, src_layout: Layout](self, new_src: LayoutTensor[dtype, src_layout, MutableAnyOrigin])

Replaces the global memory address in the TMA descriptor stored in global memory.

This method allows dynamically changing the source tensor for TMA operations without recreating the entire descriptor, which is useful for reusing descriptors with different data sources. The operation modifies the descriptor in global memory directly.

Note: A memory fence may be required after this operation to ensure visibility of the changes to other threads.

Parameters:

dtype (DType): The data type of the new source tensor.
src_layout (Layout): The layout of the new source tensor.

Args:

new_src (LayoutTensor[dtype, src_layout, MutableAnyOrigin]): The new source tensor whose address will replace the current one in the descriptor. Must have compatible layout with the original tensor.

`tensormap_fence_acquire`

tensormap_fence_acquire(self)

Establishes a memory fence for TMA operations with acquire semantics.

This method ensures proper ordering of memory operations by creating a barrier that prevents subsequent TMA operations from executing before prior operations have completed. It is particularly important when reading from a descriptor that might have been modified by other threads or processes.

The acquire semantics ensure that all memory operations after this fence will observe any modifications made to the descriptor before the fence.

Notes:

- The entire warp must call this function as the instruction is warp-aligned.
- Typically used in pairs with `tensormap_fence_release` for proper synchronization.
- The entire warp must call this function as the instruction is warp-aligned.
- Typically used in pairs with `tensormap_fence_release` for proper synchronization.

`tensormap_fence_release`

tensormap_fence_release(self)

Establishes a memory fence for TMA operations with release semantics.

This method ensures proper ordering of memory operations by creating a barrier that ensures all prior memory operations are visible before subsequent operations can proceed. It is particularly important when modifying a TMA descriptor in global memory that might be read by other threads or processes.

The release semantics ensure that all memory operations before this fence will be visible to any thread that observes operations after the fence.

Notes:

- Typically used after modifying a tensormap descriptor in global memory.
- Often paired with `tensormap_fence_acquire` for proper synchronization.
- Typically used after modifying a tensormap descriptor in global memory.
- Often paired with `tensormap_fence_acquire` for proper synchronization.

`replace_tensormap_global_address_in_shared_mem`

replace_tensormap_global_address_in_shared_mem[dtype: DType, src_layout: Layout](self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3)], new_src: LayoutTensor[dtype, src_layout, MutableAnyOrigin])

Replaces the global memory address in the TMA descriptor stored in shared memory.

Notes:

- Only one thread should call this method to avoid race conditions.
- A memory fence may be required after this operation to ensure visibility
  of the changes to other threads.
- Typically used with descriptors previously initialized with `smem_tensormap_init`.
- Only one thread should call this method to avoid race conditions.
- A memory fence may be required after this operation to ensure visibility
  of the changes to other threads.
- Typically used with descriptors previously initialized with `smem_tensormap_init`.

Parameters:

dtype (DType): The data type of the new source tensor.
src_layout (Layout): The layout of the new source tensor.

Args:

smem_tma_descriptor_ptr (UnsafePointer[TMADescriptor, address_space=AddressSpace(3)]): Pointer to the TMA descriptor in shared memory that will be modified.
new_src (LayoutTensor[dtype, src_layout, MutableAnyOrigin]): The new source tensor whose address will replace the current one in the descriptor. Must have compatible layout with the original tensor.

`tensormap_cp_fence_release`

tensormap_cp_fence_release(self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3)])

Establishes a memory fence for TMA operations with release semantics for shared memory descriptors.

This method ensures proper ordering of memory operations by creating a barrier that ensures all prior memory operations are visible before subsequent operations can proceed. It is specifically designed for synchronizing between global memory and shared memory TMA descriptors.

The release semantics ensure that all memory operations before this fence will be visible to any thread that observes operations after the fence.

Notes:

- The entire warp must call this function as the instruction is warp-aligned
- Typically used after modifying a tensormap descriptor in shared memory
- More specialized than the general `tensormap_fence_release` for cross-memory space synchronization
- The entire warp must call this function as the instruction is warp-aligned
- Typically used after modifying a tensormap descriptor in shared memory
- More specialized than the general `tensormap_fence_release` for cross-memory space synchronization

Args:

smem_tma_descriptor_ptr (UnsafePointer[TMADescriptor, address_space=AddressSpace(3)]): Pointer to the TMA descriptor in shared memory that is being synchronized with the global memory descriptor.

Parameters​

Fields​

Implemented traits​

Methods​

__init__​

__copyinit__​

prefetch_descriptor​

async_copy​

async_copy_3d​

async_multicast_load​

async_store​

async_reduce​

smem_tensormap_init​

replace_tensormap_global_address_in_gmem​

tensormap_fence_acquire​

tensormap_fence_release​

replace_tensormap_global_address_in_shared_mem​

tensormap_cp_fence_release​

Parameters

Fields

Implemented traits

Methods

`init`

`copyinit`

`prefetch_descriptor`

`async_copy`

`async_copy_3d`

`async_multicast_load`

`async_store`

`async_reduce`

`smem_tensormap_init`

`replace_tensormap_global_address_in_gmem`

`tensormap_fence_acquire`

`tensormap_fence_release`

`replace_tensormap_global_address_in_shared_mem`

`tensormap_cp_fence_release`