Mojo function
create_tma_tile
create_tma_tile[*tile_sizes: Int, *, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))](ctx: DeviceContext, tensor: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]) -> TMATensorTile[dtype, row_major[::Origin[::Bool(_to_int_tuple[*::Int]())]
Creates a TMATensorTile
with specified tile dimensions and swizzle mode.
This function creates a hardware-accelerated Tensor Memory Access (TMA) descriptor for efficient asynchronous data transfers between global memory and shared memory. It configures the tile dimensions and memory access patterns based on the provided parameters.
Constraints:
- The last dimension's size in bytes must not exceed the swizzle mode's byte limit (32B for SWIZZLE_32B, 64B for SWIZZLE_64B, 128B for SWIZZLE_128B). - Only supports 2D tensors in this overload.
Parameters:
- *tile_sizes (
Int
): The dimensions of the tile to be transferred. For 2D tensors, this should be [height, width]. The dimensions determine the shape of data transferred in each TMA operation. - swizzle_mode (
TensorMapSwizzle
): The swizzling mode to use for memory access optimization. Swizzling can improve memory access patterns for specific hardware configurations.
Args:
- ctx (
DeviceContext
): The CUDA device context used to create the TMA descriptor. - tensor (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The source tensor from which data will be transferred. This defines the global memory layout and data type.
Returns:
A TMATensorTile
configured with the specified tile dimensions and swizzle mode, ready for use in asynchronous data transfer operations.
create_tma_tile[type: DType, rank: Int, tile_shape: Index[rank], /, is_k_major: Bool = True, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), *, __tile_layout: Layout = row_major(tile_shape.__getitem__[::Indexer](0), tile_shape.__getitem__[::Indexer](1)), __desc_layout: Layout = _tma_desc_tile_layout[::DType,::Int,stdlib::utils::index::IndexList[$1, stdlib::sys::info::bitwidthof[AnyTrivialRegType,__mlir_type.!kgen.target]()](ctx: DeviceContext, tensor: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]) -> TMATensorTile[type, __tile_layout, __desc_layout]
Creates a TMATensorTile
with advanced configuration options for 2D or 3D tensors.
This overload provides more control over the TMA descriptor creation, allowing specification of data type, rank, and layout orientation. It supports both 2D and 3D tensors and provides fine-grained control over the memory access patterns.
Constraints:
- Only supports 2D and 3D tensors (rank must be 2 or 3). - For non-SWIZZLE_NONE modes, the K dimension size in bytes must be a multiple of the swizzle mode's byte size. - For MN-major layout, only SWIZZLE_128B is supported. - For 3D tensors, only K-major layout is supported.
Parameters:
- type (
DType
): DType The data type of the tensor elements. - rank (
Int
): Int The dimensionality of the tensor (must be 2 or 3). - tile_shape (
Index[rank]
): IndexList[rank] The shape of the tile to be transferred. - is_k_major (
Bool
): Bool = True Whether the tensor layout is K-major (True) or MN-major (False). K-major is typically used for weight matrices, while MN-major is used for activation matrices in matrix multiplication operations. - swizzle_mode (
TensorMapSwizzle
): TensorMapSwizzle = TensorMapSwizzle.SWIZZLE_NONE The swizzling mode to use for memory access optimization. - __tile_layout (
Layout
): Layout = Layout.row_major(tile_shape[0], tile_shape[1]) Internal parameter for the tile layout in shared memory. - __desc_layout (
Layout
): Layout = _tma_desc_tile_layout[...] Internal parameter for the descriptor layout, which may differ from the tile layout to accommodate hardware requirements.
Args:
- ctx (
DeviceContext
): DeviceContext The CUDA device context used to create the TMA descriptor. - tensor (
LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): LayoutTensor[type, *, **] The source tensor from which data will be transferred. This defines the global memory layout and must match the specified data type.
Returns:
A TMATensorTile
configured with the specified parameters, ready for use in asynchronous data transfer operations.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!