Mojo function

cp_async_bulk_tensor_shared_cluster_global

cp_async_bulk_tensor_shared_cluster_global[dst_type: AnyType, mbr_type: AnyType, rank: Int, /, *, cta_group: Int = 1](dst_mem: UnsafePointer[dst_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], mem_bar: UnsafePointer[mbr_type, address_space=AddressSpace(3)], coords: IndexList[rank])

Initiates an asynchronous bulk copy operation of tensor data from global memory to shared memory.

This function performs an asynchronous copy of tensor data using NVIDIA's Tensor Memory Access (TMA) mechanism. It supports both rank-1 and rank-2 tensors and uses cluster-level synchronization for efficient data movement.

Notes:

This operation is asynchronous - use appropriate memory barriers to ensure copy completion.
Only supports rank-1 and rank-2 tensors.
Requires NVIDIA GPU with TMA support.
The memory barrier should be properly initialized before use.

Parameters:

dst_type (AnyType): The data type of the destination memory.
mbr_type (AnyType): The data type of the memory barrier.
rank (Int): The dimensionality of the tensor (1, 2, or 3).
cta_group (Int): The CTA group to use for the copy operation. Must be 1 or 2.

Args:

dst_mem (UnsafePointer): Pointer to the destination in shared memory where the tensor data will be copied. Must be properly aligned according to TMA requirements.
tma_descriptor (UnsafePointer): Pointer to the TMA descriptor that contains metadata about the tensor layout and memory access patterns.
mem_bar (UnsafePointer): Pointer to a shared memory barrier used for synchronizing the asynchronous copy operation across threads in the cluster.
coords (IndexList): Coordinates specifying which tile of the tensor to copy. For rank-1 tensors, this is a single coordinate. For rank-2 tensors, this contains both row and column coordinates.