Mojo function

cp_async_k_major

cp_async_k_major[type: DType](dst: LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])

Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with K-major layout.

This function performs an asynchronous copy operation from global memory (DRAM) to shared memory (SRAM) using NVIDIA's Tensor Memory Accelerator (TMA) hardware. It optimizes for K-major memory access patterns, which is particularly beneficial for certain tensor operations like matrix multiplications where the inner dimension (K) is accessed contiguously.

The function automatically determines the optimal tile size and thread distribution based on the tensor shapes and hardware capabilities, leveraging TMA's efficient memory transfer mechanisms.

Example:

```mojo
from layout import LayoutTensor, Layout
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                              address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                              address_space=AddressSpace.SHARED]()

# Copy data with K-major layout optimization
cp_async_k_major[DType.float32](shared_data, global_data)

# Wait for the asynchronous copy to complete
cp_async_wait_all()
```
```mojo
from layout import LayoutTensor, Layout
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                              address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                              address_space=AddressSpace.SHARED]()

# Copy data with K-major layout optimization
cp_async_k_major[DType.float32](shared_data, global_data)

# Wait for the asynchronous copy to complete
cp_async_wait_all()
```

Performance:

- Uses TMA hardware acceleration for optimal memory transfer performance.
- Optimizes for K-major access patterns, which can significantly improve
  performance for certain tensor operations like matrix multiplications.
- Performs asynchronous transfers, allowing computation to overlap with memory operations.
- Automatically determines optimal tile sizes based on tensor dimensions.
- Uses hardware-accelerated swizzling to reduce shared memory bank conflicts.
- Uses TMA hardware acceleration for optimal memory transfer performance.
- Optimizes for K-major access patterns, which can significantly improve
  performance for certain tensor operations like matrix multiplications.
- Performs asynchronous transfers, allowing computation to overlap with memory operations.
- Automatically determines optimal tile sizes based on tensor dimensions.
- Uses hardware-accelerated swizzling to reduce shared memory bank conflicts.

Notes:

- This function requires NVIDIA GPUs with TMA support (compute capability 9.0+).
- The source tensor must be in GENERIC or GLOBAL address space (DRAM).
- The destination tensor must be in SHARED address space (SRAM).
- Both tensors must have the same data type.
- This function is asynchronous, so you must call cp_async_wait_all() or
  cp_async_wait_group() to ensure the copy has completed before using the data.
- K-major layout is particularly beneficial for matrix multiplication operations
  where the inner dimension (K) is accessed contiguously.
- This function requires NVIDIA GPUs with TMA support (compute capability 9.0+).
- The source tensor must be in GENERIC or GLOBAL address space (DRAM).
- The destination tensor must be in SHARED address space (SRAM).
- Both tensors must have the same data type.
- This function is asynchronous, so you must call cp_async_wait_all() or
  cp_async_wait_group() to ensure the copy has completed before using the data.
- K-major layout is particularly beneficial for matrix multiplication operations
  where the inner dimension (K) is accessed contiguously.

Constraints:

Requires NVIDIA GPUs with TMA support (compute capability 9.0+). - Source tensor must be in GENERIC or GLOBAL address space. - Destination tensor must be in SHARED address space. - Both tensors must have the same data type. - Source and destination tensors must be 2D.

Parameters:

type (DType): The data type of the tensor elements.

Args:

dst (LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The destination tensor, which must be in shared memory (SRAM).
src (LayoutTensor[type, layout, origin, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The source tensor, which must be in global or generic memory (DRAM).