Mojo function

copy_dram_to_sram

copy_dram_to_sram[src_thread_layout: Layout, dst_thread_layout: Layout = src_thread_layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = src_thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])

Synchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context.

This function performs a synchronous copy operation from global memory (DRAM) to shared memory (SRAM) in a GPU context, distributing the workload across multiple threads for parallel execution. It uses thread affinity mapping to ensure efficient work distribution and supports vectorized memory operations for optimal performance.

Example:

from layout import LayoutTensor, Layout
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                                address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                                address_space=AddressSpace.SHARED]()

# Copy data using a 2D thread layout with 8x8 threads
copy_dram_to_sram[Layout((8, 8))](shared_data, global_data)
from layout import LayoutTensor, Layout
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                                address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                                address_space=AddressSpace.SHARED]()

# Copy data using a 2D thread layout with 8x8 threads
copy_dram_to_sram[Layout((8, 8))](shared_data, global_data)

Performance:

Distributes the copy workload across multiple threads for parallel execution.
Supports vectorized loads and stores for better memory throughput.
Can use swizzling to optimize memory access patterns and reduce bank conflicts.
Thread affinity mapping ensures efficient work distribution.
For masked tensors, performs bounds checking to handle edge cases correctly.

Notes:

The source tensor must be in GENERIC or GLOBAL address space (DRAM).
The destination tensor must be in SHARED address space (SRAM).
Both tensors must have the same data type.
This function is synchronous, meaning all threads must complete their copy operations before proceeding.
For optimal performance, the thread layouts should match the memory access patterns of the tensors.
This function is particularly useful in GPU kernels for loading data from global memory to shared memory for faster access.

Constraints:

Source and destination tensors must have the same data type.
Source tensor must be in GENERIC or GLOBAL address space.
Destination tensor must be in SHARED address space.
For non-masked tensors, the fragment sizes must match.

Parameters:

src_thread_layout (Layout): Layout defining how threads are organized for the source tensor. This determines how the workload is distributed among threads.
dst_thread_layout (Layout): Layout defining how threads are organized for the destination tensor. Defaults to the same as src_thread_layout if not specified.
swizzle (OptionalReg[Swizzle]): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts.
num_threads (Int): Total number of threads participating in the copy operation. Defaults to the size of src_thread_layout.
thread_scope (ThreadScope): Scope at which thread operations are performed (BLOCK or WARP). Defaults to ThreadScope.BLOCK, where all threads in a block participate.

Args:

dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The destination tensor, which must be in shared memory (SRAM).
src (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The source tensor, which must be in global or generic memory (DRAM).

copy_dram_to_sram[src_thread_layout: Layout, dst_thread_layout: Layout = src_thread_layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = src_thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], bound: Int)

Efficiently copy data from global memory (DRAM) to shared memory (SRAM) on AMD GPUs.

This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's buffer_load intrinsic to efficiently transfer data while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput.

Parameters:

src_thread_layout (Layout): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads.
dst_thread_layout (Layout): The layout used to distribute the destination tensor across threads. Defaults to the same layout as src_thread_layout.
swizzle (OptionalReg[Swizzle]): Optional swizzling pattern to apply when distributing the destination tensor. This can improve memory access patterns and reduce bank conflicts. Defaults to None (no swizzling).
num_threads (Int): The total number of threads participating in the copy operation. Defaults to the size of src_thread_layout.
thread_scope (ThreadScope): Defines whether operations are performed at BLOCK or WARP level. BLOCK scope involves all threads in a thread block, while WARP scope restricts operations to threads within the same warp. Defaults to ThreadScope.BLOCK.

Args:

dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The destination tensor in shared memory (SRAM).
src_iter (LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]): The source tensor iterator in global memory (DRAM) to be copied.
bound (Int): The bound of the source tensor iterator.

copy_dram_to_sram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], bound: Int)

Synchronously copy data from DRAM to SRAM using a unified thread layout for AMD GPUs.

This is a convenience wrapper around the more general copy_dram_to_sram() function that uses the same layout for both source and destination tensors. It's specifically designed for AMD GPUs where the buffer_load intrinsic requires the original base tensor.

Performance:

Simplifies API usage when the same thread layout is appropriate for both source and destination tensors.
Optimized for AMD GPUs using buffer_load intrinsics for efficient memory transfers.
Distributes the copy workload across multiple threads for parallel execution.

Notes:

This function is only supported on AMD GPUs.
The source tensor must be in GENERIC or GLOBAL address space (DRAM).
The destination tensor must be in SHARED address space (SRAM).
Both tensors must have the same data type.

Parameters:

thread_layout (Layout): Layout defining how threads are organized for both source and destination. This determines how the workload is distributed among threads.
swizzle (OptionalReg[Swizzle]): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts.
num_threads (Int): Total number of threads participating in the copy operation. Defaults to the size of thread_layout.
thread_scope (ThreadScope): Scope at which thread operations are performed (BLOCK or WARP). Defaults to BLOCK, where all threads in a block participate.

Args:

dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The destination tensor, which must be in shared memory (SRAM).
src_iter (LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]): The source tensor iterator, which must be in global or generic memory (DRAM).
bound (Int): The bound of the source tensor iterator.

copy_dram_to_sram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])

Synchronously copy data from DRAM to SRAM using a unified thread layout.

This is a convenience wrapper around the more general copy_dram_to_sram() function that uses the same layout for both source and destination tensors. It simplifies the API for the common case where the same thread distribution pattern works well for both tensors.

Example:

from layout import LayoutTensor, Layout
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                                address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                                address_space=AddressSpace.SHARED]()

# Copy data using a 2D thread layout with 8x8 threads
copy_dram_to_sram[Layout((8, 8))](shared_data, global_data)
from layout import LayoutTensor, Layout
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                                address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                                address_space=AddressSpace.SHARED]()

# Copy data using a 2D thread layout with 8x8 threads
copy_dram_to_sram[Layout((8, 8))](shared_data, global_data)

Performance:

Simplifies API usage when the same thread layout is appropriate for both source and destination tensors.
Distributes the copy workload across multiple threads for parallel execution.
Supports vectorized loads and stores for better memory throughput.
Can use swizzling to optimize memory access patterns and reduce bank conflicts.

Notes:

The source tensor must be in GENERIC or GLOBAL address space (DRAM).
The destination tensor must be in SHARED address space (SRAM).
Both tensors must have the same data type.
This function is synchronous, meaning all threads must complete their copy operations before proceeding.

Parameters:

thread_layout (Layout): Layout defining how threads are organized for both source and destination. This determines how the workload is distributed among threads.
swizzle (OptionalReg[Swizzle]): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts.
num_threads (Int): Total number of threads participating in the copy operation. Defaults to the size of thread_layout.
thread_scope (ThreadScope): Scope at which thread operations are performed (BLOCK or WARP). Defaults to ThreadScope.BLOCK, where all threads in a block participate.

Args:

dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The destination tensor, which must be in shared memory (SRAM).
src (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The source tensor, which must be in global or generic memory (DRAM).