Mojo function

copy_sram_to_local

copy_sram_to_local[src_warp_layout: Layout, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1})](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])

Synchronously copy data from SRAM (shared memory) to local memory.

This function performs a synchronous memory transfer from SRAM (shared memory) to local memory (registers) using the specified thread layout for workload distribution.

Example:

from layout import LayoutTensor, Layout
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                            address_space=AddressSpace.SHARED]()
var local_data = LayoutTensor[DType.float32, Layout((4, 4)),
                            address_space=AddressSpace.LOCAL]()

# Copy data using a thread layout with 8 threads
copy_sram_to_local[Layout(8)](local_data, shared_data)
from layout import LayoutTensor, Layout
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                            address_space=AddressSpace.SHARED]()
var local_data = LayoutTensor[DType.float32, Layout((4, 4)),
                            address_space=AddressSpace.LOCAL]()

# Copy data using a thread layout with 8 threads
copy_sram_to_local[Layout(8)](local_data, shared_data)

Performance:

Distributes the copy workload across multiple threads for parallel execution.
Optimized for transferring data from shared memory to registers.
Supports optional axis-specific distribution for specialized access patterns.

Constraints:

The source tensor must be in SHARED address space (SRAM).
The destination tensor must be in LOCAL address space (registers).
Both tensors must have the same data type.

Parameters:

src_warp_layout (Layout): Layout defining how threads are organized for the source tensor. This determines how the workload is distributed among threads.
axis (OptionalReg[Int]): Optional parameter specifying which axis to distribute along. When provided, distribution happens along the specified axis. When None (default), distribution uses the standard layout pattern.

Args:

dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The destination tensor, which must be in local memory (registers).
src (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The source tensor, which must be in shared memory (SRAM).