Mojo function

copy_sram_to_dram

copy_sram_to_dram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), binary_op: OptionalReg[fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]]({:i1 0, 1})](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])

Synchronously copy data from SRAM (shared memory) to DRAM (global memory).

This function performs a synchronous memory transfer from SRAM (shared memory) to DRAM (global memory) using the specified thread layout for workload distribution. It supports optional swizzling for optimized memory access patterns and binary operations for combining data during the transfer.

Example:

from layout import LayoutTensor, Layout
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                                address_space=AddressSpace.SHARED]()
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                                address_space=AddressSpace.GLOBAL]()

# Copy data using a 2D thread layout with 8x8 threads
copy_sram_to_dram[Layout((8, 8))](global_data, shared_data)
from layout import LayoutTensor, Layout
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                                address_space=AddressSpace.SHARED]()
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                                address_space=AddressSpace.GLOBAL]()

# Copy data using a 2D thread layout with 8x8 threads
copy_sram_to_dram[Layout((8, 8))](global_data, shared_data)

Performance:

Distributes the copy workload across multiple threads for parallel execution.
Supports vectorized loads and stores for better memory throughput.
Can use swizzling to optimize memory access patterns.
Supports binary operations to combine data during transfer (e.g., for reduction operations).

Notes:

The source tensor must be in SHARED address space (SRAM).
The destination tensor must be in GENERIC or GLOBAL address space (DRAM).
Supports FP32 to half-precision downcast during copy if needed.
Handles masked tensors with proper bounds checking.
This function is synchronous, meaning all threads must complete their copy operations before proceeding.

Constraints:

Source tensor must be in SHARED address space with a static layout.
Destination tensor must be in GENERIC or GLOBAL address space.
For type conversion, only FP32 to half-precision is supported.
For vectorized copy with type conversion, both tensors must have element layouts matching the SIMD width of the destination type.

Parameters:

thread_layout (Layout): Layout defining how threads are organized for both source and destination. This determines how the workload is distributed among threads.
swizzle (OptionalReg[Swizzle]): Optional swizzling function to rearrange the source indices, which can improve memory access patterns and reduce bank conflicts.
num_threads (Int): Total number of threads participating in the copy operation. Defaults to the size of thread_layout.
binary_op (OptionalReg[fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]]): Optional binary operation to apply during the copy, combining source data with existing destination data.

Args:

dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The destination tensor, which must be in global or generic memory (DRAM).
src (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The source tensor, which must be in shared memory (SRAM).