Mojo function
copy_local_to_sram
copy_local_to_sram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), thread_scope: ThreadScope = ThreadScope(0), row_major: Bool = False](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])
Synchronously copy data from local memory (registers) to SRAM (shared memory).
This function performs a synchronous copy operation from register memory to shared memory in a GPU context, distributing the workload across multiple threads for parallel execution. It's particularly useful for transferring processed data from registers to shared memory for inter-thread communication.
Example:
```mojo
from layout import LayoutTensor, Layout
var register_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.LOCAL]()
var shared_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.SHARED]()
# Process data in registers
# ...
# Copy processed data to shared memory for inter-thread communication
copy_local_to_sram[Layout((8, 8))](shared_data, register_data)
```
```mojo
from layout import LayoutTensor, Layout
var register_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.LOCAL]()
var shared_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.SHARED]()
# Process data in registers
# ...
# Copy processed data to shared memory for inter-thread communication
copy_local_to_sram[Layout((8, 8))](shared_data, register_data)
```
Performance:
- Distributes the copy workload across multiple threads for parallel execution.
- Can use swizzling to optimize memory access patterns and reduce bank conflicts.
- Optimized for transferring data from registers to shared memory.
- On AMD GPUs, the row_major parameter can be used to match the memory access
pattern used during prefetching from DRAM to registers.
- Distributes the copy workload across multiple threads for parallel execution.
- Can use swizzling to optimize memory access patterns and reduce bank conflicts.
- Optimized for transferring data from registers to shared memory.
- On AMD GPUs, the row_major parameter can be used to match the memory access
pattern used during prefetching from DRAM to registers.
Notes:
- The destination tensor must be in SHARED address space (SRAM).
- The source tensor must be in LOCAL address space (registers).
- This function is particularly useful in GPU kernels for sharing processed
data between threads in the same block.
- The row_major parameter is specifically designed for AMD GPUs when using
a prefetching pattern from DRAM to SRAM via registers.
- The destination tensor must be in SHARED address space (SRAM).
- The source tensor must be in LOCAL address space (registers).
- This function is particularly useful in GPU kernels for sharing processed
data between threads in the same block.
- The row_major parameter is specifically designed for AMD GPUs when using
a prefetching pattern from DRAM to SRAM via registers.
Constraints:
- Destination tensor must be in SHARED address space. - Source tensor must be in LOCAL address space. - For optimal performance, the thread layout should match the memory access patterns of the tensors.
Parameters:
- thread_layout (
Layout
): Layout defining how threads are organized for the operation. This determines how the workload is distributed among threads. - swizzle (
OptionalReg[Swizzle]
): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts. - thread_scope (
ThreadScope
): Scope at which thread operations are performed (BLOCK or WARP). Defaults to BLOCK, where all threads in a block participate. - row_major (
Bool
): Whether to use row-major ordering for the copy operation. This is particularly relevant when prefetching from DRAM to SRAM via registers on AMD GPUs. Defaults to False.
Args:
- dst (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The destination tensor, which must be in shared memory (SRAM). - src (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The source tensor, which must be in local memory (registers).
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!