Mojo function

copy_local_to_sram

copy_local_to_sram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), thread_scope: ThreadScope = ThreadScope(0), row_major: Bool = False](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])

Synchronously copy data from local memory (registers) to SRAM (shared memory).

This function performs a synchronous copy operation from register memory to shared memory in a GPU context, distributing the workload across multiple threads for parallel execution. It's particularly useful for transferring processed data from registers to shared memory for inter-thread communication.

Example:

```mojo
from layout import LayoutTensor, Layout
var register_data = LayoutTensor[DType.float32, Layout((16, 16)),
                                address_space=AddressSpace.LOCAL]()
var shared_data = LayoutTensor[DType.float32, Layout((16, 16)),
                              address_space=AddressSpace.SHARED]()

# Process data in registers
# ...

# Copy processed data to shared memory for inter-thread communication
copy_local_to_sram[Layout((8, 8))](shared_data, register_data)
```
```mojo
from layout import LayoutTensor, Layout
var register_data = LayoutTensor[DType.float32, Layout((16, 16)),
                                address_space=AddressSpace.LOCAL]()
var shared_data = LayoutTensor[DType.float32, Layout((16, 16)),
                              address_space=AddressSpace.SHARED]()

# Process data in registers
# ...

# Copy processed data to shared memory for inter-thread communication
copy_local_to_sram[Layout((8, 8))](shared_data, register_data)
```

Performance:

- Distributes the copy workload across multiple threads for parallel execution.
- Can use swizzling to optimize memory access patterns and reduce bank conflicts.
- Optimized for transferring data from registers to shared memory.
- On AMD GPUs, the row_major parameter can be used to match the memory access
  pattern used during prefetching from DRAM to registers.
- Distributes the copy workload across multiple threads for parallel execution.
- Can use swizzling to optimize memory access patterns and reduce bank conflicts.
- Optimized for transferring data from registers to shared memory.
- On AMD GPUs, the row_major parameter can be used to match the memory access
  pattern used during prefetching from DRAM to registers.

Notes:

- The destination tensor must be in SHARED address space (SRAM).
- The source tensor must be in LOCAL address space (registers).
- This function is particularly useful in GPU kernels for sharing processed
  data between threads in the same block.
- The row_major parameter is specifically designed for AMD GPUs when using
  a prefetching pattern from DRAM to SRAM via registers.
- The destination tensor must be in SHARED address space (SRAM).
- The source tensor must be in LOCAL address space (registers).
- This function is particularly useful in GPU kernels for sharing processed
  data between threads in the same block.
- The row_major parameter is specifically designed for AMD GPUs when using
  a prefetching pattern from DRAM to SRAM via registers.

Constraints:

Destination tensor must be in SHARED address space. - Source tensor must be in LOCAL address space. - For optimal performance, the thread layout should match the memory access patterns of the tensors.

Parameters:

thread_layout (Layout): Layout defining how threads are organized for the operation. This determines how the workload is distributed among threads.
swizzle (OptionalReg[Swizzle]): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts.
thread_scope (ThreadScope): Scope at which thread operations are performed (BLOCK or WARP). Defaults to BLOCK, where all threads in a block participate.
row_major (Bool): Whether to use row-major ordering for the copy operation. This is particularly relevant when prefetching from DRAM to SRAM via registers on AMD GPUs. Defaults to False.

Args:

dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The destination tensor, which must be in shared memory (SRAM).
src (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The source tensor, which must be in local memory (registers).