Mojo function
copy_dram_to_sram
copy_dram_to_sram[src_thread_layout: Layout, dst_thread_layout: Layout = src_thread_layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = src_thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])
Synchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context.
This function performs a synchronous copy operation from global memory (DRAM) to shared memory (SRAM) in a GPU context, distributing the workload across multiple threads for parallel execution. It uses thread affinity mapping to ensure efficient work distribution and supports vectorized memory operations for optimal performance.
Example:
```mojo
from layout import LayoutTensor, Layout
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
address_space=AddressSpace.SHARED]()
# Copy data using a 2D thread layout with 8x8 threads
copy_dram_to_sram[Layout((8, 8))](shared_data, global_data)
```
```mojo
from layout import LayoutTensor, Layout
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
address_space=AddressSpace.SHARED]()
# Copy data using a 2D thread layout with 8x8 threads
copy_dram_to_sram[Layout((8, 8))](shared_data, global_data)
```
Performance:
- Distributes the copy workload across multiple threads for parallel execution.
- Supports vectorized loads and stores for better memory throughput.
- Can use swizzling to optimize memory access patterns and reduce bank conflicts.
- Thread affinity mapping ensures efficient work distribution.
- For masked tensors, performs bounds checking to handle edge cases correctly.
- Distributes the copy workload across multiple threads for parallel execution.
- Supports vectorized loads and stores for better memory throughput.
- Can use swizzling to optimize memory access patterns and reduce bank conflicts.
- Thread affinity mapping ensures efficient work distribution.
- For masked tensors, performs bounds checking to handle edge cases correctly.
Notes:
- The source tensor must be in GENERIC or GLOBAL address space (DRAM).
- The destination tensor must be in SHARED address space (SRAM).
- Both tensors must have the same data type.
- This function is synchronous, meaning all threads must complete their
copy operations before proceeding.
- For optimal performance, the thread layouts should match the memory
access patterns of the tensors.
- This function is particularly useful in GPU kernels for loading data
from global memory to shared memory for faster access.
- The source tensor must be in GENERIC or GLOBAL address space (DRAM).
- The destination tensor must be in SHARED address space (SRAM).
- Both tensors must have the same data type.
- This function is synchronous, meaning all threads must complete their
copy operations before proceeding.
- For optimal performance, the thread layouts should match the memory
access patterns of the tensors.
- This function is particularly useful in GPU kernels for loading data
from global memory to shared memory for faster access.
Constraints:
- Source and destination tensors must have the same data type. - Source tensor must be in GENERIC or GLOBAL address space. - Destination tensor must be in SHARED address space. - For non-masked tensors, the fragment sizes must match.
Parameters:
- src_thread_layout (
Layout
): Layout defining how threads are organized for the source tensor. This determines how the workload is distributed among threads. - dst_thread_layout (
Layout
): Layout defining how threads are organized for the destination tensor. Defaults to the same as src_thread_layout if not specified. - swizzle (
OptionalReg[Swizzle]
): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts. - num_threads (
Int
): Total number of threads participating in the copy operation. Defaults to the size of src_thread_layout. - thread_scope (
ThreadScope
): Scope at which thread operations are performed (BLOCK or WARP). Defaults to BLOCK, where all threads in a block participate.
Args:
- dst (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The destination tensor, which must be in shared memory (SRAM). - src (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The source tensor, which must be in global or generic memory (DRAM).
copy_dram_to_sram[src_thread_layout: Layout, dst_thread_layout: Layout = src_thread_layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = src_thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src_base: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])
Used to copy data from DRAM to SRAM for AMD GPUs. It uses buffer_load intrinsic to load data and can check for bounds. In addition to dst and src, it takes src_base as an argument to construct the buffer descriptor of the src tensor. src_base is the original global memory tensor from which src is derived.
copy_dram_to_sram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src_base: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])
Synchronously copy data from DRAM to SRAM using a unified thread layout for AMD GPUs.
This is a convenience wrapper around the more general copy_dram_to_sram
function that uses
the same layout for both source and destination tensors. It's specifically designed for
AMD GPUs where the buffer_load intrinsic requires the original base tensor.
Performance:
- Simplifies API usage when the same thread layout is appropriate for both
source and destination tensors.
- Optimized for AMD GPUs using buffer_load intrinsics for efficient memory transfers.
- Distributes the copy workload across multiple threads for parallel execution.
- Simplifies API usage when the same thread layout is appropriate for both
source and destination tensors.
- Optimized for AMD GPUs using buffer_load intrinsics for efficient memory transfers.
- Distributes the copy workload across multiple threads for parallel execution.
Notes:
- This function is only supported on AMD GPUs.
- The source tensor must be in GENERIC or GLOBAL address space (DRAM).
- The destination tensor must be in SHARED address space (SRAM).
- Both tensors must have the same data type.
- This function is only supported on AMD GPUs.
- The source tensor must be in GENERIC or GLOBAL address space (DRAM).
- The destination tensor must be in SHARED address space (SRAM).
- Both tensors must have the same data type.
Parameters:
- thread_layout (
Layout
): Layout defining how threads are organized for both source and destination. This determines how the workload is distributed among threads. - swizzle (
OptionalReg[Swizzle]
): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts. - num_threads (
Int
): Total number of threads participating in the copy operation. Defaults to the size of thread_layout. - thread_scope (
ThreadScope
): Scope at which thread operations are performed (BLOCK or WARP). Defaults to BLOCK, where all threads in a block participate.
Args:
- dst (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The destination tensor, which must be in shared memory (SRAM). - src (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The source tensor, which must be in global or generic memory (DRAM). - src_base (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The original global memory tensor from which src is derived, used to construct the buffer descriptor for AMD GPUs.
copy_dram_to_sram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])
Synchronously copy data from DRAM to SRAM using a unified thread layout.
This is a convenience wrapper around the more general copy_dram_to_sram function that uses the same layout for both source and destination tensors. It simplifies the API for the common case where the same thread distribution pattern works well for both tensors.
Example:
```mojo
from layout import LayoutTensor, Layout
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
address_space=AddressSpace.SHARED]()
# Copy data using a 2D thread layout with 8x8 threads
copy_dram_to_sram[Layout((8, 8))](shared_data, global_data)
```
```mojo
from layout import LayoutTensor, Layout
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
address_space=AddressSpace.SHARED]()
# Copy data using a 2D thread layout with 8x8 threads
copy_dram_to_sram[Layout((8, 8))](shared_data, global_data)
```
Performance:
- Simplifies API usage when the same thread layout is appropriate for both
source and destination tensors.
- Distributes the copy workload across multiple threads for parallel execution.
- Supports vectorized loads and stores for better memory throughput.
- Can use swizzling to optimize memory access patterns and reduce bank conflicts.
- Simplifies API usage when the same thread layout is appropriate for both
source and destination tensors.
- Distributes the copy workload across multiple threads for parallel execution.
- Supports vectorized loads and stores for better memory throughput.
- Can use swizzling to optimize memory access patterns and reduce bank conflicts.
Notes:
- The source tensor must be in GENERIC or GLOBAL address space (DRAM).
- The destination tensor must be in SHARED address space (SRAM).
- Both tensors must have the same data type.
- This function is synchronous, meaning all threads must complete their
copy operations before proceeding.
- The source tensor must be in GENERIC or GLOBAL address space (DRAM).
- The destination tensor must be in SHARED address space (SRAM).
- Both tensors must have the same data type.
- This function is synchronous, meaning all threads must complete their
copy operations before proceeding.
Parameters:
- thread_layout (
Layout
): Layout defining how threads are organized for both source and destination. This determines how the workload is distributed among threads. - swizzle (
OptionalReg[Swizzle]
): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts. - num_threads (
Int
): Total number of threads participating in the copy operation. Defaults to the size of thread_layout. - thread_scope (
ThreadScope
): Scope at which thread operations are performed (BLOCK or WARP). Defaults to BLOCK, where all threads in a block participate.
Args:
- dst (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The destination tensor, which must be in shared memory (SRAM). - src (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The source tensor, which must be in global or generic memory (DRAM).
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!