Mojo function
copy_dram_to_sram_async
copy_dram_to_sram_async[src_thread_layout: Layout, dst_thread_layout: Layout, swizzle: Bool = False, fill: Fill = Fill(0), eviction_policy: CacheEviction = CacheEviction(0), num_threads: Int = src_thread_layout.size()](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])
Asynchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context.
This function performs an asynchronous copy operation from global memory (DRAM) to shared memory (SRAM) in a GPU context, using NVIDIA's cp.async hardware mechanism. It distributes the workload across multiple threads and allows computation to overlap with memory transfers for improved performance.
Example:
```mojo
from layout import LayoutTensor, Layout
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
address_space=AddressSpace.SHARED]()
# Asynchronously copy data using thread layouts
copy_dram_to_sram_async[Layout((8, 8)), Layout((8, 8))](shared_data, global_data)
# Perform other computations while the copy is in progress
# Wait for the asynchronous copy to complete
cp_async_wait_all()
```
```mojo
from layout import LayoutTensor, Layout
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
address_space=AddressSpace.SHARED]()
# Asynchronously copy data using thread layouts
copy_dram_to_sram_async[Layout((8, 8)), Layout((8, 8))](shared_data, global_data)
# Perform other computations while the copy is in progress
# Wait for the asynchronous copy to complete
cp_async_wait_all()
```
Performance:
- Performs asynchronous transfers, allowing computation to overlap with memory operations.
- Distributes the copy workload across multiple threads for parallel execution.
- Can use swizzling to optimize memory access patterns and reduce bank conflicts.
- Supports different cache eviction policies to optimize memory hierarchy usage.
- For masked tensors, performs bounds checking to handle edge cases correctly.
- Performs asynchronous transfers, allowing computation to overlap with memory operations.
- Distributes the copy workload across multiple threads for parallel execution.
- Can use swizzling to optimize memory access patterns and reduce bank conflicts.
- Supports different cache eviction policies to optimize memory hierarchy usage.
- For masked tensors, performs bounds checking to handle edge cases correctly.
Notes:
- This function requires NVIDIA GPUs with cp.async support (compute capability 8.0+).
- The source tensor must be in GENERIC or GLOBAL address space (DRAM).
- The destination tensor must be in SHARED address space (SRAM).
- Both tensors must have the same data type.
- This function is asynchronous, so you must call `cp_async_wait_all()` or
`cp_async_wait_group()` to ensure the copy has completed before using the data.
- The maximum size of each element that can be copied is 16 bytes.
- This function requires NVIDIA GPUs with cp.async support (compute capability 8.0+).
- The source tensor must be in GENERIC or GLOBAL address space (DRAM).
- The destination tensor must be in SHARED address space (SRAM).
- Both tensors must have the same data type.
- This function is asynchronous, so you must call `cp_async_wait_all()` or
`cp_async_wait_group()` to ensure the copy has completed before using the data.
- The maximum size of each element that can be copied is 16 bytes.
Constraints:
- Requires NVIDIA GPUs with cp.async support (compute capability 8.0+). - Source tensor must be in
GENERIC
orGLOBAL
address space. - Destination tensor must be inSHARED
address space. - Both tensors must have the same data type. - Element size must be 4, 8, or 16 bytes.
Parameters:
- src_thread_layout (
Layout
): Layout defining how threads are organized for the source tensor. This determines how the workload is distributed among threads. - dst_thread_layout (
Layout
): Layout defining how threads are organized for the destination tensor. - swizzle (
Bool
): Whether to apply swizzling to the destination indices to reduce bank conflicts. Defaults to False. - fill (
Fill
): Fill policy for handling out-of-bounds accesses. Options include: -Fill.NONE
: No special handling (default). -Fill.ZERO
: Fill out-of-bounds elements with zeros. - eviction_policy (
CacheEviction
): Cache eviction policy for the source data. Options include: -CacheEviction.EVICT_NORMAL
: Normal eviction (default). -CacheEviction.EVICT_FIRST
: Evict data after first use. -CacheEviction.EVICT_LAST
: Keep data in cache until last use. - num_threads (
Int
): Total number of threads participating in the copy operation. Defaults to the size of src_thread_layout.
Args:
- dst (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The destination tensor, which must be in shared memory (SRAM). - src (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The source tensor, which must be in global or generic memory (DRAM).
copy_dram_to_sram_async[thread_layout: Layout, swizzle: Bool = False, masked: Bool = False, fill: Fill = Fill(0), eviction_policy: CacheEviction = CacheEviction(0), num_threads: Int = thread_layout.size()](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])
Asynchronous copy from DRAM to SRAM with thread affinity mapping.
This function performs an asynchronous memory transfer from DRAM (global memory) to SRAM (shared memory) using the specified thread layout for distribution.
Notes:
This is a convenience wrapper around the more general copy_dram_to_sram_async
function, using the same thread layout for both source and destination.
This is a convenience wrapper around the more general copy_dram_to_sram_async
function, using the same thread layout for both source and destination.
Parameters:
- thread_layout (
Layout
): The layout used to distribute work across threads. - swizzle (
Bool
): Whether to apply memory access swizzling for better performance. - masked (
Bool
): Whether the copy operation should use masking. - fill (
Fill
): Fill policy for uninitialized memory regions. - eviction_policy (
CacheEviction
): Cache eviction policy to use during the transfer. - num_threads (
Int
): Number of threads to use for the operation, defaults to thread_layout size.
Args:
- dst (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): Destination tensor in SRAM. - src (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): Source tensor in DRAM.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!