copy_dram_to_local

copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_base: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], offset: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}))

Efficiently copy data from global memory (DRAM) to registers for AMD GPUs.

This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's buffer_load intrinsic to efficiently transfer data from global memory to registers while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput.

Notes:

The offset calculation method significantly impacts performance. Current implementation optimizes for throughput over flexibility.
This function is particularly useful for prefetching data into registers before performing computations, reducing memory access latency.

Constraints:

Only supported on AMD GPUs.
The destination element layout size must match the SIMD width.
Source fragments must be rank 2 with known dimensions.

Parameters:

src_thread_layout (Layout): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads.
thread_scope (ThreadScope): Defines whether operations are performed at BLOCK or WARP level. BLOCK scope involves all threads in a thread block, while WARP scope restricts operations to threads within the same warp. Defaults to ThreadScope.BLOCK.

Args:

dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The destination tensor in register memory (LOCAL address space).
src (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The source tensor in global memory (DRAM) to be copied.
src_base (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The original global memory tensor from which src is derived. This is used to construct the buffer descriptor required by AMD's buffer_load intrinsic.
offset (OptionalReg[UInt]): The offset in the global memory.

copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[dtype, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], bounds: SIMD[uint32, 1])

Efficiently copy data from global memory (DRAM) to registers for AMD GPUs.

This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's buffer_load intrinsic to efficiently transfer data from global memory to registers while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput.

Notes:

The offset calculation method significantly impacts performance. Current implementation optimizes for throughput over flexibility.
This function is particularly useful for prefetching data into registers before performing computations, reducing memory access latency.

Constraints:

Only supported on AMD GPUs.
The destination element layout size must match the SIMD width.
Source fragments must be rank 2 with known dimensions.

Parameters:

src_thread_layout (Layout): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads.
thread_scope (ThreadScope): Defines whether operations are performed at BLOCK or WARP level. BLOCK scope involves all threads in a thread block, while WARP scope restricts operations to threads within the same warp. Defaults to ThreadScope.BLOCK.

Args:

dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The destination tensor in register memory (LOCAL address space).
src_iter (LayoutTensorIter[dtype, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]): The source tensor iterator.
bounds (SIMD[uint32, 1]): Bounds of the buffer, based on the ptr of the src_iter.

copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])

Efficiently copy data from global memory (DRAM) to registers.

This function implements an optimized memory transfer operation from global memory to register memory. It distributes the copy operation across multiple threads for maximum throughput while handling bounds checking for safety.

Constraints:

The source tensor must be in GLOBAL address space (DRAM).
The destination tensor must be in LOCAL address space (registers).
Both tensors must have compatible data types.

Parameters:

src_thread_layout (Layout): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads.
thread_scope (ThreadScope): Defines whether operations are performed at BLOCK or WARP level. BLOCK scope involves all threads in a thread block, while WARP scope restricts operations to threads within the same warp. Defaults to ThreadScope.BLOCK.

Args:

dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The destination tensor in register memory (LOCAL address space).
src (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The source tensor in global memory (DRAM).