Mojo function
copy_local_to_local
copy_local_to_local(dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])
Synchronously copy data between local memory (register) tensors with type conversion.
This function performs a synchronous copy operation between register tensors in a GPU context, with support for converting from float32 to half-precision formats (bfloat16/float16). It's particularly optimized for specific tensor layouts commonly used in matrix multiplication operations.
Example:
```mojo
from layout import LayoutTensor, Layout
from layout.layout_tensor import copy_local_to_local
var src_reg = LayoutTensor[DType.float32, Layout((16, 8)),
address_space=AddressSpace.LOCAL]()
var dst_reg = LayoutTensor[DType.bfloat16, Layout((16, 8)),
address_space=AddressSpace.LOCAL]()
# Process data in float32 registers
# ...
# Convert and copy to bfloat16 registers
copy_local_to_local(dst_reg, src_reg)
```
```mojo
from layout import LayoutTensor, Layout
from layout.layout_tensor import copy_local_to_local
var src_reg = LayoutTensor[DType.float32, Layout((16, 8)),
address_space=AddressSpace.LOCAL]()
var dst_reg = LayoutTensor[DType.bfloat16, Layout((16, 8)),
address_space=AddressSpace.LOCAL]()
# Process data in float32 registers
# ...
# Convert and copy to bfloat16 registers
copy_local_to_local(dst_reg, src_reg)
```
Performance:
- Optimized for specific 2D tensor layouts with contiguous inner dimensions.
- Special fast path for 2D tensors with specific layouts used in matrix multiplication.
- For MMA (Matrix Multiply-Accumulate) operations, efficiently handles the conversion
between output fragments and input fragments with different layouts.
- Falls back to element-wise copy for general cases.
- Optimized for specific 2D tensor layouts with contiguous inner dimensions.
- Special fast path for 2D tensors with specific layouts used in matrix multiplication.
- For MMA (Matrix Multiply-Accumulate) operations, efficiently handles the conversion
between output fragments and input fragments with different layouts.
- Falls back to element-wise copy for general cases.
Notes:
- Both source and destination tensors must be in LOCAL address space (registers).
- This function currently only supports copying from float32 to half-precision formats.
- For 2D tensors with stride[1] == 1, a specialized fast path is used that's optimized
for matrix multiplication patterns.
- This function is particularly useful in GPU kernels for converting between different
precision formats while keeping data in registers.
- Both source and destination tensors must be in LOCAL address space (registers).
- This function currently only supports copying from float32 to half-precision formats.
- For 2D tensors with stride[1] == 1, a specialized fast path is used that's optimized
for matrix multiplication patterns.
- This function is particularly useful in GPU kernels for converting between different
precision formats while keeping data in registers.
Constraints:
- Destination tensor must be in LOCAL address space. - Source tensor must be in LOCAL address space. - Destination tensor must have a half-precision floating-point data type. - Source tensor must have float32 data type. - Both tensors must have the same total size.
Args:
- dst (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The destination tensor, which must be in local memory (registers) and have a half-precision floating-point data type (bfloat16 or float16). - src (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The source tensor, which must be in local memory (registers) and have float32 data type.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!