Skip to main content
Log in

Mojo function

copy_local_to_local

copy_local_to_local(dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])

Synchronously copy data between local memory (register) tensors with type conversion.

This function performs a synchronous copy operation between register tensors in a GPU context, with support for converting from float32 to half-precision formats (bfloat16/float16). It's particularly optimized for specific tensor layouts commonly used in matrix multiplication operations.

Example:

```mojo
from layout import LayoutTensor, Layout
from layout.layout_tensor import copy_local_to_local

var src_reg = LayoutTensor[DType.float32, Layout((16, 8)),
address_space=AddressSpace.LOCAL]()
var dst_reg = LayoutTensor[DType.bfloat16, Layout((16, 8)),
address_space=AddressSpace.LOCAL]()

# Process data in float32 registers
# ...

# Convert and copy to bfloat16 registers
copy_local_to_local(dst_reg, src_reg)
```
```mojo
from layout import LayoutTensor, Layout
from layout.layout_tensor import copy_local_to_local

var src_reg = LayoutTensor[DType.float32, Layout((16, 8)),
address_space=AddressSpace.LOCAL]()
var dst_reg = LayoutTensor[DType.bfloat16, Layout((16, 8)),
address_space=AddressSpace.LOCAL]()

# Process data in float32 registers
# ...

# Convert and copy to bfloat16 registers
copy_local_to_local(dst_reg, src_reg)
```

Performance:

- Optimized for specific 2D tensor layouts with contiguous inner dimensions.
- Special fast path for 2D tensors with specific layouts used in matrix multiplication.
- For MMA (Matrix Multiply-Accumulate) operations, efficiently handles the conversion
between output fragments and input fragments with different layouts.
- Falls back to element-wise copy for general cases.
- Optimized for specific 2D tensor layouts with contiguous inner dimensions.
- Special fast path for 2D tensors with specific layouts used in matrix multiplication.
- For MMA (Matrix Multiply-Accumulate) operations, efficiently handles the conversion
between output fragments and input fragments with different layouts.
- Falls back to element-wise copy for general cases.

Notes:

- Both source and destination tensors must be in LOCAL address space (registers).
- This function currently only supports copying from float32 to half-precision formats.
- For 2D tensors with stride[1] == 1, a specialized fast path is used that's optimized
for matrix multiplication patterns.
- This function is particularly useful in GPU kernels for converting between different
precision formats while keeping data in registers.
- Both source and destination tensors must be in LOCAL address space (registers).
- This function currently only supports copying from float32 to half-precision formats.
- For 2D tensors with stride[1] == 1, a specialized fast path is used that's optimized
for matrix multiplication patterns.
- This function is particularly useful in GPU kernels for converting between different
precision formats while keeping data in registers.

Constraints:

  • Destination tensor must be in LOCAL address space. - Source tensor must be in LOCAL address space. - Destination tensor must have a half-precision floating-point data type. - Source tensor must have float32 data type. - Both tensors must have the same total size.

Args:

  • dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The destination tensor, which must be in local memory (registers) and have a half-precision floating-point data type (bfloat16 or float16).
  • src (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The source tensor, which must be in local memory (registers) and have float32 data type.