Mojo struct

LayoutTensor

@register_passable(trivial) struct LayoutTensor[mut: Bool, //, dtype: DType, layout: Layout, origin: Origin[mut], /, *, address_space: AddressSpace = AddressSpace(0), element_layout: Layout = __init__[::Origin[::Bool(IntTuple(1), IntTuple(1)), layout_bitwidth: Int = bitwidthof[::DType,__mlir_type.!kgen.target](), masked: Bool = False, alignment: Int = alignof[::DType,__mlir_type.!kgen.target]()]

A high-performance tensor with explicit memory layout and hardware-optimized access patterns.

LayoutTensor provides a powerful abstraction for multi-dimensional data with precise control over memory organization. It supports various memory layouts (row-major, column-major, tiled), hardware-specific optimizations, and efficient parallel access patterns.

Example:

from layout import Layout, LayoutTensor

var storage = InlineArray[Scalar[DType.float32], 5 * 4](uninitialized = True)
var tensor_5x4 = LayoutTensor[DType.float32, Layout.row_major(5,4)](storage)
from layout import Layout, LayoutTensor

var storage = InlineArray[Scalar[DType.float32], 5 * 4](uninitialized = True)
var tensor_5x4 = LayoutTensor[DType.float32, Layout.row_major(5,4)](storage)

Parameters

mut (Bool): The inferred mutability of the underlying pointer.
dtype (DType): The data type of the underlying pointer.
layout (Layout): The memory layout of the Tensor.
origin (Origin[mut]): The origin of the underlying pointer.
address_space (AddressSpace): The address space of the underlying pointer.
element_layout (Layout): The memory layout of each element in the Tensor.
layout_bitwidth (Int): The bitwidth of each dimension of runtime layout.
masked (Bool): If true the tensor is masked and runtime layouts determine the shape.
alignment (Int): Alignment of the data pointer.

Aliases

rank = layout.rank():
index_type = _get_index_type(layout, address_space):
uint_type = SIMD[_get_unsigned_type(layout, address_space), 1]:
element_size = element_layout.size():
element_type = SIMD[dtype, element_layout.size()]:

Fields

ptr (UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]):
runtime_layout (RuntimeLayout[layout, bitwidth=layout_bitwidth]):
runtime_element_layout (RuntimeLayout[element_layout]):

Implemented traits

AnyType, CollectionElement, CollectionElementNew, Copyable, ExplicitlyCopyable, Movable, Stringable, UnknownDestructibility, Writable

Methods

`init`

@implicit __init__(span: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]) -> Self

Create a LayoutTensor with an UnsafePointer. Expect layout to be fully static.

Args:

span (Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]): The UnsafePointer pointing to the underlying data.

__init__(span: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment], runtime_layout: RuntimeLayout[layout, bitwidth=bitwidth]) -> Self

Create a LayoutTensor with an UnsafePointer. Expect element layout to be fully static.

Args:

span (Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]): The UnsafePointer pointing to the underlying data.
runtime_layout (RuntimeLayout[layout, bitwidth=bitwidth]): The runtime layout of the LayoutTensor.

__init__(span: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment], runtime_layout: RuntimeLayout[layout, bitwidth=layout_bitwidth], element_runtime_layout: RuntimeLayout[element_layout]) -> Self

Create a LayoutTensor with an UnsafePointer, a runtime layout of the Tensor, the runtime layout of each element.

Args:

span (Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]): The UnsafePointer pointing to the underlying data.
runtime_layout (RuntimeLayout[layout, bitwidth=layout_bitwidth]): The runtime layout of the LayoutTensor.
element_runtime_layout (RuntimeLayout[element_layout]): The runtime layout of each element.

@implicit __init__(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self

Create a LayoutTensor with an UnsafePointer. Expect layout to be fully static.

Args:

ptr (UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]): The UnsafePointer pointing to the underlying data.

__init__(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, bitwidth=bitwidth]) -> Self

Create a LayoutTensor with an UnsafePointer. Expect element layout to be fully static.

Args:

ptr (UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]): The UnsafePointer pointing to the underlying data.
runtime_layout (RuntimeLayout[layout, bitwidth=bitwidth]): The runtime layout of the LayoutTensor.

__init__(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, bitwidth=layout_bitwidth], element_runtime_layout: RuntimeLayout[element_layout]) -> Self

Create a LayoutTensor with an UnsafePointer, a runtime layout of the Tensor, the runtime layout of each element.

Args:

ptr (UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]): The UnsafePointer pointing to the underlying data.
runtime_layout (RuntimeLayout[layout, bitwidth=layout_bitwidth]): The runtime layout of the LayoutTensor.
element_runtime_layout (RuntimeLayout[element_layout]): The runtime layout of each element.

@implicit __init__(device_buffer: DeviceBuffer[dtype, address_space, mut, origin]) -> Self

Create a LayoutTensor from a DeviceBuffer. The layout must have statically known dimensions.

from gpu.host import DeviceContext, DeviceBuffer
from layout import Layout, LayoutTensor

alias dtype = DType.float32

var ctx = DeviceContext()
var dev_buf = ctx.enqueue_create_buffer[dtype](8)

alias layout = Layout.row_major(4, 4)
var tensor = LayoutTensor[dtype, layout](dev_buf)
from gpu.host import DeviceContext, DeviceBuffer
from layout import Layout, LayoutTensor

alias dtype = DType.float32

var ctx = DeviceContext()
var dev_buf = ctx.enqueue_create_buffer[dtype](8)

alias layout = Layout.row_major(4, 4)
var tensor = LayoutTensor[dtype, layout](dev_buf)

Args:

device_buffer (DeviceBuffer[dtype, address_space, mut, origin]): Contains the underlying data to point to.

__init__(device_buffer: DeviceBuffer[dtype, address_space, mut, origin], runtime_layout: RuntimeLayout[layout, bitwidth=bitwidth]) -> Self

Create a LayoutTensor from a DeviceBuffer. The layout must have statically known dimensions.

Args:

device_buffer (DeviceBuffer[dtype, address_space, mut, origin]): The DeviceBuffer containing to the underlying data.
runtime_layout (RuntimeLayout[layout, bitwidth=bitwidth]): The runtime layout of the LayoutTensor.

__init__(device_buffer: DeviceBuffer[dtype, address_space, mut, origin], runtime_layout: RuntimeLayout[layout, bitwidth=layout_bitwidth], element_runtime_layout: RuntimeLayout[element_layout]) -> Self

Create a LayoutTensor from a DeviceBuffer, a runtime layout of the Tensor, and the runtime layout of each element.

Args:

device_buffer (DeviceBuffer[dtype, address_space, mut, origin]): The DeviceBuffer containing to the underlying data.
runtime_layout (RuntimeLayout[layout, bitwidth=layout_bitwidth]): The runtime layout of the LayoutTensor.
element_runtime_layout (RuntimeLayout[element_layout]): The runtime layout of each element.

`getitem`

__getitem__(self, *dims: Int) -> SIMD[dtype, element_layout.size()]

Retrieves a single element from the tensor at the specified indices.

This method provides array-like indexing for the tensor. The number of indices provided must match the rank of the tensor, otherwise an error will occur at runtime.

Args:

*dims (Int): The indices specifying the element's position in each dimension. For example, in a 3D tensor, you would use (i, j, k).

Returns:

The element at the specified position with the tensor's data type.

`setitem`

__setitem__(self, d0: Int, val: SIMD[dtype, element_layout.size()])

Sets a single element in a rank-1 tensor at the specified index.

This method provides array-like element assignment for rank-1 tensors.

Notes:

- No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.
- No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.

Args:

d0 (Int): The index along the first dimension.
val (SIMD[dtype, element_layout.size()]): The value to write to the tensor at the specified position.

__setitem__(self, d0: Int, d1: Int, val: SIMD[dtype, element_layout.size()])

Sets a single element in a rank-2 tensor at the specified indices.

This method provides array-like element assignment for rank-2 tensors.

Performance:

- Direct memory access with minimal overhead.
- Memory access pattern follows the tensor's stride configuration.
- Direct memory access with minimal overhead.
- Memory access pattern follows the tensor's stride configuration.

Notes:

- No bounds checking is performed. Accessing out-of-bounds indices

- No bounds checking is performed. Accessing out-of-bounds indices

Args:

d0 (Int): The index along the first dimension.
d1 (Int): The index along the second dimension.
val (SIMD[dtype, element_layout.size()]): The value to write to the tensor at the specified position.

__setitem__(self, d0: Int, d1: Int, d2: Int, val: SIMD[dtype, element_layout.size()])

Sets a single element in a rank-3 tensor at the specified indices.

This method provides array-like element assignment for rank-3 tensors.

Performance:

- Direct memory access with minimal overhead.
- Memory access pattern follows the tensor's stride configuration.
- Direct memory access with minimal overhead.
- Memory access pattern follows the tensor's stride configuration.

Notes:

- No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.
- No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.

Args:

d0 (Int): The index along the first dimension.
d1 (Int): The index along the second dimension.
d2 (Int): The index along the third dimension.
val (SIMD[dtype, element_layout.size()]): The value to write to the tensor at the specified position.

__setitem__(self, d0: Int, d1: Int, d2: Int, d3: Int, val: SIMD[dtype, element_layout.size()])

Sets a single element in a rank-4 tensor at the specified indices.

This method provides array-like element assignment for rank-4 tensors.

Performance:

- Direct memory access with minimal overhead.
- Memory access pattern follows the tensor's stride configuration.
- Direct memory access with minimal overhead.
- Memory access pattern follows the tensor's stride configuration.

Notes:

- No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.
- No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.

Args:

d0 (Int): The index along the first dimension.
d1 (Int): The index along the second dimension.
d2 (Int): The index along the third dimension.
d3 (Int): The index along the fourth dimension.
val (SIMD[dtype, element_layout.size()]): The value to write to the tensor at the specified position.

`add`

__add__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]

Add a scalar value to each element of the tensor.

Performs an elementwise addition operation, adding the scalar value to each element in the tensor. This operation creates a new tensor with the results.

Performance:

- This operation creates a copy of the tensor before performing the addition.
- For in-place addition, use the `__iadd__` method instead.
- This operation creates a copy of the tensor before performing the addition.
- For in-place addition, use the `__iadd__` method instead.

Args:

other (SIMD[dtype, 1]): The scalar value to add to each element.

Returns:

A new tensor containing the results of the addition operation.

__add__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]

Add another tensor to this tensor elementwise.

Performs an elementwise addition between this tensor and another tensor. This operation creates a new tensor with the results.

Limited broadcasting is supported:

For tensors of the same rank, shapes must match exactly.
For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor.

Performance:

- This operation creates a copy of the tensor before performing the addition.
- For in-place addition, use the `__iadd__` method instead.
- This operation creates a copy of the tensor before performing the addition.
- For in-place addition, use the `__iadd__` method instead.

Parameters:

other_layout (Layout): The layout of the other tensor.

Args:

other (LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth]): The tensor to add to this tensor.

Returns:

A new tensor containing the results of the addition operation.

`sub`

__sub__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]

Subtract a scalar value from each element of the tensor.

Performs an elementwise subtraction operation, subtracting the scalar value from each element in the tensor. This operation creates a new tensor with the results.

Performance:

- This operation creates a copy of the tensor before performing the subtraction.
- For in-place subtraction, use the `__isub__` method instead.
- This operation creates a copy of the tensor before performing the subtraction.
- For in-place subtraction, use the `__isub__` method instead.

Args:

other (SIMD[dtype, 1]): The scalar value to subtract from each element.

Returns:

A new tensor containing the results of the subtraction operation.

__sub__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]

Subtract another tensor from this tensor elementwise.

Performs an elementwise subtraction between this tensor and another tensor. This operation creates a new tensor with the results.

Limited broadcasting is supported:

For tensors of the same rank, shapes must match exactly.
For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor.

Performance:

- This operation creates a copy of the tensor before performing the subtraction.
- For in-place subtraction, use the `__isub__` method instead.
- This operation creates a copy of the tensor before performing the subtraction.
- For in-place subtraction, use the `__isub__` method instead.

Parameters:

other_layout (Layout): The layout of the other tensor.

Args:

other (LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth]): The tensor to subtract from this tensor.

Returns:

A new tensor containing the results of the subtraction operation.

`mul`

__mul__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]

Multiply each element of the tensor by a scalar value.

Performs an elementwise multiplication operation, multiplying each element in the tensor by the scalar value. This operation creates a new tensor with the results.

Performance:

- This operation creates a copy of the tensor before performing the multiplication.
- For in-place multiplication, use the `__imul__` method instead.
- This operation creates a copy of the tensor before performing the multiplication.
- For in-place multiplication, use the `__imul__` method instead.

Args:

other (SIMD[dtype, 1]): The scalar value to multiply with each element.

Returns:

A new tensor containing the results of the multiplication operation.

__mul__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]

Multiply this tensor with another tensor elementwise.

Performs an elementwise multiplication (Hadamard product) between this tensor and another tensor. This operation creates a new tensor with the results.

Limited broadcasting is supported:

For tensors of the same rank, shapes must match exactly.
For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor.

Note: This is NOT a matrix multiplication operation. For matrix multiplication, use the appropriate matmul function instead.

Performance:

- This operation creates a copy of the tensor before performing the multiplication.
- For in-place multiplication, use the `__imul__` method instead.
- This operation creates a copy of the tensor before performing the multiplication.
- For in-place multiplication, use the `__imul__` method instead.

Parameters:

other_layout (Layout): The layout of the other tensor.

Args:

other (LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth]): The tensor to multiply with this tensor.

Returns:

A new tensor containing the results of the elementwise multiplication.

`truediv`

__truediv__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]

Divide each element of the tensor by a scalar value.

Performs an elementwise division operation, dividing each element in the tensor by the scalar value. This operation creates a new tensor with the results.

Performance:

- This operation creates a copy of the tensor before performing the division.
- For in-place division, use the `__itruediv__` method instead.
- This operation creates a copy of the tensor before performing the division.
- For in-place division, use the `__itruediv__` method instead.

Notes:

- Division by zero will result in undefined behavior or errors depending on the dtype.
- For integer dtypes, this performs integer division.
- Division by zero will result in undefined behavior or errors depending on the dtype.
- For integer dtypes, this performs integer division.

Args:

other (SIMD[dtype, 1]): The scalar value to divide each element by.

Returns:

A new tensor containing the results of the division operation.

__truediv__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]

Divide this tensor by another tensor elementwise.

Performs an elementwise division between this tensor and another tensor. This operation creates a new tensor with the results.

Limited broadcasting is supported:

For tensors of the same rank, shapes must match exactly.
For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor.

Performance:

- This operation creates a copy of the tensor before performing the division.
- For in-place division, use the `__itruediv__` method instead.
- This operation creates a copy of the tensor before performing the division.
- For in-place division, use the `__itruediv__` method instead.

Notes:

- Division by zero will result in undefined behavior or errors depending on the dtype.
- For integer dtypes, this performs integer division.
- Division by zero will result in undefined behavior or errors depending on the dtype.
- For integer dtypes, this performs integer division.

Parameters:

other_layout (Layout): The layout of the other tensor.

Args:

other (LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth]): The tensor to divide this tensor by.

Returns:

A new tensor containing the results of the division operation.

`iadd`

__iadd__(self, other: SIMD[dtype, 1])

Add a scalar value to each element of the tensor in-place.

Performs an elementwise addition operation, adding the scalar value to each element in the tensor. This operation modifies the tensor in-place.

Performance:

- This operation modifies the tensor directly without creating a copy.

- This operation modifies the tensor directly without creating a copy.

Args:

other (SIMD[dtype, 1]): The scalar value to add to each element.

__iadd__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth])

Add another tensor to this tensor elementwise in-place.

Performs an elementwise addition between this tensor and another tensor. This operation modifies the tensor in-place.

Limited broadcasting is supported:

For tensors of the same rank, shapes must match exactly.
For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor.

Performance:

- This operation modifies the tensor directly without creating a copy.

- This operation modifies the tensor directly without creating a copy.

Parameters:

other_layout (Layout): The layout of the other tensor.

Args:

other (LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth]): The tensor to add to this tensor.

`isub`

__isub__(self, other: SIMD[dtype, 1])

Subtract a scalar value from each element of the tensor in-place.

Performs an elementwise subtraction operation, subtracting the scalar value from each element in the tensor. This operation modifies the tensor in-place.

Performance:

- This operation modifies the tensor directly without creating a copy.

- This operation modifies the tensor directly without creating a copy.

Args:

other (SIMD[dtype, 1]): The scalar value to subtract from each element.

__isub__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth])

Subtract another tensor from this tensor elementwise in-place.

Performs an elementwise subtraction between this tensor and another tensor. This operation modifies the tensor in-place.

Limited broadcasting is supported:

For tensors of the same rank, shapes must match exactly.
For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor.

Performance:

- This operation modifies the tensor directly without creating a copy.

- This operation modifies the tensor directly without creating a copy.

Parameters:

other_layout (Layout): The layout of the other tensor.

Args:

other (LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth]): The tensor to subtract from this tensor.

`imul`

__imul__(self, other: SIMD[dtype, 1])

Multiply each element of the tensor by a scalar value in-place.

Performs an elementwise multiplication operation, multiplying each element in the tensor by the scalar value. This operation modifies the tensor in-place.

Performance:

- This operation modifies the tensor directly without creating a copy.

- This operation modifies the tensor directly without creating a copy.

Args:

other (SIMD[dtype, 1]): The scalar value to multiply with each element.

__imul__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth])

Multiply this tensor with another tensor elementwise in-place.

Performs an elementwise multiplication (Hadamard product) between this tensor and another tensor. This operation modifies the tensor in-place.

Limited broadcasting is supported:

For tensors of the same rank, shapes must match exactly.
For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor.

Note: This is NOT a matrix multiplication operation. For matrix multiplication, use the appropriate matmul function instead.

Performance:

- This operation modifies the tensor directly without creating a copy.

- This operation modifies the tensor directly without creating a copy.

Parameters:

other_layout (Layout): The layout of the other tensor.

Args:

other (LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth]): The tensor to multiply with this tensor.

`itruediv`

__itruediv__(self, other: SIMD[dtype, 1])

Divide each element of the tensor by a scalar value in-place.

Performs an elementwise division operation, dividing each element in the tensor by the scalar value. This operation modifies the tensor in-place.

Performance:

- This operation modifies the tensor directly without creating a copy.

- This operation modifies the tensor directly without creating a copy.

Notes:

- Division by zero will result in undefined behavior or errors depending on the dtype.
- For integer dtypes, this performs integer division.
- Division by zero will result in undefined behavior or errors depending on the dtype.
- For integer dtypes, this performs integer division.

Args:

other (SIMD[dtype, 1]): The scalar value to divide each element by.

__itruediv__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth])

Divide this tensor by another tensor elementwise in-place.

Performs an elementwise division between this tensor and another tensor. This operation modifies the tensor in-place.

Limited broadcasting is supported:

For tensors of the same rank, shapes must match exactly.
For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor.

Performance:

- This operation modifies the tensor directly without creating a copy.

- This operation modifies the tensor directly without creating a copy.

Notes:

- Division by zero will result in undefined behavior or errors depending on the dtype.
- For integer dtypes, this performs integer division.
- Division by zero will result in undefined behavior or errors depending on the dtype.
- For integer dtypes, this performs integer division.

Parameters:

other_layout (Layout): The layout of the other tensor.

Args:

other (LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth]): The tensor to divide this tensor by.

`copy`

copy(self) -> Self

Explicitly copy the other LayoutTensor.

Returns:

A copy of the value.

`bitcast`

bitcast[new_type: DType, /, address_space: AddressSpace = address_space, element_layout: Layout = element_layout](self) -> LayoutTensor[new_type, layout, origin, address_space=address_space, element_layout=element_layout, masked=masked]

Bitcast the underlying pointer to a new data type.

Parameters:

new_type (DType): The new data type it is casting to.
address_space (AddressSpace): The address space of the returned LayoutTensor.
element_layout (Layout): The element layout of the returned LayoutTensor.

`origin_cast`

origin_cast[mut: Bool = mut, origin: Origin[mut] = origin](self) -> LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]

Changes the origin or mutability of a pointer.

Parameters:

mut (Bool): Whether the origin is mutable.
origin (Origin[mut]): Origin of the destination pointer.

Returns:

A new UnsafePointer object with the same type and the same address, as the original UnsafePointer and the new specified mutability and origin.

`get_immutable`

get_immutable(self) -> LayoutTensor[dtype, layout, (muttoimm origin._mlir_origin), address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]

Return an immutable version of this tensor.

Returns:

A LayoutTensor covering the same elements, but without mutability.

`load`

load[width: Int](self, m: Int, n: Int) -> SIMD[dtype, width]

Load a SIMD vector from the tensor at the specified 2D coordinates.

Performs a vectorized load operation from the tensor's memory, retrieving 'width' consecutive elements starting at position (m, n). This method enables efficient SIMD operations on tensor data.

Performance:

- Uses unaligned memory access which may be slower on some architectures.
- For aligned access, use aligned_load instead when data alignment is guaranteed.
- The load operation is optimized based on the tensor's memory layout.
- Uses unaligned memory access which may be slower on some architectures.
- For aligned access, use aligned_load instead when data alignment is guaranteed.
- The load operation is optimized based on the tensor's memory layout.

Notes:

- No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.
- The elements are loaded according to the tensor's stride configuration.
- No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.
- The elements are loaded according to the tensor's stride configuration.

Parameters:

width (Int): The number of elements to load into the SIMD vector. Should match the target hardware's vector width for optimal performance.

Args:

m (Int): The row index (first dimension).
n (Int): The column index (second dimension).

Returns:

A SIMD vector containing 'width' consecutive elements from the tensor.

`prefetch`

prefetch(self, m: Int, n: Int)

Prefetch tensor data at the specified 2D coordinates into cache.

Issues a software prefetch hint to the processor to load the data at position (m, n) into the cache hierarchy. This can improve performance by reducing memory latency for subsequent accesses to the same location.

Performance:

- Prefetching is a performance hint and does not guarantee data will be cached.
- Most effective when issued sufficiently ahead of the actual data access.
- Uses high locality prefetch to the data cache, optimized for data that
  will be accessed multiple times.
- Can reduce memory access latency by 50-90% when used correctly.
- Prefetching is a performance hint and does not guarantee data will be cached.
- Most effective when issued sufficiently ahead of the actual data access.
- Uses high locality prefetch to the data cache, optimized for data that
  will be accessed multiple times.
- Can reduce memory access latency by 50-90% when used correctly.

Notes:

- Excessive prefetching can pollute the cache and degrade performance.
- Most beneficial for predictable access patterns that would otherwise
  cause cache misses.
- No operation is performed on the prefetched data.
- Excessive prefetching can pollute the cache and degrade performance.
- Most beneficial for predictable access patterns that would otherwise
  cause cache misses.
- No operation is performed on the prefetched data.

Args:

m (Int): The row index (first dimension).
n (Int): The column index (second dimension).

`aligned_load`

aligned_load[width: Int](self, m: Int, n: Int) -> SIMD[dtype, width]

Load a SIMD vector with alignment guarantees from the tensor.

Performs an aligned vectorized load operation from the tensor's memory, retrieving 'width' consecutive elements starting at position (m, n). The alignment is automatically calculated based on the SIMD width and dtype.

Performance:

- Uses aligned memory access which is faster than unaligned access on most architectures.
- The alignment is automatically calculated based on the SIMD width and dtype.
- Can be up to 2x faster than unaligned loads on architectures that require alignment.
- Uses aligned memory access which is faster than unaligned access on most architectures.
- The alignment is automatically calculated based on the SIMD width and dtype.
- Can be up to 2x faster than unaligned loads on architectures that require alignment.

Notes:

- The caller must ensure that the memory at (m, n) is properly aligned.
  Misaligned access with this method may cause hardware exceptions on some architectures.
- No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.
- The caller must ensure that the memory at (m, n) is properly aligned.
  Misaligned access with this method may cause hardware exceptions on some architectures.
- No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.

Parameters:

width (Int): The number of elements to load into the SIMD vector. Should match the target hardware's vector width for optimal performance.

Args:

m (Int): The row index (first dimension).
n (Int): The column index (second dimension).

Returns:

A SIMD vector containing 'width' consecutive elements from the tensor.

`store`

store[width: Int](self, m: Int, n: Int, val: SIMD[dtype, width])

Store a SIMD vector to the tensor at the specified 2D coordinates.

Performs a vectorized store operation to the tensor's memory, writing 'width' consecutive elements starting at position (m, n). This method enables efficient SIMD operations on tensor data.

Performance:

- Uses unaligned memory access which may be slower on some architectures.
- For aligned access, use aligned_store instead when data alignment is guaranteed.
- The store operation is optimized based on the tensor's memory layout.
- Uses unaligned memory access which may be slower on some architectures.
- For aligned access, use aligned_store instead when data alignment is guaranteed.
- The store operation is optimized based on the tensor's memory layout.

Notes:

- No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.
- The elements are stored according to the tensor's stride configuration.
- This operation modifies the tensor's data in-place.
- No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.
- The elements are stored according to the tensor's stride configuration.
- This operation modifies the tensor's data in-place.

Parameters:

width (Int): The number of elements in the SIMD vector to store. Should match the target hardware's vector width for optimal performance.

Args:

m (Int): The row index (first dimension) where the store operation begins.
n (Int): The column index (second dimension) where the store operation begins.
val (SIMD[dtype, width]): The SIMD vector containing the values to store in the tensor.

`aligned_store`

aligned_store[width: Int](self, m: Int, n: Int, val: SIMD[dtype, width])

Store a SIMD vector with alignment guarantees to the tensor.

Performs an aligned vectorized store operation to the tensor's memory, writing 'width' consecutive elements starting at position (m, n). The alignment is automatically calculated based on the SIMD width and dtype.

Performance:

- Uses aligned memory access which is faster than unaligned access on most architectures.
- The alignment is automatically calculated based on the SIMD width and dtype.
- Can be up to 2x faster than unaligned stores on architectures that require alignment.
- Particularly important for streaming stores that bypass the cache.
- Uses aligned memory access which is faster than unaligned access on most architectures.
- The alignment is automatically calculated based on the SIMD width and dtype.
- Can be up to 2x faster than unaligned stores on architectures that require alignment.
- Particularly important for streaming stores that bypass the cache.

Notes:

- The caller must ensure that the memory at (m, n) is properly aligned.
  Misaligned access with this method may cause hardware exceptions on some architectures.
- No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.
- This operation modifies the tensor's data in-place.
- The caller must ensure that the memory at (m, n) is properly aligned.
  Misaligned access with this method may cause hardware exceptions on some architectures.
- No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.
- This operation modifies the tensor's data in-place.

Parameters:

width (Int): The number of elements in the SIMD vector to store. Should match the target hardware's vector width for optimal performance.

Args:

m (Int): The row index (first dimension) where the store operation begins.
n (Int): The column index (second dimension) where the store operation begins.
val (SIMD[dtype, width]): The SIMD vector containing the values to store in the tensor.

`stack_allocation`

static stack_allocation[*, alignment: Int = alignment]() -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]

Allocates stack memory for a LayoutTensor with a fully static layout.

Creates a new LayoutTensor instance with memory allocated on the stack rather than the heap. This provides deterministic memory management and potentially better performance for tensors with known sizes at compile time.

Performance:

- Stack allocation is typically faster than heap allocation.
- Proper alignment can significantly improve memory access performance,
  especially for vectorized operations.
- No dynamic memory management overhead (no malloc/free calls).
- Stack allocation is typically faster than heap allocation.
- Proper alignment can significantly improve memory access performance,
  especially for vectorized operations.
- No dynamic memory management overhead (no malloc/free calls).

Notes:

- Only works with tensors that have fully static layouts known at compile time.
- Stack memory is limited, so this should only be used for reasonably sized tensors.
- The allocated memory is automatically freed when the function returns.
- Only works with tensors that have fully static layouts known at compile time.
- Stack memory is limited, so this should only be used for reasonably sized tensors.
- The allocated memory is automatically freed when the function returns.

Constraints:

The layout must be fully static (all dimensions known at compile time). - The alignment must be a multiple of the tensor's minimum required alignment.

Parameters:

alignment (Int): Memory alignment value for the allocation in bytes. Must be a multiple of the tensor's minimum required alignment. Default is the tensor's natural alignment based on its data type and layout.

Returns:

A new LayoutTensor instance with memory allocated on the stack.

`shape`

static shape[idx: Int]() -> Int

Returns the size of the tensor along the specified dimension.

Provides static access to the tensor's shape information. This method returns the size of a specific dimension without requiring an instance of the tensor, as the shape is part of the tensor's static type information.

Performance:

- This is a compile-time operation with no runtime cost when used
  with static dimensions.
- This is a compile-time operation with no runtime cost when used
  with static dimensions.

Notes:

- This is a static method that operates on the tensor's type information,
  not on a specific tensor instance.
- For dynamic dimensions, use the instance method `dim()` instead.
- This is a static method that operates on the tensor's type information,
  not on a specific tensor instance.
- For dynamic dimensions, use the instance method `dim()` instead.

Parameters:

idx (Int): The dimension index to query (0-based). For example, in a 3D tensor with shape [10, 20, 30]: - shape[0]() returns 10 (first dimension). - shape[1]() returns 20 (second dimension). - shape[2]() returns 30 (third dimension).

Returns:

The size of the tensor along the specified dimension as an integer.

`stride`

static stride[idx: Int]() -> Int

Returns the memory stride of the tensor along the specified dimension.

Provides static access to the tensor's stride information. The stride represents the number of elements to skip in memory to move one position along a particular dimension. This method returns the stride without requiring an instance of the tensor, as the stride is part of the tensor's static type information.

Performance:

- This is a compile-time operation with no runtime cost when used
  with static dimensions.
- Understanding stride patterns is crucial for optimizing memory access
  patterns in performance-critical code.
- This is a compile-time operation with no runtime cost when used
  with static dimensions.
- Understanding stride patterns is crucial for optimizing memory access
  patterns in performance-critical code.

Notes:

- Strides depend on the memory layout (row-major, column-major, or custom).
- For non-contiguous tensors (e.g., tensor slices), strides may not follow
  a simple pattern.
- Strides depend on the memory layout (row-major, column-major, or custom).
- For non-contiguous tensors (e.g., tensor slices), strides may not follow
  a simple pattern.

Parameters:

idx (Int): The dimension index to query (0-based). For example, in a 2D tensor with shape [10, 20] and row-major layout: - stride[0]() might return 20 (moving one row requires skipping 20 elements). - stride[1]() might return 1 (moving one column requires skipping 1 element).

Returns:

The memory stride of the tensor along the specified dimension as an integer.

`dim`

dim(self, idx: Int) -> Int

Returns the runtime dimension size of the tensor along the specified axis.

Unlike the static shape method, this instance method provides access to the tensor's actual dimension sizes at runtime, which is necessary for tensors with dynamic shapes or when working with tensor slices.

Performance:

- This is a runtime operation that accesses the tensor's runtime layout information.
- For static dimensions known at compile time, prefer the static `shape` method
  when possible for better performance.
- This is a runtime operation that accesses the tensor's runtime layout information.
- For static dimensions known at compile time, prefer the static `shape` method
  when possible for better performance.

Notes:

- This method works with both static and dynamic dimensions.
- For tensors with masked or partial views, this returns the actual
  size of the view, not the original tensor.
- This method works with both static and dynamic dimensions.
- For tensors with masked or partial views, this returns the actual
  size of the view, not the original tensor.

Constraints:

Only works with tensors that have depth-1 layouts (no nested shapes).

Args:

idx (Int): The dimension index to query (0-based). For example, in a 3D tensor with shape [10, 20, 30]: - dim(0) returns 10 (first dimension). - dim(1) returns 20 (second dimension). - dim(2) returns 30 (third dimension).

Returns:

The size of the tensor along the specified dimension as an integer.

`coalesce`

coalesce(self) -> LayoutTensor[dtype, coalesce(layout, False), origin, address_space=address_space, element_layout=element_layout]

Creates a tensor with a coalesced memory layout from this tensor.

Coalescing a tensor's layout means reorganizing its memory representation to be as contiguous as possible, which can improve memory access patterns and performance. This operation does not move or copy data; it only changes how the same memory is interpreted.

Performance:

- Coalesced layouts typically provide better cache utilization and
  memory access patterns.
- This operation is zero-cost at runtime as it only changes the
  layout information, not the actual data.
- Particularly beneficial before operations that perform sequential
  memory access or vectorized operations.
- Coalesced layouts typically provide better cache utilization and
  memory access patterns.
- This operation is zero-cost at runtime as it only changes the
  layout information, not the actual data.
- Particularly beneficial before operations that perform sequential
  memory access or vectorized operations.

Notes:

- The coalesced tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
- The shape of the tensor remains the same, only the stride information
  is optimized.
- For already optimally coalesced tensors, this operation has no effect.
- The coalesced tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
- The shape of the tensor remains the same, only the stride information
  is optimized.
- For already optimally coalesced tensors, this operation has no effect.

`tile`

tile[*tile_sizes: Int](self, *tile_coords: Int) -> LayoutTensor[dtype, _compute_tile_layout[*::Int]().__getitem__(0), origin, address_space=address_space, element_layout=element_layout, masked=masked if masked else _tile_is_masked[layout::layout::Layout,*::Int]()]

Extract a tile (sub-tensor) from this tensor with specified dimensions and position.

Tiling is a fundamental operation for high-performance tensor computations that divides a tensor into smaller blocks for better cache locality and parallelism. This method extracts a specific tile at the given coordinates without copying data.

Example: For a 4×4 tensor with values:

```
[1 2 3 4]
[2 3 4 5]
[5 4 3 2]
[1 1 1 1]
```

`tile[2, 2](1, 0)` will extract the tile:

```
[5 4]
[1 1]
```
```
[1 2 3 4]
[2 3 4 5]
[5 4 3 2]
[1 1 1 1]
```

`tile[2, 2](1, 0)` will extract the tile:

```
[5 4]
[1 1]
```

Performance:

- Creates a view without copying data, making it very efficient.
- Optimized for both static and dynamic layouts with different code paths.
- Properly handles edge cases where tiles may be partially outside the tensor.
- Maintains stride information for efficient memory access within the tile.
- Creates a view without copying data, making it very efficient.
- Optimized for both static and dynamic layouts with different code paths.
- Properly handles edge cases where tiles may be partially outside the tensor.
- Maintains stride information for efficient memory access within the tile.

Notes:

- The resulting tile is a view into the original tensor, so modifications
  to the tile will affect the original tensor.
- For tiles at the edges of the tensor, the actual dimensions may be smaller
  than the requested tile_sizes if masking is enabled.
- The implementation automatically selects between static and dynamic tiling
  based on the tensor's layout properties.
- The resulting tile is a view into the original tensor, so modifications
  to the tile will affect the original tensor.
- For tiles at the edges of the tensor, the actual dimensions may be smaller
  than the requested tile_sizes if masking is enabled.
- The implementation automatically selects between static and dynamic tiling
  based on the tensor's layout properties.

Parameters:

*tile_sizes (Int): The dimensions of each tile along each axis of the tensor. For example, in a 2D tensor, tile[32, 32] creates 32×32 tiles.

Args:

*tile_coords (Int): The coordinates of the specific tile to extract. For example, tile[32, 32](1, 2) extracts the tile at position (1, 2) in the grid of 32×32 tiles.

`tiled_iterator`

tiled_iterator[*tile_sizes: Int, *, axis: Int = 0](self, *tile_coords: Int) -> LayoutTensorIter[dtype, _compute_tile_layout[*::Int]().__getitem__(0), origin, address_space=address_space, axis=OptionalReg[Int]({:_stdlib::_builtin::_int::_Int axis, 0}), layout_bitwidth=layout_bitwidth, masked=masked if masked else _tile_is_masked[layout::layout::Layout,*::Int]()]

Create an iterator that traverses tiles along a specified axis.

This method creates an iterator that allows efficient traversal of tiles within a tensor. The iterator starts at the specified tile coordinates and can move along the specified axis, providing access to consecutive tiles.

Performance:

- Provides efficient sequential access to tiles with good cache locality.
- Optimized for both static and dynamic layouts with different code paths.
- Maintains stride information for efficient memory access within each tile.
- Properly handles edge cases where tiles may be partially outside the tensor.
- Provides efficient sequential access to tiles with good cache locality.
- Optimized for both static and dynamic layouts with different code paths.
- Maintains stride information for efficient memory access within each tile.
- Properly handles edge cases where tiles may be partially outside the tensor.

Notes:

- The iterator provides views into the original tensor, so modifications
  through the iterator will affect the original tensor.
- For tiles at the edges of the tensor, the actual dimensions may be smaller
  than the requested tile_sizes if masking is enabled.
- The iterator is not circular by default, meaning it will not wrap around
  when reaching the end of the tensor along the iteration axis.
- The implementation automatically selects between static and dynamic tiling
  based on the tensor's layout properties.
- The iterator provides views into the original tensor, so modifications
  through the iterator will affect the original tensor.
- For tiles at the edges of the tensor, the actual dimensions may be smaller
  than the requested tile_sizes if masking is enabled.
- The iterator is not circular by default, meaning it will not wrap around
  when reaching the end of the tensor along the iteration axis.
- The implementation automatically selects between static and dynamic tiling
  based on the tensor's layout properties.

Example:

```mojo
var iter = tensor.tiled_iterator[16, 16, axis=0](0, 0)
for i in range(num_tiles_along_axis):
    var tile = iter.get()
    # Process tile
    iter.next()
```
.
```mojo
var iter = tensor.tiled_iterator[16, 16, axis=0](0, 0)
for i in range(num_tiles_along_axis):
    var tile = iter.get()
    # Process tile
    iter.next()
```
.

Parameters:

*tile_sizes (Int): The dimensions of each tile along each axis of the tensor. For example, in a 2D tensor, tiled_iterator[32, 32] creates an iterator over 32×32 tiles.
axis (Int): The axis along which the iterator will traverse. Default is 0 (first dimension). For example, with axis=0, the iterator will move vertically through tiles.

Args:

*tile_coords (Int): The starting coordinates of the tile where iteration begins.

Returns:

A LayoutTensorIter that can be used to traverse tiles along the specified axis.

`split`

split[count: Int, axis: Int = 0](self) -> StaticTuple[LayoutTensor[dtype, _compute_tile_layout[::Int,::Int]().__getitem__(0), origin, address_space=address_space, element_layout=element_layout, alignment=alignment], count]

Split the LayoutTensor along a axis and return an InlineArray of LayoutTensor.

Parameters:

count (Int): Number of portion to split.
axis (Int): The axis where the split is applied to.

split[axis: Int = 0, alignment: Int = 1](self, count: Int, idx: Int) -> LayoutTensor[dtype, layout.make_shape_unknown[::Int](), origin, address_space=address_space, element_layout=element_layout]

Retrieve a specific partition of the tensor after splitting along a specified axis.

This method divides the tensor into 'count' partitions along the specified axis and returns the partition at index 'idx'. The partitioning is done with alignment considerations to optimize memory access patterns.

Unlike the overloaded split method that returns all partitions, this method returns only a single partition, making it more memory-efficient for cases where only one partition is needed at a time.

Notes:

- The shape along the split axis becomes unknown at compile time.
- Only works with dimensions that have statically known sizes.
- The last partition may be smaller than others if the dimension size
  is not evenly divisible by 'count'.
- Partition sizes are aligned up to the specified alignment value,
  which can improve performance for vectorized operations.
- The shape along the split axis becomes unknown at compile time.
- Only works with dimensions that have statically known sizes.
- The last partition may be smaller than others if the dimension size
  is not evenly divisible by 'count'.
- Partition sizes are aligned up to the specified alignment value,
  which can improve performance for vectorized operations.

Performance:

- Uses aligned partitioning to improve memory access patterns.
- Avoids creating all partitions in memory, reducing memory usage.
- Maintains the original tensor's stride information for efficient
  element access within the partition.
- Uses aligned partitioning to improve memory access patterns.
- Avoids creating all partitions in memory, reducing memory usage.
- Maintains the original tensor's stride information for efficient
  element access within the partition.

Constraints:

The dimension being split must have a statically known size. - Cannot split dimensions with unknown or dynamic sizes.

Parameters:

axis (Int): The axis along which to split the tensor. Defaults to 0 (first dimension).
alignment (Int): Memory alignment value for the partition size. Defaults to 1.

Args:

count (Int): The number of partitions to divide the tensor into.
idx (Int): The index of the partition to return (0-based).

Returns:

A LayoutTensor representing the requested partition.

`distribute`

distribute[threads_layout: Layout, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), submode_axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1})](self, thread_id: UInt) -> LayoutTensor[dtype, _compute_distribute_layout[layout::layout::Layout,layout::layout::Layout,stdlib::collections::optional::OptionalReg[::Int]]().__getitem__(1), origin, address_space=address_space, element_layout=element_layout, masked=masked if masked else _distribute_is_masked[layout::layout::Layout,layout::layout::Layout,stdlib::collections::optional::OptionalReg[::Int]]()]

Distribute tensor workload across multiple threads in a structured pattern.

This method partitions a tensor across multiple threads for parallel processing, assigning each thread a specific portion of the tensor. The distribution pattern is determined by the threads_layout parameter, which defines the logical arrangement of threads.

Example: For a 4×4 tensor distributed across 4 threads in a 2×2 grid:

- Thread 0 might get the top-left quadrant
- Thread 1 might get the top-right quadrant
- Thread 2 might get the bottom-left quadrant
- Thread 3 might get the bottom-right quadrant

If axis=0 is specified with the same setup:

- Thread 0 and Thread 2 would get the same data (left half)
- Thread 1 and Thread 3 would get the same data (right half)
- Thread 0 might get the top-left quadrant
- Thread 1 might get the top-right quadrant
- Thread 2 might get the bottom-left quadrant
- Thread 3 might get the bottom-right quadrant

If axis=0 is specified with the same setup:

- Thread 0 and Thread 2 would get the same data (left half)
- Thread 1 and Thread 3 would get the same data (right half)

Performance:

- Creates a view without copying data, making it very efficient for parallel processing.
- The swizzle parameter can significantly improve cache locality and memory access patterns.
- Optimized for both static and dynamic layouts with different code paths.
- Creates a view without copying data, making it very efficient for parallel processing.
- The swizzle parameter can significantly improve cache locality and memory access patterns.
- Optimized for both static and dynamic layouts with different code paths.

Notes:

- The resulting tensor is a view into the original tensor, so modifications
  will affect the original tensor.
- For optimal performance, the `threads_layout` should match the hardware's
  thread organization (e.g., warp/wavefront size and shape).
- When using swizzling, carefully consider the memory access patterns to
  avoid cache thrashing or bank conflicts.
- This function is particularly useful for GPU programming where threads
  are organized in structured grids.
- The resulting tensor is a view into the original tensor, so modifications
  will affect the original tensor.
- For optimal performance, the `threads_layout` should match the hardware's
  thread organization (e.g., warp/wavefront size and shape).
- When using swizzling, carefully consider the memory access patterns to
  avoid cache thrashing or bank conflicts.
- This function is particularly useful for GPU programming where threads
  are organized in structured grids.

Constraints:

For dynamic layouts, the shape must be known at runtime and the threads_layout must be fully static.

Parameters:

threads_layout (Layout): Defines the logical arrangement of threads (e.g., 2×2 grid of 4 threads). This layout determines how the tensor is partitioned.
axis (OptionalReg[Int]): Optional. If specified, restricts distribution to only this axis. For example, with axis=0 in a 2D thread layout, threads that differ only in their second coordinate will receive the same data.
swizzle (OptionalReg[Swizzle]): Optional. A function that remaps the distribution pattern to improve memory access patterns or cache locality.
submode_axis (OptionalReg[Int]): Optional. Specifies an axis for specialized distribution modes.

Args:

thread_id (UInt): The ID of the current thread (0-based).

Returns:

A view into the original tensor representing the portion assigned to this thread.

`vectorize`

vectorize[*vector_shape: Int](self) -> LayoutTensor[dtype, coalesce(_compute_tile_layout[*::Int]().__getitem__(1), True), origin, address_space=address_space, element_layout=_divide_tiles[*::Int]().__getitem__(0), masked=masked]

Reshape a tensor into a vectorized form for efficient SIMD operations.

This method transforms the tensor's logical layout to enable efficient vectorized processing, treating blocks of elements as vector units. The transformation is particularly useful for SIMD (Single Instruction Multiple Data) operations and hardware acceleration.

Example: For a 16×16 tensor, vectorize[4, 4] will produce a 4×4 tensor where each element represents a 4×4 block from the original tensor.

Performance:

- Creates a view without copying data, making it very efficient.
- Enables hardware-accelerated vector operations on blocks of data.
- Improves cache locality by grouping related elements together.
- Particularly beneficial for operations that can leverage SIMD instructions.
- Creates a view without copying data, making it very efficient.
- Enables hardware-accelerated vector operations on blocks of data.
- Improves cache locality by grouping related elements together.
- Particularly beneficial for operations that can leverage SIMD instructions.

Notes:

- The tensor dimensions must be divisible by the corresponding vector dimensions.
- For dimensions with unknown size, the corresponding vector dimension must be 1.
- The resulting tensor has the same data but a different logical organization.
- Modifications to the vectorized tensor affect the original tensor.
- This transformation is particularly useful for GPU and vector processor optimizations.
- The tensor dimensions must be divisible by the corresponding vector dimensions.
- For dimensions with unknown size, the corresponding vector dimension must be 1.
- The resulting tensor has the same data but a different logical organization.
- Modifications to the vectorized tensor affect the original tensor.
- This transformation is particularly useful for GPU and vector processor optimizations.

Constraints:

Each tensor dimension must be divisible by the corresponding vector dimension. - Vector dimensions must be smaller than or equal to the corresponding tensor dimensions. - For dimensions with unknown size, the vector dimension must be 1.

Parameters:

*vector_shape (Int): The dimensions of each vector unit along each axis of the tensor. For example, in a 2D tensor, vectorize[4, 4] treats 4×4 blocks as vector units.

Returns:

A view of the tensor with a vectorized layout, where each element in the resulting tensor represents a vector of elements from the original tensor.

`slice`

slice[d0_slice: Slice, d1_slice: Slice](self) -> LayoutTensor[dtype, _compute_slice_layout(d0_slice, d1_slice), origin, address_space=address_space, element_layout=element_layout]

Extract a slice from a rank-2 tensor using slice objects.

This method creates a view into a subset of the tensor defined by the slice specifications for each dimension. The slice is a continuous region of the tensor with no gaps (step size must be 1).

Example: For a 4×4 tensor with values:

```
[1 2 3 4]
[5 6 7 8]
[9 10 11 12]
[13 14 15 16]
```

```mojo
slice[Slice(1, 3), Slice(0, 2)]
```

will extract:

```
[5 6]
[9 10]
```
```
[1 2 3 4]
[5 6 7 8]
[9 10 11 12]
[13 14 15 16]
```

```mojo
slice[Slice(1, 3), Slice(0, 2)]
```

will extract:

```
[5 6]
[9 10]
```

Performance:

- Creates a view without copying data, making it very efficient.
- Maintains the original tensor's stride information for efficient memory access.
- Zero-cost abstraction at runtime when used with compile-time constant slices.
- Creates a view without copying data, making it very efficient.
- Maintains the original tensor's stride information for efficient memory access.
- Zero-cost abstraction at runtime when used with compile-time constant slices.

Notes:

- The slice is a view into the original tensor, so modifications to the
  slice will affect the original tensor.
- Only supports rank-2 tensors. For higher-rank tensors, use the overloaded
  version with slice indices.
- The step size must be 1 (no gaps allowed in the slice).
- Slice bounds are not checked at runtime; accessing out-of-bounds indices
  will result in undefined behavior.
- The slice is a view into the original tensor, so modifications to the
  slice will affect the original tensor.
- Only supports rank-2 tensors. For higher-rank tensors, use the overloaded
  version with slice indices.
- The step size must be 1 (no gaps allowed in the slice).
- Slice bounds are not checked at runtime; accessing out-of-bounds indices
  will result in undefined behavior.

Constraints:

Only works with rank-2 tensors.

Parameters:

d0_slice (Slice): Slice specification for the first dimension (rows). Defines the start and end indices for the slice along this dimension.
d1_slice (Slice): Slice specification for the second dimension (columns). Defines the start and end indices for the slice along this dimension.

Returns:

A view into the original tensor representing the specified slice.

slice[d0_slice: Slice, d1_slice: Slice, slice_indices: Index[2], __offset_dims: Int = (layout.rank() + -2)](self, offsets: Index[__offset_dims]) -> LayoutTensor[dtype, _compute_slice_layout(d0_slice, d1_slice, slice_indices.__getitem__[::Indexer](0), slice_indices.__getitem__[::Indexer](1)), origin, address_space=address_space, element_layout=element_layout]

Extract a 2D slice from a higher-rank tensor at specific indices.

This method creates a view into a 2D subset of a higher-rank tensor by:

Selecting two dimensions to slice using the slice_indices parameter
Applying slice specifications to those dimensions
Using fixed offsets for all other dimensions

Example: For a 3×4×5 tensor, slice[Slice(1, 3), Slice(0, 2), IndexList[2](0, 2)](1) will extract a 2×2 slice from dimensions 0 and 2, with dimension 1 fixed at index 1.

Performance:

- Creates a view without copying data, making it very efficient.
- Maintains the original tensor's stride information for efficient memory access.
- Zero-cost abstraction at runtime when used with compile-time constant slices.
- Creates a view without copying data, making it very efficient.
- Maintains the original tensor's stride information for efficient memory access.
- Zero-cost abstraction at runtime when used with compile-time constant slices.

Notes:

- The slice is a view into the original tensor, so modifications to the
  slice will affect the original tensor.
- The slice indices must be ordered (e.g., [0, 2] is valid, [2, 0] is not).
- The step size must be 1 (no gaps allowed in the slice).
- Slice bounds are not checked at runtime; accessing out-of-bounds indices
  will result in undefined behavior.
- The slice is a view into the original tensor, so modifications to the
  slice will affect the original tensor.
- The slice indices must be ordered (e.g., [0, 2] is valid, [2, 0] is not).
- The step size must be 1 (no gaps allowed in the slice).
- Slice bounds are not checked at runtime; accessing out-of-bounds indices
  will result in undefined behavior.

Constraints:

Slice step size must be 1 (no gaps). - Slice indices must be ordered (ascending). - Tensor rank must be at least 2.

Parameters:

d0_slice (Slice): Slice specification for the first selected dimension.
d1_slice (Slice): Slice specification for the second selected dimension.
slice_indices (Index[2]): Indices of the two dimensions to slice (must be ordered).
__offset_dims (Int): Internal parameter representing number of fixed dimensions.

Args:

offsets (Index[__offset_dims]): Fixed index values for all dimensions not being sliced.

Returns:

A 2D view into the original tensor representing the specified slice.

`slice_1d`

slice_1d[d0_slice: Slice, slice_indices: Index[1], __offset_dims: Int = (layout.rank() + -1)](self, offsets: Index[__offset_dims]) -> LayoutTensor[dtype, _compute_slice_layout(d0_slice, slice_indices.__getitem__[::Indexer](0)), origin, address_space=address_space, element_layout=element_layout]

Extract a 1D slice from a higher-rank tensor at a specific index.

This method creates a view into a 1D subset of a higher-rank tensor by:

Selecting one dimension to slice using the slice_indices parameter
Applying a slice specification to that dimension
Using fixed offsets for all other dimensions

Example: For a 3×4×5 tensor, slice_1d[Slice(1, 3), IndexList[1](0)](1, 2) will extract a 1D slice from dimension 0, with dimensions 1 and 2 fixed at indices 1 and 2.

Performance:

- Creates a view without copying data, making it very efficient.
- Maintains the original tensor's stride information for efficient memory access.
- Zero-cost abstraction at runtime when used with compile-time constant slices.
- Creates a view without copying data, making it very efficient.
- Maintains the original tensor's stride information for efficient memory access.
- Zero-cost abstraction at runtime when used with compile-time constant slices.

Notes:

- The slice is a view into the original tensor, so modifications to the
  slice will affect the original tensor.
- The step size must be 1 (no gaps allowed in the slice).
- Slice bounds are not checked at runtime; accessing out-of-bounds indices
  will result in undefined behavior.
- This function exists as a workaround for compiler limitations with overloading.
- The slice is a view into the original tensor, so modifications to the
  slice will affect the original tensor.
- The step size must be 1 (no gaps allowed in the slice).
- Slice bounds are not checked at runtime; accessing out-of-bounds indices
  will result in undefined behavior.
- This function exists as a workaround for compiler limitations with overloading.

Constraints:

Slice step size must be 1 (no gaps). - Tensor rank must be at least 1.

Parameters:

d0_slice (Slice): Slice specification for the selected dimension.
slice_indices (Index[1]): Index of the dimension to slice.
__offset_dims (Int): Internal parameter representing number of fixed dimensions.

Args:

offsets (Index[__offset_dims]): Fixed index values for all dimensions not being sliced.

Returns:

A 1D view into the original tensor representing the specified slice.

`transpose`

transpose[M: Int = shape[::Int](), N: Int = shape[::Int]()](self) -> LayoutTensor[dtype, composition(layout, __init__[::Origin[::Bool(IntTuple(N, M), IntTuple(M, 1))), origin, address_space=address_space, element_layout=element_layout]

Create a transposed view of a rank-2 tensor.

This method creates a view of the tensor with its dimensions swapped, effectively converting rows to columns and columns to rows. The transposition is performed without copying data, by adjusting the tensor's layout information.

Example: For a 2×3 tensor with values:

```
[1 2 3]
[4 5 6]
```

`transpose()` will produce a 3×2 tensor:

```
[1 4]
[2 5]
[3 6]
```
```
[1 2 3]
[4 5 6]
```

`transpose()` will produce a 3×2 tensor:

```
[1 4]
[2 5]
[3 6]
```

Performance:

- Creates a view without copying data, making it very efficient.
- The operation is zero-cost at runtime as it only changes the layout information.
- Memory access patterns may be less efficient in the transposed view due to
  non-contiguous memory access, especially for row-major storage.
- Creates a view without copying data, making it very efficient.
- The operation is zero-cost at runtime as it only changes the layout information.
- Memory access patterns may be less efficient in the transposed view due to
  non-contiguous memory access, especially for row-major storage.

Notes:

- The transposed tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
- Only works with rank-2 tensors.
- For optimal performance when repeatedly accessing the transposed data,
  consider creating a physical copy with the transposed layout.
- The transposed tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
- Only works with rank-2 tensors.
- For optimal performance when repeatedly accessing the transposed data,
  consider creating a physical copy with the transposed layout.

Constraints:

Only works with rank-2 tensors.

Parameters:

M (Int): The size of the first dimension (rows) of the original tensor. Defaults to the static shape value of the first dimension.
N (Int): The size of the second dimension (columns) of the original tensor. Defaults to the static shape value of the second dimension.

Returns:

A view of the tensor with dimensions transposed (rows become columns and vice versa).

`reshape`

reshape[dst_layout: Layout](self) -> LayoutTensor[dtype, dst_layout, origin, address_space=address_space, element_layout=element_layout, masked=masked]

Create a view of the tensor with a different shape.

This method creates a view of the tensor with a new shape, without changing the underlying data. The total number of elements must remain the same.

Example: For a 2×6 tensor, reshape[Layout((3, 4))]() will produce a 3×4 tensor with the same elements in row-major order.

Performance:

- Creates a view without copying data, making it very efficient.
- The operation is zero-cost at runtime as it only changes the layout information.
- Memory access patterns may change, potentially affecting performance
  depending on the original and target layouts.
- Creates a view without copying data, making it very efficient.
- The operation is zero-cost at runtime as it only changes the layout information.
- Memory access patterns may change, potentially affecting performance
  depending on the original and target layouts.

Notes:

- The reshaped tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
- The total number of elements must remain the same after reshaping.
- The reshape operation assumes a row-major (C-style) memory layout.
- For tensors with complex strides or non-contiguous memory, reshaping
  may not produce the expected results.
- Masked tensors cannot be reshaped.
- The reshaped tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
- The total number of elements must remain the same after reshaping.
- The reshape operation assumes a row-major (C-style) memory layout.
- For tensors with complex strides or non-contiguous memory, reshaping
  may not produce the expected results.
- Masked tensors cannot be reshaped.

Constraints:

Cannot reshape masked tensors. - The total number of elements must be the same in both layouts.

Parameters:

dst_layout (Layout): The target layout for the reshaped tensor. Must have the same total number of elements as the original tensor.

Returns:

A view of the tensor with the new shape specified by dst_layout.

`composition`

composition[rhs_layout: Layout, dst_layout: Layout = composition(layout, rhs_layout)](self) -> LayoutTensor[dtype, dst_layout, origin, address_space=address_space, element_layout=element_layout]

Create a view of the tensor with a composed layout.

This method creates a view of the tensor with a new layout that is the composition of the original layout with another layout. Layout composition allows for complex transformations of the tensor's logical structure without copying data.

Example: For a 4×4 tensor with a standard row-major layout, composing with a layout that represents a 2×2 tiling would result in a tensor that logically views the data as 2×2 blocks.

Performance:

- Creates a view without copying data, making it very efficient.
- The operation is zero-cost at runtime as it only changes the layout information.
- Can be used to optimize memory access patterns for specific algorithms.
- Creates a view without copying data, making it very efficient.
- The operation is zero-cost at runtime as it only changes the layout information.
- Can be used to optimize memory access patterns for specific algorithms.

Notes:

- The composed tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
- Layout composition is a powerful tool for expressing complex data transformations
  like tiling, transposition, and reshaping in a unified framework.
- Understanding the mathematical properties of layout composition is important
  for correctly using this function.
- The composed tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
- Layout composition is a powerful tool for expressing complex data transformations
  like tiling, transposition, and reshaping in a unified framework.
- Understanding the mathematical properties of layout composition is important
  for correctly using this function.

Constraints:

The layouts must be compatible for composition. - The total number of elements must remain the same after composition.

Parameters:

rhs_layout (Layout): The layout to compose with the tensor's current layout.
dst_layout (Layout): The resulting layout after composition. Defaults to the composition of the tensor's layout with rhs_layout.

Returns:

A view of the tensor with the composed layout.

`distance`

distance[_uint_dtype: DType = uint32 if (address_space == AddressSpace(3)) else uint64](self, addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space]) -> SIMD[_uint_dtype, 1]

Calculate the element-wise distance between this tensor's pointer and another pointer.

This method computes the number of elements (not bytes) between the tensor's pointer and the provided address. This is useful for determining offsets within a larger memory allocation or for pointer arithmetic operations.

Example: If tensor.ptr points to element at index 100 in a buffer, and addr points to element at index 50, then distance(addr) would return 50.

Performance:

- This is a lightweight operation that only involves pointer arithmetic.
- The operation is optimized based on the address space, using smaller
  integer types for shared memory to improve efficiency.
- This is a lightweight operation that only involves pointer arithmetic.
- The operation is optimized based on the address space, using smaller
  integer types for shared memory to improve efficiency.

Notes:

- The distance is calculated in elements, not bytes.
- The result can be positive or negative depending on the relative positions
  of the pointers.
- This function is particularly useful for GPU programming where understanding
  memory offsets is critical for performance.
- Care should be taken when using this with pointers from different allocations,
  as the result would be meaningless.
- The distance is calculated in elements, not bytes.
- The result can be positive or negative depending on the relative positions
  of the pointers.
- This function is particularly useful for GPU programming where understanding
  memory offsets is critical for performance.
- Care should be taken when using this with pointers from different allocations,
  as the result would be meaningless.

Parameters:

_uint_dtype (DType): The unsigned integer type to use for the result. Defaults to uint32 for shared memory and uint64 for other address spaces.

Args:

addr (UnsafePointer[SIMD[dtype, 1], address_space=address_space]): The target pointer to calculate the distance to.

Returns:

The number of elements between this tensor's pointer and the provided address. The result is of type _uint_dtype.

distance[_layout: Layout, _uint_dtype: DType = _get_unsigned_type(_layout, address_space)](self, src: LayoutTensor[dtype, _layout, origin, address_space=address_space]) -> SIMD[_uint_dtype, 1]

Calculate the element-wise distance between this tensor and another tensor.

This method computes the number of elements (not bytes) between this tensor's pointer and another tensor's pointer. This is useful for determining the relative positions of tensors within a larger memory allocation.

Example: If tensor1 points to element at index 100 in a buffer, and tensor2 points to element at index 50, then tensor1.distance(tensor2) would return 50.

Performance:

- This is a lightweight operation that only involves pointer arithmetic.
- The operation is optimized based on the address space and layout,
  using appropriate integer types for efficiency.
- This is a lightweight operation that only involves pointer arithmetic.
- The operation is optimized based on the address space and layout,
  using appropriate integer types for efficiency.

Notes:

- The distance is calculated in elements, not bytes.
- The result can be positive or negative depending on the relative positions
  of the tensors.
- This function is particularly useful for GPU programming where understanding
  memory offsets is critical for performance.
- Both tensors must be in the same address space for the result to be meaningful.
- This overload is more type-safe than the pointer-based version as it
  ensures the tensors have compatible data types and address spaces.
- The distance is calculated in elements, not bytes.
- The result can be positive or negative depending on the relative positions
  of the tensors.
- This function is particularly useful for GPU programming where understanding
  memory offsets is critical for performance.
- Both tensors must be in the same address space for the result to be meaningful.
- This overload is more type-safe than the pointer-based version as it
  ensures the tensors have compatible data types and address spaces.

Parameters:

_layout (Layout): The layout of the source tensor.
_uint_dtype (DType): The unsigned integer type to use for the result. Automatically determined based on the layout and address space.

Args:

src (LayoutTensor[dtype, _layout, origin, address_space=address_space]): The source tensor to calculate the distance to.

Returns:

The number of elements between this tensor's pointer and the source tensor's pointer. The result is of type _uint_dtype.

`copy_from`

copy_from(self, other: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])

Copy data from another tensor to this tensor.

This method performs an element-by-element copy from the source tensor to this tensor, respecting the layouts of both tensors. The copy operation handles different memory layouts correctly, ensuring that elements are copied to their proper positions regardless of how the data is arranged in memory.

Example:

```mojo
from layout import LayoutTensor, Layout

var src = LayoutTensor[DType.float32, Layout((2, 3))]()
var dst = LayoutTensor[DType.float32, Layout((3, 2))]()
dst.copy_from(src)  # Copies all elements from src to dst
```
```mojo
from layout import LayoutTensor, Layout

var src = LayoutTensor[DType.float32, Layout((2, 3))]()
var dst = LayoutTensor[DType.float32, Layout((3, 2))]()
dst.copy_from(src)  # Copies all elements from src to dst
```

Performance:

- Performs element-by-element copying, which may be less efficient than
  vectorized or bulk memory operations.
- The copy respects the memory layout of both tensors, which may involve
  non-contiguous memory access patterns.
- For optimal performance with large tensors, consider using specialized
  copy functions that can leverage hardware acceleration.
- Performs element-by-element copying, which may be less efficient than
  vectorized or bulk memory operations.
- The copy respects the memory layout of both tensors, which may involve
  non-contiguous memory access patterns.
- For optimal performance with large tensors, consider using specialized
  copy functions that can leverage hardware acceleration.

Notes:

- Both tensors must have statically known shapes.
- The total number of elements must be the same in both tensors.
- The element sizes must match between the tensors.
- This function handles different memory layouts correctly, making it suitable
  for copying between tensors with different shapes or strides.
- The copy is performed element by element, not as a bulk memory copy.
- Both tensors must have statically known shapes.
- The total number of elements must be the same in both tensors.
- The element sizes must match between the tensors.
- This function handles different memory layouts correctly, making it suitable
  for copying between tensors with different shapes or strides.
- The copy is performed element by element, not as a bulk memory copy.

Constraints:

Both tensors must have statically known shapes. - The total number of elements must be the same in both tensors. - The element sizes must match between the tensors.

Args:

other (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The source tensor to copy data from. Must have the same total number of elements as this tensor.

`copy_from_async`

copy_from_async[is_masked: Bool = False, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), fill: Fill = Fill(0), eviction_policy: CacheEviction = CacheEviction(0)](self, src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src_idx_bound: SIMD[_get_index_type(layout, address_space), 1] = __init__[__mlir_type.!pop.int_literal](0), base_offset: SIMD[_get_unsigned_type(layout, address_space), 1] = __init__[__mlir_type.!pop.int_literal](0))

Asynchronously copy data from another tensor to this tensor using GPU hardware.

This method performs an asynchronous copy from the source tensor to this tensor using GPU hardware acceleration. It's specifically designed for copying data from global memory to shared memory in GPU kernels, leveraging hardware-specific asynchronous copy mechanisms for improved performance.

Example:

```mojo
from layout import LayoutTensor, Layout, AddressSpace
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                              address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                              address_space=AddressSpace.SHARED]()
shared_data.copy_from_async(global_data)
```
```mojo
from layout import LayoutTensor, Layout, AddressSpace
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                              address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                              address_space=AddressSpace.SHARED]()
shared_data.copy_from_async(global_data)
```

Performance:

- Uses hardware-accelerated asynchronous copy mechanisms for optimal performance.
- Particularly efficient for copying data from global memory to shared memory
  in GPU kernels.
- Supports vectorized copies for 4, 8, or 16-byte elements for better throughput.
- Can bypass L1 cache with appropriate eviction policies for specific access patterns.
- Swizzling can improve memory access patterns and reduce bank conflicts.
- Uses hardware-accelerated asynchronous copy mechanisms for optimal performance.
- Particularly efficient for copying data from global memory to shared memory
  in GPU kernels.
- Supports vectorized copies for 4, 8, or 16-byte elements for better throughput.
- Can bypass L1 cache with appropriate eviction policies for specific access patterns.
- Swizzling can improve memory access patterns and reduce bank conflicts.

Notes:

- For vectorized copies, both tensors must have contiguous element layouts.
- Asynchronous copies allow computation to overlap with memory transfers.
- A synchronization barrier is required before using the copied data.
- For vectorized copies, both tensors must have contiguous element layouts.
- Asynchronous copies allow computation to overlap with memory transfers.
- A synchronization barrier is required before using the copied data.

Constraints:

Destination must be in shared memory. - Source and destination data types must match. - Element size must be 4, 8, or 16 bytes. - Destination tensor must have a static layout.

Parameters:

is_masked (Bool): Whether to perform a masked copy, where elements outside the src_idx_bound are not copied or filled with zeros.
swizzle (OptionalReg[Swizzle]): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns.
fill (Fill): Fill policy for elements that are not copied (only used with masked copies).
eviction_policy (CacheEviction): Cache eviction policy for the source data.

Args:

src (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The source tensor to copy data from.
src_idx_bound (SIMD[_get_index_type(layout, address_space), 1]): For masked copies, the upper bound index for valid source elements.
base_offset (SIMD[_get_unsigned_type(layout, address_space), 1]): Base offset for swizzling calculations.

`fill`

fill(self: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], val: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]

Fill the entire tensor with a single value.

This method sets all elements of the tensor to the specified value. It works with both statically and dynamically shaped tensors, filling all elements regardless of the tensor's layout.

Example:

```mojo
from layout import LayoutTensor, Layout
var tensor = LayoutTensor[DType.float32, Layout((3, 4))]()
tensor.fill(0.0)  # Sets all elements to 0.0
```
```mojo
from layout import LayoutTensor, Layout
var tensor = LayoutTensor[DType.float32, Layout((3, 4))]()
tensor.fill(0.0)  # Sets all elements to 0.0
```

Performance:

- For statically known layouts, the fill operation is unrolled at compile time.
- For dynamic layouts, a runtime loop is used.
- No vectorization is applied, so performance may be suboptimal for large tensors.
- Consider using hardware-specific fill operations for better performance
  with large tensors.
- For statically known layouts, the fill operation is unrolled at compile time.
- For dynamic layouts, a runtime loop is used.
- No vectorization is applied, so performance may be suboptimal for large tensors.
- Consider using hardware-specific fill operations for better performance
  with large tensors.

Notes:

- The tensor must be mutable (mut=True).
- The fill operation respects the tensor's layout, filling all elements
  regardless of how they are arranged in memory.
- This method can be used with tensors of any rank and shape.
- For tensors with element_layout, all elements within each logical element
  are filled with the same value.
- The tensor must be mutable (mut=True).
- The fill operation respects the tensor's layout, filling all elements
  regardless of how they are arranged in memory.
- This method can be used with tensors of any rank and shape.
- For tensors with element_layout, all elements within each logical element
  are filled with the same value.

Args:

val (SIMD[dtype, 1]): The value to fill the tensor with. Must be of the same data type as the tensor.

Returns:

The tensor itself (self), allowing for method chaining.

`str`

__str__(self) -> String

`write_to`

write_to[W: Writer](self, mut writer: W)

Format and write the tensor's contents to a writer.

This method formats the tensor's contents and writes them to the provided writer. For 2D tensors, it formats the output in a 2D grid. For tensors of other ranks, it prints all values in column-major coordinate order.

Example:

```mojo
from layout import LayoutTensor, Layout
var tensor = LayoutTensor[DType.float32, Layout((2, 3))]()
tensor.fill(1.0)
print(tensor)  # Internally calls `write_to` with a StringWriter
```

Output for a 2×3 tensor:

```
[[1.0, 1.0, 1.0],
 [1.0, 1.0, 1.0]]
```
```mojo
from layout import LayoutTensor, Layout
var tensor = LayoutTensor[DType.float32, Layout((2, 3))]()
tensor.fill(1.0)
print(tensor)  # Internally calls `write_to` with a StringWriter
```

Output for a 2×3 tensor:

```
[[1.0, 1.0, 1.0],
 [1.0, 1.0, 1.0]]
```

Notes:

- For 2D tensors, the output is formatted as a 2D grid with rows and columns.
- For tensors of other ranks, values are printed in column-major coordinate order.
- Empty tensors (size 0) produce no output.
- This method is used by the `__str__` method to convert the tensor to a string.
- The formatting is designed for human readability rather than parsing.
- For large tensors, the output may be truncated to avoid excessive output.
- For 2D tensors, the output is formatted as a 2D grid with rows and columns.
- For tensors of other ranks, values are printed in column-major coordinate order.
- Empty tensors (size 0) produce no output.
- This method is used by the `__str__` method to convert the tensor to a string.
- The formatting is designed for human readability rather than parsing.
- For large tensors, the output may be truncated to avoid excessive output.

Parameters:

W (Writer): The writer type that will receive the formatted output.

Args:

writer (W): The writer instance to write the formatted output to.

Parameters​

Aliases​

Fields​

Implemented traits​

Methods​

__init__​

__getitem__​

__setitem__​

__add__​

__sub__​

__mul__​

__truediv__​

__iadd__​

__isub__​

__imul__​

__itruediv__​

copy​

bitcast​

origin_cast​

get_immutable​

load​

prefetch​

aligned_load​

store​

aligned_store​

stack_allocation​

shape​

stride​

dim​

coalesce​

tile​

tiled_iterator​

split​

distribute​

vectorize​

slice​

slice_1d​

transpose​

reshape​

composition​

distance​

copy_from​

copy_from_async​

fill​

__str__​

write_to​

Parameters

Aliases

Fields

Implemented traits

Methods

`init`

`getitem`

`setitem`

`add`

`sub`

`mul`

`truediv`

`iadd`

`isub`

`imul`

`itruediv`

`copy`

`bitcast`

`origin_cast`

`get_immutable`

`load`

`prefetch`

`aligned_load`

`store`

`aligned_store`

`stack_allocation`

`shape`

`stride`

`dim`

`coalesce`

`tile`

`tiled_iterator`

`split`

`distribute`

`vectorize`

`slice`

`slice_1d`

`transpose`

`reshape`

`composition`

`distance`

`copy_from`

`copy_from_async`

`fill`

`str`

`write_to`