# Mojo API Documentation

> The Mojo API reference.

This file contains all documentation content in a single document following the llmtxt.org standard.

## max

The MAX Mojo API reference.

The MAX API provides a state-of-the-art graph compiler and runtime
library that executes AI models with incredible speed on a wide range of
hardware.

## Packages

* [​`tensor`](/max/api/mojo/tensor/): APIs to create and manage tensors in a graph.

---

## tensor

APIs to create and manage tensors in a graph.

## Modules

* [​`io_spec`](/max/api/mojo/tensor/io_spec/):
* [​`managed_tensor_slice`](/max/api/mojo/tensor/managed_tensor_slice/): Implements the `ManagedTensorSlice` type - a view of a tensor that doesn't own the underlying data. This type is used to build custom graph operations.
* [​`tensor_spec`](/max/api/mojo/tensor/tensor_spec/): You can import these APIs from the `max.tensor` package. For example:
* [​`transitional`](/max/api/mojo/tensor/transitional/): Utilities for transitional period during NDBuffer deprecation.

---

## IO

`@register_passable(trivial)`
`struct IO`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `FusedInput`

`alias FusedInput = IO(2)`

### `FusedOutput`

`alias FusedOutput = IO(3)`

### `Input`

`alias Input = IO(1)`

### `Output`

`alias Output = IO(0)`

### `Unknown`

`alias Unknown = IO(-1)`

## Methods

### `__init__`

`__init__(value: Int) -> Self`

### `__eq__`

`__eq__(self, other: Self) -> Bool`

---

## IOSpec

`@register_passable(trivial)`
`struct IOSpec[mut: Bool, input: IO]`

Parameter used to encode whether a particular tensor argument to a DPS kernel is an output, input, or mutable input.

```mojo
Input == IOSpec[False, IO.Input]()
Output == IOSpec[True, IO.Output]()
MutableInput == IOSpec[True, IO.Input]()
FusedInput == IOSpec[False, IO.FusedInput]()
FusedOutput == IOSpec[True, IO.FusedOutput]()
```

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

---

## io_spec

## Aliases

### `FusedInput`

`alias FusedInput = IOSpec()`

### `FusedOutput`

`alias FusedOutput = IOSpec()`

### `Input`

`alias Input = IOSpec()`

### `IOUnknown`

`alias IOUnknown = IOSpec()`

### `MutableInput`

`alias MutableInput = IOSpec()`

### `Output`

`alias Output = IOSpec()`

## Structs

* [​`IO`](/max/api/mojo/tensor/io_spec/IO):
* [​`IOSpec`](/max/api/mojo/tensor/io_spec/IOSpec): Parameter used to encode whether a particular tensor argument to a DPS kernel is an output, input, or mutable input.

---

## DynamicTensor

`struct DynamicTensor[dtype: DType, rank: Int]`

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `Type`

`alias Type = ManagedTensorSlice[IOSpec(), static_spec=create_unknown()]`

---

## ManagedTensorSlice

`@register_passable(trivial)`
`struct ManagedTensorSlice[mut: Bool, input: IO, dtype: DType, rank: Int, //, io_spec: IOSpec[mut, input], *, static_spec: StaticTensorSpec[dtype, rank]]`

A view of a tensor that does not own the underlying allocated pointer. When the object lifetime ends it does not free the underlying pointer. Conversely, if a `ManagedTensorSlice` is created, it will not extend the life of the underlying pointer.

Therefore, the user must take care to keep the pointer alive until the last
use of a `ManagedTensorSlice` instance. This class is useful for writing
custom operations where memory is managed by an external runtime like in
MAX's inference stack.

## Implemented traits

`AnyType`,
`Copyable`,
`DevicePassable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `address_space`

`alias address_space = static_spec.address_space`

### `alignment`

`alias alignment = static_spec.alignment`

### `device_type`

`alias device_type = LayoutTensor[dtype, static_spec.to_layout(), MutableAnyOrigin]`

### `exclusive`

`alias exclusive = static_spec.exclusive`

## Methods

### `__init__`

`__init__(ptr: UnsafePointer[SIMD[dtype, 1]], slices: InlineArray[Slice, rank], slicer_spec: RuntimeTensorSpec[dtype, rank]) -> Self`

Initializes a ManagedTensorSlice from a pointer, array of slices and tensor spec.

In general, custom operations should not create `ManagedTensorSlice`
instances, but instead use the ones provided by the MAX inference
engine.

`__init__(ptr: UnsafePointer[SIMD[dtype, 1]], shape: IndexList[rank]) -> Self`

Initializes a ManagedTensorSlice from a pointer and shape.

In general, custom operations should not create `ManagedTensorSlice`
instances, but instead use the ones provided by the MAX inference
engine.

`__init__(ptr: UnsafePointer[SIMD[dtype, 1]], shape: IndexList[rank], strides: IndexList[rank]) -> Self`

Initializes a ManagedTensorSlice from a pointer, shape, and strides.

In general, custom operations should not create `ManagedTensorSlice`
instances, but instead use the ones provided by the MAX inference
engine.

### `__getitem__`

`__getitem__(self, indices: IndexList[rank]) -> SIMD[dtype, 1]`

Gets the value at the specified indices.

**Args:**

* ​indices (`IndexList[rank]`): The indices of the value to retrieve.

**Returns:**

The value at the specified indices.

`__getitem__(self, *indices: Int) -> SIMD[dtype, 1]`

Gets the value at the specified indices.

**Args:**

* ​\*indices (`Int`): The indices of the value to retrieve.

**Returns:**

The value at the specified indices.

### `__setitem__`

`__setitem__(self, *indices: Int, *, val: SIMD[dtype, 1])`

Stores the value at the specified indices.

**Args:**

* ​\*indices (`Int`): The indices of the value to store.
* ​val (`SIMD[dtype, 1]`): The value to store.

`__setitem__(self, indices: IndexList[rank], val: SIMD[dtype, 1])`

Stores the value at the specified indices.

**Args:**

* ​indices (`IndexList[rank]`): The indices of the value to store.
* ​val (`SIMD[dtype, 1]`): The value to store.

### `get_type_name`

`static get_type_name() -> String`

### `get_device_type_name`

`static get_device_type_name() -> String`

### `spec`

`spec(self) -> RuntimeTensorSpec[dtype, rank]`

Gets the `TensorSpec` of this tensor slice, which provides meta-data about the tensor slice.

**Returns:**

The static `TensorSpec` for this tensor slice.

### `shape`

`shape(self) -> IndexList[rank]`

Gets the shape of this tensor slice, as an `IndexList`.

**Returns:**

The shape of this tensor slice.

### `dim_size`

`dim_size(self, index: Int) -> Int`

Gets the size of a given dimension of this tensor slice using a run time value.

**Args:**

* ​index (`Int`): The zero-based index of the dimension.

**Returns:**

The size of the tensor slice in the given dimension.

`dim_size[index: Int](self) -> Int`

Gets the size of a given dimension of this tensor slice using a compile time value.

**Parameters:**

* ​index (`Int`): The zero-based index of the dimension.

**Returns:**

The size of the tensor slice in the given dimension.

### `strides`

`strides(self) -> IndexList[rank]`

Gets the strides of this tensor slice, as an `IndexList`.

**Returns:**

The strides of this tensor slice.

### `stride_length`

`stride_length(self, index: Int) -> Int`

Gets the length of the stride of a given dimension of this tensor slice using a run time value.

**Args:**

* ​index (`Int`): The zero-based index of the dimension.

**Returns:**

The size of the tensor slice in the given dimension.

`stride_length[index: Int](self) -> Int`

Gets the length of the stride of a given dimension of this tensor slice using a compile time value.

**Parameters:**

* ​index (`Int`): The zero-based index of the dimension.

**Returns:**

The size of the tensor slice in the given dimension.

### `size`

`size(self) -> Int`

Computes the tensor slice's number of elements.

**Returns:**

The total number of elements in the tensor slice.

### `unsafe_ptr`

`unsafe_ptr[__type: DType = dtype](self) -> UnsafePointer[SIMD[__type, 1]]`

Get the pointer stored in this tensor slice.

Since this method obtains the pointer stored in this tensor slice, it
can modify the invariants of this tensor slice and lead to unexpected
behavior. It should be used with caution.

**Parameters:**

* ​\_\_type (`DType`): The type of the `UnsafePointer` in this tensor slice.

**Returns:**

The `UnsafePointer` which contains the data for this tensor slice.

### `load`

`load[width: Int, _rank: Int](self, index: IndexList[_rank]) -> SIMD[dtype, width]`

Gets data from this tensor slice as a `SIMD`.

**Parameters:**

* ​width (`Int`): The width of the `SIMD` value. This must be large enough to contain the data from this tensor slice.
* ​\_rank (`Int`): The rank of the tensor slice.

**Args:**

* ​index (`IndexList[_rank]`): An `IndexList` of size `_rank` to indicate the dimension of the tensor slice to obtain data from.

**Returns:**

Data from this tensor slice at dimension `index`.

### `store`

`store[width: Int, _rank: Int, element_alignment: Int = 1](self: ManagedTensorSlice[io_spec, static_spec=static_spec], index: IndexList[_rank], val: SIMD[dtype, width])`

Sets data in this tensor slice from a `SIMD`.

**Parameters:**

* ​width (`Int`): The width of the `SIMD` value.
* ​\_rank (`Int`): The rank of the tensor slice.
* ​element\_alignment (`Int`): Indicate the alignment of the pointer stored to memory. This is needed to issue vector store for GPUs with strict alignment requirements.

**Args:**

* ​index (`IndexList[_rank]`): An `IndexList` of size `_rank` to indicate the dimension of the tensor slice to set data in.
* ​val (`SIMD[dtype, width]`): The data to set into this tensor slice.

### `with_layout`

`with_layout[new_rank: Int, //, new_static_shape: DimList, new_static_strides: DimList](self, new_runtime_shape: IndexList[new_rank], new_runtime_strides: IndexList[new_rank], offset_ptr: OptionalReg[UnsafePointer[SIMD[dtype, 1]]] = OptionalReg[UnsafePointer[SIMD[dtype, 1]]]({:i1 0, 1})) -> ManagedTensorSlice[io_spec, static_spec=static_spec.with_layout[::Int](new_static_shape, new_static_strides)]`

### `to_layout_tensor`

`to_layout_tensor(self) -> LayoutTensor[dtype, static_spec.to_layout(), MutableAnyOrigin]`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this buffer to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__repr__`

`__repr__(self) -> String`

Gets the buffer as a string.

**Returns:**

A compact string representation of the buffer.

### `__str__`

`__str__(self) -> String`

Gets the buffer as a string.

**Returns:**

A compact string of the buffer.

---

## VariadicTensors

`@register_passable(trivial)`
`struct VariadicTensors[mut: Bool, input: IO, //, dtype: DType, rank: Int, size: Int, io_spec: IOSpec[mut, input], *, static_specs: StaticTuple[StaticTensorSpec[dtype, rank], size]]`

A tuple-like container of tensors representing variadic arguments from the graph compiler.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__getitem__`

`__getitem__[index: Int](self) -> ManagedTensorSlice[io_spec, static_spec=static_specs.__getitem__[::Indexer](index)]`

Returns the tensor at the given position in the variadic argument argument pack.

**Parameters:**

* ​index (`Int`): The index into the variadic tensor arguments.

**Returns:**

The tensor at the specified index.

### `__len__`

`__len__(self) -> Int`

Returns the number of variadic arguments in the pack.

**Returns:**

The number of variadic arguments.

---

## foreach

`foreach[dtype: DType, rank: Int, //, func: fn[Int](IndexList[rank]) capturing -> SIMD[dtype, $0], *, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), simd_width: Int = get_kernel_simd_width[::DType,::StringSlice[::Bool(), _synchronous: Bool = False, _trace_name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("mogg.for_each")](tensor: ManagedTensorSlice[io_spec, static_spec=static_spec], ctx: DeviceContextPtr = DeviceContextPtr())`

Apply the function `func` to each element of the tensor slice.

**Parameters:**

* ​dtype (`DType`): The data type of the elements in the tensor slice.
* ​rank (`Int`): The rank of the tensor slice.
* ​func (`fn[Int](IndexList[rank]) capturing -> SIMD[dtype, $0]`): The function to apply to each element of the tensor slice.
* ​target (`StringSlice[StaticConstantOrigin]`): Indicates the type of the target device (e.g. "cpu", "gpu").
* ​simd\_width (`Int`): The SIMD width for the target (usually leave this as its default value).
* ​\_synchronous (`Bool`): True to run the custom op synchronously in the runtime (defaults to False).
* ​\_trace\_name (`StringSlice[StaticConstantOrigin]`): Name of the executed operation displayed in the trace\_description.

**Args:**

* ​tensor (`ManagedTensorSlice[io_spec, static_spec=static_spec]`): The output tensor slice which receives the return values from `func`.
* ​ctx (`DeviceContextPtr`): The call context (forward this from the custom operation).

`foreach[: origin.set, dtype: DType, rank: Int, //, func: fn[Int](IndexList[rank]) capturing -> SIMD[dtype, $0], out_func: fn[Int](IndexList[rank]) capturing -> None, *, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), simd_width: Int = get_kernel_simd_width[::DType,::StringSlice[::Bool(), _synchronous: Bool = False, _trace_name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("mogg.for_each")](tensor: ManagedTensorSlice[io_spec, static_spec=static_spec], ctx: DeviceContextPtr = DeviceContextPtr())`

Apply the function `func` to each element of the tensor slice.

**Parameters:**

* ​dtype (`DType`): The data type of the elements in the tensor slice.
* ​rank (`Int`): The rank of the tensor slice.
* ​func (`fn[Int](IndexList[rank]) capturing -> SIMD[dtype, $0]`): The function to apply to each element of the tensor slice.
* ​out\_func (`fn[Int](IndexList[rank]) capturing -> None`): The function to apply on each output element.
* ​target (`StringSlice[StaticConstantOrigin]`): Indicates the type of the target device (e.g. "cpu", "gpu").
* ​simd\_width (`Int`): The SIMD width for the target (usually leave this as its default value).
* ​\_synchronous (`Bool`): True to run the custom op synchronously in the runtime (defaults to False).
* ​\_trace\_name (`StringSlice[StaticConstantOrigin]`): Name of the executed operation displayed in the trace\_description.

**Args:**

* ​tensor (`ManagedTensorSlice[io_spec, static_spec=static_spec]`): The input tensor slice which the consumed values.
* ​ctx (`DeviceContextPtr`): The call context (forward this from the custom operation).

---

## managed_tensor_slice

Implements the `ManagedTensorSlice` type - a view of a tensor that doesn't own the underlying data. This type is used to build custom graph operations.

## Aliases

### `InputTensor`

`alias InputTensor = ManagedTensorSlice[IOSpec(), static_spec=?]`

### `InputVariadicTensors`

`alias InputVariadicTensors = VariadicTensors[?, ?, ?, IOSpec(), static_specs=?]`

### `OutputTensor`

`alias OutputTensor = ManagedTensorSlice[IOSpec(), static_spec=?]`

### `OutputVariadicTensors`

`alias OutputVariadicTensors = VariadicTensors[?, ?, ?, IOSpec(), static_specs=?]`

## Structs

* [​`DynamicTensor`](/max/api/mojo/tensor/managed_tensor_slice/DynamicTensor):
* [​`ManagedTensorSlice`](/max/api/mojo/tensor/managed_tensor_slice/ManagedTensorSlice): A view of a tensor that does not own the underlying allocated pointer. When the object lifetime ends it does not free the underlying pointer. Conversely, if a `ManagedTensorSlice` is created, it will not extend the life of the underlying pointer.
* [​`VariadicTensors`](/max/api/mojo/tensor/managed_tensor_slice/VariadicTensors): A tuple-like container of tensors representing variadic arguments from the graph compiler.

## Functions

* [​`foreach`](/max/api/mojo/tensor/managed_tensor_slice/foreach): Apply the function `func` to each element of the tensor slice.
* [​`rebuild_mix_precision_static_tensor_specs_with_input_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_mix_precision_static_tensor_specs_with_input_lambda):
* [​`rebuild_mix_precision_static_tensor_specs_with_output_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_mix_precision_static_tensor_specs_with_output_lambda):
* [​`rebuild_static_tensor_specs_with_input_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_static_tensor_specs_with_input_lambda):
* [​`rebuild_static_tensor_specs_with_output_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_static_tensor_specs_with_output_lambda):
* [​`trace_slice_arg`](/max/api/mojo/tensor/managed_tensor_slice/trace_slice_arg): Helper to stringify the type and shape of a kernel argument for tracing.

---

## rebuild_mix_precision_static_tensor_specs_with_input_lambda

`rebuild_mix_precision_static_tensor_specs_with_input_lambda[func_type: AnyTrivialRegType, //, src_type: DType, dst_type: DType, rank: Int](spec: StaticTensorSpec[src_type, rank], in_lambda: func_type) -> StaticTensorSpec[dst_type, rank]`

---

## rebuild_mix_precision_static_tensor_specs_with_output_lambda

`rebuild_mix_precision_static_tensor_specs_with_output_lambda[func_type: AnyTrivialRegType, //, src_rank: Int, src_shape: DimList, src_type: DType](spec: StaticTensorSpec[dtype, rank], out_lambda: func_type) -> StaticTensorSpec[src_type, src_rank]`

---

## rebuild_static_tensor_specs_with_input_lambda

`rebuild_static_tensor_specs_with_input_lambda[func_type: AnyTrivialRegType, //, dtype: DType, rank: Int](spec: StaticTensorSpec[dtype, rank], in_lambda: func_type) -> StaticTensorSpec[dtype, rank]`

---

## rebuild_static_tensor_specs_with_output_lambda

`rebuild_static_tensor_specs_with_output_lambda[func_type: AnyTrivialRegType, //, dtype: DType, rank: Int](spec: StaticTensorSpec[dtype, rank], out_lambda: func_type) -> StaticTensorSpec[dtype, rank]`

---

## trace_slice_arg

`trace_slice_arg(name: String, buf: ManagedTensorSlice[io_spec, static_spec=static_spec]) -> String`

Helper to stringify the type and shape of a kernel argument for tracing.

**Args:**

* ​name (`String`): The name of the argument.
* ​buf (`ManagedTensorSlice[io_spec, static_spec=static_spec]`): The NDBuffer to trace.

**Returns:**

A string representation of the buffer with its shape and data type.

---

## RuntimeTensorSpec

`@register_passable(trivial)`
`struct RuntimeTensorSpec[type: DType, rank: Int]`

## Fields

* ​shape (`IndexList[rank]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__getitem__`

`__getitem__(self, idx: Int) -> Int`

### `bytecount`

`bytecount(self) -> Int`

Gets the total byte count.

**Returns:**

The total byte count.

---

## tensor_spec

You can import these APIs from the `max.tensor` package. For example:

```mojo
from max.tensor import RuntimeTensorSpec
```

## Structs

* [​`RuntimeTensorSpec`](/max/api/mojo/tensor/tensor_spec/RuntimeTensorSpec):

---

## transitional

Utilities for transitional period during NDBuffer deprecation.

## Functions

* [​`managed_tensor_slice_to_ndbuffer`](/max/api/mojo/tensor/transitional/managed_tensor_slice_to_ndbuffer):

---

## managed_tensor_slice_to_ndbuffer

`managed_tensor_slice_to_ndbuffer[: DType, : Int, spec: StaticTensorSpec[$0, $1], //](tensor: ManagedTensorSlice[io_spec, static_spec=spec]) -> NDBuffer[dtype, rank, MutableAnyOrigin, spec.shape, spec.strides, alignment=spec.alignment, address_space=spec.address_space, exclusive=spec.exclusive]`

---

## kv_cache

Contains implementations for several types of key-value caches.

[KV caches](/glossary/ai/kv-cache) are used in transformer models to store
key-value tensors output from self-attention layers.

These APIs are used in the higher-level functions in the
[`nn`](/mojo/kernels/nn) package.

## Modules

* [​`types`](./types/): This module contains the types for the key-value cache APIs.

---

## ContinuousBatchingKVCache

`@register_passable(trivial)`
`struct ContinuousBatchingKVCache[type_: DType, kv_params_: KVCacheStaticParams]`

Wrapper for the ContinuousKVCache of a given layer in the transformer model.

This abstracts the Pointer indirection for accessing the ContinuousKVCache for a
given batch entry.

THIS IS THE TYPE THAT IS PASSED TO KV PROJECTION AND FLASH ATTENTION KERNELS.

## Fields

* ​blocks (`NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`):
* ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`):
* ​lookup\_table (`NDBuffer[uint32, 1, MutableAnyOrigin]`):
* ​max\_seq\_length (`SIMD[uint32, 1]`):
* ​max\_cache\_length (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`KVCacheT`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `blocks_shape`

`alias blocks_shape = __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))`

### `blocks_stride`

`alias blocks_stride = _strides_from_shape[::DimList,::Int]()`

### `blocks_type`

`alias blocks_type = NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`

### `kv_params`

`alias kv_params = kv_params_`

### `type`

`alias type = type_`

## Methods

### `__init__`

`__init__(blocks: NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1]) -> Self`

### `max_tile_size`

`static max_tile_size() -> Int`

Returns the maximum tile size for the KVCache.

### `cache_lengths_nd`

`cache_lengths_nd(self) -> NDBuffer[uint32, 1, MutableAnyOrigin]`

### `cache_length`

`cache_length(self, batch_idx: Int) -> Int`

### `load`

`load[width: Int](self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int) -> SIMD[type_, width]`

### `store`

`store(self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int, val: SIMD[type_, size])`

### `empty_cache`

`empty_cache(self) -> Bool`

Returns true if the cache\_lengths for all requests is 0, false otherwise.

### `max_prompt_length`

`max_prompt_length(self) -> SIMD[uint32, 1]`

Returns the maximum sequence length across all batches of the current request.

### `max_context_length`

`max_context_length(self) -> SIMD[uint32, 1]`

Returns the maximum cache length used across all batches of the current request.

### `block_paged_ptr`

`block_paged_ptr[tile_size: Int](self, batch_idx: Int, start_tok_idx: Int, head_idx: Int, head_dim_idx: Int = 0) -> UnsafePointer[SIMD[type_, 1]]`

---

## ContinuousBatchingKVCacheCollection

`struct ContinuousBatchingKVCacheCollection[type_: DType, kv_params_: KVCacheStaticParams]`

This is a "view" of the cache for the given sequences in the batch.

This object does not own the underlying buffers in k\_cache and v\_cache,
it's borrowing them from the BlockWrappers in our KVCacheManager.
It does own the Pointer\[NDBuffer\[type, 3]] and valid\_lengths buffer

## Fields

* ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`):
* ​lookup\_table (`NDBuffer[uint32, 1, MutableAnyOrigin]`):
* ​blocks (`NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`):
* ​max\_seq\_length (`SIMD[uint32, 1]`):
* ​max\_cache\_length (`SIMD[uint32, 1]`):
* ​kv\_cache\_dynamic\_shape (`IndexList[4]`):
* ​kv\_cache\_dynamic\_strides (`IndexList[4]`):

## Implemented traits

`AnyType`,
`Copyable`,
`KVCollectionT`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `blocks_shape`

`alias blocks_shape = DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))`

### `blocks_stride`

`alias blocks_stride = _strides_from_shape[::DimList,::Int]()`

### `blocks_type`

`alias blocks_type = NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`

### `CacheType`

`alias CacheType = ContinuousBatchingKVCache[type_, kv_params_]`

### `kv_params`

`alias kv_params = kv_params_`

### `name_str`

`alias name_str = "continuous_batching"`

### `type`

`alias type = type_`

## Methods

### `__init__`

`__init__(out self, blocks: NDBuffer[type_, 6, MutableAnyOrigin], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1])`

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

### `get_key_cache`

`get_key_cache(self, layer_idx: Int) -> ContinuousBatchingKVCache[type_, kv_params_]`

### `get_value_cache`

`get_value_cache(self, layer_idx: Int) -> ContinuousBatchingKVCache[type_, kv_params_]`

### `cache_length`

`cache_length(self, bs_idx: Int) -> Int`

---

## KVCacheStaticParams

`@register_passable(trivial)`
`struct KVCacheStaticParams`

## Fields

* ​num\_heads (`UInt`):
* ​head\_size (`UInt`):

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

---

## KVCacheT

Trait for different KVCache types and implementations.

Represents a single (key or value) cache.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `kv_params`

`alias kv_params`

### `type`

`alias type`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `cache_lengths_nd`

`cache_lengths_nd(self: _Self) -> NDBuffer[uint32, 1, MutableAnyOrigin]`

Returns the cache lengths as a NDBuffer.

### `cache_length`

`cache_length(self: _Self, batch_idx: Int) -> Int`

Returns the length of the cache for a given batch index.

### `load`

`load[width: Int](self: _Self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int) -> SIMD[get_vtable_entry(:trait _Self, "type"), width]`

Loads an element from the given index.

### `store`

`store(self: _Self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int, val: SIMD[get_vtable_entry(:trait _Self, "type"), size])`

Stores an element at the given index.

### `empty_cache`

`empty_cache(self: _Self) -> Bool`

Returns true if the cache\_lengths for all requests is 0, false otherwise.

### `max_prompt_length`

`max_prompt_length(self: _Self) -> SIMD[uint32, 1]`

Returns the maximum sequence length across all batches of the current request.

### `max_context_length`

`max_context_length(self: _Self) -> SIMD[uint32, 1]`

Returns the maximum cache length used across all batches of the current request.

### `block_paged_ptr`

`block_paged_ptr[tile_size: Int](self: _Self, batch_idx: Int, start_tok_idx: Int, head_idx: Int, head_dim_idx: Int = 0) -> UnsafePointer[SIMD[get_vtable_entry(:trait _Self, "type"), 1]]`

Returns a LayoutTensor pointing to the KVCache block at the given index.

Paged KVCache implementations must have a block\_size which is a multiple of the
and greater than the layout's first dimension.

### `max_tile_size`

`static max_tile_size() -> Int`

Returns the maximum tile size for the KVCache.

---

## KVCollectionT

Trait for a pair of caches (keys and values).

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `CacheType`

`alias CacheType`

### `kv_params`

`alias kv_params`

### `name_str`

`alias name_str`

### `type`

`alias type`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `get_key_cache`

`get_key_cache(self: _Self, layer_idx: Int) -> get_vtable_entry(:trait _Self, "CacheType")`

### `get_value_cache`

`get_value_cache(self: _Self, layer_idx: Int) -> get_vtable_entry(:trait _Self, "CacheType")`

### `cache_length`

`cache_length(self: _Self, bs_idx: Int) -> Int`

---

## PagedKVCache

`@register_passable(trivial)`
`struct PagedKVCache[type_: DType, kv_params_: KVCacheStaticParams, page_size: Int]`

The PagedKVCache is a wrapper around the KVCache blocks for a given layer. It is used to access the KVCache blocks for PagedAttention.

## Fields

* ​blocks (`NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`):
* ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`):
* ​lookup\_table (`NDBuffer[uint32, 2, MutableAnyOrigin]`):
* ​max\_seq\_length (`SIMD[uint32, 1]`):
* ​max\_cache\_length (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`KVCacheT`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `blocks_shape`

`alias blocks_shape = __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))`

### `blocks_stride`

`alias blocks_stride = _strides_from_shape[::DimList,::Int]()`

### `blocks_type`

`alias blocks_type = NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`

### `kv_params`

`alias kv_params = kv_params_`

### `type`

`alias type = type_`

## Methods

### `__init__`

`__init__(blocks: NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 2, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1]) -> Self`

### `max_tile_size`

`static max_tile_size() -> Int`

Returns the maximum tile size for the KVCache.

### `cache_lengths_nd`

`cache_lengths_nd(self) -> NDBuffer[uint32, 1, MutableAnyOrigin]`

### `cache_length`

`cache_length(self, batch_idx: Int) -> Int`

Returns the length of the cache for a given batch index.

### `load`

`load[width: Int](self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int) -> SIMD[type_, width]`

Loads an element from the given index.

### `store`

`store(self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int, val: SIMD[type_, size])`

Stores an element at the given index.

### `empty_cache`

`empty_cache(self) -> Bool`

Returns true if the cache\_lengths for all requests is 0, false otherwise.

### `max_prompt_length`

`max_prompt_length(self) -> SIMD[uint32, 1]`

Returns the maximum sequence length across all batches of the current request.

### `max_context_length`

`max_context_length(self) -> SIMD[uint32, 1]`

Returns the maximum cache length used across all batches of the current request.

### `block_paged_ptr`

`block_paged_ptr[tile_size: Int](self, batch_idx: Int, start_tok_idx: Int, head_idx: Int, head_dim_idx: Int = 0) -> UnsafePointer[SIMD[type_, 1]]`

---

## PagedKVCacheCollection

`struct PagedKVCacheCollection[type_: DType, kv_params_: KVCacheStaticParams, page_size: Int]`

## Fields

* ​blocks (`NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`):
* ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`):
* ​lookup\_table (`NDBuffer[uint32, 2, MutableAnyOrigin]`):
* ​max\_seq\_length (`SIMD[uint32, 1]`):
* ​max\_cache\_length (`SIMD[uint32, 1]`):
* ​kv\_cache\_dynamic\_shape (`IndexList[4]`):
* ​kv\_cache\_dynamic\_strides (`IndexList[4]`):

## Implemented traits

`AnyType`,
`Copyable`,
`KVCollectionT`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `blocks_shape`

`alias blocks_shape = DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))`

### `blocks_stride`

`alias blocks_stride = _strides_from_shape[::DimList,::Int]()`

### `blocks_type`

`alias blocks_type = NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`

### `CacheType`

`alias CacheType = PagedKVCache[type_, kv_params_, page_size]`

### `kv_params`

`alias kv_params = kv_params_`

### `name_str`

`alias name_str = "paged"`

### `type`

`alias type = type_`

## Methods

### `__init__`

`__init__(out self, blocks: NDBuffer[type_, 6, MutableAnyOrigin], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 2, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1])`

### `__copyinit__`

`__copyinit__(out self, other: Self)`

### `__moveinit__`

`__moveinit__(out self, owned other: Self)`

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

### `get_key_cache`

`get_key_cache(self, layer_idx: Int) -> PagedKVCache[type_, kv_params_, page_size]`

### `get_value_cache`

`get_value_cache(self, layer_idx: Int) -> PagedKVCache[type_, kv_params_, page_size]`

### `cache_length`

`cache_length(self, bs_idx: Int) -> Int`

---

## types

This module contains the types for the key-value cache APIs.

The module includes structs implementing several different types of
[KV caches](/glossary/ai/kv-cache).

This module defines two traits that define the roles of the different structs

* `KVCacheT`: Defines the interface for a single (key or value) cache.
* `KVCollectionT`: Defines the interface for a pair of caches (keys and values).

## Structs

* [​`ContinuousBatchingKVCache`](./ContinuousBatchingKVCache): Wrapper for the ContinuousKVCache of a given layer in the transformer model.
* [​`ContinuousBatchingKVCacheCollection`](./ContinuousBatchingKVCacheCollection): This is a "view" of the cache for the given sequences in the batch.
* [​`KVCacheStaticParams`](./KVCacheStaticParams):
* [​`PagedKVCache`](./PagedKVCache): The PagedKVCache is a wrapper around the KVCache blocks for a given layer. It is used to access the KVCache blocks for PagedAttention.
* [​`PagedKVCacheCollection`](./PagedKVCacheCollection):

## Traits

* [​`KVCacheT`](./KVCacheT): Trait for different KVCache types and implementations.
* [​`KVCollectionT`](./KVCollectionT): Trait for a pair of caches (keys and values).

---

## Element

`struct Element[dtype: DType, layout: Layout, /, index_type: DType = _get_index_type(layout)]`

A wrapper around SIMD types that provides layout-driven vectorized operations.

The `Element` struct extends SIMD types with layout-aware load and store
operations, enabling efficient vectorized access to multi-dimensional data.
It maps between logical tensor coordinates and physical memory locations
according to the specified layout.

## Parameters

* ​dtype (`DType`): The data type of the elements.
* ​layout (`Layout`): The memory layout describing how elements are organized.
* ​index\_type (`DType`): The integer type of the index pointing to each element.

## Fields

* ​element\_data (`SIMD[dtype, layout.size()]`): The actual SIMD data stored in this element.
  This field contains the vectorized data values that can be processed
  efficiently using SIMD operations.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout information for memory access patterns.
  This field stores the layout information needed to map between logical tensor
  coordinates and physical memory locations, supporting both compile-time and
  runtime-determined access patterns.

## Implemented traits

`AnyType`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `element_data_type`

`alias element_data_type = SIMD[dtype, layout.size()]`

The SIMD type used to store and process the element data.

This type alias defines a SIMD vector with the specified data type and size
matching the layout's total element count, enabling efficient vectorized operations.

## Methods

### `__init__`

`@implicit`
`__init__(out self, element_data: SIMD[dtype, layout.size()])`

Initializes an Element with the given SIMD data.

**Args:**

* ​element\_data (`SIMD[dtype, layout.size()]`): The SIMD data to initialize the element with.

`__init__(out self, element_data: SIMD[dtype, layout.size()], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type])`

Initializes an Element with the given SIMD data and runtime layout.

**Args:**

* ​element\_data (`SIMD[dtype, layout.size()]`): The SIMD data to initialize the element with.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access.

### `load`

`static load(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type] = RuntimeLayout()) -> Self`

Loads data from memory according to the specified layout.

This method loads data from memory using the layout information to determine
the memory access pattern. It supports both rank-1 and rank-2 layouts with
various stride patterns, optimizing for contiguous memory access when
possible.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access.

**Returns:**

A new `Element` containing the loaded data.

### `masked_load`

`static masked_load(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type] = RuntimeLayout()) -> Self`

Loads data from memory with masking for partial loads.

This method loads data from memory using the layout information, but also
handles cases where the runtime dimensions are smaller than the static
layout dimensions. It ensures that only valid memory locations are accessed.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access.

**Returns:**

A new `Element` containing the loaded data, with zeros in positions
beyond the runtime dimensions.

### `store`

`store(self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin])`

Stores element data to memory according to the specified layout.

This method performs a layout-aware store operation, writing data to memory
following the access patterns defined by the layout. It optimizes memory
writes based on the layout's stride patterns to maximize performance.

The method handles different memory layout patterns:

* For rank-1 tensors with contiguous memory (stride=1), it uses vectorized stores
* For rank-2 tensors with contiguous rows or columns, it uses optimized slice-based stores
* For non-contiguous memory layouts, it performs element-by-element stores

Unlike `masked_store()`, this method assumes the full static dimensions will be written
and does not perform runtime dimension boundary checking.

Note:
This method is constrained to layouts with rank ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): Mutable pointer to the memory location where data will be stored.

### `masked_store`

`masked_store(self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin])`

Stores element data to memory with masking for partial stores.

This method performs a layout-aware store operation with boundary checking.
It ensures that only valid memory locations are written to when the runtime
dimensions are smaller than the static layout dimensions, preventing out-of-bounds
memory access.

The method optimizes for different memory layouts:

* For contiguous memory (stride=1), it uses vectorized stores when possible
* For non-contiguous memory, it performs element-by-element stores
* For all patterns, it respects runtime dimension bounds

Note:
This method is constrained to layouts with rank ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): Pointer to the memory location where data will be stored.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the element.

**Returns:**

A string representation of the element's data.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the element to the specified writer.

**Parameters:**

* ​W (`Writer`): Type parameter representing a Writer implementation.

**Args:**

* ​writer (`W`): The writer to output the element representation to.

---

## MemoryElement

`struct MemoryElement[dtype: DType, layout: Layout, address_space: AddressSpace, alignment: Int, /, *, index_type: DType = _get_index_type(layout, address_space)]`

Represents data in memory organized according to a specific layout.

The `MemoryElement` struct provides a high-level interface for accessing data
in memory with a specific layout. It encapsulates a pointer to the memory
location and the runtime layout information needed to access the data correctly.

This abstraction enables efficient memory operations that respect the underlying
memory organization, supporting vectorized loads and stores while handling
different memory layouts transparently.

## Parameters

* ​dtype (`DType`): The data type of the elements.
* ​layout (`Layout`): The memory layout describing how elements are organized.
* ​address\_space (`AddressSpace`): The memory address space where the data is located.
* ​alignment (`Int`): The memory alignment requirement for the data.
* ​index\_type (`DType`): The integer type of the index pointing to each memory element.

## Fields

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment]`): Pointer to the memory location where the data is stored.
  This pointer provides access to the underlying memory with the specified
  address space and alignment requirements. It points to the first element
  of the data structure in memory.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): Runtime layout information used for memory access calculations.
  This field stores the runtime layout information needed to compute memory
  offsets for accessing elements according to the specified layout pattern.
  It handles both compile-time known dimensions and runtime-determined dimensions.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type])`

Initializes a `MemoryElement` with the given pointer and runtime layout.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment]`): Pointer to the memory location of the element.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access.

### `load`

`load(self, out result: Element[dtype, layout, index_type])`

Loads data from memory according to the specified layout.

This method performs a layout-aware load operation, reading data from memory
following the access patterns defined by the layout. It optimizes memory
reads based on the layout's stride patterns to maximize performance.

The method leverages the underlying `Element.load` implementation which handles
different memory layout patterns including contiguous and strided access.

**Returns:**

An `Element` containing the loaded data organized according to the layout.

### `store`

`store(self, src: Element[dtype, layout, index_type])`

Stores element data to the memory location of this MemoryElement.

This method performs a layout-aware store operation, writing data to memory
following the access patterns defined by the layout. It optimizes memory
writes based on the layout's stride patterns to maximize performance.

The method delegates to the `Element.store` implementation which handles
different memory layout patterns including vectorized stores for contiguous memory
and element-by-element stores for non-contiguous layouts.

**Args:**

* ​src (`Element[dtype, layout, index_type]`): The `Element` containing the data to store.

### `transfer`

`transfer(self, src: MemoryElement[dtype, layout, address_space, alignment, index_type=index_type])`

Transfers data from another `MemoryElement` to this one.

This method efficiently transfers data between memory locations with potentially
different layouts and data types. It performs the following operations:

1. Loads data from the source `MemoryElement` using its layout
2. Converts the data to the destination data type if necessary
3. Stores the converted data to the destination memory location using its layout

This provides a high-performance way to copy and convert data between different
memory representations while respecting both source and destination memory layouts.

**Args:**

* ​src (`MemoryElement[dtype, layout, address_space, alignment, index_type=index_type]`): The source `MemoryElement` to transfer data from.

---

## element

Provides element-based access to memory using layout-driven vectorization.

This module implements efficient memory access patterns for multi-dimensional data
using the layout system. It provides abstractions for loading and storing data with
specific memory layouts, enabling vectorized operations that respect the underlying
memory organization.

Key components:

* `Element`: A wrapper around SIMD types that provides layout-driven vectorized
  operations
* `MemoryElement`: Represents data in memory organized according to a specific layout

These components enable efficient tensor operations by ensuring memory accesses
follow optimal patterns defined by the layout system.

## Structs

* [​`Element`](./Element): A wrapper around SIMD types that provides layout-driven vectorized operations.
* [​`MemoryElement`](./MemoryElement): Represents data in memory organized according to a specific layout.

---

## layout

Provides layout and layout tensor types, which abstract memory layout for multidimensional data.

* The [`Layout`](/mojo/kernels/layout/layout/Layout) type represents a mapping
  between a set of logical coordinates and a linear index. It can be used, for
  example, to map logical tensor coordinates to a memory address, or to map GPU
  threads to tiles of data.

* The [`LayoutTensor`](/mojo/kernels/layout/layout_tensor/LayoutTensor) type is a
  high-performance tensor with explicit memory layout via a `Layout`.

## Modules

* [​`element`](./element/): Provides element-based access to memory using layout-driven vectorization.
* [​`int_tuple`](./int_tuple/): Hierarchical integer tuple data structures for high-performance tensor operations.
* [​`layout`](./layout/): Provides a high-performance tensor layout system for memory mapping and indexing.
* [​`layout_tensor`](./layout_tensor/): Provides the `LayoutTensor` type for representing multidimensional data.
* [​`math`](./math/): Implements math methods that work on layout tensors.
* [​`runtime_layout`](./runtime_layout/): Provides the `RuntimeLayout` type and functions for working with it. You can use `RuntimeLayout` to define a layout where the dimensions are not known at compile time.
* [​`runtime_tuple`](./runtime_tuple/): Provides the `RuntimeTuple` data structure and related utility functions for handling tuple-like data with both compile-time and runtime elements. `RuntimeTuple` is designed for high-performance tensor operations, supporting efficient manipulation of multi-dimensional data structures like shapes, indices, and coordinates.
* [​`swizzle`](./swizzle/): Defines swizzle layouts for optimizing memory access patterns.
* [​`tensor_builder`](./tensor_builder/): Tensor Builder Module
* [​`tensor_core`](./tensor_core/): Tensor Core Module for High-Performance Matrix Operations
* [​`tensor_core_async`](./tensor_core_async/): Tensor Core Async Module
* [​`tma_async`](./tma_async/): Tensor Memory Accelerator (TMA) Asynchronous Operations Module

---

## IntArray

`@register_passable`
`struct IntArray`

A memory-efficient, register-passable array of integers.

`IntArray` provides a low-level implementation of a dynamically-sized integer array
with direct memory management. It supports both owned and non-owned (view) modes
for efficient memory sharing without copying.

This struct serves as the underlying storage mechanism for `IntTuple` and related
data structures, optimized for high-performance tensor operations.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(size: Int = 0) -> Self`

Initialize a new owned `IntArray` with the specified size.

**Args:**

* ​size (`Int`): Number of integers to allocate space for. Defaults to 0.

`__init__(*, non_owned: Self, offset: Int = 0) -> Self`

Create a non-owned view into another `IntArray`.

Creates a view starting at the specified offset in the source array.
The resulting array doesn't own the memory and won't free it when destroyed.

**Args:**

* ​non\_owned (`Self`): The source array to create a view into.
* ​offset (`Int`): Starting position in the source array. Defaults to 0.

### `__copyinit__`

`__copyinit__(existing: Self) -> Self`

Initialize by copying an existing `IntArray`.

For owned arrays, this performs a deep copy of the data.
For non-owned arrays, this creates another view of the same data (zero-copy operation).

**Args:**

* ​existing (`Self`): The source array to copy from.

### `__del__`

`__del__(owned self)`

Destroy the `IntArray` and free its memory if owned.

Only frees memory for owned arrays (positive \_size) to prevent
double-free errors with views.

### `__getitem__`

`__getitem__(self, idx: Int) -> Int`

Access an element at the specified index.

Note:
Bounds checking is only performed when `INT_TUPLE_VALIDATION` is enabled.

**Args:**

* ​idx (`Int`): Zero-based index of the element to access.

**Returns:**

The integer value at the specified index.

### `__setitem__`

`__setitem__(mut self, idx: Int, value: Int)`

Set the value at the specified index.

Note:
Bounds checking is only performed when `INT_TUPLE_VALIDATION` is enabled.

**Args:**

* ​idx (`Int`): Zero-based index of the element to modify.
* ​value (`Int`): The integer value to store at the specified index.

### `owning`

`owning(self) -> Bool`

Check if this `IntArray` owns its memory.

**Returns:**

True if this array owns its memory (positive \_size),
False if it's a view (negative \_size).

### `size`

`size(self) -> Int`

Get the number of elements in the array.

**Returns:**

The number of elements in the array, regardless of ownership status.

### `copy_from`

`copy_from(mut self, offset: Int, source: Self, size: Int)`

Copy elements from another `IntArray`.

**Args:**

* ​offset (`Int`): Destination offset in this array.
* ​source (`Self`): Source array to copy from.
* ​size (`Int`): Number of elements to copy.

`copy_from(mut self, dst_offset: Int, source: Self, src_offset: Int, size: Int)`

Copy elements from another IntArray with source offset.

**Args:**

* ​dst\_offset (`Int`): Destination offset in this array.
* ​source (`Self`): Source array to copy from.
* ​src\_offset (`Int`): Source offset in the source array.
* ​size (`Int`): Number of elements to copy.

---

## IntTuple

`struct IntTuple[origin: ImmutableOrigin = {}]`

A hierarchical, nested tuple of integers with efficient memory management.

IntTuple provides a flexible data structure for representing multi-dimensional
shapes, indices, and other nested integer collections. It supports both flat
and hierarchical representations with efficient memory sharing.

This structure is fundamental for tensor operations, layout specifications,
and dimension handling in high-performance computing contexts.

## Parameters

* ​origin (`ImmutableOrigin`): Origin tracking for memory safety. Defaults to the current origin.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`EqualityComparable`,
`Intable`,
`Movable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `MinimumValue`

`alias MinimumValue = -65534`

Minimum allowed value for integers in an `IntTuple`.

This constant defines the lower bound for integer values that can be stored
directly in an `IntTuple`. Values below this threshold are reserved for internal
use to represent structural information like sub-tuple offsets.

## Methods

### `__init__`

`__init__(out self)`

Initialize an empty IntTuple.

Creates an `IntTuple` with zero elements, which can be used as a starting
point for building tuples incrementally with `append` or `extend`.

Performance:

* Minimal allocation (just a single element for length).
* Structure validation only performed when `INT_TUPLE_VALIDATION` is enabled.

`__init__(out self, *, num_elems: Int)`

Initialize an `IntTuple` with a specified number of uninitialized elements.

Creates an `IntTuple` with space for the specified number of elements,
but does not initialize the elements themselves.

Note:
Structure validation only performed when `INT_TUPLE_VALIDATION` is enabled.

**Args:**

* ​num\_elems (`Int`): The number of elements to allocate space for.

`@implicit`
`__init__(out self, *elements: Int)`

Initialize an `IntTuple` with a variadic list of integers.

Creates an `IntTuple` containing the provided integer values.
This constructor is implicit, allowing direct conversion from integer lists.

**Args:**

* ​\*elements (`Int`): Variable number of integer values to store in the tuple.

`__init__(out self, elements: VariadicList[Int])`

Initialize an `IntTuple` with a list of integers.

Creates an `IntTuple` containing the provided integer values.
This constructor is implicit, allowing direct conversion from integer lists.

Notes:

* Pre-allocates exact memory needed for efficiency.
* Validates that all values are above `MinimumValue`. If any value is
  less than `MinimumValue`, aborts with an error message.
* Structure validation only performed when `INT_TUPLE_VALIDATION` is
  enabled.

**Args:**

* ​elements (`VariadicList[Int]`): List of integer values to store in the tuple.

`@implicit`
`__init__(out self, value: Int)`

Initialize an `IntTuple` with a single integer value.

Creates an `IntTuple` containing a single integer element.

**Args:**

* ​value (`Int`): The integer value to store in the tuple.

`__init__(out self, *elements: IntTuple[origin], *, __list_literal__: Tuple[] = Tuple())`

Initialize an `IntTuple` with nested IntTuples.

Creates a hierarchical `IntTuple` containing the provided `IntTuple` elements,
preserving their nested structure.

**Args:**

* ​\*elements (`IntTuple[origin]`): Variable number of `IntTuple` values to store in the tuple.
* ​**list\_literal** (`Tuple[]`): Specifies that this constructor can be used for
  list literals.

`__init__(out self, *, non_owned: IntArray)`

Initialize an `IntTuple` with a non-owned `IntArray`.

Creates an `IntTuple` that uses the provided `IntArray` as its storage
without taking ownership. This allows creating views into existing
`IntTuple` data without copying.

**Args:**

* ​non\_owned (`IntArray`): The `IntArray` to use as storage without taking ownership.

`__init__(out self, existing: Self, rng: _StridedRange)`

Initialize an `IntTuple` as a slice of an existing `IntTuple`.

Creates a new `IntTuple` containing only the elements from the existing
`IntTuple` that are specified by the range.

Notes:

* Preserves nested structure of elements in the slice.
* Structure validation only performed when `INT_TUPLE_VALIDATION` is enabled.

**Args:**

* ​existing (`Self`): The source `IntTuple` to slice from.
* ​rng (`_StridedRange`): The range of indices to include in the new `IntTuple`.

`__init__(out self, dimlist: DimList)`

Initialize an `IntTuple` from a DimList.

Creates an `IntTuple` containing the dimensions from a DimList, handling
both defined and undefined dimensions appropriately.

Notes:

* Converts undefined dimensions to `UNKNOWN_VALUE`.
* Validates that all values are above `MinimumValue`. If any value is
  less than `MinimumValue`, aborts with an error message.

**Args:**

* ​dimlist (`DimList`): The DimList containing dimension information.

`@implicit`
`__init__(out self, zipper: _zip[origin, 2])`

Initialize an `IntTuple` from a zip iterator.

Creates an `IntTuple` by appending each element from the zip iterator.
This constructor is implicit, allowing direct conversion from zip iterators.

Note:
This implementation is not optimized and may be improved in future versions.

**Args:**

* ​zipper (`_zip[origin, 2]`): A zip iterator containing pairs of elements to append.

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Initialize by copying an existing `IntTuple`.

Creates a deep copy of the provided `IntTuple`, copying all its data
into newly allocated memory.

Note:
There is a Mojo bug where this method unnecessarily propagates
the origin of self to the new copy.

**Args:**

* ​existing (`Self`): The `IntTuple` to copy from.

### `__getitem__`

`__getitem__(self, _idx: Int) -> IntTuple[self]`

Retrieves an element at the specified index from the `IntTuple`.

Supports negative indexing (e.g., `-1` for the last element).

Notes:
If index validation is enabled and the index is out of bounds,
aborts with an error message.

**Args:**

* ​\_idx (`Int`): The index of the element to retrieve.

**Returns:**

An `IntTuple` containing either a single value or a sub-tuple.

`__getitem__(self, span: Slice) -> Self`

Retrieves a slice of elements from the `IntTuple`.

Creates a new `IntTuple` containing the elements specified by the slice.

**Args:**

* ​span (`Slice`): A slice object specifying the range of elements to retrieve.

**Returns:**

A new `IntTuple` containing the specified elements.

### `__lt__`

`__lt__(self, rhs: IntTuple[origin]) -> Bool`

Compare two `IntTuple`s lexicographically.

This function performs element-wise comparison of two `IntTuple`s and determines
if the first is lexicographically less than the second. It compares corresponding
elements until it finds a pair where the elements differ.

Example:

```mojo
from layout.int_tuple import IntTuple

var tuple1 = IntTuple(1, 2, 3)
var tuple2 = IntTuple(1, 2, 4)

var result = tuple1 rhs (`IntTuple[origin]`): The other `IntTuple` to compare.

**Returns:**

True if `self` is lexicographically less than `rhs`, False otherwise.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Equality operator for `IntTuple`.

**Args:**

* ​other (`Self`): The `IntTuple` to compare with.

**Returns:**

True if the `IntTuple`s are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Inequality operator for `IntTuple`.

**Args:**

* ​other (`Self`): The `IntTuple` to compare with.

**Returns:**

True if the `IntTuple`s are not equal, False otherwise.

### `elements_size`

`static elements_size[origin: ImmutableOrigin](elements: VariadicListMem[IntTuple[origin], origin, is_owned]) -> Int`

Calculate the total storage size needed for a list of IntTuples.

Computes the sum of sizes for all elements, accounting for both direct
integer values and nested sub-tuples.

**Parameters:**

* ​origin (`ImmutableOrigin`): Origin of the elements in the `IntTuple`.

**Args:**

* ​elements (`VariadicListMem[IntTuple[origin], origin, is_owned]`): List of `IntTuple` elements to measure.

**Returns:**

The total storage size required for all elements.

`static elements_size[origin: ImmutableOrigin, n: Int](elements: InlineArray[Pointer[IntTuple, origin], n], idx: Int) -> Int`

Calculate the total storage size needed for IntTuples at a specific index.

Computes the sum of sizes for all elements at the given index in an array
of `IntTuple` pointers.

**Parameters:**

* ​origin (`ImmutableOrigin`): Origin tracking for memory safety.
* ​n (`Int`): Size of the inline array.

**Args:**

* ​elements (`InlineArray[Pointer[IntTuple, origin], n]`): Array of pointers to `IntTuple`s.
* ​idx (`Int`): Index to access in each `IntTuple`.

**Returns:**

The total storage size required for all elements at the specified index.

### `owned_copy`

`owned_copy(self) -> IntTuple`

Create a deep copy of this `IntTuple` with its own memory ownership.

This method creates a completely independent copy of the `IntTuple` with
newly allocated memory. Unlike `__copyinit__`, this method can be called
on an existing instance to create a separate copy.

Example:

```mojo
from layout import IntTuple

var original = IntTuple(1, 2, 3)
var copy = original.owned_copy()
# Modifying copy will not affect original
```

.

**Returns:**

A new `IntTuple` containing the same data as this one but with
independent memory ownership.

### `replace_entry`

`replace_entry(self, idx: Int, value: IntTuple[origin]) -> IntTuple`

Replace an entry in the tuple with another `IntTuple`.

Creates a new `IntTuple` with the element at the specified index replaced
by the provided `IntTuple`.

Note:
If the index is out of bounds and `INT_TUPLE_VALIDATION` is enabled,
aborts with an error message.

**Args:**

* ​idx (`Int`): The index of the element to replace.
* ​value (`IntTuple[origin]`): The `IntTuple` to insert at the specified index.

**Returns:**

A new `IntTuple` with the replacement applied.

`replace_entry(mut self, idx: Int, *, int_value: Int)`

Replace an integer value at the specified index in-place.

Directly modifies the tuple by replacing the integer value at the given index.
This is more efficient than creating a new tuple when only a single value
needs to be changed.

Note:
If the index is out of bounds and `INT_TUPLE_VALIDATION` is enabled,
aborts with an error message.

**Args:**

* ​idx (`Int`): The index of the element to replace.
* ​int\_value (`Int`): The integer value to insert at the specified index.

### `count_values`

`count_values(self) -> Int`

Count the total number of integer values in this tuple hierarchy.

Recursively traverses the nested tuple structure and counts all integer values.
This is useful for determining the size needed for flattened representations.

Note:
For a flat tuple, this will return the same value as `len(self)`.
For nested tuples, it counts all leaf integer values.

**Returns:**

The total count of integer values in this tuple and all nested tuples.

### `flatten`

`flatten(self) -> IntTuple`

Flatten a nested `IntTuple` into a single-level `IntTuple`.

This function converts a hierarchical `IntTuple` structure into a flat
sequence of integer values, preserving the order of elements.

**Returns:**

A new `IntTuple` containing all integer values in a flat structure.

### `all_known`

`all_known(self) -> Bool`

Check if all values in this tuple hierarchy are known (not `UNKNOWN_VALUE`).

Recursively traverses the nested tuple structure and checks if any value
is equal to `UNKNOWN_VALUE`.

**Returns:**

True if all values in this tuple and nested tuples are known,
False if any value is `UNKNOWN_VALUE`.

### `append`

`append(mut self, *elements: IntTuple[origin])`

Append one or more `IntTuple` elements to this tuple.

This method modifies the tuple in-place by adding the provided elements
to the end of the tuple. It handles both value tuples and nested tuples.

Notes:

* This operation requires reallocating the underlying `IntArray` storage to accommodate
  the new elements, which may impact performance for large tuples.
* Aborts if called on a non-owning (sub-tuple) instance.

**Args:**

* ​\*elements (`IntTuple[origin]`): Variable number of `IntTuple` objects to append to this tuple.

### `extend`

`extend(mut self, tuple: IntTuple[origin])`

Extends this tuple by appending all elements from another tuple.

This method modifies the tuple in-place by adding all elements from the provided
tuple to the end of this tuple. It efficiently handles both value elements and
nested tuples.

Notes:

* This operation requires reallocating the underlying `IntArray` storage
  to accommodate the new elements, which may impact performance for large tuples.
* Aborts if called on a non-owning (sub-tuple) instance.
* If the input tuple is empty, this method returns without making any changes.

**Args:**

* ​tuple (`IntTuple[origin]`): The `IntTuple` whose elements will be appended to this tuple.

### `size`

`size(self) -> Int`

Returns the total size of the `IntTuple` in memory.

For owning tuples, returns the size of the underlying `IntArray`.
For non-owning tuples, calculates the size recursively.

**Returns:**

The total size in memory units.

### `tuple_size`

`static tuple_size(data: IntArray) -> Int`

Recursively calculates the size of a tuple represented by an `IntArray`.

This method traverses the tuple structure, accounting for both direct values
and nested sub-tuples to compute the total memory footprint.

**Args:**

* ​data (`IntArray`): `IntArray` containing the tuple data.

**Returns:**

The total size of the tuple in memory units.

### `validate_structure`

`validate_structure(self)`

Validates the internal structure of the `IntTuple`.

Ensures that the actual size of the underlying data matches the computed size
based on the tuple's structure. This helps detect memory corruption or
implementation errors.

Aborts execution with an error message if validation fails.

### `__len__`

`__len__(self) -> Int`

Returns the number of elements in the `IntTuple`.

This is the logical length of the tuple, not its memory size.

**Returns:**

The number of elements in the tuple.

### `__iter__`

`__iter__(self) -> _IntTupleIter[self, origin]`

Returns an iterator over the elements of the `IntTuple`.

This enables iteration through the tuple using for-loops.

**Returns:**

An iterator object for this `IntTuple`.

### `is_value`

`is_value(self) -> Bool`

Determines if this `IntTuple` represents a single value rather than a tuple.

**Returns:**

True if this `IntTuple` contains exactly one element that is a value,
False otherwise.

`is_value(self, i: Int) -> Bool`

Determines if the element at the specified index is a value rather than a tuple.

Notes:
If index validation is enabled and the index is out of bounds,
aborts with an error message.

**Args:**

* ​i (`Int`): The index of the element to check.

**Returns:**

True if the element at index i is a value, False if it's a tuple.

### `is_tuple`

`is_tuple(self) -> Bool`

Determines if this `IntTuple` represents a tuple rather than a single value.

**Returns:**

True if this `IntTuple` is a tuple (not a single value), False otherwise.

`is_tuple(self, i: Int) -> Bool`

Determines if the element at the specified index is a tuple rather than a value.

Notes:
This is the complement of is\_value(i).

**Args:**

* ​i (`Int`): The index of the element to check.

**Returns:**

True if the element at index i is a tuple, False if it's a value.

### `value`

`value(self) -> Int`

Retrieves the value of this `IntTuple` if it represents a single value.

This method should only be called if `is_value()` returns True.

**Returns:**

The integer value stored in this `IntTuple`.

`value(self, i: Int) -> Int`

Retrieves the value of the element at the specified index.

This method should only be called if `is_value(i)` returns True.

Notes:
If the element is not a value, the behavior is undefined.

**Args:**

* ​i (`Int`): The index of the element to retrieve.

**Returns:**

The integer value stored at the specified index.

### `tuple`

`tuple(ref self) -> ref [self] Self`

Returns a reference to this `IntTuple` as a tuple.

Notes:
This method is used to access the current `IntTuple` as a tuple
without creating a copy of the data.

**Returns:**

A reference to this `IntTuple` to avoid unnecessary copying.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes a string representation of this `IntTuple` to the provided writer.

Notes:
For single values, writes just the value.
For tuples, writes a comma-separated list of elements enclosed in parentheses.

**Parameters:**

* ​W (`Writer`): A type that conforms to the Writer trait.

**Args:**

* ​writer (`W`): The writer to output the string representation to.

### `__str__`

`__str__(self) -> String`

Returns a string representation of this `IntTuple`.

**Returns:**

A string representation of the `IntTuple`, using the `write_to` method.

### `is_equal`

`static is_equal(a: IntTuple[origin], b: IntTuple[origin]) -> Bool`

Compares two `IntTuple`s for equality.

Notes:
Handles nested tuples and special cases where a single-element tuple
is equivalent to its contained value.

**Args:**

* ​a (`IntTuple[origin]`): The first `IntTuple` to compare.
* ​b (`IntTuple[origin]`): The second `IntTuple` to compare.

**Returns:**

True if the `IntTuple`s are equal in structure and values, False otherwise.

### `__repr__`

`__repr__(self) -> String`

Returns a string representation of this `IntTuple` for debugging.

**Returns:**

A string representation of the `IntTuple`, same as `__str__`.

### `__int__`

`__int__(self) -> Int`

Converts this `IntTuple` to an integer.

This method should only be called if `is_value()` returns True.

Notes:
If the `IntTuple` is not a single value, the behavior is undefined.

**Returns:**

The integer value stored in this `IntTuple`.

---

## abs

`abs(t: IntTuple[origin]) -> IntTuple`

Compute the absolute value of each element in an `IntTuple`.

This function applies the absolute value operation to each integer
in a potentially nested `IntTuple` structure.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to transform.

**Returns:**

A new `IntTuple` with the same structure but with absolute values.

---

## apply

`apply[: origin.set, //, func: fn(Int) capturing -> Int](t: IntTuple[origin]) -> IntTuple`

Apply a function to each integer value in an `IntTuple`.

This function recursively applies the given function to each integer value
in a potentially nested `IntTuple` structure, preserving the structure.

**Parameters:**

* ​func (`fn(Int) capturing -> Int`): Function to apply to each integer value.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to transform.

**Returns:**

A new `IntTuple` with the same structure but with each integer value
transformed by the function.

---

## apply_predicate

`apply_predicate[predicate: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool](a: IntTuple[origin], b: IntTuple[origin]) -> Bool`

Apply a predicate function recursively to two `IntTuple`s.

This function traverses two `IntTuple`s with the same structure and applies
a predicate function to corresponding elements. The predicate is applied
only to the leaf nodes (integer values).

Note:
If the structures of the two `IntTuple`s don't match (different nesting or length),
the function returns False without applying the predicate.

**Parameters:**

* ​predicate (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool`): A function that takes two `IntTuple`s (containing integer values)
  and returns a boolean result.

**Args:**

* ​a (`IntTuple[origin]`): First `IntTuple` to compare.
* ​b (`IntTuple[origin]`): Second `IntTuple` to compare.

**Returns:**

True if the predicate returns True for all corresponding elements and
the structures match, False otherwise.

---

## apply_zip

`apply_zip[func: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin]) -> IntTuple`

Apply a function to pairs of elements from two `IntTuple`s.

This function zips two `IntTuple`s together and applies the given function
to each pair of elements, creating a new `IntTuple` with the results.

**Parameters:**

* ​func (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> IntTuple`): Function that takes two `IntTuple`s and returns an `IntTuple`.

**Args:**

* ​t1 (`IntTuple[origin]`): First `IntTuple`.
* ​t2 (`IntTuple[origin]`): Second `IntTuple`.

**Returns:**

A new `IntTuple` containing the results of applying func to each pair.

`apply_zip[: origin.set, //, func: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) capturing -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin]) -> IntTuple`

Apply a capturing function to pairs of elements from two `IntTuple`s.

This overload allows the function to capture variables from its environment.

**Parameters:**

* ​func (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) capturing -> IntTuple`): Capturing function that takes two `IntTuple`s and returns an `IntTuple`.

**Args:**

* ​t1 (`IntTuple[origin]`): First `IntTuple`.
* ​t2 (`IntTuple[origin]`): Second `IntTuple`.

**Returns:**

A new `IntTuple` containing the results of applying func to each pair.

`apply_zip[func: fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin], t3: IntTuple[origin]) -> IntTuple`

Apply a function to triplets of elements from three `IntTuple`s.

This function zips three `IntTuple`s together and applies the given function
to each triplet of elements, creating a new `IntTuple` with the results.

**Parameters:**

* ​func (`fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) -> IntTuple`): Function that takes three `IntTuple`s and returns an `IntTuple`.

**Args:**

* ​t1 (`IntTuple[origin]`): First `IntTuple`.
* ​t2 (`IntTuple[origin]`): Second `IntTuple`.
* ​t3 (`IntTuple[origin]`): Third `IntTuple`.

**Returns:**

A new `IntTuple` containing the results of applying func to each triplet.

`apply_zip[: origin.set, //, func: fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) capturing -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin], t3: IntTuple[origin]) -> IntTuple`

Apply a capturing function to triplets of elements from three `IntTuple`s.

This overload allows the function to capture variables from its environment.

**Parameters:**

* ​func (`fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) capturing -> IntTuple`): Capturing function that takes three `IntTuple`s and returns an `IntTuple`.

**Args:**

* ​t1 (`IntTuple[origin]`): First `IntTuple`.
* ​t2 (`IntTuple[origin]`): Second `IntTuple`.
* ​t3 (`IntTuple[origin]`): Third `IntTuple`.

**Returns:**

A new `IntTuple` containing the results of applying func to each triplet.

---

## compact_order

`compact_order(shape: IntTuple[origin], order: IntTuple[origin]) -> IntTuple`

Create a compact stride based on shape and order.

This function generates a stride tuple where lower order numbers imply
faster varying strides. The resulting shape and stride form a bijective layout.

Performance:

* Always inlined for optimal performance in tight loops.
* Flattens inputs and re-nests results for consistent behavior.

Example:

```mojo
from layout import IntTuple
from layout.int_tuple import compact_order

# Create a compact layout with dimensions (2,3,4,5) and ordering (1,4,3,5)
var x = compact_order(IntTuple(2,3,4,5), IntTuple(1,4,3,5))  # returns (1,8,2,24)

# Create a compact layout with nested dimensions and corresponding ordering
var y = compact_order(IntTuple(2,IntTuple(3,4),5), IntTuple(1,IntTuple(2,3),4))  # returns (1,(2,6),24)
```

.

**Args:**

* ​shape (`IntTuple[origin]`): The shape tuple defining dimensions.
* ​order (`IntTuple[origin]`): The order tuple defining the relative ordering of dimensions.

**Returns:**

A stride tuple that creates a compact memory layout according to the
specified order.

---

## compatible

`compatible(a: IntTuple[origin], b: IntTuple[origin]) -> Bool`

Test if two shapes are compatible for tensor operations.

This function checks if shape A is compatible with shape B, meaning:

1. The total size of A and B are the same
2. Any coordinate into A can also be used as a coordinate into B

Compatible can also be thought of as a partial order on A and B: A a (`IntTuple[origin]`): The first `IntTuple` to compare.
* ​b (`IntTuple[origin]`): The second `IntTuple` to compare.

**Returns:**

True if shape A is compatible with shape B, False otherwise.

---

## congruent

`congruent(a: IntTuple[origin], b: IntTuple[origin]) -> Bool`

Test if two `IntTuple`s have the same hierarchical structure.

This function checks if two `IntTuple`s have identical nesting patterns,
regardless of the actual integer values they contain.

**Args:**

* ​a (`IntTuple[origin]`): First `IntTuple` to compare.
* ​b (`IntTuple[origin]`): Second `IntTuple` to compare.

**Returns:**

True if both `IntTuple`s have the same hierarchical structure,
False otherwise.

---

## crd2idx

`crd2idx(crd: IntTuple[origin], shape: IntTuple[origin]) -> Int`

Map a logical coordinate to a linear index.

This function converts a multi-dimensional coordinate to a linear index based on the shape.
It uses default strides computed from the shape.

**Args:**

* ​crd (`IntTuple[origin]`): The coordinate tuple to convert.
* ​shape (`IntTuple[origin]`): The shape of the tensor/array.

**Returns:**

The linear index corresponding to the coordinate.

`crd2idx(crd: IntTuple[origin], shape: IntTuple[origin], _stride: IntTuple[origin]) -> Int`

Map a logical coordinate to a linear index with custom strides.

This function converts a multi-dimensional coordinate to a linear index based on the shape
and stride information. If no stride is provided, it computes default strides from the shape.

The function handles various input combinations:

* Tuple coordinates with tuple shapes and strides
* Single integer coordinate with tuple shapes and strides
* Single integer coordinate with single integer shape and stride

Aborts:

```
- If coordinate and shape dimensions don't match.
- If shape and stride dimensions don't match.
- If input type combinations are invalid.
```

**Args:**

* ​crd (`IntTuple[origin]`): The coordinate(s) to convert, can be a single value or a tuple of coordinates.
* ​shape (`IntTuple[origin]`): The shape of the tensor/array, can be a single value or a tuple of dimensions.
* ​\_stride (`IntTuple[origin]`): Optional custom strides, defaults to row-major strides if not provided.

**Returns:**

The linear index corresponding to the coordinate.

---

## depth

`depth(src: IntTuple[origin]) -> Int`

Calculates the maximum nesting depth of an `IntTuple`.

This function recursively traverses the `IntTuple` structure to determine
its maximum nesting depth. A scalar value has depth 0, a flat tuple has
depth 1, and nested tuples increase the depth accordingly.

Example:

```mojo
from layout import IntTuple, depth

print(depth(IntTuple(1))) # prints 0
print(depth(IntTuple(1, 2))) # prints 1
print(depth((IntTuple(1, 2)))) # prints 2
```

.

**Args:**

* ​src (`IntTuple[origin]`): The `IntTuple` to measure the depth of.

**Returns:**

An integer representing the maximum nesting depth.

---

## fill_like

`fill_like(src: IntTuple[origin], val: Int) -> IntTuple`

Creates an `IntTuple` with the same structure as the source but filled with a specified value.

This function recursively traverses the source `IntTuple` and creates a new `IntTuple`
with identical structure, but with all leaf values replaced by the specified value.

**Args:**

* ​src (`IntTuple[origin]`): The source `IntTuple` whose structure will be copied.
* ​val (`Int`): The integer value to fill the new `IntTuple` with.

**Returns:**

A new `IntTuple` with the same structure as src but filled with val.

---

## flatten

`flatten(t: IntTuple[origin]) -> IntTuple`

Flatten a nested `IntTuple` into a single-level `IntTuple`.

This function converts a hierarchical `IntTuple` structure into a flat
sequence of integer values, preserving the order of elements.

**Args:**

* ​t (`IntTuple[origin]`): The nested `IntTuple` to flatten.

**Returns:**

A new `IntTuple` containing all integer values in a flat structure.

---

## idx2crd

`idx2crd(idx: IntTuple[origin], shape: IntTuple[origin]) -> IntTuple`

Converts a linear index to a coordinate tuple within a given shape.

This function splits an index into a coordinate within a Shape via a
colexicographical enumeration of coordinates in Shape.

**Args:**

* ​idx (`IntTuple[origin]`): The linear index to convert.
* ​shape (`IntTuple[origin]`): The shape of the tensor/array.

**Returns:**

A new `IntTuple` containing the coordinates corresponding to the linear index.

`idx2crd(idx: IntTuple[origin], shape: IntTuple[origin], _stride: IntTuple[origin]) -> IntTuple`

Converts a linear index to a coordinate tuple within a given shape using custom strides.

**Args:**

* ​idx (`IntTuple[origin]`): The linear index to convert.
* ​shape (`IntTuple[origin]`): The shape of the tensor/array.
* ​\_stride (`IntTuple[origin]`): Custom strides to use for the conversion.

**Returns:**

A new `IntTuple` containing the coordinates corresponding to the linear index.

---

## idx2crd2

`idx2crd2(idx: IntTuple[origin], shape: IntTuple[origin], _stride: IntTuple[origin]) -> IntTuple`

Convert a linear index to coordinates.

This function handles the actual conversion logic for different input combinations.

Notes:

* Handles four cases: tuple-tuple-tuple, tuple-int-int, int-tuple-tuple, and int-int-int.
* When input shapes don't match, `abort()` will be called.

**Args:**

* ​idx (`IntTuple[origin]`): The linear index to convert.
* ​shape (`IntTuple[origin]`): The shape of the tensor/array.
* ​\_stride (`IntTuple[origin]`): Custom strides to use for the conversion. If empty, strides are computed
  from the shape using prefix\_product.

**Returns:**

A new IntTuple containing the coordinates corresponding to the linear index.

---

## int_tuple

Hierarchical integer tuple data structures for high-performance tensor operations.

This module provides a flexible, memory-efficient implementation of nested integer tuples
optimized for tensor shape, stride, and index operations in high-performance computing.
The core data structures support both flat and hierarchical representations with
efficient memory sharing and zero-copy views.

Key components:

* `IntArray`: Low-level register-passable array with direct memory management
* `IntTuple`: Hierarchical nested tuple with efficient memory layout and operations
* Utility functions for tensor shape manipulation, coordinate transformations, and layout operations

Performance features:

* Register-passable data structures for optimal compiler optimizations
* Zero-copy views for efficient memory sharing
* Specialized memory layout for nested structures
* Optimized algorithms for common tensor operations

Common operations:

* Shape manipulation: `flatten`, `to_nest`, `apply`, `product`, `sum`
* Coordinate transformations: `idx2crd`, `crd2idx`
* Layout operations: `compact_order`, `prefix_product`
* Structural comparisons: `congruent`, `compatible`, `weakly_congruent`

Example usage:

```mojo
from layout import IntTuple
from layout.int_tuple import flatten, compact_order, size

# Create nested tuples
var shape = IntTuple(2, IntTuple(3, 4), 5)  # Represents shape (2, (3, 4), 5)

# Flatten a nested tuple
var flat = flatten(shape)  # Results in (2, 3, 4, 5)

# Create compact strides for a given shape and order
var order = IntTuple(1, IntTuple(2, 3), 4)
var strides = compact_order(shape, order)  # Results in (1, (2, 6), 24)

# Calculate total size (product of all elements)
var total_size = size(shape)  # Results in 120
```

## Aliases

### `INT_TUPLE_VALIDATION`

`alias INT_TUPLE_VALIDATION = False`

### `IntList`

`alias IntList = List[Int, True]`

A type alias for a List of integers with ownership.

This alias defines a List that contains Int values and has ownership of its data.
It's used throughout the module for storing and manipulating collections of integers,
particularly for operations like permutations and indices.

### `UNKNOWN_VALUE`

`alias UNKNOWN_VALUE = -1`

Special value indicating an unknown or unspecified dimension.

This constant is used throughout the `IntTuple` system to represent dimensions
that are not known at compile time or have not been specified.

## Structs

* [​`IntArray`](./IntArray): A memory-efficient, register-passable array of integers.
* [​`IntTuple`](./IntTuple): A hierarchical, nested tuple of integers with efficient memory management.

## Functions

* [​`abs`](./abs): Compute the absolute value of each element in an `IntTuple`.
* [​`apply`](./apply): Apply a function to each integer value in an `IntTuple`.
* [​`apply_predicate`](./apply_predicate): Apply a predicate function recursively to two `IntTuple`s.
* [​`apply_zip`](./apply_zip): Apply a function to pairs of elements from two `IntTuple`s.
* [​`compact_order`](./compact_order): Create a compact stride based on shape and order.
* [​`compatible`](./compatible): Test if two shapes are compatible for tensor operations.
* [​`congruent`](./congruent): Test if two `IntTuple`s have the same hierarchical structure.
* [​`crd2idx`](./crd2idx): Map a logical coordinate to a linear index.
* [​`depth`](./depth): Calculates the maximum nesting depth of an `IntTuple`.
* [​`fill_like`](./fill_like): Creates an `IntTuple` with the same structure as the source but filled with a specified value.
* [​`flatten`](./flatten): Flatten a nested `IntTuple` into a single-level `IntTuple`.
* [​`idx2crd`](./idx2crd): Converts a linear index to a coordinate tuple within a given shape.
* [​`idx2crd2`](./idx2crd2): Convert a linear index to coordinates.
* [​`inner_product`](./inner_product): Compute the inner product of two `IntTuple`s.
* [​`is_flat`](./is_flat): Check if an `IntTuple` is flat.
* [​`is_int`](./is_int): Check if an `IntTuple` represents a single integer value.
* [​`is_tuple`](./is_tuple): Check if an `IntTuple` represents a nested tuple.
* [​`mul`](./mul): Multiply each element in an `IntTuple` by a scalar value.
* [​`prefix_product`](./prefix_product): Compute the exclusive prefix product of an `IntTuple`.
* [​`product`](./product): Calculate the product of all values in an `IntTuple`.
* [​`product_each`](./product_each): Compute the product of elements in each sub-tuple of an `IntTuple`.
* [​`propagate_unknown`](./propagate_unknown): Propagates unknown dimensions from the target `IntTuple` to the source `IntTuple`.
* [​`reduce`](./reduce): Apply a reduction function to an `IntTuple` with an initial value.
* [​`reverse`](./reverse): Reverses the order of elements in an `IntTuple`, recursively.
* [​`shallow_apply`](./shallow_apply): Apply a function to each top-level element of an `IntTuple`.
* [​`shape_div`](./shape_div): Performs division operation between shape tuples.
* [​`signum`](./signum): Calculate the sign of an integer.
* [​`size`](./size): Calculate the total size (product of all elements) of an `IntTuple`.
* [​`sorted`](./sorted): Sort an IntTuple using the provided comparison function.
* [​`sum`](./sum): Calculate the sum of all values in an `IntTuple`.
* [​`to_nest`](./to_nest): Nests a flat `IntTuple` according to the structure of a nested `IntTuple`.
* [​`to_unknown`](./to_unknown): Create an `IntTuple` with the same structure but filled with `UNKNOWN_VALUE`.
* [​`tuple_max`](./tuple_max): Calculate the maximum value in an `IntTuple`.
* [​`tuple_min`](./tuple_min): Compute the element-wise minimum of two `IntTuple`s.
* [​`weakly_compatible`](./weakly_compatible): Test if shape A is weakly compatible with shape B.
* [​`weakly_congruent`](./weakly_congruent): Test if two IntTuples have similar hierarchical structures.
* [​`zip`](./zip): Create a zip iterator from an array of `IntTuple` pointers.

---

## inner_product

`inner_product(a: IntTuple[origin], b: IntTuple[origin]) -> Int`

Compute the inner product of two `IntTuple`s.

For flat tuples, this is the sum of element-wise products.
For nested tuples, the function recurses into corresponding nested elements.

Note:
If the input tuples have different lengths, `abort()` will be called.

**Args:**

* ​a (`IntTuple[origin]`): First `IntTuple`.
* ​b (`IntTuple[origin]`): Second `IntTuple`.

**Returns:**

The inner product as an `Int`.

---

## is_flat

`is_flat(t: IntTuple[origin]) -> Bool`

Check if an `IntTuple` is flat.

This function checks if the `IntTuple` is flat, meaning it has no nested
elements.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to check.

**Returns:**

True if the `IntTuple` is flat, False otherwise.

---

## is_int

`is_int(t: IntTuple[origin]) -> Bool`

Check if an `IntTuple` represents a single integer value.

This function determines whether the given `IntTuple` contains a single integer value
rather than a nested tuple structure.

Example:

```mojo
from layout.int_tuple import is_int, IntTuple

var single_value = IntTuple(5)
var nested_tuple = IntTuple(1, 2, 3)

var result1 = is_int(single_value)  # Returns True
var result2 = is_int(nested_tuple)  # Returns False
```

.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to check.

**Returns:**

True if the `IntTuple` contains a single integer value,
False if it's a nested tuple.

---

## is_tuple

`is_tuple(t: IntTuple[origin]) -> Bool`

Check if an `IntTuple` represents a nested tuple.

This function determines whether the given `IntTuple` contains nested elements
rather than a single integer value. It is the complement of the `is_int` function.

Example:

```mojo
from layout.int_tuple import is_tuple, IntTuple

var single_value = IntTuple(5)
var nested_tuple = IntTuple(1, 2, 3)

var result1 = is_tuple(single_value)  # Returns False
var result2 = is_tuple(nested_tuple)  # Returns True
```

.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to check.

**Returns:**

True if the `IntTuple` contains nested elements,
False if it's a single integer value.

---

## mul

`mul(lhs: IntTuple[origin], rhs: Int) -> IntTuple`

Multiply each element in an `IntTuple` by a scalar value.

This function creates a new `IntTuple` where each element (at any nesting level)
is multiplied by the provided integer value.

**Args:**

* ​lhs (`IntTuple[origin]`): The `IntTuple` whose elements will be multiplied.
* ​rhs (`Int`): The scalar integer to multiply each element by.

**Returns:**

A new `IntTuple` with the same structure as the input but with all
elements multiplied by the scalar value.

---

## prefix_product

`prefix_product(a: IntTuple[origin]) -> IntTuple`

Compute the exclusive prefix product of an `IntTuple`.

This is a convenience wrapper that initializes the prefix product with 1.

**Args:**

* ​a (`IntTuple[origin]`): The input `IntTuple` to compute the prefix product for.

**Returns:**

A new `IntTuple` containing the exclusive prefix product of the input.

`prefix_product(a: IntTuple[origin], init: Int) -> IntTuple`

Compute the exclusive prefix product of an `IntTuple` with an initial value.

This function delegates to the implementation in prefix\_product2.

**Args:**

* ​a (`IntTuple[origin]`): The input `IntTuple` to compute the prefix product for.
* ​init (`Int`): The initial value(s) for the prefix product, defaults to 1.

**Returns:**

A new `IntTuple` containing the exclusive prefix product of the input.

---

## product

`product(t: IntTuple[origin]) -> Int`

Calculate the product of all values in an `IntTuple`.

This function recursively computes the product of all integer values
in a potentially nested `IntTuple` structure.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to multiply.

**Returns:**

The product of all integer values, or `UNKNOWN_VALUE` if any value
in the tuple is `UNKNOWN_VALUE`.

---

## product_each

`product_each(t: IntTuple[origin]) -> IntTuple`

Compute the product of elements in each sub-tuple of an `IntTuple`.

For each immediate child of the input tuple, this function computes
the product of all elements within that child.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` containing sub-tuples.

**Returns:**

A new `IntTuple` where each element is the product of the corresponding
sub-tuple in the input.

---

## propagate_unknown

`propagate_unknown(src: IntTuple[origin], target: IntTuple[origin]) -> IntTuple`

Propagates unknown dimensions from the target `IntTuple` to the source `IntTuple`.

This function creates a new `IntTuple` by combining the source and target `IntTuple`s,
preserving unknown dimensions (UNKNOWN\_VALUE) from the target while using values
from the source for known dimensions.

**Args:**

* ​src (`IntTuple[origin]`): The source `IntTuple` containing known dimension values.
* ​target (`IntTuple[origin]`): The target `IntTuple` that may contain unknown dimensions (UNKNOWN\_VALUE).

**Returns:**

A new `IntTuple` with unknown dimensions from target and known dimensions from src.

---

## reduce

`reduce[: origin.set, //, reducer: fn[ImmutableOrigin](a: Int, b: IntTuple[$0]) capturing -> Int](t: IntTuple[origin], initializer: Int) -> Int`

Apply a reduction function to an `IntTuple` with an initial value.

This function iterates through each element of the `IntTuple` and applies
the provided reduction function cumulatively, starting with the initializer.

**Parameters:**

* ​reducer (`fn[ImmutableOrigin](a: Int, b: IntTuple[$0]) capturing -> Int`): A function that combines the accumulated result with the next element.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to reduce.
* ​initializer (`Int`): The initial value for the reduction operation.

**Returns:**

The final accumulated result after applying the reduction function
to all elements in the `IntTuple`.

---

## reverse

`reverse(src: IntTuple[origin]) -> IntTuple`

Reverses the order of elements in an `IntTuple`, recursively.

This function reverses the top-level elements of the `IntTuple` and
recursively reverses any nested `IntTuple`s.

Example:

```mojo
from layout.int_tuple import IntTuple, reverse
var t = IntTuple(1, 2, IntTuple(3, 4))
var reversed = reverse(t) # returns ((4, 3), 2, 1)
```

.

**Args:**

* ​src (`IntTuple[origin]`): The source `IntTuple` to reverse.

**Returns:**

A new `IntTuple` with elements in reversed order.

---

## shallow_apply

`shallow_apply[func: fn[ImmutableOrigin](IntTuple[$0]) -> Int](t: IntTuple[origin]) -> IntTuple`

Apply a function to each top-level element of an `IntTuple`.

Unlike `apply()`, this function only operates on the immediate children
of the input tuple without recursing into nested tuples.

**Parameters:**

* ​func (`fn[ImmutableOrigin](IntTuple[$0]) -> Int`): Function that takes an `IntTuple` and returns an `Int`.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` whose elements will be transformed.

**Returns:**

A new `IntTuple` with the function applied to each top-level element.

---

## shape_div

`shape_div(a: IntTuple[origin], b: IntTuple[origin]) -> IntTuple`

Performs division operation between shape tuples.

Handles four cases:

1. tuple-tuple: Performs shape\_div element-wise when dimensions match
2. tuple-int: Folds the division of b across each element of a
   Example: `shape_div((4,5,6),40)` -> `shape_div((1,5,6),10)` -> `shape_div((1,1,6),2)` -> `(1,1,3)`
3. int-tuple: Returns `shape_div(a, product(b))`
4. int-int: Enforces the divisibility condition `a % b == 0 || b % a == 0` when possible
   Returns `a / b` with rounding away from `0` (that is, `1` or `-1` when `a a (`IntTuple[origin]`): The dividend `IntTuple`.
* ​b (`IntTuple[origin]`): The divisor `IntTuple`.

**Returns:**

A new `IntTuple` containing the result of the division operation

---

## signum

`signum(a: Int) -> Int`

Calculate the sign of an integer.

This function determines the sign of the input integer and returns a corresponding
indicator value.

Example:

```mojo
from layout.int_tuple import signum

var result1 = signum(5)    # Returns 1
var result2 = signum(-10)  # Returns -1
var result3 = signum(0)    # Returns 0
```

.

**Args:**

* ​a (`Int`): The integer value to determine the sign of.

**Returns:**

1 if `a` > 0, -1 if `a`

---

## size

`size(a: IntTuple[origin]) -> Int`

Calculate the total size (product of all elements) of an `IntTuple`.

This function computes the product of all integer values in the `IntTuple`,
regardless of nesting level.

**Args:**

* ​a (`IntTuple[origin]`): The `IntTuple` whose elements will be multiplied together.

**Returns:**

The product of all elements in the `IntTuple`.

---

## sorted

`sorted[cmp: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool = __lt__[::Origin[::Bool[?, ?]](tuple: IntTuple[origin]) -> IntTuple`

Sort an IntTuple using the provided comparison function.

This function implements a merge sort algorithm to efficiently sort
the elements of an IntTuple. The sorting is stable and has `O(n log n)`
time complexity.

**Parameters:**

* ​cmp (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool`): A comparison function that takes two `IntTuple` elements and
  returns True if the first should come before the second.
  Defaults to the `lt` function which performs lexicographical ordering.

**Args:**

* ​tuple (`IntTuple[origin]`): The `IntTuple` to be sorted.

**Returns:**

A new `IntTuple` containing the same elements as the input but sorted
according to the comparison function.

---

## sum

`sum(t: IntTuple[origin]) -> Int`

Calculate the sum of all values in an `IntTuple`.

This function recursively computes the sum of all integer values
in a potentially nested `IntTuple` structure.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to sum.

**Returns:**

The sum of all integer values, or `UNKNOWN_VALUE` if any value
in the tuple is `UNKNOWN_VALUE`.

---

## to_nest

`to_nest(nested: IntTuple[origin], flat: IntTuple[origin]) -> IntTuple`

Nests a flat `IntTuple` according to the structure of a nested `IntTuple`.

This function reshapes a flat sequence of values into a hierarchical structure
that matches the pattern of a template nested `IntTuple`.

Example:

```mojo
from layout import IntTuple
from layout.int_tuple import to_nest

var result = to_nest(IntTuple(2, IntTuple(3, 4), 5), IntTuple(1, 2, 3, 4))
# returns IntTuple(1, (2, 3), 4)
```

.

**Args:**

* ​nested (`IntTuple[origin]`): The template `IntTuple` defining the desired structure.
* ​flat (`IntTuple[origin]`): The flat `IntTuple` containing the values to be nested.

**Returns:**

A new `IntTuple` with the values from flat arranged in the structure of nested.

---

## to_unknown

`to_unknown(t: IntTuple[origin]) -> IntTuple`

Create an `IntTuple` with the same structure but filled with `UNKNOWN_VALUE`.

This function preserves the hierarchical structure of the input `IntTuple`
but replaces all integer values with `UNKNOWN_VALUE`.

**Args:**

* ​t (`IntTuple[origin]`): The template `IntTuple` defining the structure.

**Returns:**

A new `IntTuple` with the same structure as t but with all values
replaced by `UNKNOWN_VALUE`.

---

## tuple_max

`tuple_max(t: IntTuple[origin]) -> Int`

Calculate the maximum value in an `IntTuple`.

This function recursively finds the maximum integer value
in a potentially nested `IntTuple` structure.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to search.

**Returns:**

The maximum integer value found in the tuple.

---

## tuple_min

`tuple_min(a: IntTuple[origin], b: IntTuple[origin]) -> IntTuple`

Compute the element-wise minimum of two `IntTuple`s.

This function compares corresponding elements of two `IntTuple`s and
returns a new `IntTuple` containing the minimum value at each position.

Aborts:
If the input tuples have different lengths.

Note:
If either input contains `UNKNOWN_VALUE`, the result will be `UNKNOWN_VALUE`.

**Args:**

* ​a (`IntTuple[origin]`): First `IntTuple`.
* ​b (`IntTuple[origin]`): Second `IntTuple`.

**Returns:**

A new `IntTuple` with each element being the minimum of the corresponding
elements in a and b.

---

## weakly_compatible

`weakly_compatible(a: IntTuple[origin], b: IntTuple[origin]) -> Bool`

Test if shape A is weakly compatible with shape B.

A shape A is weakly compatible with shape B if there exists a shape C
congruent to A such that compatible(elem\_scale(A,C), B). This establishes
a partial order relation between shapes where A a (`IntTuple[origin]`): The first `IntTuple` to compare.
* ​b (`IntTuple[origin]`): The second `IntTuple` to compare.

**Returns:**

True if shape A is weakly compatible with shape B, False otherwise.

---

## weakly_congruent

`weakly_congruent(a: IntTuple[origin], b: IntTuple[origin]) -> Bool`

Test if two IntTuples have similar hierarchical structures.

This function establishes a partial order relation between IntTuples
based on their hierarchical structure. It's less strict than congruent.

**Args:**

* ​a (`IntTuple[origin]`): First IntTuple to compare.
* ​b (`IntTuple[origin]`): Second IntTuple to compare.

**Returns:**

True if a's structure is compatible with b's structure,
False otherwise.

---

## zip

`zip[origin: ImmutableOrigin, n: Int](ts: InlineArray[Pointer[IntTuple, origin], n]) -> _zip[origin, n]`

Create a zip iterator from an array of `IntTuple` pointers.

This function creates a zip iterator that allows simultaneous traversal
of multiple `IntTuple` collections.

**Parameters:**

* ​origin (`ImmutableOrigin`): The origin tracking parameter for memory safety.
* ​n (`Int`): The number of `IntTuple` collections being zipped together.

**Args:**

* ​ts (`InlineArray[Pointer[IntTuple, origin], n]`): Array of pointers to the `IntTuple` collections to zip.

**Returns:**

A `_zip` object that can be iterated over.

`zip(a: IntTuple[origin], b: IntTuple[origin], out result: _zip[{a, b}, 2])`

Create a zip iterator for two `IntTuple`s.

This function creates a zip iterator that allows simultaneous traversal
of two `IntTuple`s, yielding pairs of corresponding elements.

**Args:**

* ​a (`IntTuple[origin]`): First `IntTuple` to zip.
* ​b (`IntTuple[origin]`): Second `IntTuple` to zip.

**Returns:**

The resulting zip iterator for the input `IntTuple`s.

`zip(a: IntTuple[origin], b: IntTuple[origin], c: IntTuple[origin], out result: _zip[{a, b, c}, 3])`

Create a zip iterator for three `IntTuple`s.

This function creates a zip iterator that allows simultaneous traversal
of three `IntTuple`s, yielding triplets of corresponding elements.

**Args:**

* ​a (`IntTuple[origin]`): First `IntTuple` to zip.
* ​b (`IntTuple[origin]`): Second `IntTuple` to zip.
* ​c (`IntTuple[origin]`): Third `IntTuple` to zip.

**Returns:**

The resulting zip iterator for the input `IntTuple`s.

---

## Layout

`struct Layout`

Represents a memory layout for multi-dimensional data.

The Layout struct is the primary implementation of the LayoutTrait,
providing a concrete representation of memory layouts using shape and
stride information. It maps between logical coordinates and linear
memory indices, enabling efficient access to multi-dimensional data.

A Layout consists of:

* shape: Defines the dimensions of the logical coordinate space
* stride: Defines the step sizes in memory for each dimension

The Layout struct supports various operations including:

* Creation of row-major and column-major layouts
* Conversion between coordinates and indices
* Composition with other layouts
* Iteration over sub-layouts

Layouts can be hierarchical, with nested shapes and strides, allowing
for complex memory access patterns like blocked or tiled layouts.

## Fields

* ​shape (`IntTuple`): The dimensions of the layout.
  This field defines the size of each dimension in the logical coordinate space.
  For example, a shape of (3, 4) represents a 3×4 grid of elements.
* ​stride (`IntTuple`): The memory step sizes for each dimension.
  This field defines how many elements to skip in memory when moving one unit
  in each dimension. For example, in a row-major 3×4 layout, the strides might
  be (4, 1), meaning moving one unit in the first dimension requires skipping
  4 elements in memory, while moving one unit in the second dimension requires
  skipping 1 element.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`EqualityComparable`,
`LayoutTrait`,
`Movable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `has_shape`

`alias has_shape = True`

Indicates whether the layout has a valid shape.

## Methods

### `__init__`

`__init__(out self)`

Initializes an empty layout with no dimensions.

Creates a layout with empty shape and stride tuples, which can be
populated later using append operations.

`@implicit`
`__init__(out self, shape: IntTuple[origin])`

Initializes a layout with the given shape and column-major strides.

Creates a layout with the specified shape and automatically calculates
column-major strides (where the first dimension varies fastest in memory).

**Args:**

* ​shape (`IntTuple[origin]`): The dimensions of the layout.

`__init__(out self, shape: IntTuple[origin], stride: IntTuple[origin])`

Initializes a layout with the given shape and stride.

Creates a layout with explicitly specified shape and stride values.
If an empty stride is provided, column-major strides are calculated.

**Args:**

* ​shape (`IntTuple[origin]`): The dimensions of the layout.
* ​stride (`IntTuple[origin]`): The memory step size for each dimension, or empty for column-major.

`__init__(out self, *, other: Self)`

Explicitly constructs a deep copy of the provided layout.

**Args:**

* ​other (`Self`): The layout to copy.

### `__getitem__`

`__getitem__(self, index: Int) -> Self`

Returns a sub-layout for the specified dimension.

**Args:**

* ​index (`Int`): The dimension index to extract.

**Returns:**

A Layout containing the shape and stride for the specified dimension.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if this layout is equal to another layout.

Two layouts are considered equal if they have identical shape and stride tuples.

**Args:**

* ​other (`Self`): The layout to compare with.

**Returns:**

True if the layouts are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if this layout is not equal to another layout.

**Args:**

* ​other (`Self`): The layout to compare with.

**Returns:**

True if the layouts are not equal, False otherwise.

### `idx2crd`

`idx2crd(self, idx: IntTuple[origin]) -> IntTuple`

Converts a linear index to logical coordinates.

This is the inverse operation of the **call** method, mapping from
a memory index back to the corresponding logical coordinates.

**Args:**

* ​idx (`IntTuple[origin]`): The linear index to convert.

**Returns:**

The logical coordinates corresponding to the given index.

### `col_major`

`static col_major(*dims: Int) -> Self`

Creates a column-major layout with the specified dimensions.

In a column-major layout, the first dimension varies fastest in memory,
which is the default layout in languages like Fortran and MATLAB.

Example:

```mojo
from layout import Layout

# Create a 3x4 column-major layout
var layout = Layout.col_major(3, 4)
# Result: Layout with shape (3,4) and stride (1,3)
```

.

**Args:**

* ​\*dims (`Int`): Variable number of dimension sizes.

**Returns:**

A column-major Layout with the specified dimensions

`static col_major(shape: IntTuple[origin]) -> Self`

Creates a column-major layout with the specified shape.

In a column-major layout, the first dimension varies fastest in memory,
which is the default layout in languages like Fortran and MATLAB.

Example:

```mojo
from layout import Layout
from layout.int_tuple import IntTuple

# Create a 3x4 column-major layout
var layout = Layout.col_major(IntTuple(3, 4))
# Result: Layout with shape (3,4) and stride (1,3)
```

.

**Args:**

* ​shape (`IntTuple[origin]`): An IntTuple specifying the dimensions.

**Returns:**

A column-major Layout with the specified shape

### `row_major`

`static row_major(*dims: Int) -> Self`

Creates a row-major layout with the specified dimensions.

In a row-major layout, the last dimension varies fastest in memory,
which is the default layout in languages like C, C++, and Python.

Example:

```mojo
from layout import Layout

# Create a 3x4 row-major layout
var layout = Layout.row_major(3, 4)
# Result: Layout with shape (3,4) and stride (4,1)
```

.

**Args:**

* ​\*dims (`Int`): Variable number of dimension sizes.

**Returns:**

A row-major Layout with the specified dimensions

`static row_major[rank: Int](dims: DimList) -> Self`

Creates a row-major layout from a DimList with compile-time rank.

This method creates a row-major layout where the last dimension varies fastest in memory.
It handles both known and unknown dimensions at compile time, properly calculating
strides for each dimension. If any dimension is unknown, subsequent strides will
also be marked as unknown.

Example:

```mojo
from layout import Layout
from layout.layout import DimList

# Create a row-major layout with compile-time rank
var dims = DimList(3, 4)
var layout = Layout.row_major[2](dims)
# Result: Layout with shape (3,4) and stride (4,1)
```

.

**Parameters:**

* ​rank (`Int`): The compile-time rank (number of dimensions) of the layout.

**Args:**

* ​dims (`DimList`): A DimList containing the dimensions of the layout.

**Returns:**

A row-major Layout with the specified dimensions and computed strides.

`static row_major[rank: Int](tuple: IndexList[rank]) -> Self`

Creates a row-major layout from a DimList with compile-time rank.

This method creates a row-major layout where the last dimension varies fastest in memory.
It handles both known and unknown dimensions at compile time, properly calculating
strides for each dimension. If any dimension is unknown, subsequent strides will
also be marked as unknown.

Example:

```mojo
from layout import Layout
from layout.layout import DimList

# Create a row-major layout with compile-time rank
var dims = DimList(3, 4)
var layout = Layout.row_major[2](dims)
# Result: Layout with shape (3,4) and stride (4,1)
```

.

**Parameters:**

* ​rank (`Int`): The compile-time rank (number of dimensions) of the layout.

**Args:**

* ​tuple (`IndexList[rank]`): An IndexList containing the dimensions of the layout.

**Returns:**

A row-major Layout with the specified dimensions and computed strides.

`static row_major[rank: Int]() -> Self`

Creates a row-major layout with unknown values for each axis from a compile-time rank.

Example:

```mojo
from layout import Layout

var layout = Layout.row_major[2]()
# Result: Layout with shape (UNKNOWN_VALUE, UNKNOWN_VALUE)
```

**Parameters:**

* ​rank (`Int`): The compile-time rank (number of dimensions) of the layout.

**Returns:**

A row-major Layout with the given rank.

`static row_major(shape: IntTuple[origin]) -> Self`

Creates a row-major layout from an IntTuple of dimensions.

In a row-major layout, the last dimension varies fastest in memory.
This method computes the appropriate strides for a row-major layout
given the input shape.

Example:

```mojo
from layout import Layout
from layout.int_tuple import IntTuple

# Create a row-major layout from a shape tuple
var shape = IntTuple(3, 4)
var layout = Layout.row_major(shape)
# Result: Layout with shape (3,4) and stride (4,1)
```

.

**Args:**

* ​shape (`IntTuple[origin]`): An IntTuple containing the dimensions of the layout.

**Returns:**

A row-major Layout with the specified shape and computed strides.

### `make_shape_unknown`

`make_shape_unknown[axis: Int = -1](self) -> Self`

Creates a new Layout with unknown shape dimensions.

This method creates a copy of the current Layout but marks either all dimensions
or a specific dimension as unknown, while preserving the original strides.
This is useful for tiling tensors with runtime sizes where the tile's shape
is unknown but the memory layout (strides) remains constant.

Example:

```mojo
from layout import Layout
from layout.int_tuple import IntTuple

# Mark all dimensions as unknown
var layout = Layout(IntTuple(2, 3))
var unknown = layout.make_shape_unknown()
# Result: Layout with shape (?, ?) and original strides

# Mark only first dimension as unknown
var partial = layout.make_shape_unknown[0]()
# Result: Layout with shape (?, 3) and original strides
```

.

**Parameters:**

* ​axis (`Int`): The dimension to mark as unknown. If UNKNOWN\_VALUE (default),
  all dimensions are marked as unknown.

**Returns:**

A new Layout with the specified dimension(s) marked as unknown and
original strides preserved.

### `copy`

`copy(self) -> Self`

Explicitly constructs a copy of this layout.

Creates a deep copy of the layout, including its shape and stride tuples.

**Returns:**

A new Layout instance with identical shape and stride values.

### `__str__`

`__str__(self) -> String`

Converts the layout to a string representation.

**Returns:**

A string representation of the layout in the format "(shape:stride)".

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the layout to the specified writer.

Formats the layout as "(shape:stride)" and writes it to the provided writer.

**Parameters:**

* ​W (`Writer`): Type parameter representing a Writer implementation.

**Args:**

* ​writer (`W`): The writer to output the layout representation to.

### `__len__`

`__len__(self) -> Int`

Returns the number of dimensions in the layout.

**Returns:**

The number of elements in the shape tuple.

### `__iter__`

`__iter__(self) -> _LayoutIter[self]`

Returns an iterator over the layout's dimensions.

Each iteration yields a Layout containing the shape and stride for one dimension.

**Returns:**

An iterator over the layout's dimensions.

### `size`

`size(self) -> Int`

Returns the total number of elements in the layout's domain.

Calculates the product of all dimensions in the shape.

**Returns:**

The total number of elements in the layout.

### `cosize`

`cosize(self) -> Int`

Returns the size of the memory region spanned by the layout.

Calculates the maximum memory index plus one, representing the total
memory footprint required by the layout.

**Returns:**

The size of the memory region required by the layout.

### `rank`

`rank(self) -> Int`

Returns the number of dimensions in the layout.

This is equivalent to **len** and returns the number of elements in the
shape tuple.

**Returns:**

The number of dimensions in the layout.

### `__call__`

`__call__(self, idx: IntTuple[origin]) -> Int`

Maps logical coordinates to a linear memory index.

This is the core functionality of a layout, converting multi-dimensional
coordinates to a linear memory location.

**Args:**

* ​idx (`IntTuple[origin]`): The logical coordinates to map.

**Returns:**

The linear memory index corresponding to the given coordinates.

### `append`

`append(mut self, item: Self)`

Appends another layout to this layout.

This method adds the shape and stride from the provided layout to this layout,
effectively increasing its dimensionality.

**Args:**

* ​item (`Self`): The layout to append to this layout.

### `all_dims_known`

`all_dims_known(self) -> Bool`

Checks if all dimensions in the layout have known values.

A dimension is considered unknown if its shape or stride is set to the
special `UNKNOWN_VALUE` constant.

**Returns:**

True if all dimensions have known shape and stride values, False otherwise.

### `known_shape`

`known_shape(self) -> Bool`

Checks if all shape dimensions in the layout have known values.

A dimension is considered unknown if its shape is set to the special
`UNKNOWN_VALUE` constant. This method only checks shapes, not strides.

**Returns:**

True if all shape dimensions have known values, False otherwise.

---

## LayoutTrait

Defines the interface for mapping between logical coordinates and memory indices.

The `LayoutTrait` provides a common interface for all layout types, including
basic layouts, swizzles, and composed layouts. It enables mapping from
multi-dimensional logical coordinates to linear memory indices, which is
essential for tensor operations.

Implementations of this trait must provide methods for:

1. Mapping coordinates to indices via the `__call__` method
2. Calculating the total size of the layout's domain
3. Calculating the size of the layout's codomain (memory footprint)
4. Indicating whether the layout has a valid shape

This trait serves as the foundation for the layout system, allowing
different layout implementations to be used interchangeably in algorithms.

## Implemented traits

`AnyType`,
`Copyable`,
`UnknownDestructibility`

## Aliases

### `has_shape`

`alias has_shape`

Indicates whether the layout has a valid shape.

Layouts and ComposedLayouts with at least one Layout have valid shapes
and can be used in layout algebra. Swizzles don't have shapes and
should be excluded from layout algebra.

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__call__`

`__call__(self: _Self, index: IntTuple[origin]) -> Int`

Maps a logical coordinate to a linear memory index.

**Args:**

* ​index (`IntTuple[origin]`): An IntTuple representing the logical coordinates to map.

**Returns:**

The linear memory index corresponding to the given coordinates.

### `size`

`size(self: _Self) -> Int`

Returns the total number of elements in the layout's domain.

For a layout with shape (m, n), this returns m \* n, representing
the total number of valid coordinates in the layout.

**Returns:**

The total number of elements in the layout.

### `cosize`

`cosize(self: _Self) -> Int`

Returns the size of the memory region spanned by the layout.

For a layout with shape `(m, n)` and stride `(r, s)`, this returns
`(m-1)*r + (n-1)*s + 1`, representing the memory footprint.

**Returns:**

The size of the memory region required by the layout.

---

## MakeLayoutList

`MakeLayoutList(v0: Layout, v1: Layout) -> List[Layout]`

Creates a list containing two layouts.

This is a convenience function for creating a LayoutList with two elements.

**Args:**

* ​v0 (`Layout`): The first layout to include in the list.
* ​v1 (`Layout`): The second layout to include in the list.

**Returns:**

A LayoutList containing the two provided layouts.

---

## MakeTileLayoutList

`MakeTileLayoutList[*tile_sizes: Int]() -> List[Layout]`

Creates a list of layouts for tiling operations.

This function creates a list of simple layouts, each with a shape from the
provided tile\_sizes and a stride of 1. These layouts can be used for tiling
operations.

**Parameters:**

* ​\*tile\_sizes (`Int`): Variable number of integer tile dimensions.

**Returns:**

A LayoutList containing layouts for each tile size.

---

## apply_tiler

`apply_tiler[func: fn(Layout, Layout) -> Layout](layout_a: Layout, tiler: List[Layout]) -> Layout`

Applies a layout transformation function to each element of a layout with a tiler.

This utility function applies the specified transformation function to each
corresponding pair of elements from the layout and tiler list. It's a generic
mechanism for implementing various tiling operations.

Example:

```mojo
from layout import Layout, LayoutList, IntTuple
from layout.layout import apply_tiler, logical_divide

# Apply logical_divide to each element of a layout with a tiler
var base = Layout.row_major(6, 8)
var tilers = LayoutList()
tilers.append(Layout(IntTuple(2, 2), IntTuple(1, 2)))
var result = apply_tiler[logical_divide](base, tilers)
```

.

**Parameters:**

* ​func (`fn(Layout, Layout) -> Layout`): A function that takes two layouts and returns a transformed layout.

**Args:**

* ​layout\_a (`Layout`): The base layout to transform.
* ​tiler (`List[Layout]`): A list of layouts to use in the transformation.

**Returns:**

A new layout resulting from applying the transformation function to each pair.

---

## blocked_product

`blocked_product(layout_a: Layout, layout_b: Layout) -> Layout`

Creates a blocked layout by combining two layouts.

This function creates a hierarchical blocked layout by combining a base layout
with a block layout. The result is a layout where each element of the base
layout is replaced by a block defined by the second layout.

This is particularly useful for creating tiled layouts for efficient
cache utilization in tensor operations like matrix multiplication.

Example:

```mojo
from layout import Layout
from layout.layout import blocked_product

# Create a 2x3 matrix layout
var matrix = Layout.row_major(2, 3)
# Define 2x2 blocks
var block = Layout.row_major(2, 2)
# Create a blocked layout with 2x2 blocks
var blocked = blocked_product(block, matrix)
```

Output:

```plaintext
(((2, 2), (2, 3)):((2, 12), (1, 4)))
      0    1    2    3    4    5
   +----+----+----+----+----+----+
0  |  0 |  1 |  4 |  5 |  8 |  9 |
   +----+----+----+----+----+----+
1  |  2 |  3 |  6 |  7 | 10 | 11 |
   +----+----+----+----+----+----+
2  | 12 | 13 | 16 | 17 | 20 | 21 |
   +----+----+----+----+----+----+
3  | 14 | 15 | 18 | 19 | 22 | 23 |
   +----+----+----+----+----+----+
```

.

**Args:**

* ​layout\_a (`Layout`): The base layout to be blocked.
* ​layout\_b (`Layout`): The block layout defining the structure within each block.

**Returns:**

A new layout representing the blocked structure

---

## coalesce

`coalesce(layout: Layout, keep_rank: Bool = False) -> Layout`

Simplifies a layout by combining dimensions with contiguous strides.

This function reduces the rank of a layout by merging dimensions that have
contiguous memory layouts, resulting in a simpler but equivalent layout.

Example:

```mojo
from layout import Layout, IntTuple
from layout.layout import coalesce

# A layout with shape (2, (1, 4)) and stride (1, (4, 2)) can be coalesced
var layout = Layout(IntTuple(2, IntTuple(1, 4)), IntTuple(1, IntTuple(4, 2)))
var coalesced = coalesce(layout)
# Result: Layout with shape (8) and stride (1)
```

.

**Args:**

* ​layout (`Layout`): The layout to coalesce.
* ​keep\_rank (`Bool`): If True, maintains the original rank of the layout. Default is False.

**Returns:**

A simplified layout with reduced rank where possible.

---

## complement

`complement(layout: Layout, size: Int = 1) -> Layout`

Computes the complement layout for a given layout.

This function creates a layout that represents the "gaps" or complementary
structure of the input layout. It's useful for creating hierarchical layouts
where you need to fill in the spaces between existing layout elements.

Example:

```mojo
from layout import Layout, IntTuple
from layout.layout import complement

# Compute the complement of a layout
var base = Layout(IntTuple(2, 3), IntTuple(3, 1))
var comp = complement(base, 10)
# Result: A layout that fills the gaps in the original layout
```

.

**Args:**

* ​layout (`Layout`): The input layout to compute the complement for.
* ​size (`Int`): The total size of the memory region to consider. Defaults to 1.

**Returns:**

A new layout representing the complement of the input layout.

---

## composition

`composition(layout_a: Layout, layout_b: Layout) -> Layout`

Composes two layouts to create a new layout.

This function creates a new layout by composing two layouts, where the first
layout defines the outer structure and the second layout defines the inner
structure.

The new layout is compatible with `layout_b` (that is, it has the same `size`
and every set of coordinates in `layout_b` has an equivalent in the new
layout). You can think of `layout_b` as selecting a subset of elements
from `layout_a`.

Example:

```mojo
from layout.layout import Layout, IntTuple
from layout.layout import composition

# Compose a row-major layout with a tiling layout
var base = Layout.row_major(6, 8)
var tiling = Layout(IntTuple(3, 2), IntTuple(1, 3))
var composed = composition(base, tiling)
# Result: A layout that represents a 3x2 tile from
# layout_a
```

.

**Args:**

* ​layout\_a (`Layout`): The outer layout.
* ​layout\_b (`Layout`): The inner layout.

**Returns:**

A new layout representing the composition of the two layouts.

`composition(layout_a: Layout, tiler: List[Layout]) -> Layout`

Composes a layout with a list of layouts to create a hierarchical layout.

This function creates a new layout by composing each element of the first layout
with the corresponding element in the tiler list. If the tiler list is shorter
than the layout, the remaining elements from the layout are appended unchanged.

Example:

```mojo
from layout import Layout, LayoutList, IntTuple
from layout.layout import composition

# Compose a layout with a list of tiling layouts
var base = Layout.row_major(6, 8)
var tilers = LayoutList()
tilers.append(Layout(IntTuple(2, 2), IntTuple(1, 2)))
tilers.append(Layout(IntTuple(3, 3), IntTuple(1, 3)))
var composed = composition(base, tilers)
# Result: A layout with hierarchical tiling based on the tiler list
```

.

**Args:**

* ​layout\_a (`Layout`): The base layout to compose with the tiler.
* ​tiler (`List[Layout]`): A list of layouts to compose with the base layout.

**Returns:**

A new layout representing the composition of the base layout with the tiler.

---

## cosize

`cosize(l: Layout) -> Int`

Returns the size of the memory region spanned by the layout.

This is a standalone function equivalent to the Layout.cosize() method.

**Args:**

* ​l (`Layout`): The layout to calculate the cosize for.

**Returns:**

The size of the memory region required by the layout.

---

## downcast

`downcast(layout: Layout, factor: Int) -> Layout`

Splits elements in a layout to create a finer layout without changing the total number of elements so that the alignment is preserved.

This function is useful for converting between different data type granularities,
such as from uint128 to bf16.

**Args:**

* ​layout (`Layout`): The layout to downcast.
* ​factor (`Int`): The number of elements to split into.

**Returns:**

A new layout with adjusted shape and stride for the finer granularity.

---

## expand_modes_alike

`expand_modes_alike(shape_a: IntTuple[origin], stride_a: IntTuple[origin], shape_b: IntTuple[origin], stride_b: IntTuple[origin]) -> InlineArray[IntTuple, 3]`

Aligns two shape-stride pairs to have the same hierarchical structure.

This function is used to make two layouts compatible for operations by ensuring
they have the same hierarchical structure, expanding scalar values into tuples
as needed.

**Args:**

* ​shape\_a (`IntTuple[origin]`): The first shape tuple.
* ​stride\_a (`IntTuple[origin]`): The first stride tuple.
* ​shape\_b (`IntTuple[origin]`): The second shape tuple.
* ​stride\_b (`IntTuple[origin]`): The second stride tuple.

**Returns:**

An array containing three tuples: the common shape, the expanded stride\_a,
and the expanded stride\_b.

`expand_modes_alike(layout_a: Layout, layout_b: Layout) -> InlineArray[Layout, 2]`

Aligns two layouts to have the same hierarchical structure.

This function tiles both layouts so they mirror each other's structure,
making them compatible for operations that require matching hierarchies.

Example:

Given layouts with different structures:

* layout\_0: (((3, (5, 2)), 4):((1, (24, 12)), 3))
* layout\_1: ((30, (2, 2)):(2, (60, 1)))

The result would be two layouts with matching structures:

* (((3, (5, 2)), (2, 2)):((1, (24, 12)), (3, 6)))
* (((3, (5, 2)), (2, 2)):((2, (6, 30)), (60, 1)))

```mojo
from layout import Layout, IntTuple
from layout.layout import expand_modes_alike

alias layout_0 = Layout(
    IntTuple(IntTuple(3, IntTuple(5, 2)), 4),
    IntTuple(IntTuple(1, IntTuple(24, 12)), 3),
)
alias layout_1 = Layout(
    IntTuple(30, IntTuple(2, 2)), IntTuple(2, IntTuple(60, 1))
)
alias uc = expand_modes_alike(layout_0, layout_1)
print(uc[0])
# (((3, (5, 2)), (2, 2)):((1, (24, 12)), (3, 6)))
print(uc[1])
# (((3, (5, 2)), (2, 2)):((2, (6, 30)), (60, 1)))
```

.

**Args:**

* ​layout\_a (`Layout`): The first layout to align.
* ​layout\_b (`Layout`): The second layout to align.

**Returns:**

An array containing two layouts with matching hierarchical structures.

---

## expand_strides

`expand_strides(shape: IntTuple[origin], stride: Int) -> IntTuple`

Expands a scalar stride into a stride tuple matching a shape tuple.

This function creates a stride tuple that matches the structure of a shape tuple,
with each stride value calculated based on the cumulative product of shape
dimensions.

**Args:**

* ​shape (`IntTuple[origin]`): The shape tuple to match.
* ​stride (`Int`): The base stride value to expand.

**Returns:**

A stride tuple matching the structure of the shape tuple.

---

## format_layout

`format_layout[W: Writer](layout: Layout, mut writer: W)`

Formats a 2D layout as a table and writes it to the specified writer.

This function creates a visual representation of a 2D layout as a table
showing the memory indices for each logical coordinate.

**Parameters:**

* ​W (`Writer`): Type parameter representing a Writer implementation.

**Args:**

* ​layout (`Layout`): The 2D layout to format.
* ​writer (`W`): The writer to output the formatted layout to.

---

## hierarchical_unzip

`hierarchical_unzip(layout_a: Layout, tiler: List[Layout]) -> Layout`

Hierarchically unzips a layout according to a list of layouts.

This function creates a hierarchical layout by unzipping the first layout
according to the layouts in the tiler list. It's useful for decomposing
a layout into hierarchical components for more efficient memory access
patterns or to enable specialized tensor operations.

Example:

```mojo
from layout import Layout, LayoutList, IntTuple
from layout.layout import hierarchical_unzip

# Create a layout to unzip
var base = Layout.row_major(6, 8)
var tilers = LayoutList()
tilers.append(Layout(IntTuple(2, 2)))
var result = hierarchical_unzip(base, tilers)
```

.

**Args:**

* ​layout\_a (`Layout`): The layout to be unzipped.
* ​tiler (`List[Layout]`): A list of layouts defining the unzipping patterns.

**Returns:**

A new layout representing the hierarchical unzipping with components
from both the original layout and the tiler layouts.

`hierarchical_unzip(layout_a: Layout, layout_b: Layout) -> Layout`

Hierarchically unzips a layout according to another layout.

This function creates a hierarchical layout by unzipping the first layout
according to the second layout. It's a fundamental operation for decomposing
a layout into hierarchical components, which enables more efficient memory
access patterns for various tensor operations.

Example:

```mojo
from layout import Layout, IntTuple
from layout.layout import hierarchical_unzip

# Create layouts
var base = Layout.row_major(6, 8)
var pattern = Layout(IntTuple(2, 2))
var result = hierarchical_unzip(base, pattern)
```

.

**Args:**

* ​layout\_a (`Layout`): The layout to be unzipped.
* ​layout\_b (`Layout`): The layout defining the unzipping pattern.

**Returns:**

A new layout representing the hierarchical unzipping of layout\_a
according to the pattern defined by layout\_b.

---

## layout

Provides a high-performance tensor layout system for memory mapping and indexing.

The layout module implements a comprehensive system for describing memory layouts
of multi-dimensional tensors, enabling efficient mapping between logical tensor
coordinates and physical memory locations. This is a critical component for
high-performance tensor operations in machine learning and scientific computing.
These low-level primitives require careful use to avoid errors.
Understanding the relationship between tensor shapes, strides, and
memory layout is essential for effective use.

Key components:

* `LayoutTrait`: Core trait defining the interface for all layout types
* `Layout`: Primary struct implementing memory layout with shape and stride information
* Layout algebra: Functions for composing, dividing, and transforming layouts
* Tiling operations: Functions for hierarchical decomposition of layouts

Performance features:

* Zero-cost abstractions for mapping between logical and physical indices
* Support for both compile-time and runtime-determined shapes
* Efficient memory access patterns through layout transformations
* Hierarchical tiling for cache-friendly memory access

Common use cases:

* Defining memory layouts for tensors with different storage formats (row-major, column-major)
* Implementing efficient tensor operations with optimal memory access patterns
* Supporting hardware-specific memory layouts for accelerators
* Enabling zero-copy tensor views and reshaping operations

Example:

```mojo
from layout import Layout, IntTuple
from layout.layout import blocked_product

# Create a 3x4 row-major layout
var layout = Layout.row_major(3, 4)

# Access the memory location for logical coordinates (1, 2)
var memory_idx = layout([1, 2])

# Create a tiled layout for blocked matrix multiplication
var tiled = blocked_product(layout, Layout([2, 2]))
```

## Aliases

### `LayoutList`

`alias LayoutList = List[Layout]`

## Structs

* [​`Layout`](./Layout): Represents a memory layout for multi-dimensional data.

## Traits

* [​`LayoutTrait`](./LayoutTrait): Defines the interface for mapping between logical coordinates and memory indices.

## Functions

* [​`apply_tiler`](./apply_tiler): Applies a layout transformation function to each element of a layout with a tiler.
* [​`blocked_product`](./blocked_product): Creates a blocked layout by combining two layouts.
* [​`coalesce`](./coalesce): Simplifies a layout by combining dimensions with contiguous strides.
* [​`complement`](./complement): Computes the complement layout for a given layout.
* [​`composition`](./composition): Composes two layouts to create a new layout.
* [​`cosize`](./cosize): Returns the size of the memory region spanned by the layout.
* [​`downcast`](./downcast): Splits elements in a layout to create a finer layout without changing the total number of elements so that the alignment is preserved.
* [​`expand_modes_alike`](./expand_modes_alike): Aligns two shape-stride pairs to have the same hierarchical structure.
* [​`expand_strides`](./expand_strides): Expands a scalar stride into a stride tuple matching a shape tuple.
* [​`format_layout`](./format_layout): Formats a 2D layout as a table and writes it to the specified writer.
* [​`hierarchical_unzip`](./hierarchical_unzip): Hierarchically unzips a layout according to a list of layouts.
* [​`is_contiguous_dim`](./is_contiguous_dim): Checks if a flat layout is contiguous in a specific dimension.
* [​`is_row_major`](./is_row_major): Checks if a layout has row-major ordering for the specified rank.
* [​`logical_divide`](./logical_divide): Divides a layout into blocks according to another layout.
* [​`logical_product`](./logical_product): Creates a product of two layouts.
* [​`make_layout`](./make_layout): Creates a composite layout by concatenating multiple layouts.
* [​`make_ordered_layout`](./make_ordered_layout): Creates a layout with strides ordered according to a specified traversal order.
* [​`MakeLayoutList`](./MakeLayoutList): Creates a list containing two layouts.
* [​`MakeTileLayoutList`](./MakeTileLayoutList): Creates a list of layouts for tiling operations.
* [​`print_layout`](./print_layout): Prints a 2D layout to the standard output.
* [​`right_inverse`](./right_inverse): Creates a right inverse of a layout.
* [​`size`](./size): Returns the total number of elements in the layout's domain.
* [​`sublayout`](./sublayout): Creates a sublayout by selecting specific dimensions from a layout.
* [​`tile_to_shape`](./tile_to_shape): Creates a layout by tiling a base layout to match a target shape.
* [​`upcast`](./upcast): Fuses consecutive elements in a layout to create a coarser layout.
* [​`zip_modes`](./zip_modes): Combines corresponding modes from two layouts.
* [​`zipped_divide`](./zipped_divide): Divides a layout into blocks according to another layout.

---

## is_contiguous_dim

`is_contiguous_dim(layout: Layout, dim: Int) -> Bool`

Checks if a flat layout is contiguous in a specific dimension.

This function checks if a flat layout is contiguous in a specified
dimension, considering both positive strides and zero strides with a single
element. The latter case is necessary for coalesced layouts.

**Args:**

* ​layout (`Layout`): The layout to check.
* ​dim (`Int`): The dimension to check.

**Returns:**

True if the layout is contiguous in the specified dimension,
False otherwise.

---

## is_row_major

`is_row_major[rank: Int](layout: Layout) -> Bool`

Checks if a layout has row-major ordering for the specified rank.

A row-major layout has strides that decrease from left to right, with the
rightmost dimension having a stride of 1.

**Parameters:**

* ​rank (`Int`): The expected rank of the layout.

**Args:**

* ​layout (`Layout`): The layout to check.

**Returns:**

True if the layout has row-major ordering for the specified rank,
False otherwise.

---

## logical_divide

`logical_divide(layout_a: Layout, _layout_b: Layout) -> Layout`

Divides a layout into blocks according to another layout.

This function creates a hierarchical layout by dividing the first layout
according to the second layout. It's useful for creating blocked or tiled
representations of tensors.

**Args:**

* ​layout\_a (`Layout`): The layout to be divided.
* ​\_layout\_b (`Layout`): The layout defining the division pattern.

**Returns:**

A new layout representing the hierarchical division.

`logical_divide(layout_a: Layout, tiler: List[Layout]) -> Layout`

Divides a layout into blocks according to a list of layouts.

This is a variant of logical\_divide that works with a list of layouts
for more complex tiling patterns.

**Args:**

* ​layout\_a (`Layout`): The layout to be divided.
* ​tiler (`List[Layout]`): A list of layouts defining the division patterns.

**Returns:**

A new layout representing the hierarchical division.

---

## logical_product

`logical_product(_layout_a: Layout, layout_b: Layout) -> Layout`

Creates a product of two layouts.

This function creates a hierarchical layout by taking the logical product
of two layouts. It's a fundamental operation for creating blocked or tiled
layouts.

**Args:**

* ​\_layout\_a (`Layout`): The first layout.
* ​layout\_b (`Layout`): The second layout.

**Returns:**

A new layout representing the logical product of the two layouts.

`logical_product(layout_a: Layout, tiler: List[Layout]) -> Layout`

Creates a product of a layout with a list of layouts.

This is a variant of logical\_product that works with a list of layouts
for more complex tiling patterns. It applies the logical\_product operation
to each element of the layout with the corresponding element in the tiler list.

Example:

```mojo
from layout import Layout, LayoutList, IntTuple
from layout.layout import logical_product

# Create a product of a layout with a list of layouts
var base = Layout.row_major(6, 8)
var tilers = LayoutList()
tilers.append(Layout(IntTuple(2, 2)))
var result = logical_product(base, tilers)
```

.

**Args:**

* ​layout\_a (`Layout`): The base layout to create products with.
* ​tiler (`List[Layout]`): A list of layouts defining the product patterns.

**Returns:**

A new layout representing the logical product with the tiler layouts.

---

## make_layout

`make_layout(*layouts: Layout) -> Layout`

Creates a composite layout by concatenating multiple layouts.

This function combines multiple layouts into a single layout by concatenating
their shapes and strides. The resulting layout represents a hierarchical
structure where each input layout becomes a component of the output layout.

Example:

```mojo
from layout import Layout, IntTuple
from layout.layout import make_layout

var layout1 = Layout(IntTuple(2, 3), IntTuple(3, 1))
var layout2 = Layout(IntTuple(4, 5), IntTuple(5, 1))
var combined = make_layout(layout1, layout2)
# Result: Layout with shape ((2, 3), (4, 5)) and stride ((3, 1), (5, 1))
```

.

**Args:**

* ​\*layouts (`Layout`): Variable number of `Layout` objects to combine.

**Returns:**

A new Layout with concatenated shapes and strides from the input layouts.

`make_layout(layout_a: Layout, layout_b: Layout) -> Layout`

Creates a composite layout from two layouts.

This is a specialized version of make\_layout that takes exactly two layouts
and combines them into a single layout. This function exists as a workaround
for compiler limitations.

**Args:**

* ​layout\_a (`Layout`): The first layout to include in the composite.
* ​layout\_b (`Layout`): The second layout to include in the composite.

**Returns:**

A new `Layout` with concatenated shapes and strides from the input layouts.

---

## make_ordered_layout

`make_ordered_layout(shape: IntTuple[origin], order: IntTuple[origin]) -> Layout`

Creates a layout with strides ordered according to a specified traversal order.

This function generates a compact (bijective) layout where the stride values
follow the traversal order specified by the order parameter. This allows
creating layouts with custom memory traversal patterns while maintaining
a compact memory representation.

Example:

```mojo
from layout import IntTuple, Layout
from layout.layout import make_ordered_layout

# Create a layout with shape (2,3,4,5) where dimensions are traversed
# in the order: dim0, dim3, dim2, dim1
var layout = make_ordered_layout(
    IntTuple(2, 3, 4, 5),
    IntTuple(1, 4, 3, 2)
)
# Result: Layout with shape (2,3,4,5) and stride (1,24,6,2)
```

.

**Args:**

* ​shape (`IntTuple[origin]`): The shape of the layout.
* ​order (`IntTuple[origin]`): The traversal order priority (lower values indicate higher priority).

**Returns:**

A `Layout` with the specified shape and strides ordered according to the
traversal order.

---

## print_layout

`print_layout(layout: Layout)`

Prints a 2D layout to the standard output.

This function visualizes a 2D layout by printing a formatted table showing
the memory indices for each logical coordinate.

**Args:**

* ​layout (`Layout`): The 2D layout to print.

---

## right_inverse

`right_inverse(layout: Layout) -> Layout`

Creates a right inverse of a layout.

The right inverse of a layout maps memory indices back to logical coordinates.
This is useful for converting between different memory layouts.

**Args:**

* ​layout (`Layout`): The layout to invert.

**Returns:**

A new layout representing the right inverse of the input layout.

---

## size

`size(l: Layout) -> Int`

Returns the total number of elements in the layout's domain.

This is a standalone function equivalent to the Layout.size() method.

**Args:**

* ​l (`Layout`): The layout to calculate the size for.

**Returns:**

The total number of elements in the layout.

---

## sublayout

`sublayout(layout: Layout, *modes: Int) -> Layout`

Creates a sublayout by selecting specific dimensions from a layout.

This function extracts a subset of dimensions from a layout to create a new
layout with lower rank. For example, from a 3D layout, you could extract
a 2D layout containing only the first and third dimensions.

Example:

From a layout with shape (3,4,5), sublayout(layout, 0, 2) would
create a layout with shape (3,5).

**Args:**

* ​layout (`Layout`): The source layout to extract dimensions from.
* ​\*modes (`Int`): The indices of dimensions to include in the sublayout.

**Returns:**

A new layout containing only the specified dimensions.

---

## tile_to_shape

`tile_to_shape(tile: Layout, target_shape: IntTuple[origin], order: Optional[IntTuple] = Optional(None)) -> Layout`

Creates a layout by tiling a base layout to match a target shape.

This function creates a hierarchical layout by repeating a tile layout to match
a target shape. It calculates how many times the tile needs to be repeated in
each dimension to reach the target shape, and creates a tiler layout with this
information.

Example:

```mojo
from layout import Layout, IntTuple
from layout.layout import tile_to_shape

# Create a 2x2 tile layout
var tile = Layout.row_major(2, 2)
# Tile it to create a 6x4 layout
var tiled = tile_to_shape(tile, IntTuple(6, 4))
# Result: A layout with 3x2 tiles of size 2x2 each
```

.

**Args:**

* ​tile (`Layout`): The base layout to be tiled.
* ​target\_shape (`IntTuple[origin]`): The desired final shape to tile to.
* ​order (`Optional[IntTuple]`): Optional memory ordering for the tiler layout. If None, defaults to
  column-major ordering.

**Returns:**

A new layout representing the tiled structure that matches the target shape.

---

## upcast

`upcast(layout: Layout, factor: Int) -> Layout`

Fuses consecutive elements in a layout to create a coarser layout.

This function is useful for converting between different data type granularities,
such as from bytes to larger data types like bfloat16 or tf32.

**Args:**

* ​layout (`Layout`): The layout to upcast.
* ​factor (`Int`): The number of consecutive elements to fuse into one.

**Returns:**

A new layout with adjusted shape and stride for the coarser granularity.

---

## zip_modes

`zip_modes(layout_a: Layout, layout_b: Layout) -> Layout`

Combines corresponding modes from two layouts.

This function creates a new layout by combining corresponding dimensions
from two layouts. If a dimension in layout\_b has a non-positive shape,
the corresponding dimension from layout\_a is used directly.

**Args:**

* ​layout\_a (`Layout`): The first layout.
* ​layout\_b (`Layout`): The second layout.

**Returns:**

A new layout with combined dimensions from both input layouts.

---

## zipped_divide

`zipped_divide(layout_a: Layout, layout_b: Layout) -> Layout`

Divides a layout into blocks according to another layout.

This function creates a hierarchical layout by dividing the first layout
according to the second layout. It's an alias for hierarchical\_unzip that provides a
more intuitive name for the division operation. This is useful for creating
blocked or tiled representations of tensors.

Example:

```mojo
from layout import Layout, IntTuple
from layout.layout import zipped_divide

# Create layouts
var base = Layout.row_major(6, 8)
var pattern = Layout(IntTuple(2, 2))
var result = zipped_divide(base, pattern)
```

.

**Args:**

* ​layout\_a (`Layout`): The layout to be divided.
* ​layout\_b (`Layout`): The layout defining the division pattern.

**Returns:**

A new layout representing the hierarchical division of layout\_a according
to layout\_b.

`zipped_divide(layout_a: Layout, tiler: List[Layout]) -> Layout`

Divides a layout into blocks according to a list of layouts.

This function creates a hierarchical layout by dividing the first layout
according to the layouts in the tiler list. It's an alias for hierarchical\_unzip that
provides a more intuitive name for the division operation when working with
multiple tiling patterns.

Example:

```mojo
from layout import Layout, LayoutList, IntTuple
from layout.layout import zipped_divide

# Create layouts
var base = Layout.row_major(6, 8)
var tilers = LayoutList()
tilers.append(Layout(IntTuple(2, 2)))
var result = zipped_divide(base, tilers)
```

.

**Args:**

* ​layout\_a (`Layout`): The layout to be divided.
* ​tiler (`List[Layout]`): A list of layouts defining the division patterns.

**Returns:**

A new layout representing the hierarchical division of layout\_a according
to the patterns in tiler.

---

## LayoutTensor

`@register_passable(trivial)`
`struct LayoutTensor[mut: Bool, //, dtype: DType, layout: Layout, origin: Origin[mut], /, *, address_space: AddressSpace = AddressSpace(0), element_layout: Layout = __init__[::Origin[::Bool(IntTuple(1), IntTuple(1)), layout_int_type: DType = _get_layout_type(layout, address_space), linear_idx_type: DType = _get_index_type(layout, address_space), masked: Bool = False, alignment: Int = alignof[::DType,__mlir_type.!kgen.target]()]`

A high-performance tensor with explicit memory layout and hardware-optimized access patterns.

`LayoutTensor` provides a powerful abstraction for multi-dimensional data
with precise control over memory organization. It supports various memory
layouts (row-major, column-major, tiled), hardware-specific optimizations,
and efficient parallel access patterns.

Example:

```mojo
from layout import Layout, LayoutTensor

# Create tensor on CPU using InlineArray to allocate storage space.
var storage = InlineArray[Scalar[DType.float32], 5 * 4](uninitialized = True)
var tensor_5x4 = LayoutTensor[DType.float32, Layout.row_major(5, 4)](storage)
```

## Parameters

* ​mut (`Bool`): The inferred mutability of the underlying pointer.
* ​dtype (`DType`): The data type of the underlying pointer.
* ​layout (`Layout`): The memory layout of the tensor.
* ​origin (`Origin[mut]`): The origin of the underlying pointer.
* ​address\_space (`AddressSpace`): The address space of the underlying pointer.
* ​element\_layout (`Layout`): The memory layout of each element in the tensor.
* ​layout\_int\_type (`DType`): The integer type of each dimension of runtime layout.
* ​linear\_idx\_type (`DType`): The integer type of the index pointing to memory
  locations.
* ​masked (`Bool`): If true the tensor is masked and runtime layouts determine the
  shape.
* ​alignment (`Int`): Alignment of the data pointer.

## Fields

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the underlying memory buffer containing the tensor data.
  This pointer respects the specified address space, alignment, mutability,
  and origin tracking for memory safety and performance optimization.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=layout_int_type, linear_idx_type=linear_idx_type]`): Runtime representation of the tensor's memory layout.
  Handles both compile-time and runtime-determined dimensions, enabling
  efficient mapping between logical tensor coordinates and physical memory
  locations.
* ​runtime\_element\_layout (`RuntimeLayout[element_layout, element_type=int32, linear_idx_type=linear_idx_type]`): Runtime representation of each element's internal layout.
  Used when elements themselves have structure, such as in blocked or tiled
  layouts.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`,
`_Expable`

## Aliases

### `element_size`

`alias element_size = element_layout.size()`

The number of scalar values in each element of the tensor.

### `element_type`

`alias element_type = SIMD[dtype, element_layout.size()]`

The SIMD vector type used for vectorized operations on tensor elements.

### `rank`

`alias rank = layout.rank()`

The number of dimensions in the tensor's layout.

## Methods

### `__init__`

`@implicit`
`__init__(span: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]) -> Self`

Create a `LayoutTensor` with a `Span`.

**Constraints:**

Layout must be fully static.

**Args:**

* ​span (`Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]`): The `Span` pointing to the underlying data.

`__init__(span: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self`

Create a `LayoutTensor` with a `Span` and a runtime layout for the tensor. The runtime layout element type will be casted to the layout tensor layout integer type.

**Constraints:**

* Element layout must be fully static.

**Args:**

* ​span (`Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]`): The `Span` pointing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the LayoutTensor.

`__init__(span: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self`

Create a `LayoutTensor` with a `Span`, a runtime layout of the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type.

**Constraints:**

* Runtime layout and `LayoutTensor` must have the same bitwidth and
  index type.

**Args:**

* ​span (`Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]`): The `Span` pointing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`.
* ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element.

`@implicit`
`__init__(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self`

Create a `LayoutTensor` with an `UnsafePointer`.

**Constraints:**

Layout must be fully static.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The `UnsafePointer` pointing to the underlying data.

`__init__(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self`

Create a `LayoutTensor` with an `UnsafePointer` and a runtime layout for the tensor. The runtime layout element type will be casted to the layout tensor layout integer type.

**Constraints:**

Element layout must be fully static.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The UnsafePointer pointing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the LayoutTensor.

`__init__(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self`

Create a `LayoutTensor` with an `UnsafePointer`, a runtime layout for the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The `UnsafePointer` pointing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`.
* ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element.

`@implicit`
`__init__(ref [origin] device_buffer: DeviceBuffer[dtype]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Create a `LayoutTensor` from a `DeviceBuffer`. The layout must have statically known dimensions.

Note that the device buffer memory is on the accelerator device (GPU
global memory). Code running on the CPU can use the
[`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) to
allocate a `DeviceBuffer` and use that to construct a `LayoutTensor`
that can be accessed on the GPU. You cannot directly access data in the
`DeviceBuffer` or `LayoutTensor` from the CPU.

The following example shows a typical pattern for using `DeviceBuffer`
to construct a `LayoutTensor` that you can use on the GPU.

```mojo
from gpu.host import DeviceContext, DeviceBuffer
from layout import Layout, LayoutTensor

alias dtype = DType.float32

var ctx = DeviceContext()
# Allocate buffers
var dev_buf = ctx.enqueue_create_buffer[dtype](16)
var host_buf = ctx.enqueue_create_host_buffer[dtype](16)
# Ensure buffers have been created
ctx.synchronize()

# Initialize host buffer and copy to device buffer
for i in range(16):
    host_buf[i] = i
ctx.enqueue_copy(dev_buf, host_buf)

# Create LayoutTensor to use on device
alias layout = Layout.row_major(4, 4)
var tensor = LayoutTensor[dtype, layout](dev_buf)
...
```

**Constraints:**

* Layout must be fully static.

**Args:**

* ​device\_buffer (`DeviceBuffer[dtype]`): Contains the underlying data to point to.

`@implicit`
`__init__(ref [origin] host_buffer: HostBuffer[dtype]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Create a `LayoutTensor` from a `HostBuffer`. The layout must have statically known dimensions.

The resulting tensor's data can only be accessed on the CPU.

```mojo
from gpu.host import DeviceContext, HostBuffer
from layout import Layout, LayoutTensor

alias dtype = DType.float32

var ctx = DeviceContext()
var dev_buf = ctx.enqueue_create_host_buffer[dtype](8)

alias layout = Layout.row_major(4, 4)
var tensor = LayoutTensor[dtype, layout](dev_buf)
```

**Constraints:**

* Layout must be fully static.

**Args:**

* ​host\_buffer (`HostBuffer[dtype]`): Contains the underlying data to point to.

`__init__(ref [origin] device_buffer: DeviceBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Create a `LayoutTensor` from a `DeviceBuffer` and a runtime layout. The runtime layout element type will be casted to the layout tensor layout integer type.

The resulting tensor's data can only be accessed on the GPU.

**Constraints:**

* Element layout must be fully static.

**Args:**

* ​device\_buffer (`DeviceBuffer[dtype]`): The `DeviceBuffer` containing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the LayoutTensor.

`__init__(ref [origin] host_buffer: HostBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Create a `LayoutTensor` from a `HostBuffer` and a runtime layout. The runtime layout element type will be casted to the layout tensor layout integer type.

The resulting tensor's data can only be accessed on the CPU.

**Constraints:**

* Element layout must be fully static.

**Args:**

* ​host\_buffer (`HostBuffer[dtype]`): The `HostBuffer` containing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`.

`__init__(ref [origin] device_buffer: DeviceBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Create a `LayoutTensor` from a `DeviceBuffer`, a runtime layout for the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type.

The resulting tensor's data can only be accessed on the GPU.

**Args:**

* ​device\_buffer (`DeviceBuffer[dtype]`): The `DeviceBuffer` containing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`.
* ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element.

`__init__(ref [origin] host_buffer: HostBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Create a `LayoutTensor` from a `HostBuffer`, a runtime layout for the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type.

The resulting tensor's data can only be accessed on the CPU.

**Args:**

* ​host\_buffer (`HostBuffer[dtype]`): The `HostBuffer` containing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`.
* ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element.

### `__getitem__`

`__getitem__(self, *dims: Int) -> SIMD[dtype, element_layout.size()]`

Retrieves a single element from the tensor at the specified indices.

This method provides array-like indexing for the tensor. The number of
indices provided must match the rank of the tensor, otherwise an error
will occur at runtime.

**Args:**

* ​\*dims (`Int`): The indices specifying the element's position in each
  dimension. For example, in a 3D tensor, you would use (i, j, k).

**Returns:**

The element at the specified position with the tensor's data type.

`__getitem__(self, crd: RuntimeTuple[S, element_type=element_type]) -> SIMD[dtype, element_layout.size()]`

Retrieves a single element from the tensor at the specified indices.

This method provides array-like indexing for the tensor. The number of
indices provided must match the rank of the tensor, otherwise an error
will occur at runtime.

**Args:**

* ​crd (`RuntimeTuple[S, element_type=element_type]`): The coordinate specifying the element's position in each dimension. For example, in a 3D tensor, you would use (i, j, k).

**Returns:**

The element at the specified position with the tensor's data type.

### `__setitem__`

`__setitem__(self, d0: Int, val: SIMD[dtype, element_layout.size()])`

Sets a single element in a rank-1 tensor at the specified index.

This method provides array-like element assignment for rank-1 tensors.

Notes:

* No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.

**Args:**

* ​d0 (`Int`): The index along the first dimension.
* ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position.

`__setitem__(self, d0: Int, d1: Int, val: SIMD[dtype, element_layout.size()])`

Sets a single element in a rank-2 tensor at the specified indices.

This method provides array-like element assignment for rank-2 tensors.

Performance:

* Direct memory access with minimal overhead.
* Memory access pattern follows the tensor's stride configuration.

Notes:

* No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.

**Args:**

* ​d0 (`Int`): The index along the first dimension.
* ​d1 (`Int`): The index along the second dimension.
* ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position.

`__setitem__(self, d0: Int, d1: Int, d2: Int, val: SIMD[dtype, element_layout.size()])`

Sets a single element in a rank-3 tensor at the specified indices.

This method provides array-like element assignment for rank-3 tensors.

Performance:

* Direct memory access with minimal overhead.
* Memory access pattern follows the tensor's stride configuration.

Notes:

* No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.

**Args:**

* ​d0 (`Int`): The index along the first dimension.
* ​d1 (`Int`): The index along the second dimension.
* ​d2 (`Int`): The index along the third dimension.
* ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position.

`__setitem__(self, d0: Int, d1: Int, d2: Int, d3: Int, val: SIMD[dtype, element_layout.size()])`

Sets a single element in a rank-4 tensor at the specified indices.

This method provides array-like element assignment for rank-4 tensors.

Performance:

* Direct memory access with minimal overhead.
* Memory access pattern follows the tensor's stride configuration.

Notes:

* No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.

**Args:**

* ​d0 (`Int`): The index along the first dimension.
* ​d1 (`Int`): The index along the second dimension.
* ​d2 (`Int`): The index along the third dimension.
* ​d3 (`Int`): The index along the fourth dimension.
* ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position.

`__setitem__(self, d0: Int, d1: Int, d2: Int, d3: Int, d4: Int, val: SIMD[dtype, element_layout.size()])`

Sets a single element in a rank-5 tensor at the specified indices.

This method provides array-like element assignment for rank-5 tensors.

Performance:

* Direct memory access with minimal overhead.
* Memory access pattern follows the tensor's stride configuration.

Notes:

* No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.

**Args:**

* ​d0 (`Int`): The index along the first dimension.
* ​d1 (`Int`): The index along the second dimension.
* ​d2 (`Int`): The index along the third dimension.
* ​d3 (`Int`): The index along the fourth dimension.
* ​d4 (`Int`): The index along the fifth dimension.
* ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position.

### `__add__`

`__add__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Add a scalar value to each element of the tensor.

Performs an elementwise addition operation, adding the scalar value to
each element in the tensor. This operation creates a new tensor with the
results.

Performance:

* This operation creates a copy of the tensor before performing the
  addition.
* For in-place addition, use the `__iadd__` method instead (`+=`
  operator).

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to add to each element.

**Returns:**

A new tensor containing the results of the addition operation.

`__add__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Add another tensor to this tensor elementwise.

Performs an elementwise addition between this tensor and another tensor.
This operation creates a new tensor with the results.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Performance:

* This operation creates a copy of the tensor before performing the
  addition.
* For in-place addition, use the `__iadd__` method instead (`+=`
  operator).

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to add to this tensor.

**Returns:**

A new tensor containing the results of the addition operation.

### `__sub__`

`__sub__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Subtract a scalar value from each element of the tensor.

Performs an elementwise subtraction operation, subtracting the scalar
value from each element in the tensor. This operation creates a new
tensor with the results.

Performance:

* This operation creates a copy of the tensor before performing the
  subtraction.
* For in-place subtraction, use the `__isub__` method instead (`-=`
  operator).

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to subtract from each element.

**Returns:**

A new tensor containing the results of the subtraction operation.

`__sub__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Subtract another tensor from this tensor elementwise.

Performs an elementwise subtraction between this tensor and another
tensor. This operation creates a new tensor with the results.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Performance:

* This operation creates a copy of the tensor before performing the
  subtraction.
* For in-place subtraction, use the `__isub__` method instead (`-=`
  operator).

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to subtract from this tensor.

**Returns:**

A new tensor containing the results of the subtraction operation.

### `__mul__`

`__mul__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Multiply each element of the tensor by a scalar value.

Performs an elementwise multiplication operation, multiplying each
element in the tensor by the scalar value. This operation creates a new
tensor with the results.

Performance:

* This operation creates a copy of the tensor before performing the
  multiplication.
* For in-place multiplication, use the `__imul__` method instead
  (`*=` operator).

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to multiply with each element.

**Returns:**

A new tensor containing the results of the multiplication operation.

`__mul__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Multiply this tensor with another tensor elementwise.

Performs an elementwise multiplication (Hadamard product) between this tensor
and another tensor. This operation creates a new tensor with the results.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Note: This is NOT a matrix multiplication operation. For matrix
multiplication, use the appropriate matmul function instead.

Performance:

* This operation creates a copy of the tensor before performing the
  multiplication.
* For in-place multiplication, use the `__imul__` method instead
  (`*=` operator).

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to multiply with this tensor.

**Returns:**

A new tensor containing the results of the elementwise
multiplication.

### `__truediv__`

`__truediv__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Divide each element of the tensor by a scalar value.

Performs an elementwise division operation, dividing each element in the
tensor by the scalar value. This operation creates a new tensor with the
results.

Performance:

* This operation creates a copy of the tensor before performing the
  division.
* For in-place division, use the `__itruediv__` method instead
  (`/=` operator).

Notes:

* Division by zero will result in undefined behavior or errors
  depending on the dtype.
* For integer dtypes, this performs integer division.

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to divide each element by.

**Returns:**

A new tensor containing the results of the division operation.

`__truediv__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Divide this tensor by another tensor elementwise.

Performs an elementwise division between this tensor and another tensor.
This operation creates a new tensor with the results.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Performance:

* This operation creates a copy of the tensor before performing the
  division.
* For in-place division, use the `__itruediv__` method instead
  (`/=` operator).

Notes:

* Division by zero will result in undefined behavior or errors depending on the dtype.
* For integer dtypes, this performs integer division.

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to divide this tensor by.

**Returns:**

A new tensor containing the results of the division operation.

### `__iadd__`

`__iadd__(self, other: SIMD[dtype, 1])`

Add a scalar value to each element of the tensor in-place.

Performs an elementwise addition operation, adding the scalar value to
each element in the tensor. This operation modifies the tensor in-place.

Performance:

* This operation modifies the tensor directly without creating a copy.

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to add to each element.

`__iadd__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Add another tensor to this tensor elementwise in-place.

Performs an elementwise addition between this tensor and another tensor.
This operation modifies the tensor in-place.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Performance:

* This operation modifies the tensor directly without creating a
  copy.

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to add to this tensor.

### `__isub__`

`__isub__(self, other: SIMD[dtype, 1])`

Subtract a scalar value from each element of the tensor in-place.

Performs an elementwise subtraction operation, subtracting the scalar
value from each element in the tensor. This operation modifies the
tensor in-place.

Performance:

* This operation modifies the tensor directly without creating a copy.

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to subtract from each element.

`__isub__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Subtract another tensor from this tensor elementwise in-place.

Performs an elementwise subtraction between this tensor and another
tensor. This operation modifies the tensor in-place.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Performance:

* This operation modifies the tensor directly without creating a copy.

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to subtract from this tensor.

### `__imul__`

`__imul__(self, other: SIMD[dtype, 1])`

Multiply each element of the tensor by a scalar value in-place.

Performs an elementwise multiplication operation, multiplying each
element in the tensor by the scalar value. This operation modifies the
tensor in-place.

Performance:

* This operation modifies the tensor directly without creating a copy.

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to multiply with each element.

`__imul__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Multiply this tensor with another tensor elementwise in-place.

Performs an elementwise multiplication (Hadamard product) between this
tensor and another tensor. This operation modifies the tensor in-place.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Note: This is NOT a matrix multiplication operation. For matrix
multiplication, use the appropriate matmul function instead.

Performance:

* This operation modifies the tensor directly without creating a copy.

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to multiply with this tensor.

### `__itruediv__`

`__itruediv__(self, other: SIMD[dtype, 1])`

Divide each element of the tensor by a scalar value in-place.

Performs an elementwise division operation, dividing each element in the
tensor by the scalar value. This operation modifies the tensor in-place.

Performance:

* This operation modifies the tensor directly without creating a copy.

Notes:

* Division by zero will result in undefined behavior or errors depending on the dtype.
* For integer dtypes, this performs integer division.

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to divide each element by.

`__itruediv__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Divide this tensor by another tensor elementwise in-place.

Performs an elementwise division between this tensor and another tensor.
This operation modifies the tensor in-place.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Performance:

* This operation modifies the tensor directly without creating a copy.

Notes:

* Division by zero will result in undefined behavior or errors depending on the dtype.
* For integer dtypes, this performs integer division.

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to divide this tensor by.

### `copy`

`copy(self) -> Self`

Explicitly copy the other `LayoutTensor`.

**Returns:**

A copy of the value.

### `bitcast`

`bitcast[new_type: DType, /, address_space: AddressSpace = address_space, element_layout: Layout = element_layout](self) -> LayoutTensor[new_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`

Bitcast the underlying pointer to a new data type.

**Parameters:**

* ​new\_type (`DType`): The new data type it is casting to.
* ​address\_space (`AddressSpace`): The address space of the returned `LayoutTensor`.
* ​element\_layout (`Layout`): The element layout of the returned `LayoutTensor`.

**Returns:**

A new `LayoutTensor` with the same memory location but with the
specified data type, address space, and element layout.

### `origin_cast`

`origin_cast[mut: Bool = mut, origin: Origin[mut] = origin](self) -> LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Changes the origin or mutability of a pointer.

**Parameters:**

* ​mut (`Bool`): Whether the origin is mutable.
* ​origin (`Origin[mut]`): Origin of the destination pointer.

**Returns:**

A new `LayoutTensor` object with the same type and the same address,
as the original `LayoutTensor`, and the new specified mutability and
origin.

### `address_space_cast`

`address_space_cast[address_space: AddressSpace = address_space](self) -> LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Changes the origin or mutability of a pointer.

**Parameters:**

* ​address\_space (`AddressSpace`): The new address space.

**Returns:**

A new `LayoutTensor` object with the same type and origin
as the original `LayoutTensor`, and the new specified address\_space.

### `get_immutable`

`get_immutable(self) -> LayoutTensor[dtype, layout, (muttoimm origin._mlir_origin), address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Return an immutable version of this tensor.

**Returns:**

A `LayoutTensor` covering the same elements, but without mutability.

### `__exp__`

`__exp__(self) -> Self`

Computes element-wise exponential function.

Returns a new tensor containing the
[element-wise exponential](/mojo/stdlib/math/math/exp/) of the input tensor.

**Returns:**

A new tensor containing the element-wise exponential.

### `load`

`load[width: Int](self, m: Int, n: Int) -> SIMD[dtype, width]`

Load a SIMD vector from the tensor at the specified 2D coordinates.

Performs a vectorized load operation from the tensor's memory,
retrieving `width` consecutive elements starting at position (m, n).
This method enables efficient SIMD operations on tensor data.

Performance:

* Uses unaligned memory access which may be slower on some
  architectures.
* For aligned access, use `aligned_load` instead when data alignment is
  guaranteed.
* The load operation is optimized based on the tensor's memory layout.

Notes:

* No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.
* The elements are loaded according to the tensor's stride configuration.

**Parameters:**

* ​width (`Int`): The number of elements to load into the SIMD vector. Should match
  the target hardware's vector width for optimal performance.

**Args:**

* ​m (`Int`): The row index (first dimension).
* ​n (`Int`): The column index (second dimension).

**Returns:**

A SIMD vector containing 'width' consecutive elements from the tensor.

### `prefetch`

`prefetch(self, m: Int, n: Int)`

Prefetch tensor data at the specified 2D coordinates into cache.

Issues a software prefetch hint to the processor to load the data at
position (m, n) into the cache hierarchy. This can improve performance
by reducing memory latency for subsequent accesses to the same location.

Performance:

* Prefetching is a performance hint and does not guarantee data will be
  cached.
* Most effective when issued sufficiently ahead of the actual data
  access.
* Uses high locality prefetch to the data cache, optimized for data that
  will be accessed multiple times.
* Can reduce memory access latency by 50-90% when used correctly.

Notes:

* Excessive prefetching can pollute the cache and degrade performance.
* Most beneficial for predictable access patterns that would otherwise
  cause cache misses.
* No operation is performed on the prefetched data.

**Args:**

* ​m (`Int`): The row index (first dimension).
* ​n (`Int`): The column index (second dimension).

### `aligned_load`

`aligned_load[width: Int](self, m: Int, n: Int) -> SIMD[dtype, width]`

Load a SIMD vector with alignment guarantees from the tensor.

Performs an aligned vectorized load operation from the tensor's memory,
retrieving `width` consecutive elements starting at position (m, n). The
alignment is automatically calculated based on the SIMD width and dtype.

Performance:

* Uses aligned memory access which is faster than unaligned access on
  most architectures.
* The alignment is automatically calculated based on the SIMD width and
  dtype.
* Can be up to 2x faster than unaligned loads on architectures that
  require alignment.

Notes:

* The caller must ensure that the memory at (m, n) is properly aligned.
  Misaligned access with this method may cause hardware exceptions on
  some architectures.
* No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.

**Parameters:**

* ​width (`Int`): The number of elements to load into the SIMD vector. Should
  match the target hardware's vector width for optimal performance.

**Args:**

* ​m (`Int`): The row index (first dimension).
* ​n (`Int`): The column index (second dimension).

**Returns:**

A SIMD vector containing 'width' consecutive elements from the tensor.

### `store`

`store[width: Int](self, m: Int, n: Int, val: SIMD[dtype, width])`

Store a SIMD vector to the tensor at the specified 2D coordinates.

Performs a vectorized store operation to the tensor's memory, writing
'width' consecutive elements starting at position (m, n). This method
enables efficient SIMD operations on tensor data.

Performance:

* Uses unaligned memory access which may be slower on some
  architectures.
* For aligned access, use aligned\_store instead when data alignment is
  guaranteed.
* The store operation is optimized based on the tensor's memory layout.

Notes:

* No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.
* The elements are stored according to the tensor's stride configuration.
* This operation modifies the tensor's data in-place.

**Parameters:**

* ​width (`Int`): The number of elements in the SIMD vector to store. Should
  match the target hardware's vector width for optimal performance.

**Args:**

* ​m (`Int`): The row index (first dimension) where the store operation begins.
* ​n (`Int`): The column index (second dimension) where the store operation
  begins.
* ​val (`SIMD[dtype, width]`): The SIMD vector containing the values to store in the tensor.

### `aligned_store`

`aligned_store[width: Int](self, m: Int, n: Int, val: SIMD[dtype, width])`

Store a SIMD vector with alignment guarantees to the tensor.

Performs an aligned vectorized store operation to the tensor's memory,
writing `width` consecutive elements starting at position (m, n). The
alignment is automatically calculated based on the SIMD width and dtype.

Performance:

* Uses aligned memory access which is faster than unaligned access on
  most architectures.
* The alignment is automatically calculated based on the SIMD width and
  dtype.
* Can be up to 2x faster than unaligned stores on architectures that
  require alignment.
* Particularly important for streaming stores that bypass the cache.

Notes:

* The caller must ensure that the memory at (m, n) is properly aligned.
  Misaligned access with this method may cause hardware exceptions on
  some architectures.
* No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.
* This operation modifies the tensor's data in-place.

**Parameters:**

* ​width (`Int`): The number of elements in the SIMD vector to store. Should
  match the target hardware's vector width for optimal performance.

**Args:**

* ​m (`Int`): The row index (first dimension) where the store operation begins.
* ​n (`Int`): The column index (second dimension) where the store operation
  begins.
* ​val (`SIMD[dtype, width]`): The SIMD vector containing the values to store in the tensor.

### `size`

`size(self) -> Int`

Get the total number of elements that the tensor can contain.

**Returns:**

The total number of elements that can be stores in the tensor.

### `stack_allocation`

`static stack_allocation[*, alignment: Int = alignment]() -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Allocates stack memory for a `LayoutTensor` with a fully static layout.

Creates a new `LayoutTensor` instance with memory allocated on the stack
rather than the heap. This provides deterministic memory management and
potentially better performance for tensors with known sizes at compile
time.

Performance:

* Stack allocation is typically faster than heap allocation.
* Proper alignment can significantly improve memory access performance,
  especially for vectorized operations.
* No dynamic memory management overhead (no malloc/free calls).

Notes:

* Only works with tensors that have fully static layouts known at
  compile time.
* Stack memory is limited, so this should only be used for reasonably
  sized tensors.
* The allocated memory is automatically freed when the function returns.

**Constraints:**

* The layout must be fully static (all dimensions known at compile
  time).
* The alignment must be a multiple of the tensor's minimum required
  alignment.

**Parameters:**

* ​alignment (`Int`): Memory alignment value for the allocation in bytes. Must
  be a multiple of the tensor's minimum required alignment.
  Default is the tensor's natural alignment based on its data type
  and layout.

**Returns:**

A new `LayoutTensor` instance with memory allocated on the stack.

### `shape`

`static shape[idx: Int]() -> Int`

Returns the size of the tensor along the specified dimension.

Provides static access to the tensor's shape information. This method
returns the size of a specific dimension without requiring an instance
of the tensor, as the shape is part of the tensor's static type
information.

Performance:

* This is a compile-time operation with no runtime cost when used
  with static dimensions.

Notes:

* This is a static method that operates on the tensor's type information,
  not on a specific tensor instance.

**Parameters:**

* ​idx (`Int`): The dimension index to query (0-based).
  For example, in a 3D tensor with shape \[10, 20, 30]:
  * `shape[0]()` returns 10 (first dimension).
  * `shape[1]()` returns 20 (second dimension).
  * `shape[2]()` returns 30 (third dimension).

**Returns:**

The size of the tensor along the specified dimension as an integer.

### `stride`

`static stride[idx: Int]() -> Int`

Returns the memory stride of the tensor along the specified dimension.

Provides static access to the tensor's stride information. The stride
represents the number of elements to skip in memory to move one position
along a particular dimension. This method returns the stride without
requiring an instance of the tensor, as the stride is part of the
tensor's static type information.

Performance:

* This is a compile-time operation with no runtime cost when used
  with static dimensions.
* Understanding stride patterns is crucial for optimizing memory access
  patterns in performance-critical code.

Notes:

* Strides depend on the memory layout (row-major, column-major, or
  custom).
* For non-contiguous tensors (e.g., tensor slices), strides may not
  follow a simple pattern.

**Parameters:**

* ​idx (`Int`): The dimension index to query (0-based).
  For example, in a 2D tensor with shape \[10, 20] and row-major
  layout:
  * `stride[0]()` might return 20 (moving one row requires
    skipping 20 elements).
  * `stride[1]()` might return 1 (moving one column requires
    skipping 1 element).

**Returns:**

The memory stride of the tensor along the specified dimension as an
integer.

### `dim`

`dim(self, idx: Int) -> Int`

Returns the runtime dimension size of the tensor along the specified axis.

Unlike the static `dim` method, this instance method takes a runtime
dimension index.

**Args:**

* ​idx (`Int`): The dimension index to query (0-based).
  For example, in a 3D tensor with shape `[10, 20, 30]`:
  * `dim(0)` returns 10 (first dimension).
  * `dim(1)` returns 20 (second dimension).
  * `dim(2)` returns 30 (third dimension).

**Returns:**

The dimension of the tensor along the specified axis as an integer.

`dim[idx: Int](self) -> Int`

Returns the dimension size of the tensor along the specified axis.

Unlike the static `shape` method, this instance method provides access
to the tensor's actual dimension sizes. If the dimension is unknown,
the runtime layout is used to get the dimension size.

Performance:

* For static dimensions known at compile time, prefer the static
  `shape` method when possible for better performance.

Notes:

* This method works with both static and dynamic dimensions.
* For tensors with masked or partial views, this returns the actual
  size of the view, not the original tensor.

**Constraints:**

* Only works with tensors that have depth-1 layouts (no nested
  shapes).

**Parameters:**

* ​idx (`Int`): The dimension index to query (0-based).
  For example, in a 3D tensor with shape `[10, 20, 30]`:
  * `dim[0]()` returns 10 (first dimension).
  * `dim[1]()` returns 20 (second dimension).
  * `dim[2]()` returns 30 (third dimension).

**Returns:**

The size of the tensor along the specified dimension as an integer.

### `coalesce`

`coalesce(self) -> LayoutTensor[dtype, coalesce(layout, False), origin, address_space=address_space, element_layout=element_layout]`

Creates a tensor with a coalesced memory layout from this tensor.

Coalescing a tensor's layout means reorganizing its memory
representation to be as contiguous as possible, which can improve memory
access patterns and performance. This operation does not move or copy
data; it only changes how the same memory is interpreted.

Performance:

* Coalesced layouts typically provide better cache utilization and
  memory access patterns.
* This operation is zero-cost at runtime as it only changes the
  layout information, not the actual data.
* Particularly beneficial before operations that perform sequential
  memory access or vectorized operations.

Notes:

* The coalesced tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
* The shape of the tensor remains the same, only the stride information
  is optimized.
* For already optimally coalesced tensors, this operation has no effect.

**Returns:**

A tensor with the same data but with a coalesced memory layout.
The returned tensor has type `LayoutTensor` with the same dtype but
with a coalesced layout.

### `tile_type`

`static tile_type[*tile_sizes: Int](*tile_coords: Int) -> LayoutTensor[dtype, _compute_tile_layout[*::Int]()[0], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _tile_is_masked[::Layout,*::Int](), alignment=alignment]`

Returns a the type of a tile view of the tensor with specified dimensions and coordinates.

**Parameters:**

* ​\*tile\_sizes (`Int`): The dimensions of each tile along each axis of the
  tensor.

**Args:**

* ​\*tile\_coords (`Int`): The coordinates of the specific tile to extract.

**Returns:**

The type of a view into the original tensor representing the
specified tile.

### `tile`

`tile[*tile_sizes: Int](self, *tile_coords: Int) -> LayoutTensor[dtype, _compute_tile_layout[*::Int]()[0], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _tile_is_masked[::Layout,*::Int](), alignment=alignment]`

Extract a tile (sub-tensor) from this tensor with specified dimensions and position.

Tiling is a fundamental operation for high-performance tensor
computations that divides a tensor into smaller blocks for better cache
locality and parallelism. This method extracts a specific tile at the
given coordinates without copying data.

Example:

For a 4×4 tensor with values:

```
[1 2 3 4]
[2 3 4 5]
[5 4 3 2]
[1 1 1 1]
```

`tile[2, 2](1, 0)` will extract the tile:

```
[5 4]
[1 1]
```

Performance:

* Creates a view without copying data, making it very efficient.
* Optimized for both static and dynamic layouts with different code paths.
* Properly handles edge cases where tiles may be partially outside the tensor.
* Maintains stride information for efficient memory access within the tile.

Notes:

* The resulting tile is a view into the original tensor, so modifications
  to the tile will affect the original tensor.
* For tiles at the edges of the tensor, the actual dimensions may be smaller
  than the requested tile\_sizes if masking is enabled.
* The implementation automatically selects between static and dynamic tiling
  based on the tensor's layout properties.

**Parameters:**

* ​\*tile\_sizes (`Int`): The dimensions of each tile along each axis of the
  tensor. For example, in a 2D tensor, `tile[32, 32]` creates
  32×32 tiles.

**Args:**

* ​\*tile\_coords (`Int`): The coordinates of the specific tile to extract. For
  example, `tile[32, 32](1, 2)` extracts the tile at position
  (1, 2) in the grid of 32×32 tiles.

**Returns:**

A view into the original tensor representing the specified tile.

### `tile_with_offset`

`tile_with_offset[*tile_sizes: Int](self, *tile_coords: Int, out result: Tuple[LayoutTensor[dtype, _compute_tile_layout[*::Int]()[0], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _tile_is_masked[::Layout,*::Int](), alignment=alignment], IndexList[len[::Sized](flatten[::Origin[::Bool(layout.shape)), element_type=layout_int_type], SIMD[linear_idx_type, 1]])`

Similar to `tile`, but also returns the corner coordinates of the tile as well as the offset.

**Parameters:**

* ​\*tile\_sizes (`Int`): The dimensions of each tile along each axis of the
  tensor.

**Args:**

* ​\*tile\_coords (`Int`): The coordinates of the specific tile to extract.

**Returns:**

A tuple containing:

* The extracted tile as a `LayoutTensor`.
* The corner coordinates of the tile.
* The offset of the tile.

### `tiled_iterator`

`tiled_iterator[*tile_sizes: Int, *, axis: Int = 0](self, *tile_coords: Int) -> LayoutTensorIter[dtype, _compute_tile_layout[*::Int]()[0], origin, address_space=address_space, axis=OptionalReg[Int]({:_stdlib::_builtin::_int::_Int axis, 0}), layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _tile_is_masked[::Layout,*::Int]()]`

Create an iterator that traverses tiles along a specified axis.

This method creates an iterator that allows efficient traversal of tiles
within a tensor. The iterator starts at the specified tile coordinates
and can move along the specified axis, providing access to consecutive
tiles.

Performance:

* Provides efficient sequential access to tiles with good cache
  locality.
* Optimized for both static and dynamic layouts with different code
  paths.
* Maintains stride information for efficient memory access within each
  tile.
* Properly handles edge cases where tiles may be partially outside the
  tensor.

Notes:

* The iterator provides views into the original tensor, so modifications
  through the iterator will affect the original tensor.
* For tiles at the edges of the tensor, the actual dimensions may be smaller
  than the requested tile\_sizes if masking is enabled.
* The iterator is not circular by default, meaning it will not wrap around
  when reaching the end of the tensor along the iteration axis.
* The implementation automatically selects between static and dynamic tiling
  based on the tensor's layout properties.

Example:

```mojo
var iter = tensor.tiled_iterator[16, 16, axis=0](0, 0)
for i in range(num_tiles_along_axis):
    var tile = iter.get()
    # Process tile
    iter.next()
```

**Parameters:**

* ​\*tile\_sizes (`Int`): The dimensions of each tile along each axis of the
  tensor. For example, in a 2D tensor, `tiled_iterator[32, 32]`
  creates an iterator over 32×32 tiles.
* ​axis (`Int`): The axis along which the iterator will traverse. Default is 0
  (first dimension). For example, with axis=0, the iterator will
  move vertically through tiles.

**Args:**

* ​\*tile\_coords (`Int`): The starting coordinates of the tile where iteration
  begins.

**Returns:**

A `LayoutTensorIter` that can be used to traverse tiles along the
specified axis.

### `split`

`split[count: Int, axis: Int = 0](self) -> StaticTuple[LayoutTensor[dtype, _compute_tile_layout[::Int,::Int]()[0], origin, address_space=address_space, element_layout=element_layout, alignment=alignment], count]`

Split the `LayoutTensor` along a axis and return a `StaticTuple` of `LayoutTensor`.

**Parameters:**

* ​count (`Int`): Number of portion to split.
* ​axis (`Int`): The axis where the split is applied to.

**Returns:**

A `StaticTuple` containing `count` `LayoutTensors`, each
representing an equal-sized partition of the original tensor along
the specified axis. Each partition has the same data type and memory
characteristics as the original tensor, but with a reduced size
along the split axis.

`split[axis: Int = 0, alignment: Int = 1](self, count: Int, idx: Int) -> LayoutTensor[dtype, layout.make_shape_unknown[::Int](), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Retrieve a specific partition of the tensor after splitting along a specified axis.

This method divides the tensor into 'count' partitions along the
specified axis and returns the partition at index 'idx'. The
partitioning is done with alignment considerations to optimize memory
access patterns.

Unlike the overloaded split method that returns all partitions, this
method returns only a single partition, making it more memory-efficient
for cases where only one partition is needed at a time.

Notes:

* The shape along the split axis becomes unknown at compile time.
* Only works with dimensions that have statically known sizes.
* The last partition may be smaller than others if the dimension size
  is not evenly divisible by `count`.
* Partition sizes are aligned up to the specified alignment value,
  which can improve performance for vectorized operations.

Performance:

* Uses aligned partitioning to improve memory access patterns.
* Avoids creating all partitions in memory, reducing memory usage.
* Maintains the original tensor's stride information for efficient
  element access within the partition.

**Constraints:**

* The dimension being split must have a statically known size.
* Cannot split dimensions with unknown or dynamic sizes.

**Parameters:**

* ​axis (`Int`): The axis along which to split the tensor. Defaults to 0 (first
  dimension).
* ​alignment (`Int`): Memory alignment value for the partition size. Defaults
  to 1.

**Args:**

* ​count (`Int`): The number of partitions to divide the tensor into.
* ​idx (`Int`): The index of the partition to return (0-based).

**Returns:**

A `LayoutTensor` representing the requested partition.

### `distribute_type`

`static distribute_type[threads_layout: Layout, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1})]() -> LayoutTensor[dtype, _compute_distribute_layout[::Layout,::Layout,::OptionalReg[::Int]]()[1], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _distribute_is_masked[::Layout,::Layout,::OptionalReg[::Int]]() if is_nvidia_gpu() else False]`

Returns the type of the distributed tensor.

**Parameters:**

* ​threads\_layout (`Layout`): The layout of the threads.
* ​axis (`OptionalReg[Int]`): The axis to distribute along.

**Returns:**

The type of the distributed tensor.

### `distribute`

`distribute[threads_layout: Layout, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), submode_axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1})](self, thread_id: UInt) -> LayoutTensor[dtype, _compute_distribute_layout[::Layout,::Layout,::OptionalReg[::Int]]()[1], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _distribute_is_masked[::Layout,::Layout,::OptionalReg[::Int]]() if is_nvidia_gpu() else False]`

Distribute tensor workload across multiple threads in a structured pattern.

This method partitions a tensor across multiple threads for parallel
processing, assigning each thread a specific portion of the tensor. The
distribution pattern is determined by the threads\_layout parameter,
which defines the logical arrangement of threads.

Example:

For a 4×4 tensor distributed across 4 threads in a 2×2 grid:

* Thread 0 might get the top-left quadrant
* Thread 1 might get the top-right quadrant
* Thread 2 might get the bottom-left quadrant
* Thread 3 might get the bottom-right quadrant

If axis=0 is specified with the same setup:

* Thread 0 and Thread 2 would get the same data (left half)
* Thread 1 and Thread 3 would get the same data (right half)

Performance:

* Creates a view without copying data, making it very efficient for
  parallel processing.
* The swizzle parameter can significantly improve cache locality and
  memory access patterns.
* Optimized for both static and dynamic layouts with different code
  paths.

Notes:

* The resulting tensor is a view into the original tensor, so
  modifications will affect the original tensor.
* For optimal performance, the `threads_layout` should match the
  hardware's thread organization (e.g., warp/wavefront size and shape).
* When using swizzling, carefully consider the memory access patterns to
  avoid cache thrashing or bank conflicts.
* This function is particularly useful for GPU programming where threads
  are organized in structured grids.

**Constraints:**

* For dynamic layouts, the shape must be known at runtime and the
  threads\_layout must be fully static.

**Parameters:**

* ​threads\_layout (`Layout`): Defines the logical arrangement of threads (e.g.,
  2×2 grid of 4 threads). This layout determines how the tensor is
  partitioned.
* ​axis (`OptionalReg[Int]`): Optional. If specified, restricts distribution to only this
  axis. For example, with axis=0 in a 2D thread layout, threads
  that differ only in their second coordinate will receive the
  same data.
* ​swizzle (`OptionalReg[Swizzle]`): Optional. A function that remaps the distribution pattern
  to improve memory access patterns or cache locality.
* ​submode\_axis (`OptionalReg[Int]`): Optional. Specifies an axis for specialized
  distribution modes.

**Args:**

* ​thread\_id (`UInt`): The ID of the current thread (0-based).

**Returns:**

A view into the original tensor representing the portion assigned to
this thread.

### `distribute_with_offset`

`distribute_with_offset[threads_layout: Layout, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), submode_axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1})](self, thread_id: UInt, out result: Tuple[LayoutTensor[dtype, _compute_distribute_layout[::Layout,::Layout,::OptionalReg[::Int]]()[1], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _distribute_is_masked[::Layout,::Layout,::OptionalReg[::Int]]() if is_nvidia_gpu() else False], IndexList[threads_layout.rank(), element_type=layout_int_type], SIMD[linear_idx_type, 1]])`

Similar to `distribute`, but also returns the corner coordinates of the tile as well as the offset.

**Parameters:**

* ​threads\_layout (`Layout`): The layout of the threads.
* ​axis (`OptionalReg[Int]`): The axis to distribute along.
* ​swizzle (`OptionalReg[Swizzle]`): An optional swizzle function.
* ​submode\_axis (`OptionalReg[Int]`): An optional submode axis.

**Args:**

* ​thread\_id (`UInt`): The ID of the current thread (0-based).

**Returns:**

A tuple containing:

* The distributed tensor.
* The corner coordinates of the tile.
* The offset of the tile.

### `vectorize_type`

`static vectorize_type[*vector_shape: Int]() -> LayoutTensor[dtype, coalesce(_compute_tile_layout[*::Int]()[1], True), origin, address_space=address_space, element_layout=_divide_tiles[*::Int]()[0], layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`

Returns the type of a vectorized view of the tensor with specified vector dimensions.

**Parameters:**

* ​\*vector\_shape (`Int`): The dimensions of each vector unit along each axis of
  the tensor.

**Returns:**

The type of a view into the original tensor with a vectorized layout.

### `vectorize`

`vectorize[*vector_shape: Int](self) -> LayoutTensor[dtype, coalesce(_compute_tile_layout[*::Int]()[1], True), origin, address_space=address_space, element_layout=_divide_tiles[*::Int]()[0], layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`

Reshape a tensor into a vectorized form for efficient SIMD operations.

This method transforms the tensor's logical layout to enable efficient
vectorized processing, treating blocks of elements as vector units. The
transformation is particularly useful for SIMD (Single Instruction
Multiple Data) operations and hardware acceleration.

Example:

For a 16×16 tensor, `vectorize[4, 4]` will produce a 4×4 tensor
where each element represents a 4×4 block from the original tensor.

Performance:

* Creates a view without copying data, making it very efficient.
* Enables hardware-accelerated vector operations on blocks of data.
* Improves cache locality by grouping related elements together.
* Particularly beneficial for operations that can leverage SIMD
  instructions.

Notes:

* The tensor dimensions must be divisible by the corresponding vector
  dimensions.
* For dimensions with unknown size, the corresponding vector dimension
  must be 1.
* The resulting tensor has the same data but a different logical
  organization.
* Modifications to the vectorized tensor affect the original tensor.
* This transformation is particularly useful for GPU and vector
  processor optimizations.

**Constraints:**

* Each tensor dimension must be divisible by the corresponding
  vector dimension.
* Vector dimensions must be smaller than or equal to the
  corresponding tensor dimensions.
* For dimensions with unknown size, the vector dimension must be 1.

**Parameters:**

* ​\*vector\_shape (`Int`): The dimensions of each vector unit along each axis of
  the tensor. or example, in a 2D tensor, `vectorize[4, 4]` treats
  4×4 blocks as vector units.

**Returns:**

A view of the tensor with a vectorized layout, where each element in
the resulting tensor represents a vector of elements from the
original tensor.

### `slice`

`slice[d0_slice: Slice, d1_slice: Slice](self) -> LayoutTensor[dtype, _compute_slice_layout(d0_slice, d1_slice), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Extract a slice from a rank-2 tensor using slice objects.

This method creates a view into a subset of the tensor defined by the
slice specifications for each dimension. The slice is a continuous
region of the tensor with no gaps (step size must be 1).

Example:

For a 4×4 tensor, `t` with values:

```
[1 2 3 4]
[5 6 7 8]
[9 10 11 12]
[13 14 15 16]
```

```mojo
t.slice[Slice(1, 3), Slice(0, 2)]
```

will extract:

```
[5 6]
[9 10]
```

Performance:

* Creates a view without copying data, making it very efficient.
* Maintains the original tensor's stride information for efficient
  memory access.
* Zero-cost abstraction at runtime when used with compile-time constant
  slices.

Notes:

* The slice is a view into the original tensor, so modifications to the
  slice will affect the original tensor.
* Only supports rank-2 tensors. For higher-rank tensors, use the
  overloaded version with slice indices.
* The step size must be 1 (no gaps allowed in the slice).
* Slice bounds are not checked at runtime; accessing out-of-bounds
  indices will result in undefined behavior.

**Constraints:**

* Only works with rank-2 tensors.

**Parameters:**

* ​d0\_slice (`Slice`): Slice specification for the first dimension (rows).
  Defines the start and end indices for the slice along this
  dimension.
* ​d1\_slice (`Slice`): Slice specification for the second dimension (columns).
  Defines the start and end indices for the slice along this
  dimension.

**Returns:**

A view into the original tensor representing the specified slice.

`slice[d0_slice: Slice, d1_slice: Slice, slice_indices: IndexList[2], __offset_dims: Int = (layout.rank() + -2)](self, offsets: IndexList[__offset_dims]) -> LayoutTensor[dtype, _compute_slice_layout(d0_slice, d1_slice, slice_indices.__getitem__[::Indexer](0), slice_indices.__getitem__[::Indexer](1)), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Extract a 2D slice from a higher-rank tensor at specific indices.

This method creates a view into a 2D subset of a higher-rank tensor:

Selecting two dimensions to slice using the slice\_indices parameter.
Applying slice specifications to those dimensions.
Using fixed offsets for all other dimensions.

Example:

Given a 3×4×5 tensor, `t`, the following example extracts a 2×2 slice
from dimensions 0 and 2, with dimension 1 fixed at index 1.

```mojo
t.slice = t.slice[Slice(1, 3), Slice(0, 2), IndexList[2](0, 2)](1)
```

Performance:

* Creates a view without copying data, making it very efficient.
* Maintains the original tensor's stride information for efficient
  memory access.
* Zero-cost abstraction at runtime when used with compile-time constant
  slices.

Notes:

* The slice is a view into the original tensor, so modifications to the
  slice will affect the original tensor.
* The slice indices must be ordered (e.g., \[0, 2] is valid, \[2, 0] is
  not).
* The step size must be 1 (no gaps allowed in the slice).
* Slice bounds are not checked at runtime; accessing out-of-bounds
  indices will result in undefined behavior.

**Constraints:**

* Slice step size must be 1 (no gaps).
* Slice indices must be ordered (ascending).
* Tensor rank must be at least 2.

**Parameters:**

* ​d0\_slice (`Slice`): Slice specification for the first selected dimension.
* ​d1\_slice (`Slice`): Slice specification for the second selected dimension.
* ​slice\_indices (`IndexList[2]`): Indices of the two dimensions to slice (must be
  ordered).
* ​\_\_offset\_dims (`Int`): Internal parameter representing number of fixed
  dimensions.

**Args:**

* ​offsets (`IndexList[__offset_dims]`): Fixed index values for all dimensions not being sliced.

**Returns:**

A 2D view into the original tensor representing the specified slice.

### `slice_1d`

`slice_1d[d0_slice: Slice, slice_indices: IndexList[1], __offset_dims: Int = (layout.rank() + -1)](self, offsets: IndexList[__offset_dims]) -> LayoutTensor[dtype, _compute_slice_layout(d0_slice, slice_indices.__getitem__[::Indexer](0)), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Extract a 1D slice from a higher-rank tensor at a specific index.

This method creates a view into a 1D subset of a higher-rank tensor by:

1. Selecting one dimension to slice using the slice\_indices parameter
2. Applying a slice specification to that dimension
3. Using fixed offsets for all other dimensions

Example:

For a 3×4×5 tensor, `t`, the following example extracts a 1D slice from
dimension 0, with dimensions 1 and 2 fixed at indices 1 and 2:

```mojo
t.slice_1d[Slice(1, 3), IndexList[1](0)](1, 2)`
```

Performance:

* Creates a view without copying data, making it very efficient.
* Maintains the original tensor's stride information for efficient
  memory access.
* Zero-cost abstraction at runtime when used with compile-time constant
  slices.

Notes:

* The slice is a view into the original tensor, so modifications
  to the slice will affect the original tensor.
* The step size must be 1 (no gaps allowed in the slice).
* Slice bounds are not checked at runtime; accessing out-of-bounds
  indices will result in undefined behavior.
* This function exists as a workaround for compiler limitations with
  overloading.

**Constraints:**

* Slice step size must be 1 (no gaps).
* Tensor rank must be at least 1.

**Parameters:**

* ​d0\_slice (`Slice`): Slice specification for the selected dimension.
* ​slice\_indices (`IndexList[1]`): Index of the dimension to slice.
* ​\_\_offset\_dims (`Int`): Internal parameter representing number of fixed
  dimensions.

**Args:**

* ​offsets (`IndexList[__offset_dims]`): Fixed index values for all dimensions not being sliced.

**Returns:**

A 1D view into the original tensor representing the specified slice.

### `transpose`

`transpose[M: Int = shape[::Int](), N: Int = shape[::Int]()](self) -> LayoutTensor[dtype, composition(layout, __init__[::Origin[::Bool(__init__[::Origin[::Bool(IntTuple(N), IntTuple(M), Tuple()), __init__[::Origin[::Bool(IntTuple(M), IntTuple(1), Tuple()))), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Create a transposed view of a rank-2 tensor.

This method creates a view of the tensor with its dimensions swapped, effectively
converting rows to columns and columns to rows. The transposition is performed
without copying data, by adjusting the tensor's layout information.

Example:

For a 2×3 tensor with values:

```
[1 2 3]
[4 5 6]
```

`transpose()` will produce a 3×2 tensor:

```
[1 4]
[2 5]
[3 6]
```

Performance:

* Creates a view without copying data, making it very efficient.
* The operation is zero-cost at runtime as it only changes the layout
  information.
* Memory access patterns may be less efficient in the transposed view
  due to non-contiguous memory access, especially for row-major
  storage.

Notes:

* The transposed tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
* Only works with rank-2 tensors.
* For optimal performance when repeatedly accessing the transposed data,
  consider creating a physical copy with the transposed layout.

**Constraints:**

* Only works with rank-2 tensors.

**Parameters:**

* ​M (`Int`): The size of the first dimension (rows) of the original tensor.
  Defaults to the static shape value of the first dimension.
* ​N (`Int`): The size of the second dimension (columns) of the original tensor.
  Defaults to the static shape value of the second dimension.

**Returns:**

A view of the tensor with dimensions transposed (rows become columns and vice versa).

### `reshape`

`reshape[dst_layout: Layout](self) -> LayoutTensor[dtype, dst_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Create a view of the tensor with a different shape.

This method creates a view of the tensor with a new shape, without changing
the underlying data. The total number of elements must remain the same.

Example:

Given a 2×6 row-major tensor, `reshape[Layout.col_major(3, 4)]()`
produces a 3×4 tensor with the same elements in column-major order.

Performance:

* Creates a view without copying data, making it very efficient.
* The operation is zero-cost at runtime as it only changes the layout
  information.
* Memory access patterns may change, potentially affecting performance
  depending on the original and target layouts.

Notes:

* The reshaped tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
* The total number of elements must remain the same after reshaping.
* The reshape operation assumes a row-major (C-style) memory layout.
* For tensors with complex strides or non-contiguous memory, reshaping
  may not produce the expected results.
* Masked tensors cannot be reshaped.

**Constraints:**

* Cannot reshape masked tensors.
* The total number of elements must be the same in both layouts.

**Parameters:**

* ​dst\_layout (`Layout`): The target layout for the reshaped tensor. Must have the same
  total number of elements as the original tensor.

**Returns:**

A view of the tensor with the new shape specified by dst\_layout.

### `composition`

`composition[rhs_layout: Layout, dst_layout: Layout = composition(layout, rhs_layout)](self) -> LayoutTensor[dtype, dst_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Create a view of the tensor with a composed layout.

This method creates a view of the tensor with a new layout that is the
composition of the original layout with another layout. Layout
composition allows for complex transformations of the tensor's logical
structure without copying data.

Example:

For a 4×4 tensor with a standard row-major layout, composing with a
layout that represents a 2×2 tiling would result in a tensor that
logically views the data as 2×2 blocks.

Performance:

* Creates a view without copying data, making it very efficient.
* The operation is zero-cost at runtime as it only changes the layout information.
* Can be used to optimize memory access patterns for specific algorithms.

Notes:

* The composed tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
* Layout composition is a powerful tool for expressing complex data
  transformations like tiling, transposition, and reshaping in a
  unified framework.
* Understanding the mathematical properties of layout composition is
  important for correctly using this function.

**Constraints:**

* The layouts must be compatible for composition.
* The total number of elements must remain the same after
  composition.

**Parameters:**

* ​rhs\_layout (`Layout`): The layout to compose with the tensor's current layout.
* ​dst\_layout (`Layout`): The resulting layout after composition. Defaults to the
  composition of the tensor's layout with rhs\_layout.

**Returns:**

A view of the tensor with the composed layout.

### `distance`

`distance(self, addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space]) -> SIMD[linear_idx_type, 1]`

Calculate the element-wise distance between this tensor's pointer and another pointer.

This method computes the number of elements (not bytes) between the
tensor's pointer and the provided address. This is useful for
determining offsets within a larger memory allocation or for pointer
arithmetic operations.

Example:

If `tensor.ptr` points to an element at index 100 in a buffer, and
`addr` points to element at index 50, then `distance(addr)` returns 50.

Performance:

* This is a lightweight operation that only involves pointer arithmetic.
* The operation is optimized based on the address space, using smaller
  integer types for shared memory to improve efficiency.

Notes:

* The distance is calculated in elements, not bytes.
* The result can be positive or negative depending on the relative positions
  of the pointers.
* This function is particularly useful for GPU programming where understanding
  memory offsets is critical for performance.
* Care should be taken when using this with pointers from different allocations,
  as the result would be meaningless.

**Args:**

* ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space]`): The target pointer to calculate the distance to.

**Returns:**

The number of elements between this tensor's pointer and the
provided address. The result is of type `_uint_dtype`.

`distance[_layout: Layout, _uint_dtype: DType = _get_unsigned_type(_layout, address_space)](self, src: LayoutTensor[dtype, _layout, origin, address_space=address_space]) -> SIMD[_uint_dtype, 1]`

Calculate the element-wise distance between this tensor and another tensor.

This method computes the number of elements (not bytes) between this
tensor's pointer and another tensor's pointer. This is useful for
determining the relative positions of tensors within a larger memory
allocation.

Example:

If tensor1 points to element at index 100 in a buffer, and tensor2 points
to element at index 50, then `tensor1.distance(tensor2)` would return 50.

Performance:

* This is a lightweight operation that only involves pointer arithmetic.
* The operation is optimized based on the address space and layout,
  using appropriate integer types for efficiency.

Notes:

* The distance is calculated in elements, not bytes.
* The result can be positive or negative depending on the relative
  positions of the tensors.
* This function is particularly useful for GPU programming where
  understanding memory offsets is critical for performance.
* Both tensors must be in the same address space for the result to be
  meaningful.
* This overload is more type-safe than the pointer-based version as it
  ensures the tensors have compatible data types and address spaces.

**Parameters:**

* ​\_layout (`Layout`): The layout of the source tensor.
* ​\_uint\_dtype (`DType`): The unsigned integer type to use for the result.
  Automatically determined based on the layout and address space.

**Args:**

* ​src (`LayoutTensor[dtype, _layout, origin, address_space=address_space]`): The source tensor to calculate the distance to.

**Returns:**

The number of elements between this tensor's pointer and the source
tensor's pointer. The result is of type \_uint\_dtype.

### `copy_from`

`copy_from(self, other: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Copy data from another tensor to this tensor.

This method performs an element-by-element copy from the source tensor
to this tensor, respecting the layouts of both tensors. The copy
operation handles different memory layouts correctly, ensuring that
elements are copied to their proper positions regardless of how the data
is arranged in memory.

* Both tensors must have statically known shapes.
* The total number of elements must be the same in both tensors.
* The element sizes must match between the tensors.

Example:

```mojo
from layout import LayoutTensor, Layout

var src_storage = InlineArray[Float32, 2 * 3](uninitialized=True)
var dst_storage = InlineArray[Float32, 3 * 2](uninitialized=True)
var src = LayoutTensor[
    DType.float32,
    Layout([2, 3]),
](src_storage).fill(1.0)

var dst = LayoutTensor[
    DType.float32,
    Layout([3, 2]),
](dst_storage)

dst.copy_from(src)  # Copies all elements from src to dst
```

Performance:

* Performs element-by-element copying, which may be less efficient than
  vectorized or bulk memory operations.
* The copy respects the memory layout of both tensors, which may involve
  non-contiguous memory access patterns.
* For optimal performance with large tensors, consider using specialized
  copy functions that can leverage hardware acceleration.

Notes:

* Both tensors must have statically known shapes.
* The total number of elements must be the same in both tensors.
* The element sizes must match between the tensors.
* This function handles different memory layouts correctly, making it suitable
  for copying between tensors with different shapes or strides.
* The copy is performed element by element, not as a bulk memory copy.

**Args:**

* ​other (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor to copy data from. Must have the same total
  number of elements as this tensor.

### `copy_from_async`

`copy_from_async[is_masked: Bool = False, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), fill: Fill = Fill(0), eviction_policy: CacheEviction = CacheEviction(0)](self, src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_idx_bound: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0), base_offset: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0))`

Asynchronously copy data from another tensor to this tensor using GPU hardware.

This method performs an asynchronous copy from the source tensor to this
tensor using GPU hardware acceleration. It's specifically designed for
copying data from global memory to shared memory in GPU kernels,
leveraging hardware-specific asynchronous copy mechanisms for improved
performance.

For optimal performance, you need to arrange the copy correctly. Use the
[`distribute()`](/mojo/kernels/layout/layout_tensor/LayoutTensor/#distribute)
method to create thread-local fragments of the source and
destination tensors, assigning each thread one or more elements to copy.

Optionally, use the
\[`vectorize()`]\((/mojo/kernels/layout/layout\_tensor/LayoutTensor/#vectorize)
method to get vectorized views of both tensors before calling
`distribute()`. This allows each thread to copy multiple elements of the
tensor. For example:

```mojo
var fragment = tensor.vectorize[1, simd_width]().distribute[
    thread_layout
](thread_id)
```

The copy operation is asynchronous, so you must call
[`async_copy_wait_all()`](/mojo/stdlib/gpu/memory/async_copy_wait_all/)
or
[`async_copy_wait_group()`](/mojo/stdlib/gpu/memory/async_copy_wait_group/)
to ensure the copy has completed before using the data.

Example:

```mojo
from layout import LayoutTensor, Layout
from gpu import thread_idx, block_idx, block_dim, global_idx, grid_dim
from gpu.memory import AddressSpace, async_copy_wait_all

alias dtype = DType.float32
alias in_size = 128
alias block_size = 16
num_blocks = in_size // block_size
alias input_layout = Layout.row_major(in_size, in_size)

fn kernel(tensor: LayoutTensor[dtype, input_layout, MutableAnyOrigin]):
    # extract a tile from the input tensor.
    var global_tile = tensor.tile[block_size, block_size](block_idx.x, block_idx.y)

    # allocate a shared memory tile
    alias tile_layout = Layout.row_major(block_size, block_size)
    var shared_tile = LayoutTensor[
        dtype,
        tile_layout,
        MutableAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # Create per-thread tile fragments for copying
    var tid = thread_idx.y + thread_idx.x * block_dim.x
    alias thread_layout = Layout.row_major(block_size, block_size)
    var global_fragment = global_tile.distribute[thread_layout](tid)
    var shared_fragment = shared_tile.distribute[thread_layout](tid)

    # async copy to shared memory
    shared_fragment.copy_from_async(global_fragment)
    async_copy_wait_all()
    # ... do something with the shared tile
```

Performance:

* Supports vectorized copies for 4, 8, or 16-byte elements for better
  throughput.
* Can bypass L1 cache with appropriate eviction policies for specific
  access patterns.
* Swizzling can improve memory access patterns and reduce bank
  conflicts.

Notes:

* For vectorized copies, both tensors must have contiguous element
  layouts.
* Asynchronous copies allow computation to overlap with memory
  transfers.
* A synchronization barrier is required before using the copied data.

**Constraints:**

* Destination must be in shared memory.
* Source and destination data types must match.
* Element size must be 4, 8, or 16 bytes.
* Destination tensor must have a static layout.

**Parameters:**

* ​is\_masked (`Bool`): Whether to perform a masked copy, where elements outside
  the `src_idx_bound` are not copied or filled with zeros.
* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination
  indices, which can improve memory access patterns.
* ​fill (`Fill`): Fill policy for elements that are not copied (only used with
  masked copies).
* ​eviction\_policy (`CacheEviction`): Cache eviction policy for the source data.

**Args:**

* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor to copy data from.
* ​src\_idx\_bound (`SIMD[linear_idx_type, 1]`): For masked copies, the upper bound index for valid
  source elements.
* ​base\_offset (`SIMD[linear_idx_type, 1]`): Base offset for swizzling calculations.

### `fill`

`fill[*, use_runtime_layout: Bool = (layout.all_dims_known() ^ True) if (layout.all_dims_known() ^ True) else ((layout.size()  LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Fill the entire tensor with a single value.

This method sets all elements of the tensor to the specified value. It
works with both statically and dynamically shaped tensors.

For statically known layouts, the fill operation is unrolled at compile
time. For dynamic layouts, a runtime loop is used. No vectorization is
applied, so performance may be suboptimal for large tensors. Consider
using hardware-specific fill operations for better performance with
large tensors.

This method can be used with tensors of any rank and shape. The
fill operation respects the tensor's layout, filling all
elements regardless of how they are arranged in memory. For
tensors with `element_layout`, all elements within each logical element
are filled with the same value.

Example:

```mojo
from layout import Layout, LayoutTensor

def main():
    var storage = InlineArray[Float32, 3 * 4](uninitialized=True)
    var tensor = LayoutTensor[
        DType.float32,
        Layout([3, 4]),
    ](storage).fill(0.0)
    print(tensor)
```

If not using method chaining, you can either reassign the result to the
tensor variable, or assign the result to the discard pattern (`_`) to
avoid warnings about an unused value:

```mojo
tensor = tensor.fill(0.0)
# or
_ = tensor.fill(0.0)
```

**Parameters:**

* ​use\_runtime\_layout (`Bool`): Whether to use the runtime layout for filling.
  This parameter is defaulted to `True` if the layout is not
  statically known. If loop bounds are too large, it's better to
  use the runtime layout to avoid long compilation time.

**Args:**

* ​val (`SIMD[dtype, 1]`): The value to fill the tensor with. Must be of the same data
  type as the tensor.

**Returns:**

The tensor itself (self), allowing for method chaining.

### `__str__`

`__str__(self) -> String`

Convert the tensor to a string representation.

This method converts the tensor to a human-readable string
representation by writing its contents to a string. It delegates to the
`write_to` method which formats the tensor appropriately based on its
rank and shape.

**Returns:**

A string representation of the tensor.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Format and write the tensor's contents to a writer.

This method formats the tensor's contents and writes them to the
provided writer. For 2D tensors, it formats the output in a 2D grid. For
tensors of other ranks, it prints all values in column-major coordinate
order.

Example:

```mojo
from layout import Layout, LayoutTensor

def main():
    var storage = InlineArray[Float32, 2 * 3](uninitialized=True)
    var tensor = LayoutTensor[
        DType.float32,
        Layout([2, 3]),
    ](storage).fill(1.0)
    print(tensor)  # Internally calls `write_to` with a StringWriter
```

Output for a 2×3 tensor:

```
[[1.0, 1.0, 1.0],
    [1.0, 1.0, 1.0]]
```

Notes:

* For 2D tensors, the output is formatted as a 2D grid with rows and
  columns.
* For tensors of other ranks, values are printed in column-major
  coordinate order.
* Empty tensors (size 0) produce no output.
* This method is used by the `__str__` method to convert the tensor to a
  string.
* The formatting is designed for human readability rather than parsing.
* For large tensors, the output may be truncated to avoid excessive
  output.

**Parameters:**

* ​W (`Writer`): The writer type that will receive the formatted output.

**Args:**

* ​writer (`W`): The writer instance to write the formatted output to.

---

## LayoutTensorIter

`@register_passable(trivial)`
`struct LayoutTensorIter[mut: Bool, //, type: DType, layout: Layout, origin: Origin[mut], /, *, address_space: AddressSpace = AddressSpace(0), alignment: Int = alignof[::DType,__mlir_type.!kgen.target]() if is_nvidia_gpu() else 1, circular: Bool = False, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), layout_int_type: DType = _get_index_type(address_space), linear_idx_type: DType = _get_index_type(address_space), masked: Bool = False]`

Iterator for traversing a memory buffer with a specific layout.

`LayoutTensorIter` provides a way to iterate through memory according to a
specific layout pattern, constructing layout tensors at each position. This
enables efficient traversal of multi-dimensional data structures with custom
memory layouts.

Notes:

The returned layout tensor is NOT vectorized. Users should explicitly vectorize
if needed for performance-critical operations.

## Parameters

* ​mut (`Bool`): Whether the iterator allows mutation of the underlying data.
* ​type (`DType`): The data type of the tensor elements.
* ​layout (`Layout`): The memory layout pattern to follow during iteration.
* ​origin (`Origin[mut]`): Origin tracking for memory safety.
* ​address\_space (`AddressSpace`): The memory address space (`GLOBAL`, `SHARED`, etc.).
* ​alignment (`Int`): Memory alignment requirement for the data.
* ​circular (`Bool`): Whether iteration wraps around at boundaries.
* ​axis (`OptionalReg[Int]`): Optional axis for dimension-specific operations.
* ​layout\_int\_type (`DType`): Integer type used for layout indices.
* ​linear\_idx\_type (`DType`): Integer type used for indexing into memory.
* ​masked (`Bool`): Whether to apply bounds masking during iteration.

## Fields

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory region being iterated, with appropriate type and memory attributes.
* ​offset (`SIMD[linear_idx_type, 1]`): Current offset from the base pointer, representing the iterator's position in memory.
* ​stride (`SIMD[linear_idx_type, 1]`): Step size between consecutive elements or blocks in memory during iteration.
* ​bound (`SIMD[linear_idx_type, 1]`): Upper bound of the memory region, limiting the iteration range.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=layout_int_type, linear_idx_type=linear_idx_type]`): Runtime representation of the layout pattern used for mapping logical indices to memory locations.
* ​dimension\_bound (`SIMD[layout_int_type, 1]`): Boundary value for the current dimension when iterating along a specific axis.
* ​idx (`SIMD[linear_idx_type, 1]`): Current logical index position within the iteration sequence.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `layout_uint_type`

`alias layout_uint_type = SIMD[layout_int_type, 1]`

The unsigned integer type used for layout, based on layout and address space.

### `linear_uint_type`

`alias linear_uint_type = SIMD[linear_idx_type, 1]`

The unsigned integer type used for indexing into memory.

## Methods

### `__init__`

`__init__() -> Self`

Initialize an empty iterator.

Creates a default iterator with zero values, typically used as a
placeholder or default value.

`__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], bound: SIMD[linear_idx_type, 1], stride: SIMD[linear_idx_type, 1] = SIMD(layout.size()), offset: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> Self`

Initialize an iterator with a pointer and basic parameters.

Creates an iterator for a memory region with the specified bounds and
stride.

**Constraints:**

The layout must have all dimensions known at compile time.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the beginning of the memory region.
* ​bound (`SIMD[linear_idx_type, 1]`): Upper bound of the memory region.
* ​stride (`SIMD[linear_idx_type, 1]`): Step size between consecutive elements (defaults to layout
  size).
* ​offset (`SIMD[linear_idx_type, 1]`): Initial offset from the base pointer.

`__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], bound: SIMD[linear_idx_type, 1], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], stride: SIMD[linear_idx_type, 1] = SIMD(layout.size() if layout.all_dims_known() else -1), offset: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0), dimension_bound: SIMD[layout_int_type, 1] = __init__[__mlir_type.!pop.int_literal](0), idx: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> Self`

Initialize an iterator with a runtime layout.

Creates an iterator with a runtime-determined layout, allowing for more
flexible memory traversal patterns.

**Constraints:**

The runtime layout must have the same bitwidth as specified for the
iterator. Circular iteration is not supported when an axis is
defined.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the beginning of the memory region.
* ​bound (`SIMD[linear_idx_type, 1]`): Upper bound of the memory region.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): Layout determined at runtime.
* ​stride (`SIMD[linear_idx_type, 1]`): Step size between consecutive elements.
* ​offset (`SIMD[linear_idx_type, 1]`): Initial offset from the base pointer.
* ​dimension\_bound (`SIMD[layout_int_type, 1]`): Bound for the specified dimension when using masked
  iteration.
* ​idx (`SIMD[linear_idx_type, 1]`): Initial index position.

### `__getitem__`

`__getitem__(self) -> LayoutTensor[type, layout, origin, address_space=address_space, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Get the layout tensor at the current iterator position.

Operator overload that returns a layout tensor representing the data
at the current position of the iterator.

**Returns:**

A layout tensor at the current iterator position.

### `__iadd__`

`__iadd__[T: Intable](mut self, rhs: T)`

Increment the iterator by an integer value.

Advances the iterator by the specified number of positions.

Notes:

This function is unsafe. It omits bound checking for performance
reasons. Caller must ensure the index doesn't go out-of-bounds.

**Parameters:**

* ​T (`Intable`): A type that can be converted to an integer.

**Args:**

* ​rhs (`T`): The number of positions to advance.

`__iadd__(mut self, rhs: SIMD[linear_idx_type, 1])`

Increment the iterator by a uint value.

Advances the iterator by the specified number of positions.

Notes:

This function is unsafe. It omits bound checking for performance
reasons. Caller must ensure the index doesn't go out-of-bounds.

**Args:**

* ​rhs (`SIMD[linear_idx_type, 1]`): The number of positions to advance.

### `get`

`get(self) -> LayoutTensor[type, layout, origin, address_space=address_space, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Get the layout tensor at the current iterator position.

Returns a layout tensor representing the data at the current position
of the iterator.

**Returns:**

A tensor view at the current iterator position with the
same type, layout, and memory characteristics as specified by the
output parameter.

### `next`

`next[T: Intable](self, rhs: T) -> Self`

Return an iterator pointing to a position ahead by rhs steps.

Creates a new iterator that points rhs positions ahead of the current
one.

**Parameters:**

* ​T (`Intable`): An integer-convertible type for the step size.

**Args:**

* ​rhs (`T`): The number of positions to advance.

**Returns:**

A new iterator pointing to the advanced position.

`next(self, rhs: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](1)) -> Self`

Return an iterator pointing to a position ahead by rhs steps.

Creates a new iterator that points rhs positions ahead of the current
one.

**Args:**

* ​rhs (`SIMD[linear_idx_type, 1]`): The number of positions to advance (defaults to 1).

**Returns:**

A new iterator pointing to the advanced position.

### `next_unsafe`

`next_unsafe(self, rhs: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](1)) -> Self`

Return an iterator pointing to a position ahead by rhs steps (unsafe version).

Creates a new iterator that points rhs positions ahead of the current
one. This is an unsafe version that omits certain checks for
performance.

**Constraints:**

Cannot be used with masked iterators.
User must ensure rhs rhs (`SIMD[linear_idx_type, 1]`): The number of positions to advance (defaults to 1).

**Returns:**

A new iterator pointing to the advanced position.

### `reshape`

`reshape[dst_layout: Layout](self) -> LayoutTensorIter[type, dst_layout, origin, address_space=address_space, alignment=alignment, circular=circular, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`

Reshape the iterator to a new layout.

This method creates a new iterator with a different layout while
preserving the underlying data. The new layout must have the same total
size as the original.

**Constraints:**

* The destination layout must have the same total size as the original.
* Both layouts must be contiguous.
* Both layouts must have compile-time known dimensions.

**Parameters:**

* ​dst\_layout (`Layout`): The target layout to reshape to.

**Returns:**

A new iterator with the specified layout.

### `bitcast`

`bitcast[new_type: DType, *, address_space: AddressSpace = address_space, alignment: Int = alignment](self) -> LayoutTensorIter[new_type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`

Reinterpret the iterator's underlying pointer as a different data type.

This method performs a bitcast operation, allowing you to view the same
memory location as a different data type without copying or converting
the data.

**Parameters:**

* ​new\_type (`DType`): The target data type to cast to.
* ​address\_space (`AddressSpace`): The memory address space for the new
  iterator (defaults to current).
* ​alignment (`Int`): Memory alignment requirement for the new
  iterator (defaults to current).

**Returns:**

A new LayoutTensorIter with the same layout but different data type.

---

## ThreadScope

`@register_passable(trivial)`
`struct ThreadScope`

Represents the scope of thread operations in GPU programming.

This struct defines the scope at which thread operations are performed,
particularly for operations like tensor distribution and synchronization.
It provides two main scopes: `BLOCK` and `WARP`, which correspond to
different levels of thread grouping in GPU programming models.

Example:

```mojo
from layout.layout_tensor import copy_dram_to_sram, ThreadScope

# Distribute tensor at block level (all threads in block participate)
copy_dram_to_sram[layout, thread_scope=ThreadScope.BLOCK](dst, src)

# Distribute tensor at warp level (only threads in same warp participate)
copy_dram_to_sram[layout, thread_scope=ThreadScope.WARP](dst, src)
```

Performance:

* WARP scope operations typically have lower synchronization overhead
  than BLOCK scope operations.
* BLOCK scope operations allow coordination across all threads in a block,
  which is necessary for certain algorithms.
* The choice of scope can significantly impact performance and correctness
  of parallel algorithms.

Notes:

* The appropriate scope depends on the specific algorithm and hardware.
* WARP scope operations may be more efficient for operations that only
  require coordination within a warp.
* BLOCK scope operations are necessary when threads from different warps
  need to coordinate.
* The actual size of a warp or block is hardware-dependent.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `BLOCK`

`alias BLOCK = ThreadScope(0)`

Represents operations at the thread block level, where all threads in a block participate.

### `WARP`

`alias WARP = ThreadScope(1)`

Represents operations at the warp level, where only threads within the same warp participate.

## Methods

### `__init__`

`@implicit`
`__init__(value: Int) -> Self`

Initialize a `ThreadScope` with the given integer value.

**Args:**

* ​value (`Int`): An integer representing the thread scope (0 for `BLOCK`,
  1 for `WARP`).

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compare two `ThreadScope` objects for equality.

**Args:**

* ​other (`Self`): Another `ThreadScope` object to compare with.

**Returns:**

True if the thread scopes are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Compare two `ThreadScope` objects for inequality.

**Args:**

* ​other (`Self`): Another `ThreadScope` object to compare with.

**Returns:**

True if the thread scopes are not equal, False otherwise.

### `__str__`

`__str__(self) -> String`

Convert the `ThreadScope` to a human-readable string representation.

Aborts:
If the thread scope has an invalid value.

**Returns:**

A string representation of the thread scope ("BLOCK" or "WARP").

### `__int__`

`__int__(self) -> Int`

Convert the `ThreadScope` to an integer value.

**Returns:**

The integer value of the thread scope (0 for BLOCK, 1 for WARP).

---

## copy

`copy[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), thread_scope: ThreadScope = ThreadScope(0), row_major: Bool = False](dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Synchronously copy data from local memory (registers) to SRAM (shared memory).

This function performs a synchronous copy operation from register memory to
shared memory in a GPU context, distributing the workload across multiple
threads for parallel execution. It's particularly useful for transferring
processed data from registers to shared memory for inter-thread
communication.

Performance:

* Distributes the copy workload across multiple threads for parallel execution.
* Can use swizzling to optimize memory access patterns and reduce bank conflicts.
* Optimized for transferring data from registers to shared memory.
* On AMD GPUs, the `row_major` parameter can be used to match the memory
  access pattern used during prefetching from DRAM to registers.

Notes:

* The destination tensor must be in `SHARED` address space (SRAM).
* The source tensor must be in `LOCAL` address space (registers).
* This function is particularly useful in GPU kernels for sharing processed
  data between threads in the same block.
* The `row_major` parameter is specifically designed for AMD GPUs when using
  a prefetching pattern from DRAM to SRAM via registers.

**Constraints:**

* Destination tensor must be in SHARED address space.
* Source tensor must be in LOCAL address space.
* For optimal performance, the thread layout should match the memory
  access patterns of the tensors.

**Parameters:**

* ​thread\_layout (`Layout`): Layout defining how threads are organized for the
  operation. This determines how the workload is distributed among
  threads.
* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination
  indices, which can improve memory access patterns and reduce bank
  conflicts.
* ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or
  `WARP` level. `BLOCK` scope involves all threads in a thread block,
  while `WARP` scope restricts operations to threads within the same
  warp. Defaults to `ThreadScope.BLOCK`.
* ​row\_major (`Bool`): Whether to use row-major ordering for the copy operation.
  This is particularly relevant when prefetching from DRAM to SRAM
  via registers on AMD GPUs. Defaults to False.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in local memory (registers).

---

## copy_dram_to_local

`copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_base: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], offset: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}))`

Efficiently copy data from global memory (DRAM) to registers for AMD GPUs.

This function implements an optimized memory transfer operation specifically
for AMD GPU architectures. It utilizes the hardware's buffer\_load intrinsic
to efficiently transfer data from global memory to registers while handling
bounds checking. The function distributes the copy operation across multiple
threads for maximum throughput.

Notes:

* The offset calculation method significantly impacts performance.
  Current implementation optimizes for throughput over flexibility.
* This function is particularly useful for prefetching data into registers
  before performing computations, reducing memory access latency.

**Constraints:**

* Only supported on AMD GPUs.
* The destination element layout size must match the SIMD width.
* Source fragments must be rank 2 with known dimensions.

**Parameters:**

* ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor
  across threads. This determines how the workload is divided among
  participating threads.
* ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or
  `WARP` level. `BLOCK` scope involves all threads in a thread block,
  while `WARP` scope restricts operations to threads within the same
  warp. Defaults to `ThreadScope.BLOCK`.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in register memory (LOCAL address space).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in global memory (DRAM) to be copied.
* ​src\_base (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The original global memory tensor from which src is derived.
  This is used to construct the buffer descriptor required by AMD's
  `buffer_load` intrinsic.
* ​offset (`OptionalReg[UInt]`): The offset in the global memory.

`copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], bounds: SIMD[uint32, 1])`

Efficiently copy data from global memory (DRAM) to registers for AMD GPUs.

This function implements an optimized memory transfer operation specifically
for AMD GPU architectures. It utilizes the hardware's buffer\_load intrinsic
to efficiently transfer data from global memory to registers while handling
bounds checking. The function distributes the copy operation across multiple
threads for maximum throughput.

Notes:

* The offset calculation method significantly impacts performance.
  Current implementation optimizes for throughput over flexibility.
* This function is particularly useful for prefetching data into registers
  before performing computations, reducing memory access latency.

**Constraints:**

* Only supported on AMD GPUs.
* The destination element layout size must match the SIMD width.
* Source fragments must be rank 2 with known dimensions.

**Parameters:**

* ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor
  across threads. This determines how the workload is divided among
  participating threads.
* ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or
  `WARP` level. `BLOCK` scope involves all threads in a thread block,
  while `WARP` scope restricts operations to threads within the same
  warp. Defaults to `ThreadScope.BLOCK`.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in register memory (LOCAL address space).
* ​src\_iter (`LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`): The source tensor iterator.
* ​bounds (`SIMD[uint32, 1]`): Bounds of the buffer, based on the ptr of the src\_iter.

`copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Efficiently copy data from global memory (DRAM) to registers.

This function implements an optimized memory transfer operation from
global memory to register memory. It distributes the copy operation across
multiple threads for maximum throughput while handling bounds checking for
safety.

**Constraints:**

* The source tensor must be in GLOBAL address space (DRAM).
* The destination tensor must be in LOCAL address space (registers).
* Both tensors must have compatible data types.

**Parameters:**

* ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor
  across threads. This determines how the workload is divided among
  participating threads.
* ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or
  `WARP` level. `BLOCK` scope involves all threads in a thread block,
  while `WARP` scope restricts operations to threads within the same
  warp. Defaults to `ThreadScope.BLOCK`.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in register memory (LOCAL address space).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in global memory (DRAM).

---

## copy_dram_to_sram

`copy_dram_to_sram[src_thread_layout: Layout, dst_thread_layout: Layout = src_thread_layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = src_thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Synchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context.

This function performs a synchronous copy operation from global memory
(DRAM) to shared memory (SRAM) in a GPU context, distributing the workload
across multiple threads for parallel execution. It uses thread affinity
mapping to ensure efficient work distribution and supports vectorized memory
operations for optimal performance.

Performance:

* Distributes the copy workload across multiple threads for parallel
  execution.
* Supports vectorized loads and stores for better memory throughput.
* Can use swizzling to optimize memory access patterns and reduce bank
  conflicts.
* Thread affinity mapping ensures efficient work distribution.
* For masked tensors, performs bounds checking to handle edge cases
  correctly.

Notes:

* The source tensor must be in GENERIC or GLOBAL address space (DRAM).
* The destination tensor must be in SHARED address space (SRAM).
* Both tensors must have the same data type.
* This function is synchronous, meaning all threads must complete their
  copy operations before proceeding.
* For optimal performance, the thread layouts should match the memory
  access patterns of the tensors.
* This function is particularly useful in GPU kernels for loading data
  from global memory to shared memory for faster access.

**Constraints:**

* Source and destination tensors must have the same data type.
* Source tensor must be in GENERIC or GLOBAL address space.
* Destination tensor must be in SHARED address space.
* For non-masked tensors, the fragment sizes must match.

**Parameters:**

* ​src\_thread\_layout (`Layout`): Layout defining how threads are organized for the
  source tensor. This determines how the workload is distributed among
  threads.
* ​dst\_thread\_layout (`Layout`): Layout defining how threads are organized for the
  destination tensor. Defaults to the same as `src_thread_layout` if
  not specified.
* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination
  indices, which can improve memory access patterns and reduce bank
  conflicts.
* ​num\_threads (`Int`): Total number of threads participating in the copy
  operation. Defaults to the size of `src_thread_layout`.
* ​thread\_scope (`ThreadScope`): Scope at which thread operations are performed (`BLOCK` or
  `WARP`). Defaults to `ThreadScope.BLOCK`, where all threads in a
  block participate.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory
  (DRAM).

`copy_dram_to_sram[src_thread_layout: Layout, dst_thread_layout: Layout = src_thread_layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = src_thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], bound: Int)`

Efficiently copy data from global memory (DRAM) to shared memory (SRAM) on AMD GPUs.

This function implements an optimized memory transfer operation specifically
for AMD GPU architectures. It utilizes the hardware's `buffer_load`
intrinsic to efficiently transfer data while handling bounds checking. The
function distributes the copy operation across multiple threads for maximum
throughput.

**Parameters:**

* ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor
  across threads. This determines how the workload is divided among
  participating threads.
* ​dst\_thread\_layout (`Layout`): The layout used to distribute the destination tensor
  across threads. Defaults to the same layout as `src_thread_layout`.
* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling pattern to apply when distributing the
  destination tensor. This can improve memory access patterns and
  reduce bank conflicts. Defaults to None (no swizzling).
* ​num\_threads (`Int`): The total number of threads participating in the copy
  operation. Defaults to the size of `src_thread_layout`.
* ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or
  `WARP` level. `BLOCK` scope involves all threads in a thread block,
  while `WARP` scope restricts operations to threads within the same
  warp. Defaults to `ThreadScope.BLOCK`.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in shared memory (SRAM).
* ​src\_iter (`LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`): The source tensor iterator in global memory (DRAM) to be
  copied.
* ​bound (`Int`): The bound of the source tensor iterator.

`copy_dram_to_sram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], bound: Int)`

Synchronously copy data from DRAM to SRAM using a unified thread layout for AMD GPUs.

This is a convenience wrapper around the more general `copy_dram_to_sram()`
function that uses the same layout for both source and destination tensors.
It's specifically designed for AMD GPUs where the buffer\_load intrinsic
requires the original base tensor.

Performance:

* Simplifies API usage when the same thread layout is appropriate for both
  source and destination tensors.
* Optimized for AMD GPUs using buffer\_load intrinsics for efficient memory
  transfers.
* Distributes the copy workload across multiple threads for parallel
  execution.

Notes:

* This function is only supported on AMD GPUs.
* The source tensor must be in GENERIC or GLOBAL address space (DRAM).
* The destination tensor must be in SHARED address space (SRAM).
* Both tensors must have the same data type.

**Parameters:**

* ​thread\_layout (`Layout`): Layout defining how threads are organized for both source
  and destination. This determines how the workload is distributed
  among threads.
* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination
  indices, which can improve memory access patterns and reduce bank
  conflicts.
* ​num\_threads (`Int`): Total number of threads participating in the copy
  operation. Defaults to the size of thread\_layout.
* ​thread\_scope (`ThreadScope`): Scope at which thread operations are performed (`BLOCK` or
  `WARP`). Defaults to `BLOCK`, where all threads in a block
  participate.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM).
* ​src\_iter (`LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`): The source tensor iterator, which must be in global or generic
  memory (DRAM).
* ​bound (`Int`): The bound of the source tensor iterator.

`copy_dram_to_sram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Synchronously copy data from DRAM to SRAM using a unified thread layout.

This is a convenience wrapper around the more general `copy_dram_to_sram()`
function that uses the same layout for both source and destination tensors.
It simplifies the API for the common case where the same thread distribution
pattern works well for both tensors.

Performance:

* Simplifies API usage when the same thread layout is appropriate for both
  source and destination tensors.
* Distributes the copy workload across multiple threads for parallel
  execution.
* Supports vectorized loads and stores for better memory throughput.
* Can use swizzling to optimize memory access patterns and reduce bank
  conflicts.

Notes:

* The source tensor must be in `GENERIC` or `GLOBAL` address space (DRAM).
* The destination tensor must be in `SHARED` address space (SRAM).
* Both tensors must have the same data type.
* This function is synchronous, meaning all threads must complete their
  copy operations before proceeding.

**Parameters:**

* ​thread\_layout (`Layout`): Layout defining how threads are organized for both source
  and destination. This determines how the workload is distributed
  among threads.
* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination
  indices, which can improve memory access patterns and reduce bank
  conflicts.
* ​num\_threads (`Int`): Total number of threads participating in the copy
  operation. Defaults to the size of `thread_layout`.
* ​thread\_scope (`ThreadScope`): Scope at which thread operations are performed
  (`BLOCK` or `WARP)`. Defaults to `ThreadScope.BLOCK`, where all
  threads in a block participate.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory
  (DRAM).

---

## copy_dram_to_sram_async

`copy_dram_to_sram_async[src_thread_layout: Layout, dst_thread_layout: Layout, swizzle: Bool = False, fill: Fill = Fill(0), eviction_policy: CacheEviction = CacheEviction(0), num_threads: Int = src_thread_layout.size()](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Asynchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context.

This function performs an asynchronous copy operation from global memory
(DRAM) to shared memory (SRAM) in a GPU context, using NVIDIA's cp.async
hardware mechanism. It distributes the workload across multiple threads and
allows computation to overlap with memory transfers for improved
performance.

Performance:

* Performs asynchronous transfers, allowing computation to overlap with
  memory operations.
* Distributes the copy workload across multiple threads for parallel
  execution.
* Can use swizzling to optimize memory access patterns and reduce bank
  conflicts.
* Supports different cache eviction policies to optimize memory hierarchy
  usage.
* For masked tensors, performs bounds checking to handle edge cases
  correctly.

Notes:

* This function requires NVIDIA GPUs with `cp.async` support (compute
  capability 8.0+).
* The source tensor must be in GENERIC or GLOBAL address space (DRAM).
* The destination tensor must be in SHARED address space (SRAM).
* Both tensors must have the same data type.
* This function is asynchronous, so you must call
  [`async_copy_wait_all()`](/mojo/stdlib/gpu/memory/async_copy_wait_all/)
  or
  [`async_copy_wait_group()`](/mojo/stdlib/gpu/memory/async_copy_wait_group/)
  to ensure the copy has completed before using the data.
* The maximum size of each element that can be copied is 16 bytes.

**Constraints:**

* Requires NVIDIA GPUs with cp.async support (compute capability 8.0+).
* Source tensor must be in `GENERIC` or `GLOBAL` address space.
* Destination tensor must be in `SHARED` address space.
* Both tensors must have the same data type.
* Element size must be 4, 8, or 16 bytes.

**Parameters:**

* ​src\_thread\_layout (`Layout`): Layout defining how threads are organized for the
  source tensor. This determines how the workload is distributed among
  threads.
* ​dst\_thread\_layout (`Layout`): Layout defining how threads are organized for the
  destination tensor.
* ​swizzle (`Bool`): Whether to apply swizzling to the destination indices to
  reduce bank conflicts. Defaults to False.
* ​fill (`Fill`): Fill policy for handling out-of-bounds accesses. Options
  include:
  * `Fill.NONE`: No special handling (default).
  * `Fill.ZERO`: Fill out-of-bounds elements with zeros.
* ​eviction\_policy (`CacheEviction`): Cache eviction policy for the source data. Options
  include:
  * `CacheEviction.EVICT_NORMAL`: Normal eviction (default).
  * `CacheEviction.EVICT_FIRST`: Evict data after first use.
  * `CacheEviction.EVICT_LAST`: Keep data in cache until last use.
* ​num\_threads (`Int`): Total number of threads participating in the copy operation.
  Defaults to the size of src\_thread\_layout.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory (DRAM).

`copy_dram_to_sram_async[thread_layout: Layout, swizzle: Bool = False, masked: Bool = False, fill: Fill = Fill(0), eviction_policy: CacheEviction = CacheEviction(0), num_threads: Int = thread_layout.size()](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Asynchronous copy from DRAM to SRAM with thread affinity mapping.

This function performs an asynchronous memory transfer from DRAM (global
memory) to SRAM (shared memory) using the specified thread layout for
distribution.

Notes:

This is a convenience wrapper around the more general
`copy_dram_to_sram_async()` function, using the same thread layout for
both source and destination.

**Parameters:**

* ​thread\_layout (`Layout`): The layout used to distribute work across threads.
* ​swizzle (`Bool`): Whether to apply memory access swizzling for better performance.
* ​masked (`Bool`): Whether the copy operation should use masking.
* ​fill (`Fill`): Fill policy for uninitialized memory regions.
* ​eviction\_policy (`CacheEviction`): Cache eviction policy to use during the transfer.
* ​num\_threads (`Int`): Number of threads to use for the operation, defaults to
  the size of `thread_layout`.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Destination tensor in SRAM.
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Source tensor in DRAM.

---

## copy_local_to_dram

`copy_local_to_dram[dst_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Efficiently copy data from registers (LOCAL) to global memory (DRAM).

This function implements a high-performance memory transfer operation from
register memory to global memory. It distributes the copy operation across
multiple threads for maximum throughput while handling bounds checking for
safety.

**Constraints:**

* The source tensor must be in LOCAL address space (registers).
* The destination tensor must be in GENERIC or GLOBAL address space (DRAM).
* Both tensors must have compatible data types.

**Parameters:**

* ​dst\_thread\_layout (`Layout`): The layout used to distribute the destination tensor
  across threads. This determines how the workload is divided among
  participating threads.
* ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or
  `WARP` level. `BLOCK` scope involves all threads in a thread block,
  while `WARP` scope restricts operations to threads within the same
  warp. Defaults to `ThreadScope.BLOCK`.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in global memory (DRAM).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in register memory (LOCAL) to be copied.

`copy_local_to_dram[dst_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dst_base: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Efficiently copy data from registers (LOCAL) to global memory (DRAM) on AMD GPUs.

This function implements an optimized memory transfer operation specifically
for AMD GPU architectures. It utilizes the hardware's buffer\_store intrinsic
to efficiently transfer data from registers to global memory while handling
bounds checking. The function distributes the copy operation across multiple
threads for maximum throughput.

Notes:

* This function is particularly useful for writing computed results from
  registers back to global memory with minimal latency.
* The offset calculation is optimized for performance rather than
  flexibility.

**Constraints:**

* Only supported on AMD GPUs.
* Destination tensor must be in GLOBAL address space.
* Source tensor must be in LOCAL address space.
* Data types must match between source and destination tensors.

**Parameters:**

* ​dst\_thread\_layout (`Layout`): The layout used to distribute the destination tensor
  across threads. This determines how the workload is divided among
  participating threads.
* ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or
  `WARP` level. `BLOCK` scope involves all threads in a thread block,
  while `WARP` scope restricts operations to threads within the same
  warp. Defaults to `ThreadScope.BLOCK`.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in global memory (DRAM).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in register memory (LOCAL address space) to be
  copied.
* ​dst\_base (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The original global memory tensor from which dst is derived.
  This is used to construct the buffer descriptor required by AMD's
  `buffer_store` intrinsic.

---

## copy_local_to_local

`copy_local_to_local(dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Synchronously copy data between local memory (register) tensors with type conversion.

This function performs a synchronous copy operation between register tensors
in a GPU context, with support for converting from float32 to half-precision
formats (bfloat16/float16). It's particularly optimized for specific tensor
layouts commonly used in matrix multiplication operations.

Example:

```mojo
from layout import LayoutTensor, Layout
from layout.layout_tensor import copy_local_to_local
from gpu.memory import AddressSpace

fn kernel():
    ...
    var src_reg = LayoutTensor[DType.float32,
        Layout.row_major(16, 8),
        MutableAnyOrigin,
        address_space = AddressSpace.LOCAL,
    ].stack_allocation().fill(1)

    var dst_reg = LayoutTensor[DType.bfloat16,
        Layout.row_major(16, 8),
        MutableAnyOrigin,
        address_space = AddressSpace.LOCAL,
    ].stack_allocation()

    # Process data in float32 registers
    # ...

    # Convert and copy to bfloat16 registers
    copy_local_to_local(dst_reg, src_reg)
```

Performance:

* Optimized for specific 2D tensor layouts with contiguous inner dimensions.
* Special fast path for 2D tensors with specific layouts used in matrix
  multiplication.
* For MMA (Matrix Multiply-Accumulate) operations, efficiently handles the
  conversion between output fragments and input fragments with different
  layouts.
* Falls back to element-wise copy for general cases.

Notes:

* Both source and destination tensors must be in `LOCAL` address space
  (registers).
* This function currently only supports copying from float32 to half-precision formats.
* For 2D tensors with stride\[1] == 1, a specialized fast path is used that's optimized
  for matrix multiplication patterns.
* This function is particularly useful in GPU kernels for converting between different
  precision formats while keeping data in registers.

**Constraints:**

* Destination tensor must be in `LOCAL` address space.
* Source tensor must be in `LOCAL` address space.
* Destination tensor must have a half-precision floating-point data type.
* Source tensor must have float32 data type.
* Both tensors must have the same total size.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in local memory (registers)
  and have a half-precision floating-point data type (bfloat16 or
  float16).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in local memory (registers) and
  have float32 data type.

---

## copy_sram_to_dram

`copy_sram_to_dram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), binary_op: OptionalReg[fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]]({:i1 0, 1})](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Synchronously copy data from SRAM (shared memory) to DRAM (global memory).

This function performs a synchronous memory transfer from SRAM (shared
memory) to DRAM (global memory) using the specified thread layout for
workload distribution. It supports optional swizzling for optimized memory
access patterns and binary operations for combining data during the
transfer.

Performance:

* Distributes the copy workload across multiple threads for parallel
  execution.
* Supports vectorized loads and stores for better memory throughput.
* Can use swizzling to optimize memory access patterns.
* Supports binary operations to combine data during transfer (e.g., for
  reduction operations).

Notes:

* The source tensor must be in `SHARED` address space (SRAM).
* The destination tensor must be in `GENERIC` or `GLOBAL` address space
  (DRAM).
* Supports FP32 to half-precision downcast during copy if needed.
* Handles masked tensors with proper bounds checking.
* This function is synchronous, meaning all threads must complete their
  copy operations before proceeding.

**Constraints:**

* Source tensor must be in SHARED address space with a static layout.
* Destination tensor must be in GENERIC or GLOBAL address space.
* For type conversion, only FP32 to half-precision is supported.
* For vectorized copy with type conversion, both tensors must have
  element layouts matching the SIMD width of the destination type.

**Parameters:**

* ​thread\_layout (`Layout`): Layout defining how threads are organized for both source
  and destination. This determines how the workload is distributed
  among threads.
* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the source indices,
  which can improve memory access patterns and reduce bank conflicts.
* ​num\_threads (`Int`): Total number of threads participating in the copy
  operation. Defaults to the size of thread\_layout.
* ​binary\_op (`OptionalReg[fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]]`): Optional binary operation to apply during the copy, combining
  source data with existing destination data.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in global or generic memory
  (DRAM).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in shared memory (SRAM).

---

## copy_sram_to_local

`copy_sram_to_local[src_warp_layout: Layout, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1})](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Synchronously copy data from SRAM (shared memory) to local memory.

This function performs a synchronous memory transfer from SRAM (shared
memory) to local memory (registers) using the specified thread layout for
workload distribution.

Performance:

* Distributes the copy workload across multiple threads for parallel
  execution.
* Optimized for transferring data from shared memory to registers.
* Supports optional axis-specific distribution for specialized access
  patterns.

**Constraints:**

* The source tensor must be in SHARED address space (SRAM).
* The destination tensor must be in LOCAL address space (registers).
* Both tensors must have the same data type.

**Parameters:**

* ​src\_warp\_layout (`Layout`): Layout defining how threads are organized for the
  source tensor. This determines how the workload is distributed among
  threads.
* ​axis (`OptionalReg[Int]`): Optional parameter specifying which axis to distribute along.
  When provided, distribution happens along the specified axis.
  When None (default), distribution uses the standard layout pattern.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in local memory (registers).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in shared memory (SRAM).

---

## cp_async_k_major

`cp_async_k_major[type: DType, eviction_policy: CacheEviction = CacheEviction(0)](dst: LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with K-major layout.

This function performs an asynchronous copy operation from global memory
(DRAM) to shared memory (SRAM) using NVIDIA's Tensor Memory Accelerator
(TMA) hardware. It optimizes for K-major memory access patterns, which is
particularly beneficial for certain tensor operations like matrix
multiplications where the inner dimension (K) is accessed contiguously.

The function automatically determines the optimal tile size and thread
distribution based on the tensor shapes and hardware capabilities,
leveraging TMA's efficient memory transfer mechanisms.

Performance:

* Uses TMA hardware acceleration for optimal memory transfer performance.
* Optimizes for K-major access patterns, which can significantly improve
  performance for certain tensor operations like matrix multiplications.
* Performs asynchronous transfers, allowing computation to overlap with
  memory operations.
* Automatically determines optimal tile sizes based on tensor dimensions.
* Uses hardware-accelerated swizzling to reduce shared memory bank
  conflicts.

Notes:

* This function requires NVIDIA GPUs with TMA support (compute capability
  9.0+).
* The source tensor must be in GENERIC or GLOBAL address space (DRAM).
* The destination tensor must be in SHARED address space (SRAM).
* Both tensors must have the same data type.
* This function is asynchronous, so you must call
  [`async_copy_wait_all()`](/mojo/stdlib/gpu/memory/async_copy_wait_all/)
  or
  [`async_copy_wait_group()`](/mojo/stdlib/gpu/memory/async_copy_wait_group/)
  to ensure the copy has completed before using the data.
* K-major layout is particularly beneficial for matrix multiplication
  operations where the inner dimension (K) is accessed contiguously.

**Constraints:**

* Requires NVIDIA GPUs with TMA support (compute capability 9.0+).
* Source tensor must be in GENERIC or GLOBAL address space.
* Destination tensor must be in SHARED address space.
* Both tensors must have the same data type.
* Source and destination tensors must be 2D.

**Parameters:**

* ​type (`DType`): The data type of the tensor elements.
* ​eviction\_policy (`CacheEviction`): The cache eviction policy to use. Default is `CacheEviction.EVICT_NORMAL`.

**Args:**

* ​dst (`LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM).
* ​src (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory
  (DRAM).

---

## cp_async_mn_major

`cp_async_mn_major[type: DType, eviction_policy: CacheEviction = CacheEviction(0)](dst: LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with MN-major layout.

This function performs an asynchronous copy operation from global memory
(DRAM) to shared memory (SRAM) using NVIDIA's Tensor Memory Accelerator
(TMA) hardware. It optimizes for MN-major memory access patterns, which is
particularly beneficial for tensor operations where the outer dimensions (M,
N) are accessed contiguously.

The function automatically determines the optimal tile size and thread
distribution based on the tensor shapes and hardware capabilities,
leveraging TMA's efficient memory transfer mechanisms.

Performance:

* Uses TMA hardware acceleration for optimal memory transfer performance.
* Optimizes for MN-major access patterns, which can significantly improve
  performance for certain tensor operations where outer dimensions are accessed
  contiguously.
* Performs asynchronous transfers, allowing computation to overlap with memory operations.
* Automatically determines optimal tile sizes based on tensor dimensions.
* Uses hardware-accelerated swizzling to reduce shared memory bank conflicts.

Notes:

* This function requires NVIDIA GPUs with TMA support (compute capability 9.0+).
* The source tensor must be in `GENERIC` or `GLOBAL` address space (DRAM).
* The destination tensor must be in `SHARED` address space (SRAM).
* Both tensors must have the same data type.
* This function is asynchronous, so you must call
  [`async_copy_wait_all()`](/mojo/stdlib/gpu/memory/async_copy_wait_all/)
  or
  [`async_copy_wait_group()`](/mojo/stdlib/gpu/memory/async_copy_wait_group/)
  to ensure the copy has completed before using the data.
* MN-major layout is particularly beneficial for operations where the outer
  dimensions are accessed contiguously, such as certain convolution operations.

**Constraints:**

* Requires NVIDIA GPUs with TMA support (compute capability 9.0+).
* Source tensor must be in `GENERIC` or `GLOBAL` address space.
* Destination tensor must be in `SHARED` address space.
* Both tensors must have the same data type.
* Source and destination tensors must be 2D.

**Parameters:**

* ​type (`DType`): The data type of the tensor elements.
* ​eviction\_policy (`CacheEviction`): The cache eviction policy to use. Default is `CacheEviction.EVICT_NORMAL`.

**Args:**

* ​dst (`LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM).
* ​src (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory
  (DRAM).

---

## layout_tensor

Provides the `LayoutTensor` type for representing multidimensional data.

## Aliases

### `binary_op_type`

`alias binary_op_type = fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]`

Type alias for binary operations on SIMD vectors.

This type represents a function that takes two SIMD vectors of the same type and
width and returns a SIMD vector of the same type and width.

Args:
type: The data type of the SIMD vector elements.
width: The width of the SIMD vector.
lhs: Left-hand side SIMD vector operand.
rhs: Right-hand side SIMD vector operand.

Returns:
A SIMD vector containing the result of the binary operation.

## Structs

* [​`LayoutTensor`](./LayoutTensor): A high-performance tensor with explicit memory layout and hardware-optimized access patterns.
* [​`LayoutTensorIter`](./LayoutTensorIter): Iterator for traversing a memory buffer with a specific layout.
* [​`ThreadScope`](./ThreadScope): Represents the scope of thread operations in GPU programming.

## Functions

* [​`copy`](./copy): Synchronously copy data from local memory (registers) to SRAM (shared memory).
* [​`copy_dram_to_local`](./copy_dram_to_local): Efficiently copy data from global memory (DRAM) to registers for AMD GPUs.
* [​`copy_dram_to_sram`](./copy_dram_to_sram): Synchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context.
* [​`copy_dram_to_sram_async`](./copy_dram_to_sram_async): Asynchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context.
* [​`copy_local_to_dram`](./copy_local_to_dram): Efficiently copy data from registers (LOCAL) to global memory (DRAM).
* [​`copy_local_to_local`](./copy_local_to_local): Synchronously copy data between local memory (register) tensors with type conversion.
* [​`copy_sram_to_dram`](./copy_sram_to_dram): Synchronously copy data from SRAM (shared memory) to DRAM (global memory).
* [​`copy_sram_to_local`](./copy_sram_to_local): Synchronously copy data from SRAM (shared memory) to local memory.
* [​`cp_async_k_major`](./cp_async_k_major): Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with K-major layout.
* [​`cp_async_mn_major`](./cp_async_mn_major): Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with MN-major layout.
* [​`stack_allocation_like`](./stack_allocation_like): Create a stack-allocated tensor with the same layout as an existing tensor.

---

## stack_allocation_like

`stack_allocation_like[layout: Layout, dtype: DType, *, address_space: AddressSpace, target_address_space: AddressSpace = AddressSpace(0)](in_tensor: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=target_address_space, masked=masked]`

Create a stack-allocated tensor with the same layout as an existing tensor.

This function creates a new tensor on the stack with the same layout, data
type, and masking properties as the input tensor, but potentially with a
different address space. This is useful for creating temporary tensors that
match the structure of existing tensors.

Example:

```mojo
from layout import LayoutTensor, Layout
from layout.layout_tensor import stack_allocation_like
from gpu.memory import AddressSpace

var global_tensor = LayoutTensor[
    DType.float32,
    Layout([10, 10]),
    MutableAnyOrigin,
    address_space=AddressSpace.GLOBAL
].stack_allocation()

var shared_tensor = stack_allocation_like[
    target_address_space=AddressSpace.SHARED
](global_tensor)
```

Performance:

* Creates a tensor on the stack, which is typically faster to allocate and
  access than heap-allocated memory.
* Stack allocations have automatic lifetime management, reducing memory
  management overhead.
* Stack size is limited, so be cautious with large tensor allocations.

Notes:

* The new tensor will have the same layout, data type, and masking properties
  as the input tensor.
* The address space can be changed, which is useful for moving data between
  different memory regions (e.g., from global to shared memory).
* Stack allocations are automatically freed when they go out of scope.
* The function uses the stack\_allocation method of the result tensor type.

**Parameters:**

* ​layout (`Layout`): The layout of the tensor to allocate.
* ​dtype (`DType`): The data type of the tensor elements.
* ​address\_space (`AddressSpace`): The address space of the input tensor.
* ​target\_address\_space (`AddressSpace`): The address space for the new tensor. Defaults to
  GENERIC.

**Args:**

* ​in\_tensor (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to match the layout of.

**Returns:**

A new tensor allocated on the stack with the same layout as the input
tensor.

---

## math

Implements math methods that work on layout tensors.

## Functions

* [​`max`](./max): Computes maximum reduction along specified axis.
* [​`outer_product_acc`](./outer_product_acc): Updates result tensor with the outer product of two vectors.
* [​`sum`](./sum): Computes sum reduction along specified axis.

---

## max

`max[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], outp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Computes maximum reduction along specified axis.

Reduces the input tensor by taking maximum elements along the specified
axis and stores the result in the output tensor.

**Constraints:**

All tensors must have statically known shapes.
`outp.rank` must equal `inp.rank - 1`.
Non-reduction dimensions must match between `inp` and `outp`.
Currently only supports rank-2 inputs.

**Parameters:**

* ​axis (`Int`): The axis to take maximum along.

**Args:**

* ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to reduce.
* ​outp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor to store maximum results.

`max[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, _reduce_res_row_major_shape(axis, layout), MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Computes maximum reduction along specified axis, returning a new tensor.

Reduces the input tensor by taking maximum elements along the specified
axis and returns a new tensor with the results.

**Constraints:**

All tensors must have statically known shapes.
Result will have rank equal to `inp.rank` - 1.
Non-reduction dimensions in the result match the input.
Currently only supports rank-2 inputs.

**Parameters:**

* ​axis (`Int`): The axis to take maximum along.

**Args:**

* ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to reduce.

**Returns:**

A new tensor containing the maximum values along the specified axis.

`max[dtype: DType, layout: Layout](x: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], y: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Computes element-wise maximum of two tensors.

Returns a new tensor containing the element-wise maximum between the
input tensors.

**Constraints:**

Input tensors must have statically known shapes and matching layouts.

**Parameters:**

* ​dtype (`DType`): The data type of the input tensors.
* ​layout (`Layout`): The layout of the input tensors.

**Args:**

* ​x (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): First input tensor.
* ​y (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Second input tensor.

**Returns:**

A new tensor containing the element-wise maximum.

---

## outer_product_acc

`outer_product_acc(res: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], lhs: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], rhs: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Updates result tensor with the outer product of two vectors.

Computes `res += outer(lhs, rhs)` where `lhs` and `rhs` are vectors and
`res` is a matrix.

**Constraints:**

All tensors must have statically known shapes.
`res` must be rank 2.
`lhs` and `rhs` must be rank 1.
`res.shape[0]` `==` `lhs.shape[0]` and `res.shape[1]` `==` `rhs.shape[0]`.

**Args:**

* ​res (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The result matrix to accumulate into, shape (M, N).
* ​lhs (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The left-hand side vector, shape (M,).
* ​rhs (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The right-hand side vector, shape (N,).

---

## sum

`sum[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], outp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Computes sum reduction along specified axis.

Reduces the input tensor by summing elements along the specified axis
and stores the result in the output tensor.

Example:

```mojo
from layout import LayoutTensor, Layout
from layout.math import sum

data = InlineArray[Int32, 6](0, 1, 2, 3, 4, 5)
tensor = LayoutTensor[DType.int32, Layout.row_major(2, 3)](data)
print(tensor)
print("-----")
print(sum[0](tensor))
```

Output:

```plaintext
0 1 2
3 4 5
-----
3 5 7
```

.

**Constraints:**

All tensors must have statically known shapes.
`outp.rank` must equal `inp.rank - 1`.
Non-reduction dimensions must match between inp and outp.
Currently only supports rank-2 inputs.

**Parameters:**

* ​axis (`Int`): The axis to sum along.

**Args:**

* ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to sum.
* ​outp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor to store sum results.

`sum[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, _reduce_res_row_major_shape(axis, layout), MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Computes sum reduction along specified axis, returning a new tensor.

Reduces the input tensor by summing elements along the specified axis
and returns a new tensor with the results.

**Constraints:**

All tensors must have statically known shapes.
Result will have rank equal to `inp.rank` - 1.
Non-reduction dimensions in the result match the input.
Currently only supports rank-2 inputs.

**Parameters:**

* ​axis (`Int`): The axis to sum along.

**Args:**

* ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to sum.

**Returns:**

A new tensor containing the sum values along the specified axis.

---

## RuntimeLayout

`@register_passable(trivial)`
`struct RuntimeLayout[layout: Layout, /, *, element_type: DType = int64, linear_idx_type: DType = int64]`

A runtime-configurable layout that uses `RuntimeTuple` for storage.

This struct provides a layout implementation that can be modified at runtime,
unlike the static [`Layout`](/mojo/kernels/layout/layout/Layout) type. It
uses [`RuntimeTuple`](/mojo/kernels/layout/runtime_tuple/RuntimeTuple) for
shape and stride storage.

The layout must have statically known dimensions at compile time, but the
actual shape and stride values can be modified during execution.

## Parameters

* ​layout (`Layout`): The static `Layout` type to base this runtime layout on.
* ​element\_type (`DType`): The integer type of the each dimension element. Must be signed.
* ​linear\_idx\_type (`DType`): The integer type of the linear index into memory returned by `crd2idx`. Must be signed.

## Fields

* ​shape (`RuntimeTuple[layout.shape, element_type=element_type]`): The shape of the layout as a runtime tuple.
  Stores the size of each dimension. Uses the specified bitwidth and is
  unsigned. Must match the static layout's shape dimensions.
* ​stride (`RuntimeTuple[layout.stride, element_type=linear_idx_type]`): The stride of the layout as a runtime tuple.
  Stores the stride (step size) for each dimension. Uses 64-bit unsigned
  integers since strides can be large values. Must match the static layout's
  stride dimensions.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Initialize a `RuntimeLayout` with default values.

Creates a new `RuntimeLayout` instance with default shape and stride
values. Requires that the static layout has known dimensions at compile
time.

**Constraints:**

The static layout that this runtime layout is based on must have all
dimensions known.

`__init__(shape: RuntimeTuple[layout.shape, element_type=element_type], stride: RuntimeTuple[layout.stride, element_type=linear_idx_type]) -> Self`

Initialize a `RuntimeLayout` with specified shape and stride.

**Args:**

* ​shape (`RuntimeTuple[layout.shape, element_type=element_type]`): A `RuntimeTuple` containing the dimensions of each axis.
* ​stride (`RuntimeTuple[layout.stride, element_type=linear_idx_type]`): A `RuntimeTuple` containing the stride values for each axis.

### `__call__`

`__call__(self, idx: Int) -> SIMD[linear_idx_type, 1]`

Convert a single index to a flat linear index.

**Args:**

* ​idx (`Int`): The one-dimensional index to convert.

**Returns:**

The corresponding flat linear index in the layout.

`__call__[: ImmutableOrigin, //, t: IntTuple[$0]](self, idx: RuntimeTuple[t, element_type=element_type]) -> SIMD[linear_idx_type, 1]`

Convert a multi-dimensional index to a flat linear index.

**Parameters:**

* ​t (`IntTuple[$0]`): The `IntTuple` type for the index.

**Args:**

* ​idx (`RuntimeTuple[t, element_type=element_type]`): A `RuntimeTuple` containing the multi-dimensional coordinates.

**Returns:**

The corresponding flat linear index in the layout.

### `idx2crd`

`idx2crd[: ImmutableOrigin, //, t: IntTuple[$0]](self, idx: RuntimeTuple[t, element_type=element_type]) -> RuntimeTuple[idx2crd[::Origin[::Bool(t, layout.shape, layout.stride), element_type=element_type]`

Converts a linear index to logical coordinates.

This is the inverse operation of the **call** method, mapping from
a memory index back to the corresponding logical coordinates.

**Parameters:**

* ​t (`IntTuple[$0]`): The `IntTuple` type for the index.

**Args:**

* ​idx (`RuntimeTuple[t, element_type=element_type]`): The linear index to convert.

**Returns:**

The logical coordinates corresponding to the given index.

### `size`

`size(self) -> Int`

Calculate the total number of elements in the layout.

**Returns:**

The product of all dimensions in the shape, representing the total
number of elements that can be addressed by this layout.

### `bound_check_required`

`bound_check_required(self) -> Bool`

Determine if bounds checking is required for this layout.

**Returns:**

True if any dimension in the shape differs from the static layout's
shape, False otherwise.

### `cast`

`cast[element_type: DType, /, *, linear_idx_type: DType = linear_idx_type](self) -> RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`

Cast the layout to use a different element bitwidth.

**Parameters:**

* ​element\_type (`DType`): The target data type.
* ​linear\_idx\_type (`DType`): The target linear idx type.

**Returns:**

A new `RuntimeLayout` with the shape cast to the specified type.

### `__str__`

`__str__(self) -> String`

Convert the layout to a string representation.

**Returns:**

A string representation of the layout.

### `row_major`

`static row_major[rank: Int, //](shape: IndexList[rank, element_type=element_type]) -> Self`

Create a row-major layout from the given shape.

In row-major layout, elements with adjacent rightmost indices are
adjacent in memory.

**Parameters:**

* ​rank (`Int`): The number of dimensions in the layout.

**Args:**

* ​shape (`IndexList[rank, element_type=element_type]`): An `IndexList` containing the dimensions of each axis.

**Returns:**

A `RuntimeLayout` with row-major stride ordering.

### `col_major`

`static col_major[rank: Int, //](shape: IndexList[rank, element_type=element_type]) -> Self`

Create a column-major layout from the given shape.

In column-major layout, elements with adjacent leftmost indices are
adjacent in memory.

**Parameters:**

* ​rank (`Int`): The number of dimensions in the layout.

**Args:**

* ​shape (`IndexList[rank, element_type=element_type]`): An `IndexList` containing the dimensions of each axis.

**Returns:**

A `RuntimeLayout` with column-major stride ordering.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Write a string representation of the layout to a writer.

**Parameters:**

* ​W (`Writer`): The `Writer` type.

**Args:**

* ​writer (`W`): The `Writer` object to write the layout representation to.

### `sublayout`

`sublayout[i: Int](self) -> RuntimeLayout[layout[i], element_type=element_type, linear_idx_type=linear_idx_type]`

Extract a nested sublayout at the specified index.

**Parameters:**

* ​i (`Int`): The index of the nested layout to extract.

**Returns:**

A `RuntimeLayout` representing the nested layout at index i.

### `dim`

`dim(self, i: Int) -> Int`

Get the size of the dimension at the specified index.

**Args:**

* ​i (`Int`): The index of the dimension to retrieve.

**Returns:**

The size of the dimension at index `i`.

### `__len__`

`static __len__() -> Int`

Get the number of dimensions in the layout.

**Returns:**

The number of dimensions (rank) of the layout.

---

## coalesce

`coalesce[l: Layout, keep_rank: Bool = False](layout: RuntimeLayout[l, element_type=element_type, linear_idx_type=linear_idx_type]) -> RuntimeLayout[coalesce(l, keep_rank), element_type=element_type, linear_idx_type=linear_idx_type]`

Coalesce adjacent dimensions in a runtime layout when possible.

This optimizes the layout by merging adjacent dimensions when their
relationship allows it, potentially reducing the number of dimensions.

**Parameters:**

* ​l (`Layout`): The static layout type to coalesce.
* ​keep\_rank (`Bool`): Whether to maintain the original rank (currently unsupported).

**Args:**

* ​layout (`RuntimeLayout[l, element_type=element_type, linear_idx_type=linear_idx_type]`): The input `RuntimeLayout` to coalesce.

**Returns:**

A new `RuntimeLayout` with coalesced dimensions.

---

## runtime_layout

Provides the `RuntimeLayout` type and functions for working with it. You can use `RuntimeLayout` to define a layout where the dimensions are not known at compile time.

You can import these APIs from `layout.runtime_layout`.

```mojo
from layout.runtime_layout import RuntimeLayout, make_layout
```

## Structs

* [​`RuntimeLayout`](./RuntimeLayout): A runtime-configurable layout that uses `RuntimeTuple` for storage.

## Functions

* [​`coalesce`](./coalesce): Coalesce adjacent dimensions in a runtime layout when possible.
* [​`make_layout`](./make_layout): Combine two runtime layouts into a single composite layout.

---

## make_layout

`make_layout[l1: Layout, l2: Layout, /, *, linear_idx_type: DType = uint64](a: RuntimeLayout[l1, element_type=element_type, linear_idx_type=linear_idx_type], b: RuntimeLayout[l2, element_type=element_type, linear_idx_type=linear_idx_type]) -> RuntimeLayout[make_layout(l1, l2), element_type=element_type, linear_idx_type=linear_idx_type]`

Combine two runtime layouts into a single composite layout.

This creates a new layout by concatenating the dimensions and strides of the
input layouts.

**Parameters:**

* ​l1 (`Layout`): The static layout type of `a`.
* ​l2 (`Layout`): The static layout type of `b`.
* ​linear\_idx\_type (`DType`): The integer type of the all index calculated by the returned
  runtime layout.

**Args:**

* ​a (`RuntimeLayout[l1, element_type=element_type, linear_idx_type=linear_idx_type]`): The first `RuntimeLayout` to combine.
* ​b (`RuntimeLayout[l2, element_type=element_type, linear_idx_type=linear_idx_type]`): The second `RuntimeLayout` to combine.

**Returns:**

A new `RuntimeLayout` with dimensions from both input layouts.

---

## RuntimeTuple

`@register_passable(trivial)`
`struct RuntimeTuple[origin: ImmutableOrigin, //, S: IntTuple[origin] = IntTuple(-1), /, *, element_type: DType = int64]`

A struct representing tuple-like data with compile-time and runtime elements. RuntimeTuple combines static (compile-time) and dynamic (runtime) handling of tuple-like data structures, typically used for tensor shapes, indices, and coordinates in high-performance computing contexts. This struct is optimized for parallel execution and hardware acceleration, allowing efficient manipulation of multi-dimensional data. It supports both known compile-time values and runtime-determined values.

## Parameters

* ​origin (`ImmutableOrigin`): The origin corresponding to the `IntTuple`.
* ​S (`IntTuple[origin]`): `IntTuple` with compile-time known values (or `UNKNOWN_VALUE` for runtime values).
* ​element\_type (`DType`): Integer type of the underlying elements.

## Fields

* ​value (`IndexList[len[::Sized](flatten[::Origin[::Bool(S)), element_type=element_type]`): Storage for the actual tuple values, implemented as an IndexList with the appropriate size and element type.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Intable`,
`Movable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `scalar_length`

`alias scalar_length = len[::Sized](flatten[::Origin[::Bool(S))`

The total number of scalar elements in this RuntimeTuple after flattening nested tuples.

## Methods

### `__init__`

`__init__() -> Self`

Initialize a `RuntimeTuple` with default values.

For dimensions with known compile-time values in S, uses those values.
For unknown dimensions, initializes them to UNKNOWN\_VALUE.

`@implicit`
`__init__(*values: Int) -> Self`

Initialize a `RuntimeTuple` with the provided values.

**Args:**

* ​\*values (`Int`): Variadic number of integer values to initialize the tuple with.

`@implicit`
`__init__[l: Int](values: IndexList[l, element_type=element_type]) -> Self`

Initialize a `RuntimeTuple` from an `IndexList`.

**Parameters:**

* ​l (`Int`): Compile-time length of the input `IndexList`.

**Args:**

* ​values (`IndexList[l, element_type=element_type]`): `IndexList` to initialize from. Must have same length as the `RuntimeTuple`.
  The values will be cast to the appropriate element type if needed.

### `__getitem__`

`__getitem__[i: Int](self) -> RuntimeTuple[S[i], element_type=element_type]`

Retrieves the element at the specified index in the tuple.

This method provides array-like indexing for RuntimeTuple, allowing access
to individual elements or sub-tuples. It handles the internal offset calculation
to access the correct elements in the flattened storage array.

**Parameters:**

* ​i (`Int`): The index of the element to retrieve.

**Returns:**

A new `RuntimeTuple` containing the element or sub-tuple at the specified index.

### `__setitem__`

`__setitem__[i: Int](mut self, val: SIMD[element_type, 1])`

Sets the value of the element at the specified index in the tuple.

This method enables array-like assignment for RuntimeTuple elements,
handling the internal offset calculation to modify the correct element
in the flattened storage array.

**Parameters:**

* ​i (`Int`): The index of the element to modify.

**Args:**

* ​val (`SIMD[element_type, 1]`): The new value to assign to the element.

### `offset_until`

`static offset_until[i: Int]() -> Int`

Calculates the offset in the flattened value array for a given tuple index.

This method computes the sum of lengths of all flattened subtuple elements
that come before the specified index, which is used for indexing into the
internal storage.

**Parameters:**

* ​i (`Int`): The tuple index to calculate the offset for.

**Returns:**

The offset in the flattened array where the i-th element begins.

### `get_int`

`get_int(self) -> SIMD[element_type, 1]`

Returns the integer value of this RuntimeTuple.

For tuples with a known compile-time value, returns that value.
For tuples with a runtime value, returns the first element of the
internal storage array.

**Returns:**

The integer value of this RuntimeTuple.

### `__str__`

`__str__(self) -> String`

Converts the RuntimeTuple to its string representation.

This method provides a human-readable string representation of the tuple,
which is useful for debugging and logging.

**Returns:**

A string representation of the `RuntimeTuple`.

### `concat`

`concat[: ImmutableOrigin, //, R: IntTuple[$0]](self, rhs: RuntimeTuple[R, element_type=element_type]) -> RuntimeTuple[concat[::Origin[::Bool(S, R), element_type=element_type]`

Concatenates two `RuntimeTuple`s together.

This method combines the current `RuntimeTuple` with another one, preserving
both compile-time and runtime values. It handles the complexity of merging
the underlying storage arrays while maintaining the proper semantic structure.

**Parameters:**

* ​R (`IntTuple[$0]`): The `IntTuple` type parameter of the right-hand side RuntimeTuple.

**Args:**

* ​rhs (`RuntimeTuple[R, element_type=element_type]`): The `RuntimeTuple` to concatenate to the end of this one.

**Returns:**

A new `RuntimeTuple` containing all elements from both tuples in sequence.

### `flatten`

`flatten(self) -> RuntimeTuple[flatten[::Origin[::Bool(S), element_type=element_type]`

Flattens a potentially nested `RuntimeTuple` into a single-level tuple. This method converts a hierarchical structure of tuples into a flat representation, preserving all values but removing the nested structure. This is useful for operations that need to treat all elements uniformly.

**Returns:**

A new `RuntimeTuple` containing all elements in a flat (non-nested) structure.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the RuntimeTuple to a Writer object.

This method is used by the string conversion system to generate a string
representation of the RuntimeTuple. It handles both scalar values and
nested tuple structures, producing a properly formatted output.

**Parameters:**

* ​W (`Writer`): The Writer type to use for output.

**Args:**

* ​writer (`W`): The Writer object to write the string representation to.

### `__len__`

`__len__(self) -> Int`

Returns the length (number of top-level elements) of the `RuntimeTuple`.

This method provides the standard Python-like len() functionality,
giving the number of elements at the top level of the tuple structure.
For nested tuples, this returns the number of first-level entries,
not the total number of scalar values.

**Returns:**

The number of top-level elements in the tuple.

### `cast`

`cast[type: DType](self) -> RuntimeTuple[S, element_type=type]`

Casts the RuntimeTuple to use a different numeric type. This method creates a new RuntimeTuple with the same structure and values but using a different underlying numeric type for storage. This is useful for changing precision or signedness of the data.

**Parameters:**

* ​type (`DType`): The target DType to cast the elements to.

**Returns:**

A new `RuntimeTuple` with elements cast to the specified type.

### `__int__`

`__int__(self) -> Int`

Converts the RuntimeTuple to an integer value.

This method enables implicit conversion of a RuntimeTuple to an integer,
but is constrained to only work on scalar tuples (those that contain a single value).

**Returns:**

The integer value of the tuple.

---

## concat

`concat(owned lhs: IntTuple[origin], rhs: IntTuple[origin]) -> IntTuple`

Concatenates two `IntTuple` instances into a single `IntTuple`.

This function appends all elements from the right-hand side tuple to the
left-hand side tuple, creating a new combined tuple. The operation preserves
the hierarchical structure of both tuples.

**Args:**

* ​lhs (`IntTuple[origin]`): The left-hand side `IntTuple` that will be modified (owned parameter).
* ​rhs (`IntTuple[origin]`): The right-hand side `IntTuple` whose elements will be appended.

**Returns:**

A new `IntTuple` containing all elements from both tuples in sequence.

---

## crd2idx

`crd2idx[: ImmutableOrigin, : ImmutableOrigin, : ImmutableOrigin, //, crd_t: IntTuple[$2], shape_t: IntTuple[$1], stride_t: IntTuple[$0], out_type: DType = uint64](crd: RuntimeTuple[crd_t, element_type=element_type], shape: RuntimeTuple[shape_t, element_type=element_type], stride: RuntimeTuple[stride_t, element_type=element_type]) -> SIMD[out_type, 1]`

Converts multi-dimensional coordinates to a linear index.

This function is the inverse of idx2crd, transforming a set of coordinates
into a flat index based on the provided shape and stride information.
This is essential for mapping multi-dimensional tensor elements to linear memory.

**Parameters:**

* ​crd\_t (`IntTuple[$2]`): Type of the coordinates.
* ​shape\_t (`IntTuple[$1]`): Type of the shape.
* ​stride\_t (`IntTuple[$0]`): Type of the stride.
* ​out\_type (`DType`): The output data type for the index (default: uint64).

**Args:**

* ​crd (`RuntimeTuple[crd_t, element_type=element_type]`): The coordinates to convert.
* ​shape (`RuntimeTuple[shape_t, element_type=element_type]`): The shape of the multi-dimensional array.
* ​stride (`RuntimeTuple[stride_t, element_type=element_type]`): The stride values for each dimension.

**Returns:**

A scalar value representing the linear index corresponding to the given coordinates.

---

## idx2crd

`idx2crd[: ImmutableOrigin, : ImmutableOrigin, : ImmutableOrigin, //, idx_t: IntTuple[$2], shape_t: IntTuple[$1], stride_t: IntTuple[$0]](idx: RuntimeTuple[idx_t, element_type=element_type], shape: RuntimeTuple[shape_t, element_type=element_type], stride: RuntimeTuple[stride_t, element_type=element_type]) -> RuntimeTuple[idx2crd[::Origin[::Bool(idx_t, shape_t, stride_t), element_type=element_type]`

Converts a linear index to multi-dimensional coordinates. This function transforms a flat index into coordinate values based on the provided shape and stride information. This is essential for mapping linear memory accesses to multi-dimensional tensor elements.

**Constraints:**

The index must be a scalar value (not a tuple).

**Parameters:**

* ​idx\_t (`IntTuple[$2]`): IntTuple type of the index.
* ​shape\_t (`IntTuple[$1]`): IntTuple type of the shape.
* ​stride\_t (`IntTuple[$0]`): IntTuple type of the stride.

**Args:**

* ​idx (`RuntimeTuple[idx_t, element_type=element_type]`): The linear index to convert.
* ​shape (`RuntimeTuple[shape_t, element_type=element_type]`): The shape of the multi-dimensional array.
* ​stride (`RuntimeTuple[stride_t, element_type=element_type]`): The stride values for each dimension.

**Returns:**

A `RuntimeTuple` containing the multi-dimensional coordinates.

`idx2crd[: ImmutableOrigin, : ImmutableOrigin, //, idx_t: IntTuple[$1], shape_t: IntTuple[$0]](idx: RuntimeTuple[idx_t, element_type=element_type], shape: RuntimeTuple[shape_t, element_type=element_type]) -> RuntimeTuple[idx2crd[::Origin[::Bool(idx_t, shape_t, prefix_product[::Origin[::Bool(shape_t)), element_type=element_type]`

Converts a linear index to multi-dimensional coordinates using shape-derived strides. This is a convenience overload of `idx2crd` that automatically calculates the stride values from the shape using `prefix_product`. This is the common case for row-major storage order tensors.

**Parameters:**

* ​idx\_t (`IntTuple[$1]`): IntTuple type of the index.
* ​shape\_t (`IntTuple[$0]`): IntTuple type of the shape.

**Args:**

* ​idx (`RuntimeTuple[idx_t, element_type=element_type]`): The linear index to convert.
* ​shape (`RuntimeTuple[shape_t, element_type=element_type]`): The shape of the multi-dimensional array.

**Returns:**

A `RuntimeTuple` containing the multi-dimensional coordinates calculated using
automatically derived strides from the shape.

---

## runtime_tuple

Provides the `RuntimeTuple` data structure and related utility functions for handling tuple-like data with both compile-time and runtime elements. `RuntimeTuple` is designed for high-performance tensor operations, supporting efficient manipulation of multi-dimensional data structures like shapes, indices, and coordinates.

Key features:

* Hybrid compile-time/runtime value handling
* Optimized for parallel execution and hardware acceleration
* Support for nested tuple structures
* Efficient conversion between linear indices and multi-dimensional coordinates
* Specialized operations for tensor shape calculations

The module includes functions for tuple manipulation (concatenation, flattening),
coordinate transformations (`idx2crd`, `crd2idx`), and specialized tensor operations
like shape division and prefix products.

## Structs

* [​`RuntimeTuple`](./RuntimeTuple): A struct representing tuple-like data with compile-time and runtime elements. RuntimeTuple combines static (compile-time) and dynamic (runtime) handling of tuple-like data structures, typically used for tensor shapes, indices, and coordinates in high-performance computing contexts. This struct is optimized for parallel execution and hardware acceleration, allowing efficient manipulation of multi-dimensional data. It supports both known compile-time values and runtime-determined values.

## Functions

* [​`concat`](./concat): Concatenates two `IntTuple` instances into a single `IntTuple`.
* [​`crd2idx`](./crd2idx): Converts multi-dimensional coordinates to a linear index.
* [​`idx2crd`](./idx2crd): Converts a linear index to multi-dimensional coordinates. This function transforms a flat index into coordinate values based on the provided shape and stride information. This is essential for mapping linear memory accesses to multi-dimensional tensor elements.
* [​`is_int`](./is_int): Determines if a `RuntimeTuple` represents a scalar integer value.
* [​`is_tuple`](./is_tuple): Determines if a `RuntimeTuple` represents a tuple rather than a scalar value.
* [​`prefix_product`](./prefix_product): Computes the prefix products of elements in the `RuntimeTuple`.
* [​`product`](./product): Computes the product of all elements in the `RuntimeTuple`.
* [​`shape_div`](./shape_div): Performs specialized shape division between `RuntimeTuple`s.
* [​`signum`](./signum): Returns the sign of an integer value.

---

## is_int

`is_int[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> Bool`

Determines if a `RuntimeTuple` represents a scalar integer value.

This function checks if the `RuntimeTuple` holds a single scalar value
rather than a tuple structure with multiple elements.

**Parameters:**

* ​t (`IntTuple[$0]`): The IntTuple type parameter of the RuntimeTuple.

**Args:**

* ​tuple (`RuntimeTuple[t, element_type=element_type]`): The `RuntimeTuple` to check.

**Returns:**

True if the `RuntimeTuple` represents a scalar integer, False otherwise.

---

## is_tuple

`is_tuple[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> Bool`

Determines if a `RuntimeTuple` represents a tuple rather than a scalar value.

This function checks the structure of the underlying IntTuple to determine
if it represents a tuple with multiple elements or a single scalar value.

**Parameters:**

* ​t (`IntTuple[$0]`): The IntTuple type parameter of the RuntimeTuple.

**Args:**

* ​tuple (`RuntimeTuple[t, element_type=element_type]`): The `RuntimeTuple` to check.

**Returns:**

True if the `RuntimeTuple` represents a tuple, False if it represents a scalar.

---

## prefix_product

`prefix_product[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> RuntimeTuple[prefix_product[::Origin[::Bool(t)]`

Computes the prefix products of elements in the `RuntimeTuple`.

This function calculates the running product of elements, where each
output element is the product of all previous elements in the input.
This is commonly used in tensor computations to calculate stride values.

**Parameters:**

* ​t (`IntTuple[$0]`): The IntTuple type parameter of the input RuntimeTuple.

**Args:**

* ​tuple (`RuntimeTuple[t, element_type=element_type]`): The input `RuntimeTuple`.

**Returns:**

A new `RuntimeTuple` containing the prefix products of the input elements.

---

## product

`product[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> Int`

Computes the product of all elements in the `RuntimeTuple`.

This function multiplies all scalar values in the tuple, including
those in nested tuples after flattening. This is commonly used to
calculate the total size of a tensor from its shape.

**Parameters:**

* ​t (`IntTuple[$0]`): The IntTuple type parameter of the input RuntimeTuple.

**Args:**

* ​tuple (`RuntimeTuple[t, element_type=element_type]`): The input `RuntimeTuple`.

**Returns:**

The product of all scalar elements in the tuple.

---

## shape_div

`shape_div[: ImmutableOrigin, : ImmutableOrigin, //, a_t: IntTuple[$1], b_t: IntTuple[$0]](a: RuntimeTuple[a_t, element_type=element_type], b: RuntimeTuple[b_t, element_type=element_type]) -> RuntimeTuple[shape_div[::Origin[::Bool(a_t, b_t)]`

Performs specialized shape division between `RuntimeTuple`s.

This function implements a special division operation specifically designed for
tensor shape calculations. Unlike standard division, it handles special cases:

1. If shapes are directly divisible (a % b == 0), returns a standard division (a // b)
2. If shapes are inversely divisible (b % a == 0), returns the signed reciprocal
3. If shapes are incompatible, aborts with an error

This operation is essential for transformations between tensor layouts and computing
broadcasting semantics.

**Parameters:**

* ​a\_t (`IntTuple[$1]`): Type of the first operand.
* ​b\_t (`IntTuple[$0]`): Type of the second operand.

**Args:**

* ​a (`RuntimeTuple[a_t, element_type=element_type]`): The dividend `RuntimeTuple`.
* ​b (`RuntimeTuple[b_t, element_type=element_type]`): The divisor `RuntimeTuple`.

**Returns:**

A new `RuntimeTuple` containing the result of the shape division.

---

## signum

`signum(a: Int) -> Int`

Returns the sign of an integer value.

This helper function determines whether a number is positive, negative, or zero,
returning 1 for positive, -1 for negative, and 0 for zero.

**Args:**

* ​a (`Int`): The integer value to determine the sign of.

**Returns:**

1 if a > 0, -1 if a

---

## ComposedLayout

`struct ComposedLayout[LayoutA: LayoutTrait, LayoutB: LayoutTrait, offset: OptionalReg[Int] = OptionalReg[Int]({:@stdlib::@builtin::@int::@Int {0}, 0})]`

Layout composed of two layouts applied sequentially.

Combines two layouts. Output of the first (`LayoutA`) is input to
the second (`LayoutB`), with optional offset in between.

## Parameters

* ​LayoutA (`LayoutTrait`): The first layout to apply.
* ​LayoutB (`LayoutTrait`): The second layout to apply.
* ​offset (`OptionalReg[Int]`): Optional offset between layouts (default: 0).

## Fields

* ​layout\_a (`LayoutA`): The first layout to apply.
* ​layout\_b (`LayoutB`): The second layout to apply.

## Implemented traits

`AnyType`,
`Copyable`,
`LayoutTrait`,
`UnknownDestructibility`

## Aliases

### `has_shape`

`alias has_shape = get_vtable_entry(:trait LayoutA, "has_shape") if get_vtable_entry(:trait LayoutA, "has_shape") else get_vtable_entry(:trait LayoutB, "has_shape")`

True if either layout has a shape.

## Methods

### `__init__`

`__init__(out self, layout_a: LayoutA, layout_b: LayoutB)`

Initialize ComposedLayout with two layouts.

**Args:**

* ​layout\_a (`LayoutA`): The first layout.
* ​layout\_b (`LayoutB`): The second layout.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Copy constructor for ComposedLayout.

**Args:**

* ​other (`Self`): The ComposedLayout to copy from.

### `__call__`

`__call__(self, idx: IntTuple[origin]) -> Int`

Apply composed layout to an index.

Applies `LayoutA`, then adds offset, then applies `LayoutB`.

**Args:**

* ​idx (`IntTuple[origin]`): The index to transform.

**Returns:**

The transformed index.

`__call__(self, idx: IntTuple[origin], offset_val: Int) -> Int`

Apply composed layout with runtime offset.

Applies `LayoutA`, then adds runtime `offset_val`, then `LayoutB`.
Static offset must not be set when using runtime offset.

**Args:**

* ​idx (`IntTuple[origin]`): The index to transform.
* ​offset\_val (`Int`): Runtime offset to apply.

**Returns:**

The transformed index.

### `size`

`size(self) -> Int`

Get the size of the composed layout.

Returns the size of the first layout (`LayoutA`).

**Returns:**

The size of the first layout.

### `cosize`

`cosize(self) -> Int`

Get the cosize of the composed layout.

Returns the cosize of the second layout (`LayoutB`).

**Returns:**

The cosize of the second layout.

---

## Swizzle

`@register_passable(trivial)`
`struct Swizzle`

Swizzle functor for memory access pattern optimization.

Implements a swizzling pattern to reduce bank conflicts in shared
memory accesses.  It XORs specific bits of memory indices based
on configurable parameters.

Swizzle operation:
Given index `i`, and Swizzle\[bits, base, shift]:

1. Extract `bits` number of bits from `i` starting from position
   `base + max(0, shift)`. Let's call this `YYY`.
2. Extract `bits` number of bits from `i` starting from position
   `base - min(0, shift)`. Let's call this `ZZZ`.
3. Result is `i ^ (YYY shifted by 'shift' positions)`.

Example (Swizzle\[2, 0, 3]):
Input index bits:  `xxxxxxxxxxxxxxxxYYxxxxxxxxxZZxxxx`
Output index bits: `xxxxxxxxxxxxxxxxYYxxxxxxxxxAAxxxx`
where `AA = ZZ ^ YY`.

Attributes:
bits (Int): Number of bits in the mask (YYY).
base (Int): Number of least significant bits to keep constant.
shift (Int): Shift distance for the mask (positive: right,
negative: left).
yyy\_mask (Int): Mask for the bits to be shifted (YYY).
zzz\_mask (Int): Mask for the target bits (ZZZ).

## Fields

* ​bits (`Int`): Number of bits in the mask.
* ​base (`Int`): Number of least significant bits to keep constant.
* ​shift (`Int`): Distance to shift the mask (pos right, neg left).
* ​yyy\_mask (`Int`): Mask for the bits to be shifted.
* ​zzz\_mask (`Int`): Mask for the target bits.

## Implemented traits

`AnyType`,
`Copyable`,
`LayoutTrait`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `has_shape`

`alias has_shape = False`

Indicates if layout has shape. Swizzle always False.

## Methods

### `__init__`

`__init__(bits: Int, base: Int, shift: Int) -> Self`

Initialize a Swizzle object.

Configures the swizzle operation based on bits, base, and
shift parameters.

**Args:**

* ​bits (`Int`): Number of bits in the mask.
* ​base (`Int`): Least significant bits to keep constant.
* ​shift (`Int`): Distance to shift the mask.

### `__call__`

`__call__(self, index: IntTuple[origin]) -> Int`

Apply swizzle to an IntTuple index.

Unwraps the IntTuple and applies the swizzle to the integer
value.

**Args:**

* ​index (`IntTuple[origin]`): The IntTuple index to swizzle.

**Returns:**

The swizzled index value.

`__call__(self, offset: Int) -> Int`

Apply swizzle to an integer offset.

Performs the swizzle operation on an integer offset to
rearrange memory access patterns.

**Args:**

* ​offset (`Int`): The integer offset to swizzle.

**Returns:**

The swizzled offset value.

`__call__(self, offset: SIMD[dtype, 1]) -> SIMD[dtype, 1]`

Apply swizzle to a scalar offset.

Scalar version of the swizzle operation.  Applies swizzle to
a scalar offset.

**Args:**

* ​offset (`SIMD[dtype, 1]`): The scalar offset to swizzle.

**Returns:**

The swizzled scalar value.

### `size`

`size(self) -> Int`

Get the size of the swizzle pattern.

Calculates the size of the memory region affected by the
swizzle pattern.

**Returns:**

The size of the swizzle pattern.

### `cosize`

`cosize(self) -> Int`

Get the cosize of the swizzle pattern.

Cosize is the same as size for swizzle layouts, representing
the output size.

**Returns:**

The cosize of the swizzle pattern (same as size).

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Write the swizzle parameters to a writer.

Outputs the swizzle parameters (bits, base, shift) in a
tuple format.

**Parameters:**

* ​W (`Writer`): The writer type that implements the Writer trait.

**Args:**

* ​writer (`W`): The writer to write to.

### `__str__`

`__str__(self) -> String`

Convert the swizzle to a string representation.

**Returns:**

String representation of the swizzle parameters.

---

## eval_composed

`eval_composed[composed_layout: ComposedLayout[Layout, Swizzle]](idx: UInt, offset: UInt = UInt(0)) -> UInt`

Evaluate a composed layout with swizzle.

Evaluates a `ComposedLayout[Layout, Swizzle]`. Applies the base
layout, adds an optional offset, and then applies the swizzle.

**Parameters:**

* ​composed\_layout (`ComposedLayout[Layout, Swizzle]`): The composed layout to evaluate, consisting of a base Layout
  and a Swizzle transformation.

**Args:**

* ​idx (`UInt`): The input index to transform.
* ​offset (`UInt`): Optional offset to apply between layouts (default: 0).

**Returns:**

The transformed index after applying both layouts.

---

## swizzle

Defines swizzle layouts for optimizing memory access patterns.

This module is designed for use in shared memory, especially in GPU
kernels, to reduce bank conflicts.  It provides tools to create and
apply swizzle transformations to memory indices.  Swizzling
rearranges memory access order to distribute accesses across
different memory banks.  This mitigates bank contention and improves
memory access efficiency.

Module components:

* `Swizzle` struct: Represents a swizzle transformation with
  configurable bits, base, and shift parameters.
* Helper functions: `make_ldmatrix_swizzle`, `make_swizzle` create
  predefined swizzle patterns. These are optimized for scenarios
  like `ldmatrix` instructions and general 2D memory access.
* `ComposedLayout` struct: Combines a base layout with a swizzle
  layout for complex memory access optimizations.

Primary use case: GPU kernel development where shared memory bank
conflicts can degrade performance.  Applying swizzle layouts
optimizes memory access patterns for higher throughput.

## Structs

* [​`ComposedLayout`](./ComposedLayout): Layout composed of two layouts applied sequentially.
* [​`Swizzle`](./Swizzle): Swizzle functor for memory access pattern optimization.

## Functions

* [​`eval_composed`](./eval_composed): Evaluate a composed layout with swizzle.
* [​`make_ldmatrix_swizzle`](./make_ldmatrix_swizzle): Make swizzle to avoid bank conflict for ldmatrix ops.
* [​`make_swizzle`](./make_swizzle): Create a 2D swizzle to avoid bank conflicts.
* [​`shiftl`](./shiftl): Shift left or right based on sign of shift amount.
* [​`shiftr`](./shiftr): Shift right or left based on sign of shift amount.

---

## make_ldmatrix_swizzle

`make_ldmatrix_swizzle[type: DType, row_size: Int, log2_vector_width: Int = 0]() -> Swizzle`

Make swizzle to avoid bank conflict for ldmatrix ops.

Creates a swizzle pattern optimized for `ldmatrix` operations.
Minimizes bank conflicts in shared memory for these operations.
Calculates swizzle parameters based on data type and row size.

**Parameters:**

* ​type (`DType`): The data type of the elements.
* ​row\_size (`Int`): Size of each row in elements.
* ​log2\_vector\_width (`Int`): Log2 of the vector width (default: 0).

**Returns:**

A `Swizzle` object configured for `ldmatrix`.

---

## make_swizzle

`make_swizzle[num_rows: Int, row_size: Int, access_size: Int]() -> Swizzle`

Create a 2D swizzle to avoid bank conflicts.

Generates a swizzle pattern for 2D memory layout to minimize
bank conflicts in shared memory access.

**Parameters:**

* ​num\_rows (`Int`): Number of rows in the minimum access pattern.
* ​row\_size (`Int`): Size of each row in elements.
* ​access\_size (`Int`): Number of elements accessed at once.

**Returns:**

A `Swizzle` object for 2D memory access.

`make_swizzle[type: DType, mode: TensorMapSwizzle]() -> Swizzle`

Create swizzle based on predefined swizzle modes.

Returns a swizzle pattern based on standard modes (32B, 64B,
128B, none), adjusted for data type.

**Parameters:**

* ​type (`DType`): The data type of the elements.
* ​mode (`TensorMapSwizzle`): The swizzle mode to use (TensorMapSwizzle enum).

**Returns:**

A `Swizzle` object configured by the specified mode.

---

## shiftl

`shiftl(a: Int, s: Int) -> Int`

Shift left or right based on sign of shift amount.

Performs a left shift if `s` is positive, or a right shift if
`s` is negative.

**Args:**

* ​a (`Int`): The integer value to shift.
* ​s (`Int`): The shift amount. Positive for left, negative for right.

**Returns:**

The shifted integer value.

`shiftl(a: SIMD[dtype, 1], s: SIMD[dtype, 1]) -> SIMD[dtype, 1]`

Shift left/right based on sign of shift for scalars.

Scalar version of `shiftl`.  Left shift if `s` is positive,
right shift if `s` is negative.

**Args:**

* ​a (`SIMD[dtype, 1]`): The scalar value to shift.
* ​s (`SIMD[dtype, 1]`): The scalar shift amount. Positive for left, negative right.

**Returns:**

The shifted scalar value.

---

## shiftr

`shiftr(a: Int, s: Int) -> Int`

Shift right or left based on sign of shift amount.

Performs a right shift if `s` is positive, or a left shift if
`s` is negative.

**Args:**

* ​a (`Int`): The integer value to shift.
* ​s (`Int`): The shift amount. Positive for right, negative for left.

**Returns:**

The shifted integer value.

`shiftr(a: SIMD[dtype, 1], s: SIMD[dtype, 1]) -> SIMD[dtype, 1]`

Shift right/left based on sign of shift for scalars.

Scalar version of `shiftr`.  Right shift if `s` is positive,
left shift if `s` is negative.

**Args:**

* ​a (`SIMD[dtype, 1]`): The scalar value to shift.
* ​s (`SIMD[dtype, 1]`): The scalar shift amount. Positive for right, negative left.

**Returns:**

The shifted scalar value.

---

## LayoutTensorBuild

`@register_passable(trivial)`
`struct LayoutTensorBuild[dtype: DType, *, __layout: Layout = __init__[::Origin[::Bool(IntTuple(1)), __layout_init: Bool = False, __address_space: AddressSpace = AddressSpace(0), __layout_int_type: DType = _get_layout_type(__layout, __address_space), __index_type: DType = _get_index_type(__layout, __address_space), __circular: Bool = False]`

Tensor layout builder providing a fluent interface for constructing tensors with various layouts.

## Parameters

* ​dtype (`DType`): Data type of tensor elements.
* ​\_\_layout (`Layout`): The tensor's memory layout.
* ​\_\_layout\_init (`Bool`): Whether the layout has been initialized.
* ​\_\_address\_space (`AddressSpace`): Memory space (generic, shared, local).
* ​\_\_layout\_int\_type (`DType`): Layout index type.
* ​\_\_index\_type (`DType`): Type used for indexing.
* ​\_\_circular (`Bool`): Whether tensor has circular indexing semantics.

## Fields

* ​runtime\_layout (`RuntimeLayout[__layout, element_type=__layout_int_type, linear_idx_type=__index_type]`): Runtime representation of the tensor's layout.
  This field stores the layout information that can be manipulated at runtime,
  particularly important for tensors with dynamic dimensions. It encapsulates:
  * The static layout template from `__layout` parameter
  * The bit width for index calculations
  * The appropriate index type based on address space

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Initializes a new `LayoutTensorBuild` instance with default values.

### `row_major`

`row_major[*shapes: Int](self) -> LayoutTensorBuild[dtype, __layout=row_major[::Origin[::Bool(_to_int_tuple[::VariadicList[::Int]]()), __layout_init=True]`

Creates a row-major layout using compile-time dimensions.

**Parameters:**

* ​\*shapes (`Int`): Variadic parameter specifying the dimensions of the tensor.
  Each value represents the size of a dimension.

**Returns:**

`LayoutTensorBuild` - A new builder with row-major layout.

`row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim), __layout_init=True]`

Creates a row-major 2D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with row-major layout.

`row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim, dim), __layout_init=True]`

Creates a row-major 3D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.
* ​shape2 (`ValueOrUnknown[dim]`): Third dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with row-major layout.

`row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim, dim, dim), __layout_init=True]`

Creates a row-major 4D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.
* ​shape2 (`ValueOrUnknown[dim]`): Third dimension size.
* ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with row-major layout.

`row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim], shape4: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim, dim, dim, dim), __layout_init=True]`

Creates a row-major 5D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.
* ​shape2 (`ValueOrUnknown[dim]`): Third dimension size.
* ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size.
* ​shape4 (`ValueOrUnknown[dim]`): Fifth dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with row-major layout.

### `col_major`

`col_major[*shapes: Int](self) -> LayoutTensorBuild[dtype, __layout=col_major[::Origin[::Bool(_to_int_tuple[::VariadicList[::Int]]()), __layout_init=True]`

Creates a column-major layout using compile-time dimensions.

**Parameters:**

* ​\*shapes (`Int`): Variadic parameter specifying the dimensions of the tensor.
  Each value represents the size of a dimension.

**Returns:**

`LayoutTensorBuild` - A new builder with column-major layout.

`col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim), __layout_init=True]`

Creates a column-major 2D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with column-major layout.

`col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim, dim), __layout_init=True]`

Creates a column-major 3D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.
* ​shape2 (`ValueOrUnknown[dim]`): Third dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with column-major layout.

`col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim, dim, dim), __layout_init=True]`

Creates a column-major 4D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.
* ​shape2 (`ValueOrUnknown[dim]`): Third dimension size.
* ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with column-major layout.

`col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim], shape4: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim, dim, dim, dim), __layout_init=True]`

Creates a column-major 5D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.
* ​shape2 (`ValueOrUnknown[dim]`): Third dimension size.
* ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size.
* ​shape4 (`ValueOrUnknown[dim]`): Fifth dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with column-major layout.

### `layout`

`layout[shape0: Int](self) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(IntTuple(shape0)), __layout_init=True]`

Creates a 1D layout with a compile-time dimension.

**Parameters:**

* ​shape0 (`Int`): Size of the single dimension.

**Returns:**

`LayoutTensorBuild` - A new builder with the specified layout.

`layout[rank: Int, shape: IndexList[rank], stride: IndexList[rank]](self) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(_to_int_tuple[::Int](shape), _to_int_tuple[::Int](stride)), __layout_init=True]`

Creates a custom layout with compile-time dimensions and strides.

**Parameters:**

* ​rank (`Int`): Number of dimensions.
* ​shape (`IndexList[rank]`): List of dimension sizes.
* ​stride (`IndexList[rank]`): List of strides for each dimension.

**Returns:**

`LayoutTensorBuild` - A new builder with the specified custom layout.

`layout[rank: Int](self, shape: IndexList[rank], stride: IndexList[rank]) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(_to_int_tuple[::Int](-1), _to_int_tuple[::Int](-1)), __layout_init=True]`

Creates a custom layout with runtime dimensions and strides.

**Parameters:**

* ​rank (`Int`): Number of dimensions.

**Args:**

* ​shape (`IndexList[rank]`): List of dimension sizes.
* ​stride (`IndexList[rank]`): List of strides for each dimension.

**Returns:**

`LayoutTensorBuild` - A new builder with the specified custom layout.

`layout(self, shape0: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(IntTuple(dim)), __layout_init=True]`

Creates a 1D layout with a runtime dimension.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): Size of the single dimension.

**Returns:**

`LayoutTensorBuild` - A new builder with the specified layout.

### `shared`

`shared(self) -> LayoutTensorBuild[dtype, __layout=__layout, __layout_init=__layout_init, __address_space=AddressSpace(3)]`

Places the tensor in GPU shared memory.

**Returns:**

`LayoutTensorBuild` - A new builder with shared memory address space.

### `local`

`local(self) -> LayoutTensorBuild[dtype, __layout=__layout, __layout_init=__layout_init, __address_space=AddressSpace(5)]`

Places the tensor in GPU local memory.

**Returns:**

`LayoutTensorBuild` - A new builder with local memory address space.

### `alloc`

`alloc(self) -> LayoutTensor[dtype, __layout, MutableAnyOrigin, address_space=__address_space]`

Allocates a new tensor using the current layout.

Note:
Fails to compile if layout is not set, dimensions are not known, or tensor is circular.

**Returns:**

`LayoutTensor` - A newly allocated tensor with the specified layout

### `view`

`view[address_space: AddressSpace](self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space]) -> LayoutTensor[dtype, __layout, MutableAnyOrigin, address_space=address_space, layout_int_type=__layout_int_type, linear_idx_type=__index_type]`

Creates a tensor view over existing memory.

Note:
Fails to compile if layout is not set, address spaces don't match, or tensor is circular.

**Parameters:**

* ​address\_space (`AddressSpace`): Memory address space for the tensor (generic, shared, local).

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space]`): Pointer to memory region to create the view over.

**Returns:**

`LayoutTensor` - A tensor view over the specified memory region with the current layout.

### `circular`

`circular(self) -> LayoutTensorBuild[dtype, __layout=__layout, __layout_init=__layout_init, __address_space=__address_space, __circular=True]`

Enables circular indexing for the tensor.

**Returns:**

`LayoutTensorBuild` - A new builder with circular indexing enabled.

### `iter`

`iter(self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=__address_space], bound: Int) -> LayoutTensorIter[dtype, __layout, MutableAnyOrigin, address_space=__address_space, circular=__circular, layout_int_type=__layout_int_type, linear_idx_type=__index_type]`

Creates an iterator over tensor elements.

Note:
Fails to compile if layout is not set or dimensions are not known.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=__address_space]`): Pointer to memory region.
* ​bound (`Int`): Upper bound for iteration.

**Returns:**

`LayoutTensorIter` - An iterator over tensor elements.

---

## ValueOrUnknown

`struct ValueOrUnknown[dim: Int = -1]`

Represents either a static dimension (known at compile time) or a dynamic dimension (known at runtime).

## Parameters

* ​dim (`Int`): Optional compile-time dimension value. Default is `UNKNOWN_VALUE` for dynamic dimensions.

## Fields

* ​value (`Int`): The runtime value of the dimension.
  For static dimensions, this is set to the compile-time value.
  For dynamic dimensions, this is set at runtime.

## Implemented traits

`AnyType`,
`Defaultable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Initializes a static dimension with compile-time value.

Note:
Fails to compile if dim is `UNKNOWN_VALUE`, as dynamic dimensions require a runtime value.

`@implicit`
`__init__(out self, v: Int)`

Initializes a dynamic dimension with runtime value.

**Args:**

* ​v (`Int`): Runtime value for the dimension.

---

## dynamic

`dynamic(d: Int) -> ValueOrUnknown`

Creates a dynamic dimension with runtime value.

**Args:**

* ​d (`Int`): Runtime dimension value.

**Returns:**

`ValueOrUnknown` - A dynamic dimension with the given value.

---

## tensor_builder

Tensor Builder Module

Provides a fluent interface for constructing tensors with various layouts and memory configurations.
It includes utilities for creating both static (compile-time) and dynamic (runtime) tensor dimensions,
supporting row-major, column-major, and custom layouts. The module enables memory placement in different
address spaces (generic, shared, local) and supports features like circular indexing.

Key components:

* `ValueOrUnknown`: Represents static or dynamic tensor dimensions
* `LayoutTensorBuild`: Builder class for tensor construction
* Helper functions for dimension specification and layout creation

## Structs

* [​`LayoutTensorBuild`](./LayoutTensorBuild): Tensor layout builder providing a fluent interface for constructing tensors with various layouts.
* [​`ValueOrUnknown`](./ValueOrUnknown): Represents either a static dimension (known at compile time) or a dynamic dimension (known at runtime).

## Functions

* [​`dynamic`](./dynamic): Creates a dynamic dimension with runtime value.
* [​`static`](./static): Creates a static dimension with compile-time value.

---

## static

`static[d: Int]() -> ValueOrUnknown[d]`

Creates a static dimension with compile-time value.

**Parameters:**

* ​d (`Int`): The compile-time dimension value to use.

**Returns:**

`ValueOrUnknown[d]` - A static dimension with the given value.

---

## TensorCore

`struct TensorCore[out_type: DType, in_type: DType, shape: IndexList[3], transpose_b: Bool = False]`

TensorCore provides an abstraction for GPU tensor core hardware to perform optimized matrix operations.

This struct encapsulates the functionality required to efficiently map matrix operations to Tensor Cores
on NVIDIA and AMD GPUs. It handles loading matrix fragments, performing matrix multiply-accumulate
operations, and storing results with hardware-specific optimizations.

Note:
Different shapes and data types are supported depending on the GPU hardware.
For NVIDIA GPUs:

* float32: 16×8×8 or 16×8×4
* half-precision: 16×8×16
* float8: 16×8×32
  For AMD GPUs:
* float32: 16×16×4
* half-precision: 16×16×16 or 32×32×8

## Parameters

* ​out\_type (`DType`): The data type for output/accumulation operations.
* ​in\_type (`DType`): The data type for input matrix elements.
* ​shape (`IndexList[3]`): The shape parameters for the matrix operation in the form \[M, N, K]
  where M×N is the output shape and K is the inner dimension.
* ​transpose\_b (`Bool`): Whether to transpose the B matrix before multiplication. Defaults to False.

## Implemented traits

`AnyType`,
`Defaultable`,
`UnknownDestructibility`

## Aliases

### `a_reg_type`

`alias a_reg_type = SIMD[in_type, num_matrix_reg[::Int,::Int]()]`

### `b_reg_type`

`alias b_reg_type = SIMD[in_type, num_matrix_reg[::Int,::Int]()]`

### `c_reg_tile_type`

`alias c_reg_tile_type = LayoutTensor[out_type, col_major(1, num_matrix_reg[::Int,::Int]()), MutableAnyOrigin, address_space=AddressSpace(5)]`

### `c_reg_type`

`alias c_reg_type = SIMD[out_type, num_matrix_reg[::Int,::Int]()]`

### `supported_fp32`

`alias supported_fp32 = (shape == IndexList(16, 8, 8, Tuple())) if is_nvidia_gpu() else (shape == IndexList(16, 16, 4, Tuple())) if (in_type is float32) else (in_type is float32)`

### `supported_fp8`

`alias supported_fp8 = (shape == IndexList(16, 8, 32, Tuple())) if Tuple(VariadicPack(float8_e4m3fn, float8_e5m2)).__contains__[::EqualityComparable & ::Copyable & ::Movable](in_type) else Tuple(VariadicPack(float8_e4m3fn, float8_e5m2)).__contains__[::EqualityComparable & ::Copyable & ::Movable](in_type)`

### `supported_half`

`alias supported_half = (shape == IndexList(16, 8, 16, Tuple())) if is_nvidia_gpu() else Tuple(VariadicPack(IndexList(16, 16, 16, Tuple()), IndexList(32, 32, 8, Tuple()))).__contains__[::EqualityComparable & ::Copyable & ::Movable](shape) if in_type.is_half_float() else in_type.is_half_float()`

## Methods

### `__init__`

`__init__(out self)`

Initialize a new TensorCore instance.

### `get_shapes`

`static get_shapes[out_type: DType, in_type: DType]() -> List[IndexList[3]]`

Get supported shapes for given data types.

Returns a list of valid shapes for the specified output and input data types.

Note:
The returned shapes are hardware-dependent. Different shapes are supported
for different combinations of input and output types.

**Parameters:**

* ​out\_type (`DType`): The output/accumulation data type.
* ​in\_type (`DType`): The input matrix data type.

**Returns:**

List\[IndexList\[3]]: Valid shapes for the matrix operations given the specified types.

### `load_a`

`load_a[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, a: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[in_type, _get_a_reg_tile_layout[::Layout,::IndexList[::Int(), MutableAnyOrigin, address_space=AddressSpace(5)]`

Load the A matrix fragments.

Loads matrix A from memory into a LayoutTensor suitable for tensor core operations.

**Parameters:**

* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzle pattern for optimal memory access (AMD only).

**Args:**

* ​a (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source matrix A data.

**Returns:**

The loaded matrix fragments as a `LayoutTensor`.

`load_a[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, warp_tile: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], fragments: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mma_tile_coord_k: UInt = UInt(0))`

Load A matrix fragments from shared memory.

Optimized version for loading A matrix fragments from shared memory.

**Parameters:**

* ​swizzle (`OptionalReg[Swizzle]`): Optional memory access pattern for to optimize memory bandwidth.

**Args:**

* ​warp\_tile (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source data in shared memory.
* ​fragments (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor for fragments.
* ​mma\_tile\_coord\_k (`UInt`): The K coordinate of the MMA tile. Defaults to 0.

### `load_b`

`load_b[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, b: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[in_type, _get_b_reg_tile_layout[::Layout,::IndexList[::Int(), MutableAnyOrigin, address_space=AddressSpace(5)]`

Load the B matrix fragments.

Loads matrix B from memory into a `LayoutTensor` suitable for tensor core operations.
The function handles different hardware architectures and memory access patterns.

Note:
If transpose\_b is `True`, the B matrix will be transposed during loading.
This is more efficient than transposing the matrix in memory.

**Parameters:**

* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzle pattern for optimal memory access (AMD only).
  Will cause an error if used with NVIDIA GPUs.

**Args:**

* ​b (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source matrix B data.

**Returns:**

The loaded matrix fragments as a `LayoutTensor`.

`load_b[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, warp_tile: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], fragments: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mma_tile_coord_k: UInt = UInt(0), warp_tile_coord_n: UInt = UInt(0))`

Load B matrix fragments from shared memory into registers for tensor core operations.

This function loads matrix B fragments from a warp tile in shared memory into register fragments
for use in tensor core matrix multiply operations. It handles hardware-specific optimizations
for both NVIDIA and AMD GPUs.

Note:
The `warp_tile` must be in shared memory. For NVIDIA GPUs, `swizzle` must be `None`.
For AMD GPUs, providing an appropriate `swizzle` pattern can improve performance.

**Parameters:**

* ​swizzle (`OptionalReg[Swizzle]`): Optional memory access pattern for AMD GPUs to optimize memory bandwidth.
  Must be None when running on NVIDIA GPUs. For NVIDIA GPUs, swizzle is always on.

**Args:**

* ​warp\_tile (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Source `LayoutTensor` in shared memory containing the B matrix data.
* ​fragments (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Destination `LayoutTensor` to store the loaded matrix fragments.
* ​mma\_tile\_coord\_k (`UInt`): K-dimension coordinate within the warp tile. Defaults to 0.
* ​warp\_tile\_coord\_n (`UInt`): N-dimension coordinate within the warp tile. Defaults to 0.

`load_b(self, warp_tile: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], fragments: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scales: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mma_tile_coord_k: UInt = UInt(0))`

Load quantized B matrix fragments from shared memory with dequantization.

This function loads int4 quantized matrix B fragments from shared memory, dequantizes them
using the provided scales, and stores the result in register fragments for tensor core operations.

Notes:

* The `warp_tile` must be in shared memory.
* The `fragments` and `scales` must be in local memory.
* This function only supports half-precision data types (bfloat16, float16).
* The quantized data is stored as int4 values packed into int32 elements.
* Each thread processes multiple fragments by unpacking and dequantizing the int4 values.

**Args:**

* ​warp\_tile (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Source `LayoutTensor` in shared memory containing the quantized B matrix data.
* ​fragments (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Destination `LayoutTensor` to store the dequantized matrix fragments.
* ​scales (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): `LayoutTensor` containing the scaling factors for dequantization.
* ​mma\_tile\_coord\_k (`UInt`): K-dimension coordinate within the warp tile. Defaults to 0.

### `load_c`

`load_c(self, c: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[out_type, col_major(1, num_matrix_reg[::Int,::Int]()), MutableAnyOrigin, address_space=AddressSpace(5)]`

Load the C matrix fragments.

Loads matrix C from memory into a `LayoutTensor` suitable for tensor core operations.
The function handles different hardware architectures and memory access patterns.

**Args:**

* ​c (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source matrix C data.

**Returns:**

The loaded matrix fragments as a `LayoutTensor`.

### `store_d`

`store_d(self, d_dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], d_src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Store matrix D to destination memory.

Stores the result matrix D from tensor core computation to the destination memory.

**Args:**

* ​d\_dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor to store the result.
* ​d\_src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor containing the computed result.

### `mma_op`

`mma_op(self, a: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[out_type, col_major(1, num_matrix_reg[::Int,::Int]()), MutableAnyOrigin, address_space=AddressSpace(5)]`

Perform matrix multiply-accumulate operation (MMA).

Executes `D = A * B + C` using tensor cores.

**Args:**

* ​a (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The A matrix input.
* ​b (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The B matrix input.
* ​c (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The C matrix input for accumulation.

**Returns:**

`Self.c_reg_tile_type`: The result of the MMA operation.

### `mma`

`mma(self, a_frag: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b_frag: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_frag: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Perform matrix multiply-accumulate operation using tensor cores.

Executes C = A \* B + C using tensor cores, where A, B, and C are matrix fragments
stored in register memory. This function handles the mapping of fragments to
hardware tensor core operations.

Notes:

* All fragments must be properly loaded using the corresponding load functions.
* The function assumes fragments are vectorized layout tensors with dimensions num\_vectors x 1.
* The c\_frag shape\[0] must equal num\_m\_mmas \* num\_n\_mmas.
* The result is accumulated in-place in c\_frag.

**Args:**

* ​a\_frag (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix A fragments as a `LayoutTensor`.
* ​b\_frag (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix B fragments as a `LayoutTensor`.
* ​c\_frag (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix C fragments as a `LayoutTensor` for both input and output.

---

## get_fragment_size

`get_fragment_size[mma_shape: IndexList[3]]() -> IndexList[3]`

Calculates the fragment size per thread for a given MMA shape.

For tensor core operations, each thread in a warp handles a portion of the
computation. This function determines how many elements each thread needs to
process for the A, B, and C/D matrices based on the MMA shape.

**Parameters:**

* ​mma\_shape (`IndexList[3]`): An `IndexList[3]` containing the MMA dimensions \[M, N, K].

**Returns:**

An `IndexList[3]` containing the fragment sizes per thread for matrices
A, B, and C/D respectively, calculated as:
`[M*K/WARP_SIZE, N*K/WARP_SIZE, M*N/WARP_SIZE]`.

---

## get_mma_shape

`get_mma_shape[input_type: DType, accum_type: DType, shape_id: Int = 0]() -> IndexList[3]`

Returns the appropriate matrix multiply-accumulate (MMA) shape for tensor core operations.

Selects the optimal MMA shape based on the GPU architecture, input data type,
accumulation data type, and optional shape identifier. This function handles
different configurations for both NVIDIA and AMD GPUs.

**Parameters:**

* ​input\_type (`DType`): The data type of the input matrices (A and B).
* ​accum\_type (`DType`): The data type used for accumulation (C and D).
* ​shape\_id (`Int`): Optional identifier to select between multiple valid shapes (default: 0).

**Returns:**

An `IndexList[3]` containing the MMA dimensions in the format `[M, N, K]`,
where `M×N` is the output matrix size and `K` is the reduction dimension.

---

## tensor_core

Tensor Core Module for High-Performance Matrix Operations

Provides abstractions for using GPU Tensor Cores to perform optimized matrix operations.
It supports both NVIDIA and AMD GPU architectures with hardware-specific optimizations.

## Key Components:

* `TensorCore`: Core struct that encapsulates tensor core operations with support for various
  data types and matrix shapes. It handles loading matrix fragments, performing matrix
  multiply-accumulate operations, and storing results.

* Matrix Fragment Management: Functions for loading and storing matrix fragments to/from
  shared memory with hardware-specific optimizations.

* Matrix Multiply-Accumulate (MMA): Optimized implementations of matrix multiplication
  operations using tensor cores.

## Supported Operations:

* Matrix loading with various layouts and swizzling patterns
* Matrix multiply-accumulate (D = A \* B + C)
* Matrix storing with hardware-specific optimizations

## Supported Data Types:

* NVIDIA: float32, bfloat16, float16, float8\_e4m3fn, float8\_e5m2
* AMD: float32, bfloat16, float16

## Supported Matrix Shapes:

* NVIDIA: 16×8×8, 16×8×4, 16×8×16, 8×8×4, 16×8×32
* AMD: 16×16×4, 16×16×16, 32×32×8

## Aliases

### `shape_16x16x16`

`alias shape_16x16x16 = IndexList(16, 16, 16, Tuple())`

### `shape_16x16x4`

`alias shape_16x16x4 = IndexList(16, 16, 4, Tuple())`

### `shape_16x8x16`

`alias shape_16x8x16 = IndexList(16, 8, 16, Tuple())`

### `shape_16x8x32`

`alias shape_16x8x32 = IndexList(16, 8, 32, Tuple())`

### `shape_16x8x4`

`alias shape_16x8x4 = IndexList(16, 8, 4, Tuple())`

### `shape_16x8x8`

`alias shape_16x8x8 = IndexList(16, 8, 8, Tuple())`

### `shape_32x32x8`

`alias shape_32x32x8 = IndexList(32, 32, 8, Tuple())`

### `shape_8x8x4`

`alias shape_8x8x4 = IndexList(8, 8, 4, Tuple())`

### `shape_null`

`alias shape_null = IndexList(0, 0, 0, Tuple())`

## Structs

* [​`TensorCore`](./TensorCore): TensorCore provides an abstraction for GPU tensor core hardware to perform optimized matrix operations.

## Functions

* [​`get_fragment_size`](./get_fragment_size): Calculates the fragment size per thread for a given MMA shape.
* [​`get_mma_shape`](./get_mma_shape): Returns the appropriate matrix multiply-accumulate (MMA) shape for tensor core operations.
* [​`num_matrix_reg`](./num_matrix_reg): Calculates the number of matrix registers required per thread.

---

## num_matrix_reg

`num_matrix_reg[dim_1: Int, dim_2: Int]() -> Int`

Calculates the number of matrix registers required per thread.

Determines how many registers each thread in a warp needs to store a matrix
of the given dimensions. This is calculated by dividing the total number of
elements (dim\_1 \* dim\_2) by the warp size, as the matrix is distributed
across all threads in the warp.

**Parameters:**

* ​dim\_1 (`Int`): First dimension of the matrix.
* ​dim\_2 (`Int`): Second dimension of the matrix.

**Returns:**

The number of matrix registers needed per thread.

---

## TensorCoreAsync

`struct TensorCoreAsync[c_type: DType, a_type: DType, b_type: DType, mma_shape: IndexList[3], /, a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = False]`

High-performance asynchronous tensor core operations for matrix multiplication.

This struct provides methods for utilizing NVIDIA's Tensor Cores for asynchronous
matrix multiplication operations, with support for various data types and swizzling
configurations.

## Parameters

* ​c\_type (`DType`): Data type of the output matrix C.
* ​a\_type (`DType`): Data type of the input matrix A.
* ​b\_type (`DType`): Data type of the input matrix B.
* ​mma\_shape (`IndexList[3]`): Dimensions for the matrix multiply-accumulate (MMA) operation as \[M, N, K].
* ​a\_swizzle (`TensorMapSwizzle`): Swizzling mode for matrix A (default: SWIZZLE\_NONE).
* ​b\_swizzle (`TensorMapSwizzle`): Swizzling mode for matrix B (default: SWIZZLE\_NONE).
* ​transpose\_b (`Bool`): Whether to transpose matrix B (default: False).

## Implemented traits

`AnyType`,
`Defaultable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Initialize the `TensorCoreAsync` instance.

Ensures that the provided MMA shape is supported.

Note:
Fails to compile if `mma_shape` is not supported.

### `wgmma`

`static wgmma[num_warp_groups: Int = 1, scale_c: Int = 1, scale_a: Int = 1, scale_b: Int = 1](a_smem_tile: LayoutTensor[a_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b_smem_tile: LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_reg_tile: LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], wg_idx: Int = 0)`

Perform asynchronous matrix multiplication using warp group matrix multiply-accumulate (WGMMA).

This method handles the case where both A and B matrices are in shared memory.

**Parameters:**

* ​num\_warp\_groups (`Int`): Number of warp groups to distribute work across (default: 1).
* ​scale\_c (`Int`): Scale factor for matrix C. Valid values are 1 or 0 (default: 1).
* ​scale\_a (`Int`): Scale factor for matrix A. Valid values are 1 or -1 (default: 1).
* ​scale\_b (`Int`): Scale factor for matrix B. Valid values are 1 or -1 (default: 1).

**Args:**

* ​a\_smem\_tile (`LayoutTensor[a_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix A in shared memory.
* ​b\_smem\_tile (`LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix B in shared memory.
* ​c\_reg\_tile (`LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Output matrix C in register memory.
* ​wg\_idx (`Int`): Warp group index for multi-warp group scenarios (default: 0).

`static wgmma(a_frag_tile: LayoutTensor[a_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b_smem_tile: LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_reg_tile: LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Perform asynchronous matrix multiplication using warp group matrix multiply-accumulate (WGMMA).

This overloaded method handles the case where matrix A is in register memory and matrix B
is in shared memory.

**Args:**

* ​a\_frag\_tile (`LayoutTensor[a_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix A in register memory.
* ​b\_smem\_tile (`LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix B in shared memory.
* ​c\_reg\_tile (`LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Output matrix C in register memory.

### `arrive`

`static arrive()`

Ensures memory consistency by creating a fence for WGMMA operations.

This method should be called before committing a group to ensure all
shared memory accesses are properly aligned and visible.

### `commit_group`

`static commit_group()`

Commits the current warp group for execution.

This synchronizes the warp group and commits all pending WGMMA operations
that have been previously issued.

### `wait_group`

`static wait_group[group: Int = 0]()`

Waits for the completion of a specific warp group's operations.

This method blocks until all WGMMA operations from the specified group are complete.

**Parameters:**

* ​group (`Int`): The group ID to wait for (default: 0).

---

## tensor_core_async

Tensor Core Async Module

This module provides high-performance abstractions for utilizing NVIDIA's Tensor Cores
to perform asynchronous matrix multiplication operations. It implements optimized memory
layouts and access patterns for efficient tensor core computations.

Key components:

* Layout creation functions for K-major and MN-major memory arrangements
* Swizzling support for improved memory access patterns
* WGMMA (Warp Group Matrix Multiply-Accumulate) descriptor generation
* TensorCoreAsync struct with methods for asynchronous matrix multiplication

The module supports various data types, matrix dimensions, and memory configurations,
enabling efficient implementation of deep learning primitives and other tensor operations
that can leverage hardware acceleration.

Performance features:

* Asynchronous execution model to overlap computation and memory access
* Support for different swizzling modes to optimize memory bandwidth
* Efficient register and shared memory utilization
* Support for multi-warp group execution

This implementation is specifically optimized for NVIDIA GPUs with Tensor Core support.

## Aliases

### `WGMMA_K_BYTES`

`alias WGMMA_K_BYTES = 32`

## Structs

* [​`TensorCoreAsync`](./TensorCoreAsync): High-performance asynchronous tensor core operations for matrix multiplication.

## Functions

* [​`select_k_atom`](./select_k_atom): Creates a core matrix layout for tensor core operations.
* [​`st_matrix_n_atom`](./st_matrix_n_atom): Creates a layout for N-major `st_matrix` atom in the context of WGMMA C matrix.
* [​`st_matrix_n_layout`](./st_matrix_n_layout): Creates a layout for N-major `st_matrix` in the context of WGMMA C matrix.
* [​`tile_layout_k_major`](./tile_layout_k_major): Creates a K-major layout for tensor core operations.
* [​`tile_layout_mn_major`](./tile_layout_mn_major): Creates an MN-major layout for tensor core operations.
* [​`tile_to_descriptor`](./tile_to_descriptor): Transforms a layout into a WGMMA descriptor-compatible layout.
* [​`wgmma_c_layout`](./wgmma_c_layout): Generates three layouts for mapping WGMMA C matrix coordinates.
* [​`wgmma_c_thread_layout`](./wgmma_c_thread_layout): Returns the thread layout component for WGMMA C matrix.
* [​`wgmma_output_layout`](./wgmma_output_layout): Returns the output layout component for WGMMA C matrix.

---

## select_k_atom

`select_k_atom[type: DType, swizzle_mode: TensorMapSwizzle]() -> Layout`

Creates a core matrix layout for tensor core operations.

Constructs the fundamental atomic layout for tensor core operations based on the
specified data type and swizzle mode. This layout represents the minimal dense
matrix structure that can be efficiently processed by tensor cores.

**Parameters:**

* ​type (`DType`): Element data type of the tensor.
* ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern swizzling mode.

**Returns:**

`Layout` - A core matrix layout optimized for tensor core operations.

---

## st_matrix_n_atom

`st_matrix_n_atom[num_stmatrix: Int]() -> Layout`

Creates a layout for N-major `st_matrix` atom in the context of WGMMA C matrix.

The domain of this layout is the warp group local thread index. Thus, the
layout takes \[0, 128) as input and returns an offset for a logical array
with an element size of 128-bit.

**Parameters:**

* ​num\_stmatrix (`Int`): Number of N-dimension tiles in the C matrix.

**Returns:**

`Layout` - A layout that maps warp group local thread index to an offset
for a logical array with an element size of 128-bit.

---

## st_matrix_n_layout

`st_matrix_n_layout[c_type: DType, WG_BN: Int, num_m_mmas: Int, num_consumer: Int]() -> Layout`

Creates a layout for N-major `st_matrix` in the context of WGMMA C matrix.

The layout modes are: the warp group local thread index, the N-dimension
tiling size `WG_BN // 16`, the number of MMA tiles `num_m_mmas` in the
M-dimension, and the number of consumers `num_consumer`. The output is an
offset for a logical array with the element type `c_type`.

**Parameters:**

* ​c\_type (`DType`): Data type of the C matrix.
* ​WG\_BN (`Int`): Size of the K dimension in the C matrix in shared memory.
* ​num\_m\_mmas (`Int`): Number of MMA tiles in the M dimension.
* ​num\_consumer (`Int`): Number of consumers.

**Returns:**

`Layout` - A layout that maps warp group local thread index to an offset
for a logical array with the element type `c_type`.

---

## tile_layout_k_major

`tile_layout_k_major[type: DType, BM: Int, BK: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))]() -> Layout`

Creates a K-major layout for tensor core operations.

Constructs a layout optimized for K-major access patterns in tensor core operations,
with optional swizzling for improved memory access patterns.

**Parameters:**

* ​type (`DType`): Element data type of the tensor.
* ​BM (`Int`): Size of the M dimension in the tile.
* ​BK (`Int`): Size of the K dimension in the tile.
* ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern swizzling mode (default: SWIZZLE\_NONE).

**Returns:**

`Layout` - A K-major layout configured for the specified dimensions and swizzle mode.

---

## tile_layout_mn_major

`tile_layout_mn_major[type: DType, mn_dim: Int, k_dim: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))]() -> Layout`

Creates an MN-major layout for tensor core operations.

Constructs a unit layout optimized for MN-major access patterns in shared memory,
with optional swizzling for improved memory access patterns.

Note:
This returns the "unit" layout; the actual shared memory layout can be a multiple of this unit.
Currently only supports SWIZZLE\_NONE and SWIZZLE\_128B modes.

**Parameters:**

* ​type (`DType`): Element data type of the tensor.
* ​mn\_dim (`Int`): Size of the MN dimension.
* ​k\_dim (`Int`): Size of the K dimension.
* ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern swizzling mode (default: SWIZZLE\_NONE).

**Returns:**

`Layout` - An MN-major layout configured for the specified dimensions and swizzle mode.

---

## tile_to_descriptor

`tile_to_descriptor[type: DType, layout: Layout, is_k_major: Bool = True]() -> Layout`

Transforms a layout into a WGMMA descriptor-compatible layout.

Converts a standard layout into a form that can be used with WGMMA descriptors,
handling both K-major and MN-major layouts differently.

**Parameters:**

* ​type (`DType`): Element data type of the tensor.
* ​layout (`Layout`): Input layout to transform.
* ​is\_k\_major (`Bool`): Whether the layout is K-major (True) or MN-major (False).

**Returns:**

\`Layout - A transformed layout compatible with WGMMA descriptors.

---

## wgmma_c_layout

`wgmma_c_layout[mma_m: Int, mma_n: Int, C: Layout]() -> List[Layout]`

Generates three layouts for mapping WGMMA C matrix coordinates.

This function creates three layout mappings that are essential for working with WGMMA
(Warp Group Matrix Multiply-Accumulate) operations:

1. A projection layout that maps linearized indices to row coordinates (i)
2. A projection layout that maps linearized indices to column coordinates (j)
3. A composite layout that maps thread and vector coordinates to linearized indices
   across multiple MMA tiles

These layouts are particularly useful for operations like attention masking and
matrix multiplication epilogues, where register values need to be mapped to the
coordinate system of the C matrix.

Note:
This function enforces constraints on the WGMMA dimensions and ensures the C matrix
dimensions are compatible with the WGMMA instruction size.

**Parameters:**

* ​mma\_m (`Int`): The M dimension (rows) of a single WGMMA instruction, must be 64.
* ​mma\_n (`Int`): The N dimension (columns) of a single WGMMA instruction, must be multiple of 8.
* ​C (`Layout`): The layout of the C matrix within a thread block.

**Returns:**

`List[Layout]` - A list containing three layouts:

1. proj\_i: Maps linearized indices to row coordinates
2. proj\_j: Maps linearized indices to column coordinates
3. TV\_tile\_to\_idx: Maps thread/vector/tile coordinates to linearized indices

---

## wgmma_c_thread_layout

`wgmma_c_thread_layout[C: Layout]() -> Layout`

Returns the thread layout component for WGMMA C matrix.

Generates the first mode of the WGMMA C layout, which maps thread coordinates
to linearized indices in the output matrix.

**Parameters:**

* ​C (`Layout`): The layout of the C matrix.

**Returns:**

`Layout` - A layout mapping thread coordinates to linearized indices.

---

## wgmma_output_layout

`wgmma_output_layout[mma_n: Int, C: Layout]() -> Layout`

Returns the output layout component for WGMMA C matrix.

Generates the second mode of the WGMMA C layout, which maps output vector
coordinates to linearized indices in the output matrix.

**Parameters:**

* ​mma\_n (`Int`): The N dimension of the WGMMA instruction.
* ​C (`Layout`): The layout of the C matrix.

**Returns:**

`Layout` - A layout mapping output vector coordinates to linearized indices.

---

## PipelineState

`@register_passable(trivial)`
`struct PipelineState[num_stages: Int]`

Manages state for a multi-stage pipeline with circular buffer semantics.

PipelineState provides a mechanism for tracking the current stage in a
multi-stage pipeline, particularly useful for double or triple buffering
in GPU tensor operations. It maintains an index that cycles through the
available stages, a phase bit that toggles when the index wraps around,
and a monotonically increasing count.

This struct is commonly used with TMA operations to coordinate the use of
multiple buffers in a pipeline fashion, allowing for overlapping computation
and data transfer.

## Parameters

* ​num\_stages (`Int`): The number of stages in the pipeline (e.g., 2 for double buffering,
  3 for triple buffering).

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Initialize a PipelineState with default values.

Creates a new PipelineState with index 0, phase 0, and count 0.

`__init__(index: Int, phase: Int, count: Int) -> Self`

Initialize a PipelineState with specific values.

Creates a new PipelineState with the specified index, phase, and count.

**Args:**

* ​index (`Int`): The initial stage index.
* ​phase (`Int`): The initial phase value (0 or 1).
* ​count (`Int`): The initial count value.

### `index`

`index(self) -> Int`

Get the current stage index.

**Returns:**

The current index value, which ranges from 0 to num\_stages-1.

### `phase`

`phase(self) -> SIMD[uint32, 1]`

Get the current phase bit.

**Returns:**

The current phase value (0 or 1), which toggles when the index wraps around.

### `step`

`step(mut self)`

Advance the pipeline state to the next stage.

Increments the index and count. When the index reaches num\_stages,
it wraps around to 0 and toggles the phase bit.

This function is used to move to the next buffer in a multi-buffer
pipeline, implementing circular buffer semantics.

---

## SharedMemBarrier

`@register_passable(trivial)`
`struct SharedMemBarrier`

A hardware-accelerated synchronization primitive for GPU shared memory operations.

This struct provides a barrier mechanism optimized for coordinating thread execution
and memory transfers in GPU kernels, particularly for Tensor Memory Accelerator (TMA)
operations. It enables efficient synchronization between threads and memory operations
by leveraging hardware-specific barrier instructions.

Key features:

* Thread synchronization across thread blocks
* Memory transfer completion tracking
* Hardware-accelerated barrier operations
* Support for phased synchronization

This barrier is particularly useful for ensuring that shared memory operations
complete before dependent computations begin, which is critical for maintaining
data consistency in high-performance GPU kernels.

## Fields

* ​mbar (`SIMD[int64, 1]`): Shared memory location used for the barrier state.
  This field stores an 8-byte aligned shared memory location that
  maintains the state of the barrier. The memory must be in shared address
  space to be accessible by all threads in a block.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `init`

`init(ref [3] self, num_threads: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](1))`

Initialize the barrier state with the expected number of threads.

Sets up the barrier to expect arrivals from the specified number of threads
before it can be satisfied. This is essential for coordinating thread
synchronization in GPU kernels.

**Args:**

* ​num\_threads (`SIMD[int32, 1]`): Number of threads that must arrive at the barrier
  before it is satisfied. Defaults to 1.

### `expect_bytes`

`expect_bytes(ref [3] self, bytes: SIMD[int32, 1])`

Configure the barrier to expect a specific number of bytes to be transferred.

Used with TMA operations to indicate the expected size of data transfer.
The barrier will be satisfied when the specified number of bytes has been
transferred, enabling efficient coordination of memory operations.

**Args:**

* ​bytes (`SIMD[int32, 1]`): Number of bytes expected to be transferred.

### `expect_bytes_relaxed`

`expect_bytes_relaxed(ref [3] self, bytes: SIMD[int32, 1]) -> SIMD[uint64, 1]`

Configure the barrier to expect a specific number of bytes to be transferred.

Used with TMA operations to indicate the expected size of data transfer.
The barrier will be satisfied when the specified number of bytes has been
transferred, enabling efficient coordination of memory operations.

**Args:**

* ​bytes (`SIMD[int32, 1]`): Number of bytes expected to be transferred.

**Returns:**

The state.

### `wait`

`wait(ref [3] self, phase: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0))`

Wait until the barrier is satisfied.

Blocks the calling thread until the barrier is satisfied, either by
the expected number of threads arriving or the expected data transfer
completing. This method implements an efficient spin-wait mechanism
optimized for GPU execution.

Note:
Minimizes thread divergence during synchronization by using
hardware-accelerated barrier instructions.

**Args:**

* ​phase (`SIMD[uint32, 1]`): The phase value to check against. Defaults to 0.

### `wait_acquire`

`wait_acquire[scope: Scope](ref [3] self, phase: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0))`

Acquire and wait until the barrier is satisfied.

Blocks the calling thread until the barrier is satisfied, either by
the expected number of threads arriving or the expected data transfer
completing. This method implements an efficient spin-wait mechanism
optimized for GPU execution.

Note:
Minimizes thread divergence during synchronization by using
hardware-accelerated barrier instructions.

**Parameters:**

* ​scope (`Scope`): The scope of the barrier.

**Args:**

* ​phase (`SIMD[uint32, 1]`): The phase value to check against. Defaults to 0.

### `wait_relaxed`

`wait_relaxed[scope: Scope](ref [3] self, phase: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0))`

Wait until the barrier is satisfied with relaxed ordering.

Blocks the calling thread until the barrier is satisfied, either by
the expected number of threads arriving or the expected data transfer
completing. This method implements an efficient spin-wait mechanism
optimized for GPU execution.

Note:
Minimizes thread divergence during synchronization by using
hardware-accelerated barrier instructions.

**Parameters:**

* ​scope (`Scope`): The scope of the barrier.

**Args:**

* ​phase (`SIMD[uint32, 1]`): The phase value to check against. Defaults to 0.

### `unsafe_ptr`

`unsafe_ptr(ref [3] self) -> UnsafePointer[SIMD[int64, 1], address_space=AddressSpace(3), alignment=8, mut=self_is_mut, origin=self_is_origin]`

Get an unsafe pointer to the barrier's memory location.

Provides low-level access to the shared memory location storing the barrier state.
This method is primarily used internally by other barrier operations that need
direct access to the underlying memory.

**Returns:**

An unsafe pointer to the barrier's memory location in shared memory,
properly typed and aligned for barrier operations.

### `arrive_cluster`

`arrive_cluster(ref [3] self, cta_id: SIMD[uint32, 1], count: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1))`

Signal arrival at the barrier from a specific CTA (Cooperative Thread Array) in a cluster.

This method is used in multi-CTA scenarios to coordinate barrier arrivals
across different CTAs within a cluster. It enables efficient synchronization
across thread blocks in clustered execution models.

**Args:**

* ​cta\_id (`SIMD[uint32, 1]`): The ID of the CTA (Cooperative Thread Array) that is arriving.
* ​count (`SIMD[uint32, 1]`): The number of arrivals to signal. Defaults to 1.

### `arrive`

`arrive(ref [3] self) -> Int`

Signal arrival at the barrier and return the arrival count.

This method increments the arrival count at the barrier and returns
the updated count. It's used to track how many threads have reached
the synchronization point.

**Returns:**

The updated arrival count after this thread's arrival.

---

## TMATensorTile

`struct TMATensorTile[dtype: DType, layout: Layout, desc_layout: Layout = layout]`

A hardware-accelerated tensor memory access (TMA) tile for efficient asynchronous data movement.

The TMATensorTile struct provides a high-performance interface for asynchronous data transfers
between global memory and shared memory in GPU tensor operations. It encapsulates a TMA descriptor
that defines the memory access pattern and provides methods for various asynchronous operations.

Performance:

* Hardware-accelerated memory transfers using TMA instructions
* Supports prefetching of descriptors for latency hiding
* Enforces 128-byte alignment requirements for optimal memory access

## Parameters

* ​dtype (`DType`): DType
  The data type of the tensor elements.
* ​layout (`Layout`): Layout
  The layout of the tile in shared memory, typically specified as row\_major.
* ​desc\_layout (`Layout`): Layout = layout
  The layout of the descriptor, which can be different from the shared memory layout
  to accommodate hardware requirements like WGMMA.

## Fields

* ​descriptor (`TMADescriptor`): The TMA descriptor that defines the memory access pattern.
  This field stores the hardware descriptor that encodes information about:

  * The source tensor's memory layout and dimensions
  * The tile shape and access pattern
  * Swizzling configuration for optimal memory access

  The descriptor is used by the GPU's Tensor Memory Accelerator hardware to
  efficiently transfer data between global and shared memory.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(out self, descriptor: TMADescriptor)`

Initializes a new TMATensorTile with the provided TMA descriptor.

**Args:**

* ​descriptor (`TMADescriptor`): The TMA descriptor that defines the memory access pattern.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Copy initializes this `TMATensorTile` from another instance.

**Args:**

* ​other (`Self`): The other `TMATensorTile` instance to copy from.

### `prefetch_descriptor`

`prefetch_descriptor(self)`

Prefetches the TMA descriptor into cache to reduce latency.

This method helps hide memory access latency by prefetching the descriptor
before it's needed for actual data transfers.

### `async_copy`

`async_copy[cta_group: Int = 1](self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ref [3] mem_barrier: SharedMemBarrier, coords: Tuple[UInt, UInt])`

Schedules an asynchronous copy from global memory to shared memory at specified coordinates.

This method initiates a hardware-accelerated asynchronous transfer of data from global memory
to the specified destination in shared memory. The transfer is tracked by the provided memory
barrier.

**Constraints:**

* The destination tensor must be 128-byte aligned in shared memory.
* The descriptor layout may be smaller than the shared memory tile shape
  to accommodate hardware requirements.

**Parameters:**

* ​cta\_group (`Int`): Int
  If the TMA is issued with cta\_group == 2, only the leader CTA needs
  to be notified upon completion.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in shared memory where data will be copied.
  Must be 128-byte aligned.
* ​mem\_barrier (`SharedMemBarrier`): The memory barrier used to track and synchronize the asynchronous transfer.
* ​coords (`Tuple[UInt, UInt]`): The 2D coordinates in the source tensor from which to copy data.

### `async_copy_3d`

`async_copy_3d(self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ref [3] mem_barrier: SharedMemBarrier, coords: Tuple[UInt, UInt, UInt])`

Schedules an asynchronous copy from global memory to shared memory at specified 3D coordinates.

This method initiates a hardware-accelerated asynchronous transfer of data from global memory
to the specified destination in shared memory for 3D tensors. The transfer is tracked by the
provided memory barrier.

**Constraints:**

* The destination tensor must be 128-byte aligned in shared memory.
* The descriptor layout may be smaller than the shared memory tile shape
  to accommodate hardware requirements.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in shared memory where data will be copied.
  Must be 128-byte aligned.
* ​mem\_barrier (`SharedMemBarrier`): The memory barrier used to track and synchronize the asynchronous transfer.
* ​coords (`Tuple[UInt, UInt, UInt]`): The 3D coordinates in the source tensor from which to copy data.

### `async_multicast_load`

`async_multicast_load[cta_group: Int = 1](self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ref [3] mem_barrier: SharedMemBarrier, coords: Tuple[UInt, UInt], multicast_mask: SIMD[uint16, 1])`

Schedules an asynchronous multicast load from global memory to multiple shared memory locations.

This method initiates a hardware-accelerated asynchronous transfer of data from global memory
to multiple destination locations in shared memory across different CTAs (Cooperative Thread Arrays)
as specified by the multicast mask.

**Constraints:**

The destination tensor must be 128-byte aligned in shared memory.

**Parameters:**

* ​cta\_group (`Int`): Int
  If the TMA is issued with cta\_group == 2, only the leader CTA needs
  to be notified upon completion.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): LayoutTensor
  The destination tensor in shared memory where data will be copied.
  Must be 128-byte aligned.
* ​mem\_barrier (`SharedMemBarrier`): SharedMemBarrierArray
  The memory barrier used to track and synchronize the asynchronous transfer.
* ​coords (`Tuple[UInt, UInt]`): Tuple\[UInt, UInt]
  The 2D coordinates in the source tensor from which to copy data.
* ​multicast\_mask (`SIMD[uint16, 1]`): UInt16
  A bit mask specifying which CTAs should receive the data.

### `async_store`

`async_store(self, src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], coords: Tuple[UInt, UInt])`

Schedules an asynchronous store from shared memory to global memory.

This method initiates a hardware-accelerated asynchronous transfer of data from shared memory
to global memory at the specified coordinates.

**Constraints:**

The source tensor must be 128-byte aligned in shared memory.

**Args:**

* ​src (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): LayoutTensor
  The source tensor in shared memory from which data will be copied.
  Must be 128-byte aligned.
* ​coords (`Tuple[UInt, UInt]`): Tuple\[UInt, UInt]
  The 2D coordinates in the destination tensor where data will be stored.

### `async_reduce`

`async_reduce[reduction_kind: ReduceOp](self, src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], coords: Tuple[UInt, UInt])`

Schedules an asynchronous reduction operation from shared memory to global memory.

This method initiates a hardware-accelerated asynchronous reduction operation that combines
data from shared memory with data in global memory using the specified reduction operation.
The reduction is performed element-wise at the specified coordinates in the global tensor.

**Constraints:**

The source tensor must be 128-byte aligned in shared memory.

**Parameters:**

* ​reduction\_kind (`ReduceOp`): The type of reduction operation to perform (e.g., ADD, MIN, MAX).
  This determines how values are combined during the reduction.

**Args:**

* ​src (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in shared memory containing the data to be reduced.
  Must be 128-byte aligned.
* ​coords (`Tuple[UInt, UInt]`): The 2D coordinates in the destination tensor where the reduction will be applied.

### `commit_group`

`commit_group(self)`

Commits all prior initiated but uncommitted TMA instructions into a group.

This function behaves the same as `cp_async_bulk_commit_group`, which creates
a synchronization point for bulk TMA transfer.

### `wait_group`

`wait_group[n: Int = 0](self)`

Wait for the completion of asynchronous copy until a specified number of groups are waiting.

This function behaves the same as `cp_async_bulk_wait_group`, which causes the executing
thread to wait until a specified number of the most recent TMA copy are pending.

**Parameters:**

* ​n (`Int`): The number of pending groups left.

### `smem_tensormap_init`

`smem_tensormap_init(self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3)])`

Initializes a TMA descriptor in shared memory from this tensor tile's descriptor.

This method copies the TMA descriptor from global memory to shared memory, allowing
for faster access during kernel execution. The descriptor is copied in 16-byte chunks
using asynchronous copy operations for efficiency.

Note:

* Only one thread should call this method to avoid race conditions
* The descriptor is copied in 8 chunks of 16 bytes each (total 128 bytes)

**Args:**

* ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3)]`): Pointer to the location in shared memory where the
  descriptor will be stored. Must be properly aligned.

### `replace_tensormap_global_address_in_gmem`

`replace_tensormap_global_address_in_gmem[dtype: DType](self, src_ptr: UnsafePointer[SIMD[dtype, 1]])`

Replaces the global memory address in the TMA descriptor stored in global memory.

This method allows dynamically changing the source tensor for TMA operations without
recreating the entire descriptor, which is useful for reusing descriptors with different
data sources. The operation modifies the descriptor in global memory directly.

Note:
A memory fence may be required after this operation to ensure visibility
of the changes to other threads.

**Parameters:**

* ​dtype (`DType`): The data type of the new source tensor.

**Args:**

* ​src\_ptr (`UnsafePointer[SIMD[dtype, 1]]`): The new source tensor whose address will replace the current one in the descriptor.
  Must have compatible layout with the original tensor.

### `tensormap_fence_acquire`

`tensormap_fence_acquire(self)`

Establishes a memory fence for TMA operations with acquire semantics.

This method ensures proper ordering of memory operations by creating a barrier
that prevents subsequent TMA operations from executing before prior operations
have completed. It is particularly important when reading from a descriptor
that might have been modified by other threads or processes.

The acquire semantics ensure that all memory operations after this fence
will observe any modifications made to the descriptor before the fence.

Notes:

* The entire warp must call this function as the instruction is warp-aligned.
* Typically used in pairs with `tensormap_fence_release` for proper synchronization.

### `tensormap_fence_release`

`tensormap_fence_release(self)`

Establishes a memory fence for TMA operations with release semantics.

This method ensures proper ordering of memory operations by creating a barrier
that ensures all prior memory operations are visible before subsequent operations
can proceed. It is particularly important when modifying a TMA descriptor in
global memory that might be read by other threads or processes.

The release semantics ensure that all memory operations before this fence
will be visible to any thread that observes operations after the fence.

Notes:

* Typically used after modifying a tensormap descriptor in global memory.
* Often paired with `tensormap_fence_acquire` for proper synchronization.

### `replace_tensormap_global_address_in_shared_mem`

`replace_tensormap_global_address_in_shared_mem[dtype: DType](self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], src_ptr: UnsafePointer[SIMD[dtype, 1]])`

Replaces the global memory address in the TMA descriptor stored in shared memory.

This method allows dynamically changing the source tensor for TMA operations without
recreating the entire descriptor, which is useful for reusing descriptors with different
data sources. The operation modifies a descriptor that has been previously copied to
shared memory.

Notes:

* Only one thread should call this method to avoid race conditions.
* A memory fence may be required after this operation to ensure visibility
  of the changes to other threads.
* Typically used with descriptors previously initialized with `smem_tensormap_init`.

**Parameters:**

* ​dtype (`DType`): The data type of the new source tensor.

**Args:**

* ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the TMA descriptor in shared memory that will be modified.
* ​src\_ptr (`UnsafePointer[SIMD[dtype, 1]]`): The new source tensor whose address will replace the current one in the descriptor.

### `tensormap_cp_fence_release`

`tensormap_cp_fence_release(self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3)])`

Establishes a memory fence for TMA operations with release semantics for shared memory descriptors.

This method ensures proper ordering of memory operations by creating a barrier
that ensures all prior memory operations are visible before subsequent operations
can proceed. It is specifically designed for synchronizing between global memory and
shared memory TMA descriptors.

The release semantics ensure that all memory operations before this fence
will be visible to any thread that observes operations after the fence.

Notes:

* The entire warp must call this function as the instruction is warp-aligned
* Typically used after modifying a tensormap descriptor in shared memory
* More specialized than the general `tensormap_fence_release` for cross-memory space synchronization

**Args:**

* ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3)]`): Pointer to the TMA descriptor in shared memory that
  is being synchronized with the global memory descriptor.

### `replace_tensormap_global_dim_strides_in_shared_mem`

`replace_tensormap_global_dim_strides_in_shared_mem[dtype: DType, only_update_dim_0: Bool, /, *, rank: Int](self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], gmem_dims: IndexList[rank], gmem_strides: IndexList[rank])`

Replaces dimensions and strides in a TMA descriptor stored in shared memory. Note: This function is only supported for CUDA versions >= 12.5.

This function allows dynamically modifying the dimensions and strides of a TMA
descriptor that has been previously initialized in shared memory. If only the first dimension (dim 0) is updated, then updating strides can be skipped.

Notes:

* Only one thread should call this method to avoid race conditions.
* A memory fence may be required after this operation to ensure visibility
  of the changes to other threads.

**Parameters:**

* ​dtype (`DType`): The data type of the new source tensor.
* ​only\_update\_dim\_0 (`Bool`): If true, only the first dimension (dim 0) is updated with updating strides.
* ​rank (`Int`): The rank of the tensor.

**Args:**

* ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the TMA descriptor in shared memory that will be modified.
* ​gmem\_dims (`IndexList[rank]`): The global dimensions of the tensor to be updated.
* ​gmem\_strides (`IndexList[rank]`): The global strides of the tensor to be updated.

`replace_tensormap_global_dim_strides_in_shared_mem[dtype: DType, tensor_rank: Int, dim_idx: Int](self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], dim_value: SIMD[uint32, 1], dim_stride: Optional[SIMD[uint64, 1]] = Optional(None))`

Replaces dimensions and strides in a TMA descriptor stored in shared memory. Note: This function is only supported for CUDA versions >= 12.5. This function allows dynamically modifying the dimensions and strides of a TMA descriptor that has been previously initialized in shared memory. If only the first dimension is updated, then updating strides can be skipped.

Notes:

* Only one thread should call this method to avoid race conditions.
* A memory fence may be required after this operation to ensure visibility
  of the changes to other threads.

**Parameters:**

* ​dtype (`DType`): The data type of the source tensor in GMEM.
* ​tensor\_rank (`Int`): The rank of the source tensor in GMEM.
* ​dim\_idx (`Int`): The index of the dimension to be updated in the TMA descriptor with the provided dimension and stride values at runtime.

**Args:**

* ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the TMA descriptor in shared memory that will be modified.
* ​dim\_value (`SIMD[uint32, 1]`): The new dimension value to be set.
* ​dim\_stride (`Optional[SIMD[uint64, 1]]`): The new stride value to be set.

---

## TMATensorTileArray

`@register_passable(trivial)`
`struct TMATensorTileArray[num_of_tensormaps: Int, dtype: DType, cta_tile_layout: Layout, desc_layout: Layout]`

An array of TMA descripotr.

## Parameters

* ​num\_of\_tensormaps (`Int`): Int
  The number of TMA descriptors aka tensor map.
* ​dtype (`DType`): DType
  The data type of the tensor elements.
* ​cta\_tile\_layout (`Layout`): Layout
  The layout of the tile in shared memory, typically specified as row\_major.
* ​desc\_layout (`Layout`): Layout
  The layout of the descriptor, which can be different from the shared memory layout
  to accommodate hardware requirements like WGMMA.

## Fields

* ​tensormaps\_ptr (`UnsafePointer[SIMD[uint8, 1]]`): A static tuple of pointers to TMA descriptors.
  This field stores an array of pointers to `TMATensorTile` instances, where each pointer
  references a TMA descriptor in device memory. The array has a fixed size determined by
  the num\_of\_tensormaps parameter.

  The TMA descriptors are used by the GPU hardware to efficiently transfer data between
  global and shared memory with specific memory access patterns defined by the layouts.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `descriptor_bytes`

`alias descriptor_bytes = 128`

Size of the TMA descriptor in bytes.

This is a constant value that represents the size of the TMA descriptor in bytes.
It is used to calculate the offset of the TMA descriptor in the device memory.

## Methods

### `__init__`

`__init__(out self, tensormaps_device: DeviceBuffer[uint8])`

Initializes a new TMATensorTileArray.

**Args:**

* ​tensormaps\_device (`DeviceBuffer[uint8]`): Device buffer to store TMA descriptors.

### `__getitem__`

`__getitem__(self, index: Int) -> UnsafePointer[TMATensorTile[dtype, cta_tile_layout, desc_layout]]`

Retrieve a TMA descriptor.

**Args:**

* ​index (`Int`): Index of the TMA descriptor.

**Returns:**

`UnsafePointer` to the `TMATensorTile` at the specified index.

---

## create_tma_tile

`create_tma_tile[*tile_sizes: Int, *, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))](ctx: DeviceContext, tensor: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> TMATensorTile[dtype, row_major[::Origin[::Bool(_to_int_tuple[*::Int]())]`

Creates a `TMATensorTile` with specified tile dimensions and swizzle mode.

This function creates a hardware-accelerated Tensor Memory Access (TMA) descriptor
for efficient asynchronous data transfers between global memory and shared memory.
It configures the tile dimensions and memory access patterns based on the provided
parameters.

**Constraints:**

* The last dimension's size in bytes must not exceed the swizzle mode's byte limit
  (32B for SWIZZLE\_32B, 64B for SWIZZLE\_64B, 128B for SWIZZLE\_128B).
* Only supports 2D tensors in this overload.

**Parameters:**

* ​\*tile\_sizes (`Int`): The dimensions of the tile to be transferred. For 2D tensors, this should be
  \[height, width]. The dimensions determine the shape of data transferred in each
  TMA operation.
* ​swizzle\_mode (`TensorMapSwizzle`):
  The swizzling mode to use for memory access optimization. Swizzling can improve
  memory access patterns for specific hardware configurations.

**Args:**

* ​ctx (`DeviceContext`):
  The CUDA device context used to create the TMA descriptor.
* ​tensor (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`):
  The source tensor from which data will be transferred. This defines the
  global memory layout and data type.

**Returns:**

A `TMATensorTile` configured with the specified tile dimensions and swizzle mode,
ready for use in asynchronous data transfer operations.

`create_tma_tile[type: DType, rank: Int, tile_shape: IndexList[rank], /, is_k_major: Bool = True, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), *, __tile_layout: Layout = row_major(tile_shape.__getitem__[::Indexer](0), tile_shape.__getitem__[::Indexer](1)), __desc_layout: Layout = _tma_desc_tile_layout[::DType,::Int,::IndexList[$1, ::DType()](ctx: DeviceContext, tensor: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> TMATensorTile[type, __tile_layout, __desc_layout]`

Creates a `TMATensorTile` with advanced configuration options for 2D or 3D tensors.

This overload provides more control over the TMA descriptor creation, allowing
specification of data type, rank, and layout orientation. It supports both 2D and 3D
tensors and provides fine-grained control over the memory access patterns.

**Constraints:**

* Only supports 2D and 3D tensors (rank must be 2 or 3).
* For non-SWIZZLE\_NONE modes, the K dimension size in bytes must be a multiple
  of the swizzle mode's byte size.
* For MN-major layout, only SWIZZLE\_128B is supported.
* For 3D tensors, only K-major layout is supported.

**Parameters:**

* ​type (`DType`): DType
  The data type of the tensor elements.
* ​rank (`Int`): Int
  The dimensionality of the tensor (must be 2 or 3).
* ​tile\_shape (`IndexList[rank]`): IndexList\[rank]
  The shape of the tile to be transferred.
* ​is\_k\_major (`Bool`): Bool = True
  Whether the tensor layout is K-major (True) or MN-major (False).
  K-major is typically used for weight matrices, while MN-major is used for
  activation matrices in matrix multiplication operations.
* ​swizzle\_mode (`TensorMapSwizzle`): TensorMapSwizzle = TensorMapSwizzle.SWIZZLE\_NONE
  The swizzling mode to use for memory access optimization.
* ​\_\_tile\_layout (`Layout`): Layout = Layout.row\_major(tile\_shape\[0], tile\_shape\[1])
  Internal parameter for the tile layout in shared memory.
* ​\_\_desc\_layout (`Layout`): Layout = \_tma\_desc\_tile\_layout\[...]
  Internal parameter for the descriptor layout, which may differ from the
  tile layout to accommodate hardware requirements.

**Args:**

* ​ctx (`DeviceContext`): DeviceContext
  The CUDA device context used to create the TMA descriptor.
* ​tensor (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): LayoutTensor\[type, \**, \*\**]
  The source tensor from which data will be transferred. This defines the
  global memory layout and must match the specified data type.

**Returns:**

A `TMATensorTile` configured with the specified parameters, ready for use in
asynchronous data transfer operations.

---

## tma_async

Tensor Memory Accelerator (TMA) Asynchronous Operations Module

Provides high-performance abstractions for NVIDIA's Tensor Memory Accelerator (TMA),
enabling efficient asynchronous data movement between global and shared memory in GPU kernels.
It is designed for use with NVIDIA Hopper architecture and newer GPUs that support TMA instructions.

## Key Components:

* `TMATensorTile`: Core struct that encapsulates a TMA descriptor for efficient data transfers
  between global and shared memory with various access patterns and optimizations.

* `SharedMemBarrier`: Synchronization primitive for coordinating asynchronous TMA operations,
  ensuring data transfers complete before dependent operations begin.

* `PipelineState`: Helper struct for managing multi-stage pipeline execution with circular
  buffer semantics, enabling efficient double or triple buffering techniques.

* `create_tma_tile`: Factory functions for creating optimized `TMATensorTile` instances with
  various configurations for different tensor shapes and memory access patterns.

## Structs

* [​`PipelineState`](./PipelineState): Manages state for a multi-stage pipeline with circular buffer semantics.
* [​`SharedMemBarrier`](./SharedMemBarrier): A hardware-accelerated synchronization primitive for GPU shared memory operations.
* [​`TMATensorTile`](./TMATensorTile): A hardware-accelerated tensor memory access (TMA) tile for efficient asynchronous data movement.
* [​`TMATensorTileArray`](./TMATensorTileArray): An array of TMA descripotr.

## Functions

* [​`create_tma_tile`](./create_tma_tile): Creates a `TMATensorTile` with specified tile dimensions and swizzle mode.

---

## accumulate


---

## apple_batched_matmul

`apple_batched_matmul[*, transpose_b: Bool = False, elementwise_epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])`

---

## apple_gemv

`apple_gemv[*, b_packed: Bool, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape])`

---

## apple_matmul

`apple_matmul[*, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](cblas_gemm_fn: fn(_CBLASOrder, _CBLASTranspose, _CBLASTranspose, SIMD[int32, 1], SIMD[int32, 1], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1]) -> None, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])`

`apple_matmul[*, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])`

---

## get_cblas_f32_function

`get_cblas_f32_function() -> fn(_CBLASOrder, _CBLASTranspose, _CBLASTranspose, SIMD[int32, 1], SIMD[int32, 1], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1]) -> None`

---

## apple_accelerate

## Aliases

### `APPLE_ACCELERATE`

`alias APPLE_ACCELERATE = _Global[__init__[__mlir_type.!kgen.string]("APPLE_ACCELERATE"), _OwnedDLHandle, _init_dylib]`

### `cblas_gemm_type`

`alias cblas_gemm_type = fn(_CBLASOrder, _CBLASTranspose, _CBLASTranspose, SIMD[int32, 1], SIMD[int32, 1], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1]) -> None`

### `LIB_ACC_PATH`

`alias LIB_ACC_PATH = "/System/Library/Frameworks/Accelerate.framework/Accelerate"`

## Functions

* [​`apple_batched_matmul`](./apple_batched_matmul):
* [​`apple_gemv`](./apple_gemv):
* [​`apple_matmul`](./apple_matmul):
* [​`get_cblas_f32_function`](./get_cblas_f32_function):
* [​`use_apple_accelerate_lib`](./use_apple_accelerate_lib):

---

## use_apple_accelerate_lib

`use_apple_accelerate_lib[c_type: DType, a_type: DType, b_type: DType]() -> Bool`

---

## dot_at_b

`dot_at_b(c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])`

---

## dot_at_b_impl

`dot_at_b_impl(c: NDBuffer[float32, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(16, 16)))], a: NDBuffer[float32, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(16, 16)))], b: NDBuffer[float32, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(16, 16)))])`

`dot_at_b_impl(c: NDBuffer[float16, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(32, 32)))], a: NDBuffer[float16, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(32, 32)))], b: NDBuffer[float16, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(32, 32)))])`

---

## extrx

`extrx(gpr: Int)`

Extracts a row or moves it to x, result in amx0.

---

## extry

`extry(gpr: Int)`

Extracts a row or moves it to y, result in amx0.

---

## fma

`fma[mode: StringSlice[StaticConstantOrigin], type: DType](z_row_index: Int, x_row_index: Int, y_row_index: Int, clear_z: Bool)`

---

## fma16

`fma16(gpr: Int)`

Float16 matrix multiply and subtract.

---

## fma32

`fma32(gpr: Int)`

Float32 matrix multiply and add.

---

## fma64

`fma64(gpr: Int)`

Float64 matrix multiply and add.

---

## fms16

`fms16(gpr: Int)`

Float16 matrix multiply and add.

---

## fsm32

`fsm32(gpr: Int)`

Float32 matrix multiply and subtract.

---

## fsm64

`fsm64(gpr: Int)`

Float64 matrix multiply and subtract.

---

## genlut

`genlut(gpr: Int)`

---

## apple_amx_intrinsics

## Functions

* [​`dot_at_b`](./dot_at_b):
* [​`dot_at_b_impl`](./dot_at_b_impl):
* [​`extrx`](./extrx): Extracts a row or moves it to x, result in amx0.
* [​`extry`](./extry): Extracts a row or moves it to y, result in amx0.
* [​`fma`](./fma):
* [​`fma16`](./fma16): Float16 matrix multiply and subtract.
* [​`fma32`](./fma32): Float32 matrix multiply and add.
* [​`fma64`](./fma64): Float64 matrix multiply and add.
* [​`fms16`](./fms16): Float16 matrix multiply and add.
* [​`fsm32`](./fsm32): Float32 matrix multiply and subtract.
* [​`fsm64`](./fsm64): Float64 matrix multiply and subtract.
* [​`genlut`](./genlut):
* [​`ldx`](./ldx):
* [​`ldy`](./ldy):
* [​`ldz`](./ldz):
* [​`ldzi`](./ldzi):
* [​`load_z`](./load_z):
* [​`mac16`](./mac16): SI16 matrix multiply and add.
* [​`matfp`](./matfp): Float16 matrix multiply.
* [​`max_int__`](./max_int__): UI16 matrix multiply.
* [​`read_x`](./read_x):
* [​`read_y`](./read_y):
* [​`store_x`](./store_x):
* [​`store_y`](./store_y):
* [​`store_z`](./store_z):
* [​`stx`](./stx):
* [​`sty`](./sty):
* [​`stz`](./stz):
* [​`stzi`](./stzi):
* [​`transpose_z_to_x_or_y`](./transpose_z_to_x_or_y):
* [​`vec_int__`](./vec_int__): Horizontal ui16 multiply `z0[i] += x0[i] + y0[i]`.
* [​`vecfp`](./vecfp): Horizontal float16 multiply `z0[i] += x0[i] + y0[i]`.

---

## ldx

`ldx(gpr: Int)`

---

## ldy

`ldy(gpr: Int)`

---

## ldz

`ldz(gpr: Int)`

---

## ldzi

`ldzi(gpr: Int)`

---

## load_z

`load_z[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)`

---

## mac16

`mac16(gpr: Int)`

SI16 matrix multiply and add.

---

## matfp

`matfp(gpr: Int)`

Float16 matrix multiply.

---

## max_int__

`max_int__(gpr: Int)`

UI16 matrix multiply.

---

## read_x

`read_x[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)`

---

## read_y

`read_y[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)`

---

## store_x

`store_x[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)`

---

## store_y

`store_y[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)`

---

## store_z

`store_z[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)`

---

## stx

`stx(gpr: Int)`

---

## sty

`sty(gpr: Int)`

---

## stz

`stz(gpr: Int)`

---

## stzi

`stzi(gpr: Int)`

---

## transpose_z_to_x_or_y

`transpose_z_to_x_or_y[destination: StringSlice[StaticConstantOrigin], type: DType](z_col_index: Int, xy_row_index: Int, z_row_suboffset: Int)`

---

## vec_int__

`vec_int__(gpr: Int)`

Horizontal ui16 multiply `z0[i] += x0[i] + y0[i]`.

---

## vecfp

`vecfp(gpr: Int)`

Horizontal float16 multiply `z0[i] += x0[i] + y0[i]`.

---

## batched_matmul

`batched_matmul[rank: Int, a_type: DType, b_type: DType, c_type: DType, //, *, transpose_a: Bool, transpose_b: Bool, elementwise_epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), saturated_vnni: Bool = False, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c_buf: NDBuffer[c_type, rank, origin], a_buf: NDBuffer[a_type, rank, origin], b_buf: NDBuffer[b_type, rank, origin], *, context: DeviceContextPtr = DeviceContextPtr())`

`batched_matmul[rank: Int, a_type: DType, b_type: DType, c_type: DType, //, *, transpose_b: Bool, elementwise_epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), saturated_vnni: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c_buf: NDBuffer[c_type, rank, origin], a_buf: NDBuffer[a_type, rank, origin], b_buf: NDBuffer[b_type, rank, origin], *, context: DeviceContextPtr = DeviceContextPtr())`

---

## batched_matmul_kernel

`batched_matmul_kernel[rank: Int, c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), accum_type: DType = get_accum_type[::DType,::DType]()](c_buff: NDBuffer[c_type, 3, MutableAnyOrigin, c_shape], a_buff: NDBuffer[a_type, 3, MutableAnyOrigin, a_shape], b_buff: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], c_buff_nd_shape: IndexList[rank])`

---

## batched_matmul_shape

`batched_matmul_shape[rank: Int, a_type: DType, b_type: DType, single_thread_blocking_override: Bool](a_buff: NDBuffer[a_type, rank, origin], b_buff: NDBuffer[b_type, rank, origin]) -> IndexList[rank]`

Compute the output shape of a `batch_matmul` operation, and assert the inputs are compatible.

**Parameters:**

* ​rank (`Int`): Rank of the input and output tensors.
* ​a\_type (`DType`): Type of the lhs input tensor.
* ​b\_type (`DType`): Type of the rhs input tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​a\_buff (`NDBuffer[a_type, rank, origin]`): The lhs input tensor.
* ​b\_buff (`NDBuffer[b_type, rank, origin]`): The rhs input tensor.

**Returns:**

The output shape.

---

## bmm

## Aliases

### `elementwise_epilogue_type`

`alias elementwise_epilogue_type = fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None`

## Functions

* [​`batched_matmul`](./batched_matmul):
* [​`batched_matmul_kernel`](./batched_matmul_kernel):
* [​`batched_matmul_shape`](./batched_matmul_shape): Compute the output shape of a `batch_matmul` operation, and assert the inputs are compatible.

---

## create_matmul_configs_ampere

`create_matmul_configs_ampere[key: String, a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool]() -> MatmulConfig[a_type, b_type, c_type, transpose_b]`

---

## get_dispatch_table

`get_dispatch_table[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool]() -> Dict[String, MatmulConfig[a_type, b_type, c_type, transpose_b]]`

---

## dispatch_table_a100_gpu

## Functions

* [​`create_matmul_configs_ampere`](./create_matmul_configs_ampere):
* [​`get_dispatch_table`](./get_dispatch_table):

---

## distributed_matmul

## Functions

* [​`matmul_allreduce`](./matmul_allreduce): Performs C = matmul(A, B^T) followed with Out = allreduce(C) operation across multiple GPUs. Split the A or B and C matrices into `num_partitions` submatrices at dimension `partition_dim`. This way we can perform `num_partitions` independent matmul + allreduce kernels, and overlap some of the computation.

---

## matmul_allreduce

`matmul_allreduce[ngpus: Int, partition_dim: Int, num_partitions: Int, outputs_lambda: fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None, type: DType, a_static_shape: DimList, b_static_shape: DimList, c_static_shape: DimList, out_static_shape: DimList, overlap_with_dpl: Bool = True](a_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, a_static_shape], ngpus], b_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, b_static_shape], ngpus], c_temp_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, c_static_shape], ngpus], output_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, out_static_shape], ngpus], rank_sigs: InlineArray[UnsafePointer[Signal], 8], ctxs: List[DeviceContext])`

Performs C = matmul(A, B^T) followed with Out = allreduce(C) operation across multiple GPUs. Split the A or B and C matrices into `num_partitions` submatrices at dimension `partition_dim`. This way we can perform `num_partitions` independent matmul + allreduce kernels, and overlap some of the computation.

---

## config_in_smem

`config_in_smem[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool, //, max_smem: Int](config: MatmulConfig[a_type, b_type, c_type, transpose_b]) -> MatmulConfig[a_type, b_type, c_type, transpose_b]`

---

## dual_gemm

`dual_gemm[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], config: OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]] = OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]]({:i1 0, 1}), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], ctx: DeviceContext)`

---

## dual_gemv

`dual_gemv[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], ctx: DeviceContext)`

---

## dual_gemv_kernel

`dual_gemv_kernel[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, simd_width: UInt, tile_m: UInt, tile_n: UInt, num_threads: UInt, binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape])`

---

## dual_gemm

## Aliases

### `binary_fn_type`

`alias binary_fn_type = fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1]`

## Functions

* [​`config_in_smem`](./config_in_smem):
* [​`dual_gemm`](./dual_gemm):
* [​`dual_gemv`](./dual_gemv):
* [​`dual_gemv_kernel`](./dual_gemv_kernel):
* [​`multistage_dual_gemm`](./multistage_dual_gemm):
* [​`multistage_dual_gemm_kernel`](./multistage_dual_gemm_kernel):
* [​`multistage_dual_mma`](./multistage_dual_mma):
* [​`swilu`](./swilu):
* [​`swishGLU`](./swishGLU): Reference:     GLU Variants Improve Transformer     by Noam Shazeer      The implementation follows cutlass, using one kernel invocation and writing to the destination once.

---

## multistage_dual_gemm

`multistage_dual_gemm[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, b_type: DType, b_layout: Layout, //, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: LayoutTensor[c_type, c_layout, origin], a: LayoutTensor[a_type, a_layout, origin], b0: LayoutTensor[b_type, b_layout, origin], b1: LayoutTensor[b_type, b_layout, origin], ctx: DeviceContext)`

`multistage_dual_gemm[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), num_k_partitions: Int = 1](c: NDBuffer[c_type, 2, origin, c_shape], a: NDBuffer[a_type, 2, origin, a_shape], b0: NDBuffer[b_type, 2, origin, b_shape], b1: NDBuffer[b_type, 2, origin, b_shape], ctx: DeviceContext)`

---

## multistage_dual_gemm_kernel

`multistage_dual_gemm_kernel[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, b_type: DType, b_layout: Layout, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: LayoutTensor[c_type, c_layout, MutableAnyOrigin], a: LayoutTensor[a_type, a_layout, MutableAnyOrigin], b0: LayoutTensor[b_type, b_layout, MutableAnyOrigin], b1: LayoutTensor[b_type, b_layout, MutableAnyOrigin])`

---

## multistage_dual_mma

`multistage_dual_mma[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, a_smem_layout: Layout, b_type: DType, b_layout: Layout, b_smem_layout: Layout, //, BM: Int, BN: Int, BK: Int, WM: Int, WN: Int, num_threads: Int, num_pipeline_stages: Int, transpose_b: Bool, /, *, swizzle_a: Bool = True, static_num_iters: Dim = Dim(-31337), k_group_size: UInt = UInt(1)](c0: LayoutTensor[c_type, c_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c1: LayoutTensor[c_type, c_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], a_iter_arg: LayoutTensorIter[type, a_layout, MutableAnyOrigin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b0_iter_arg: LayoutTensorIter[b_type, b_layout, MutableAnyOrigin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b1_iter_arg: LayoutTensorIter[b_type, b_layout, MutableAnyOrigin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], a_smem_iter_arg: LayoutTensorIter[a_type, a_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut b0_smem_iter: LayoutTensorIter[b_type, b_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut b1_smem_iter: LayoutTensorIter[b_type, b_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], num_iters: Int, /, *, num_b_rows: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

---

## swilu

`swilu[type: DType, width: Int](x: SIMD[type, width], y: SIMD[type, width]) -> SIMD[type, width]`

---

## swishGLU

`swishGLU[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], ctx: DeviceContextPtr)`

Reference:     GLU Variants Improve Transformer     by Noam Shazeer      The implementation follows cutlass, using one kernel invocation and writing to the destination once.

---

## FastDiv

`@register_passable(trivial)`
`struct FastDiv[type: DType]`

Implements fast division for a given type.

This struct provides optimized division by a constant divisor,
replacing the division operation with a series of shifts and
multiplications. This approach significantly improves performance,
especially in scenarios where division is a frequent operation.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `uint_type`

`alias uint_type = _uint_type_of_width[::Int]()`

## Methods

### `__init__`

`@implicit`
`__init__(divisor: Int = 1) -> Self`

Initializes FastDiv with the divisor.

**Constraints:**

ConstraintError: If the bitwidth of the type is > 32.

**Args:**

* ​divisor (`Int`): The divisor to use for fast division.
  Defaults to 1.

### `__rtruediv__`

`__rtruediv__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> SIMD[_uint_type_of_width[::Int](), 1]`

Divides the other scalar by the divisor (true division).

Uses the fast division algorithm.

**Args:**

* ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend.

**Returns:**

The result of the division.

### `__rmod__`

`__rmod__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> SIMD[_uint_type_of_width[::Int](), 1]`

Computes the remainder of division.

**Args:**

* ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend.

**Returns:**

The remainder.

### `__rdiv__`

`__rdiv__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> SIMD[_uint_type_of_width[::Int](), 1]`

Divides the other scalar by the divisor.

**Args:**

* ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend.

**Returns:**

The result of the division.

### `__divmod__`

`__divmod__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> Tuple[SIMD[_uint_type_of_width[::Int](), 1], SIMD[_uint_type_of_width[::Int](), 1]]`

Computes both quotient and remainder.

**Args:**

* ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend.

**Returns:**

A tuple containing the quotient and remainder.

---

## fast_div

Implements the fast division algorithm.

This method replaces division by constants with a sequence of shifts and
multiplications, significantly optimizing division performance.

## Structs

* [​`FastDiv`](./FastDiv): Implements fast division for a given type.

---

## block_reduce

`block_reduce[type: DType, //, warps_per_block: Int](val: SIMD[type, 1]) -> SIMD[type, 1]`

---

## fp8_quantization

## Functions

* [​`block_reduce`](./block_reduce):
* [​`matmul_dynamic_scaled_fp8`](./matmul_dynamic_scaled_fp8):
* [​`quantize_dynamic_scaled_fp8`](./quantize_dynamic_scaled_fp8):
* [​`quantize_fp8_kernel`](./quantize_fp8_kernel):
* [​`quantize_static_scaled_fp8`](./quantize_static_scaled_fp8):

---

## matmul_dynamic_scaled_fp8

`matmul_dynamic_scaled_fp8[c_type: DType, a_type: DType, b_type: DType, a_scales_type: DType, b_scales_type: DType, //, transpose_b: Bool = False, config: OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]] = OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]]({:i1 0, 1}), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[b_type, 2, origin, shape], a_scales: NDBuffer[a_scales_type, 2, origin, shape], b_scales: NDBuffer[b_scales_type, 2, origin, shape], ctx: DeviceContext)`

---

## quantize_dynamic_scaled_fp8

`quantize_dynamic_scaled_fp8[out_dtype: DType, in_dtype: DType, scales_dtype: DType, //, group_size_or_per_token: Int](scaled_output: NDBuffer[out_dtype, 2, origin, shape, strides], scales: NDBuffer[scales_dtype, 2, origin, shape, strides], input: NDBuffer[in_dtype, 2, origin, shape, strides], scale_ub: SIMD[float32, 1], ctx: DeviceContext)`

---

## quantize_fp8_kernel

`quantize_fp8_kernel[out_type: DType, scales_type: DType, in_type: DType, warps_per_block: Int, group_size: Int](output: NDBuffer[out_type, 2, MutableAnyOrigin], scales: NDBuffer[scales_type, 2, MutableAnyOrigin], input: NDBuffer[in_type, 2, MutableAnyOrigin], scale_ub: SIMD[scales_type, 1])`

---

## quantize_static_scaled_fp8

`quantize_static_scaled_fp8[out_dtype: DType, in_dtype: DType, is_scale_inverted: Bool = True](out_buffer: NDBuffer[out_dtype, 2, origin, shape, strides], in_buffer: NDBuffer[in_dtype, 2, origin, shape, strides], scale: SIMD[float32, 1], context: DeviceContext)`

---

## GEMVAlgorithm

`struct GEMVAlgorithm`

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `GEMV_KERNEL`

`alias GEMV_KERNEL = GEMVAlgorithm(0)`

### `GEMV_KERNEL_VECTOR`

`alias GEMV_KERNEL_VECTOR = GEMVAlgorithm(1)`

### `GEMV_SPLIT_K`

`alias GEMV_SPLIT_K = GEMVAlgorithm(2)`

### `GEVM_KERNEL`

`alias GEVM_KERNEL = GEMVAlgorithm(4)`

### `GEVM_KERNEL_VECTOR`

`alias GEVM_KERNEL_VECTOR = GEMVAlgorithm(3)`

### `MATMUL_NAIVE`

`alias MATMUL_NAIVE = GEMVAlgorithm(5)`

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

### `__is__`

`__is__(self, other: Self) -> Bool`

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

---

## gemv

`gemv[parallelize: Bool, c_size: Dim, c_type: DType, a_shape: DimList, a_type: DType, b_size: Dim, b_type: DType, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c_buf: NDBuffer[c_type, 1, origin, __init__[::Intable](c_size)], a_buf: NDBuffer[a_type, 2, origin, a_shape], b_buf: NDBuffer[b_type, 1, origin, __init__[::Intable](b_size)])`

---

## gemv_gpu

`gemv_gpu[transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], ctx: DeviceContext)`

---

## gemv_gpu_dispatch

`gemv_gpu_dispatch[transpose_b: Bool = False, reduction_method: ReductionMethod = ReductionMethod(1), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](kernel_func: GEMVAlgorithm, c: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], ctx: DeviceContext)`

---

## gemv_kernel

`gemv_kernel[c_type: DType, a_type: DType, b_type: DType, *, reduction_method: ReductionMethod, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: UnsafePointer[SIMD[c_type, 1]], a: UnsafePointer[SIMD[a_type, 1]], b: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)`

---

## gemv_kernel_vector

`gemv_kernel_vector[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, *, reduction_method: ReductionMethod, simd_width: UInt, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], m: UInt, n: UInt, k: UInt)`

---

## gemv_split_k

`gemv_split_k[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, simd_width: UInt, tile_m: UInt, tile_n: UInt, num_threads: UInt, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](output: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], act: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], weight: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], m: UInt, n: UInt, k: UInt)`

GEMV with tiling in K dimension. Assuming the B (weight) matrix is transposed i.e. row major N x K, this kernel implements a vector (1 x K) times a matrix (N x K).

The impl can actually handle M > 1 but it's only optimal fro tiny M. We use
it for M = 1 only.

---

## gevm_kernel

`gevm_kernel[c_type: DType, a_type: DType, b_type: DType, *, tile_size: Int, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: UnsafePointer[SIMD[c_type, 1]], a: UnsafePointer[SIMD[a_type, 1]], b: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)`

---

## gevm_tc_kernel_vector_8x

`gevm_tc_kernel_vector_8x[c_type: DType, a_type: DType, b_type: DType, tile_size: Int, simd_width: Int, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: NDBuffer[c_type, 2, MutableAnyOrigin], a: NDBuffer[a_type, 2, MutableAnyOrigin], b: NDBuffer[b_type, 2, MutableAnyOrigin], m: UInt, n: UInt, k: UInt)`

---

## gemv

## Structs

* [​`GEMVAlgorithm`](./GEMVAlgorithm):

## Functions

* [​`gemv`](./gemv):
* [​`gemv_gpu`](./gemv_gpu):
* [​`gemv_gpu_dispatch`](./gemv_gpu_dispatch):
* [​`gemv_kernel`](./gemv_kernel):
* [​`gemv_kernel_vector`](./gemv_kernel_vector):
* [​`gemv_split_k`](./gemv_split_k): GEMV with tiling in K dimension. Assuming the B (weight) matrix is transposed i.e. row major N x K, this kernel implements a vector (1 x K) times a matrix (N x K).
* [​`gevm_kernel`](./gevm_kernel):
* [​`gevm_tc_kernel_vector_8x`](./gevm_tc_kernel_vector_8x):
* [​`naive_gemv`](./naive_gemv):
* [​`reverse_idx`](./reverse_idx):

---

## naive_gemv

`naive_gemv[c_size: Dim, a_shape: DimList, b_size: Dim, type: DType](c_buf: NDBuffer[type, 1, origin, __init__[::Intable](c_size)], a_buf: NDBuffer[type, 2, origin, a_shape], b_buf: NDBuffer[type, 1, origin, __init__[::Intable](b_size)])`

---

## reverse_idx

`reverse_idx[transpose: Bool](x: Int, y: Int) -> IndexList[2]`

---

## default_config_sm90

`default_config_sm90[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool, wgmma_shape: IndexList[3]]() -> MatmulConfig[a_type, b_type, c_type, transpose_b, wgmma_shape]`

---

## grouped_matmul

`grouped_matmul[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], max_num_tokens_per_expert: Int, num_active_experts: Int, ctx: DeviceContext)`

---

## grouped_matmul_kernel

`grouped_matmul_kernel[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, c_desc_layout: Layout, c_smem_layout: Layout, cluster_shape: StaticTuple[SIMD[int32, 1], 3], a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), c_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = True, num_threads: Int = 128, pipeline_stages: Int = 7, use_tma_store: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c_tma_op: TMATensorTile[c_type, c_smem_layout, c_desc_layout], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin])`

---

## grouped_matmul_sm90

`grouped_matmul_sm90[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool = True, wgmma_shape: IndexList[3] = Index(64, 256, 16), config: MatmulConfig[a_type, b_type, c_type, transpose_b, wgmma_shape] = default_config_sm90[::DType,::DType,::DType,::Bool,::IndexList[::Int(), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], max_num_tokens_per_expert: Int, b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], num_active_experts: Int, ctx: DeviceContext)`

---

## grouped_matmul

## Aliases

### `NumWarpPerWarpGroup`

`alias NumWarpPerWarpGroup = 4`

### `WARP_GROUP_SIZE`

`alias WARP_GROUP_SIZE = 128`

## Functions

* [​`default_config_sm90`](./default_config_sm90):
* [​`grouped_matmul`](./grouped_matmul):
* [​`grouped_matmul_kernel`](./grouped_matmul_kernel):
* [​`grouped_matmul_sm90`](./grouped_matmul_sm90):
* [​`naive_grouped_matmul`](./naive_grouped_matmul):
* [​`naive_grouped_matmul_kernel`](./naive_grouped_matmul_kernel):

---

## naive_grouped_matmul

`naive_grouped_matmul[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool = True](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], max_num_tokens_per_expert: Int, num_active_experts: Int, ctx: DeviceContext)`

---

## naive_grouped_matmul_kernel

`naive_grouped_matmul_kernel[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin])`

---

## linalg

Provides CPU and GPU implementations of linear algebra functions.

## Modules

* [​`accumulate`](./accumulate/):
* [​`apple_accelerate`](./apple_accelerate/):
* [​`apple_amx_intrinsics`](./apple_amx_intrinsics/):
* [​`bmm`](./bmm/):
* [​`dispatch_table_a100_gpu`](./dispatch_table_a100_gpu/):
* [​`distributed_matmul`](./distributed_matmul/):
* [​`dual_gemm`](./dual_gemm/):
* [​`fast_div`](./fast_div/): Implements the fast division algorithm.
* [​`fp8_quantization`](./fp8_quantization/):
* [​`gemv`](./gemv/):
* [​`grouped_matmul`](./grouped_matmul/):
* [​`intel_amx_intrinsics`](./intel_amx_intrinsics/):
* [​`matmul`](./matmul/):
* [​`matmul_default`](./matmul_default/):
* [​`matmul_gpu`](./matmul_gpu/):
* [​`matmul_i8mm`](./matmul_i8mm/):
* [​`matmul_neon`](./matmul_neon/):
* [​`matmul_sm90`](./matmul_sm90/):
* [​`matmul_tile_scheduler`](./matmul_tile_scheduler/):
* [​`matmul_vendor`](./matmul_vendor/):
* [​`matmul_vnni`](./matmul_vnni/):
* [​`matrix_band_part`](./matrix_band_part/): The module implements matrix band part functions.
* [​`neon_intrinsics`](./neon_intrinsics/):
* [​`packing`](./packing/):
* [​`qr_factorization`](./qr_factorization/):
* [​`transpose`](./transpose/): The module implements Transpose functions.
* [​`utils`](./utils/):
* [​`utils_gpu`](./utils_gpu/):
* [​`vendor_blas`](./vendor_blas/):
* [​`vnni_intrinsics`](./vnni_intrinsics/):

---

## intel_amx_intrinsics

## Aliases

### `void`

`alias void = invalid`

## Structs

* [​`__tile`](./__tile): An AMX tile representation
* [​`tileconfig`](./tileconfig):

## Functions

* [​`init_intel_amx`](./init_intel_amx):

---

## init_intel_amx

`init_intel_amx() -> Bool`

---

## tileconfig

`struct tileconfig`

## Fields

* ​pavarte\_id (`SIMD[uint8, 1]`):
* ​start\_row (`SIMD[uint8, 1]`):
* ​reserved (`StaticTuple[scalar, 14]`):
* ​colb (`StaticTuple[scalar, 16]`):
* ​rows (`StaticTuple[scalar, 16]`):

## Implemented traits

`AnyType`,
`UnknownDestructibility`

---

## InnerMatmulKernel

## Implemented traits

`AnyType`,
`Copyable`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__inner_matmul__`

`__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self: _Self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)`

---

## TiledMatmul

`struct TiledMatmul[a_mut: Bool, b_mut: Bool, //, config: KernelConfig, transpose_b: Bool, b_packed: Bool, elementwise_epilogue_enabled: Bool, kernel_id: InnerKernelID, a_type: DType, a_shape: DimList, a_origin: Origin[a_mut], b_type: DType, b_shape: DimList, b_origin: Origin[b_mut], c_type: DType, c_shape: DimList, c_origin: MutableOrigin, algorithm: InnerMatmulKernel]`

Tiled matmul implementation integrating packing, inner loop and tile partitions.

TODO: add tag based implementation dispatch.
TODO: add fusion hooks.

## Fields

* ​alg (`algorithm`):
* ​c (`NDBuffer[c_type, 2, c_origin, c_shape]`):
* ​a (`NDBuffer[a_type, 2, a_origin, a_shape]`):
* ​b (`NDBuffer[b_type, 2, b_origin, b_shape]`):
* ​tile\_n\_k (`IndexList[2]`):
* ​global\_tile\_offset (`GemmShape`):
* ​global\_tile\_shape (`GemmShape`):
* ​b\_tile\_generator (`BTileGenerator[config, a_type, b_type, c_type, b_shape, transpose_b, b_packed, b_origin]`):
* ​elementwise\_epilogue\_fn (`fn(GemmShape, GemmShape) escaping -> None`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

---

## elementwise_epilogue_c_tile

`elementwise_epilogue_c_tile[: origin.set, //, simd_width: Int, type: DType, origin: MutableOrigin, c_shape: DimList, func: fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None](offset: GemmShape, tile_len: GemmShape, c: NDBuffer[type, 2, origin, c_shape])`

---

## matmul

## Structs

* [​`TiledMatmul`](./TiledMatmul): Tiled matmul implementation integrating packing, inner loop and tile partitions.

## Traits

* [​`InnerMatmulKernel`](./InnerMatmulKernel):

## Functions

* [​`elementwise_epilogue_c_tile`](./elementwise_epilogue_c_tile):
* [​`matmul`](./matmul):
* [​`tiled_matmul_run`](./tiled_matmul_run): Interface function to run tiled matmul on a given sub-tile.

---

## matmul

`matmul[transpose_a: Bool = False, transpose_b: Bool = False, b_packed: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), saturated_vnni: Bool = False, single_thread_blocking_override: Bool = False, _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], ctx: DeviceContextPtr = DeviceContextPtr())`

`matmul[transpose_a: Bool = False, transpose_b: Bool = False, b_packed: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), saturated_vnni: Bool = False, single_thread_blocking_override: Bool = False, _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], ctx: Optional[DeviceContext])`

---

## tiled_matmul_run

`tiled_matmul_run[config: KernelConfig, transpose_b: Bool, b_packed: Bool, simd_size: Int, elementwise_epilogue_enabled: Bool, kernel_id: InnerKernelID, algorithm: InnerMatmulKernel](alg: algorithm, c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], elementwise_epilogue_fn: fn(GemmShape, GemmShape) escaping -> None, global_tile_shape: GemmShape, global_tile_offset: GemmShape)`

Interface function to run tiled matmul on a given sub-tile.

**Args:**

* ​alg (`algorithm`): InnerMatmulKernel algorithm for microkernel.
* ​c (`NDBuffer[type, 2, origin, shape]`): Pre-allocated buffer space for result.
* ​a (`NDBuffer[type, 2, origin, shape]`): Operand A of the matmul.
* ​b (`NDBuffer[type, 2, origin, shape]`): Operand B of the mamtul.
* ​elementwise\_epilogue\_fn (`fn(GemmShape, GemmShape) escaping -> None`): The elementwise epilogue function.
* ​global\_tile\_shape (`GemmShape`): Tile shape this call will process.
* ​global\_tile\_offset (`GemmShape`): Tile offset on the original buffer.

---

## Inner_matmul_default

`struct Inner_matmul_default`

## Implemented traits

`AnyType`,
`Copyable`,
`InnerMatmulKernel`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__inner_matmul__`

`__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)`

Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows, TileN, TileK) tile.

---

## matmul_default

## Structs

* [​`Inner_matmul_default`](./Inner_matmul_default):

---

## AMDSchedulerTuning

`@register_passable(trivial)`
`struct AMDSchedulerTuning`

## Fields

* ​block\_shape (`IndexList[2]`):
* ​tuning\_values (`IndexList[3]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

---

## matmul_gpu

## Structs

* [​`AMDSchedulerTuning`](./AMDSchedulerTuning):

## Functions

* [​`__nvvm_ldg_f4`](./__nvvm_ldg_f4):
* [​`matmul_kernel`](./matmul_kernel): Matrix Multiplication using shared memory. This version loads blocks of size tile\_size x tile\_size from A and B and updates a tile\_size x tile\_size in C. The thread block should have shape (tile\_size, tile\_size, 1). Each thread is mapped one element in C. The grid should have shape (N/tile\_size, M/tile\_size, 1). N is the first dimension for coalesced access.
* [​`matmul_kernel_naive`](./matmul_kernel_naive):
* [​`multistage_gemm`](./multistage_gemm):
* [​`split_k_reduce`](./split_k_reduce):

---

## matmul_kernel

`matmul_kernel[c_type: DType, a_type: DType, b_type: DType, tile_size: Int, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c_ptr: UnsafePointer[SIMD[c_type, 1]], a_ptr: UnsafePointer[SIMD[a_type, 1]], b_ptr: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)`

Matrix Multiplication using shared memory. This version loads blocks of size tile\_size x tile\_size from A and B and updates a tile\_size x tile\_size in C. The thread block should have shape (tile\_size, tile\_size, 1). Each thread is mapped one element in C. The grid should have shape (N/tile\_size, M/tile\_size, 1). N is the first dimension for coalesced access.

---

## matmul_kernel_naive

`matmul_kernel_naive[c_type: DType, a_type: DType, b_type: DType, BLOCK_DIM: Int, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c_ptr: UnsafePointer[SIMD[c_type, 1]], a_ptr: UnsafePointer[SIMD[a_type, 1]], b_ptr: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)`

---

## multistage_gemm

`multistage_gemm[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), serial_reduction: Bool = False](c: NDBuffer[c_type, 2, origin, c_shape], a: NDBuffer[a_type, 2, origin, a_shape], b: NDBuffer[b_type, 2, origin, b_shape], runtime_config: MatmulConfig[a_type, b_type, c_type, transpose_b], ctx: DeviceContext)`

---

## split_k_reduce

`split_k_reduce[c_type: DType, work_space_type: DType, c_shape: DimList, work_space_shape: DimList, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, c_shape], work_space: NDBuffer[work_space_type, 3, origin, work_space_shape], ctx: DeviceContext)`

---

## Inner_matmul_i8mm

`struct Inner_matmul_i8mm`

## Implemented traits

`AnyType`,
`Copyable`,
`InnerMatmulKernel`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__inner_matmul__`

`__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)`

Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows2, TileN, TileK) tile.

---

## LoadStore_i8mm

`struct LoadStore_i8mm[type: DType, simd_size: Int, single_row: Bool, tile_rows: Int, tile_columns: Int]`

## Fields

* ​output\_tile (`_Accumulator[type, tile_rows, 0 if (simd_size == 0) else (div_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) + -1) if (((rem_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) == 0) ^ True) & ((simd_size , #lit.struct.extract, 0), {1}, simd_size), "value">), simd_size]`):
* ​skip\_boundary\_check (`Bool`):

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `num_simd_cols`

`alias num_simd_cols = 0 if (simd_size == 0) else (div_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) + -1) if (((rem_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) == 0) ^ True) & ((simd_size , #lit.struct.extract, 0), {1}, simd_size), "value">)`

## Methods

### `__init__`

`@implicit`
`__init__(out self, skip_boundary_check: Bool)`

---

## matmul_i8mm

## Structs

* [​`Inner_matmul_i8mm`](./Inner_matmul_i8mm):
* [​`LoadStore_i8mm`](./LoadStore_i8mm):

---

## Inner_matmul_neon

`struct Inner_matmul_neon`

## Implemented traits

`AnyType`,
`Copyable`,
`InnerMatmulKernel`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__inner_matmul__`

`__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)`

Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows, TileN, TileK) tile.

---

## matmul_neon

## Structs

* [​`Inner_matmul_neon`](./Inner_matmul_neon):

---

## cluster_size

`cluster_size[cluster_shape: StaticTuple[SIMD[int32, 1], 3]]() -> SIMD[int32, 1]`

---

## consumer_main_loop

`consumer_main_loop[accum_type: DType, a_type: DType, b_type: DType, c_reg_layout: Layout, a_smem_layout: Layout, b_smem_layout: Layout, wgmma_shape: IndexList[3], a_swizzle: TensorMapSwizzle, b_swizzle: TensorMapSwizzle, transpose_b: Bool, pipeline_stages: Int, /, *, num_k_iters: Int, cluster_shape: StaticTuple[SIMD[int32, 1], 3] = StaticTuple(__init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1)), promotion_frequency: Int = 1, num_consumer: Int = 1](final_c_reg_tile: LayoutTensor[accum_type, c_reg_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_reg_tile: LayoutTensor[accum_type, c_reg_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], a_smem_iter: LayoutTensorIter[a_type, a_smem_layout, origin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b_smem_iter: LayoutTensorIter[b_type, b_smem_layout, origin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut read_pipeline_states: PipelineState[pipeline_stages], full: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8], empty: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8], wgmma_op: TensorCoreAsync[accum_type, a_type, b_type, wgmma_shape, a_swizzle, b_swizzle, transpose_b], local_warp_group_idx: UInt, warp_group_thread_idx: UInt)`

---

## hopper_matmul_tma_wgmma

`hopper_matmul_tma_wgmma[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, wgmma_shape: IndexList[3], block_tile_shape: IndexList[3]](c_device: NDBuffer[c_type, 2, origin, c_shape], a_device: NDBuffer[a_type, 2, origin, a_shape], b_device: NDBuffer[b_type, 2, origin, b_shape], M: Int, N: Int, K: Int, ctx: DeviceContext)`

---

## hopper_matmul_tma_wgmma_kernel

`hopper_matmul_tma_wgmma_kernel[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, transpose_b: Bool = True, promotion_frequency: Int = 1](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin])`

---

## matmul_sm90

## Aliases

### `NumWarpPerWarpGroup`

`alias NumWarpPerWarpGroup = 4`

### `WARP_GROUP_SIZE`

`alias WARP_GROUP_SIZE = 128`

## Functions

* [​`cluster_size`](./cluster_size):
* [​`consumer_main_loop`](./consumer_main_loop):
* [​`hopper_matmul_tma_wgmma`](./hopper_matmul_tma_wgmma):
* [​`hopper_matmul_tma_wgmma_kernel`](./hopper_matmul_tma_wgmma_kernel):
* [​`producer_main_loop`](./producer_main_loop):
* [​`promote_to_cuda_cores`](./promote_to_cuda_cores):
* [​`tma_wgmma_warp_specialized_gemm_kernel`](./tma_wgmma_warp_specialized_gemm_kernel):
* [​`tma_wgmma_warp_specialized_gemm_kernel_persistent`](./tma_wgmma_warp_specialized_gemm_kernel_persistent):
* [​`warp_specialize_gemm_with_multicasting`](./warp_specialize_gemm_with_multicasting):
* [​`warp_specialized_gemm_output`](./warp_specialized_gemm_output):

---

## producer_main_loop

`producer_main_loop[a_type: DType, b_type: DType, a_tile_layout: Layout, b_tile_layout: Layout, a_smem_layout: Layout, b_smem_layout: Layout, a_desc_layout: Layout, b_desc_layout: Layout, pipeline_stages: Int, /, *, num_k_iters: Int, block_tile_shape: IndexList[3], cluster_shape: StaticTuple[SIMD[int32, 1], 3] = StaticTuple(__init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1)), partitioned_multicast: Bool = False](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], a_smem_iter: LayoutTensorIter[a_type, a_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b_smem_iter: LayoutTensorIter[b_type, b_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], m_coord: UInt, n_coord: UInt, rank_n: UInt, rank_m: UInt, mut write_pipeline_states: PipelineState[pipeline_stages], empty_mbar: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8], full_mbar: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8])`

---

## promote_to_cuda_cores

`promote_to_cuda_cores[accum_type: DType, layout: Layout](c_reg_tile: LayoutTensor[accum_type, layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], final_c_reg_tile: LayoutTensor[accum_type, layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

---

## tma_wgmma_warp_specialized_gemm_kernel

`tma_wgmma_warp_specialized_gemm_kernel[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, c_desc_layout: Layout, c_tma_layout: Layout, c_smem_layout: Layout, cluster_shape: StaticTuple[SIMD[int32, 1], 3], a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), c_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = True, num_threads: Int = 128, pipeline_stages: Int = 7, partitioned_multicast: Bool = False, use_tma_store: Bool = False, promotion_frequency: Int = 1, pdl_level: PDLLevel = PDLLevel(), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), hilbert_swizzle: Bool = False](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c_tma_op: TMATensorTile[c_type, c_tma_layout, c_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin], lut_ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(1)] = UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(1)](0))`

---

## tma_wgmma_warp_specialized_gemm_kernel_persistent

`tma_wgmma_warp_specialized_gemm_kernel_persistent[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, c_desc_layout: Layout, c_tma_layout: Layout, c_smem_layout: Layout, cluster_shape: StaticTuple[SIMD[int32, 1], 3], grid_shape: IndexList[2], schedule: MatmulSchedule, a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), c_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = True, num_threads: Int = 128, pipeline_stages: Int = 7, partitioned_multicast: Bool = False, use_tma_store: Bool = False, promotion_frequency: Int = 1, pdl_level: PDLLevel = PDLLevel(), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1})](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c_tma_op: TMATensorTile[c_type, c_tma_layout, c_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin], problem_shape: IndexList[3])`

---

## warp_specialize_gemm_with_multicasting

`warp_specialize_gemm_with_multicasting[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, wgmma_shape: IndexList[3], config: MatmulConfig[a_type, b_type, c_type, transpose_b, wgmma_shape], grid_shape: OptionalReg[IndexList[2]] = OptionalReg[IndexList[2]]({:i1 0, 1}), use_tma_store: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), schedule: MatmulSchedule = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](-1)), hilbert_swizzle: Bool = False](c_device: NDBuffer[c_type, 2, origin, c_shape], a_device: NDBuffer[a_type, 2, origin, a_shape], b_device: NDBuffer[b_type, 2, origin, b_shape], M: Int, N: Int, K: Int, ctx: DeviceContext)`

---

## warp_specialized_gemm_output

`warp_specialized_gemm_output[c_type: DType, accum_type: DType, c_layout: Layout, c_smem_layout: Layout, c_tma_layout: Layout, c_reg_layout: Layout, c_desc_layout: Layout, /, *, c_tile_shape: IndexList[2], c_swizzle: TensorMapSwizzle, wgmma_shape: IndexList[3], num_consumer: Int = 1, use_tma_store: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1})](c_tma_op: TMATensorTile[c_type, c_tma_layout, c_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_smem_tile: LayoutTensor[c_type, c_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=128], c_reg_tile: LayoutTensor[accum_type, c_reg_layout, MutableAnyOrigin, address_space=AddressSpace(5)], warp_group_thread_idx: UInt, local_warp_group_idx: UInt, local_thread_idx: UInt, block_y: Int, block_x: Int)`

---

## MatmulSchedule

`@register_passable(trivial)`
`struct MatmulSchedule`

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `DS_SCHEDULER`

`alias DS_SCHEDULER = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](2))`

### `NONE`

`alias NONE = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](-1))`

### `TILE1D`

`alias TILE1D = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](0))`

### `TILE2D`

`alias TILE2D = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](1))`

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

---

## TileScheduler

`@register_passable(trivial)`
`struct TileScheduler[problem_shape: IndexList[3], tile_shape: IndexList[3], grid_shape: IndexList[2], cluster: IndexList[3] = Index(1, 1, 1), raster_dim: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1), schedule: MatmulSchedule = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](1))]`

## Fields

* ​idx (`SIMD[uint32, 1]`):
* ​prob\_shape (`IndexList[3]`):
* ​num\_waves\_m (`SIMD[uint32, 1]`):
* ​num\_waves\_n (`SIMD[uint32, 1]`):
* ​log\_num\_waves\_n (`FastDiv[uint32]`):
* ​current\_iter (`Int`):
* ​num\_aligned\_m\_blocks (`SIMD[uint32, 1]`):
* ​num\_blocks (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `kNum1DBlocksPerGroup`

`alias kNum1DBlocksPerGroup = __init__[__mlir_type.!pop.int_literal](16)`

### `kNumNBlocks`

`alias kNumNBlocks = SIMD(ceildiv[::CeilDivable](problem_shape.__getitem__[::Indexer](1), tile_shape.__getitem__[::Indexer](1)))`

### `num_grids`

`alias num_grids = SIMD((grid_shape.__getitem__[::Indexer](0) * grid_shape.__getitem__[::Indexer](1)))`

### `wave_shape`

`alias wave_shape = Index((grid_shape.__getitem__[::Indexer](1) * tile_shape.__getitem__[::Indexer](0)), (grid_shape.__getitem__[::Indexer](0) * tile_shape.__getitem__[::Indexer](1)))`

## Methods

### `__init__`

`__init__(prob_shape: IndexList[3]) -> Self`

### `get_current_work_info`

`get_current_work_info(mut self) -> WorkInfo`

### `advance`

`advance(mut self)`

### `fetch_next_work`

`fetch_next_work(mut self) -> WorkInfo`

### `num_output_tiles`

`num_output_tiles(self) -> UInt`

### `fetch_next_work_ds`

`fetch_next_work_ds(mut self) -> WorkInfo`

---

## WorkInfo

`@register_passable(trivial)`
`struct WorkInfo`

## Fields

* ​m (`SIMD[uint32, 1]`):
* ​n (`SIMD[uint32, 1]`):
* ​k\_start (`SIMD[uint32, 1]`):
* ​num\_k\_tiles (`SIMD[uint32, 1]`):
* ​is\_valid\_tile (`Bool`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `is_valid`

`is_valid(self) -> Bool`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

---

## matmul_tile_scheduler

## Structs

* [​`MatmulSchedule`](./MatmulSchedule):
* [​`TileScheduler`](./TileScheduler):
* [​`WorkInfo`](./WorkInfo):

---

## matmul_vendor

## Functions

* [​`matmul`](./matmul): This implements the matmul kernel for the Blackwell architecture. Note that we do not currently have pure mojo kernels which would utilize blackwell architectures, so in place we just call the CUBLAS library.

---

## matmul

`matmul[c_type: DType, a_type: DType, b_type: DType, //, use_tensor_core: Bool = False, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), config: OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]] = OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]]({:i1 0, 1}), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[b_type, 2, origin, shape], ctx: DeviceContext)`

This implements the matmul kernel for the Blackwell architecture. Note that we do not currently have pure mojo kernels which would utilize blackwell architectures, so in place we just call the CUBLAS library.

---

## Inner_matmul_vnni

`struct Inner_matmul_vnni[saturated_vnni: Bool]`

## Implemented traits

`AnyType`,
`Copyable`,
`InnerMatmulKernel`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__inner_matmul__`

`__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)`

Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows, TileN, TileK) tile.

---

## matmul_vnni

## Structs

* [​`Inner_matmul_vnni`](./Inner_matmul_vnni):

---

## matrix_band_part

The module implements matrix band part functions.

## Functions

* [​`matrix_band_part`](./matrix_band_part):

---

## matrix_band_part

`matrix_band_part[: origin.set, //, type: DType, int_type: DType, cond_type: DType, rank: Int, input_0_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], simd_width: Int, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[rank], num_lower: NDBuffer[int_type, 1, origin], num_upper: NDBuffer[int_type, 1, origin], exclude_buf: NDBuffer[cond_type, 1, origin], output: NDBuffer[type, rank, origin], ctx: DeviceContextPtr)`

---

## neon_intrinsics


---

## BTileGenerator

`struct BTileGenerator[mut: Bool, //, config: KernelConfig, a_type: DType, b_type: DType, c_type: DType, shape: DimList, transpose_b: Bool, b_packed: Bool, origin: Origin[mut]]`

Struct to encapsulate a tile of B that supports prepacking.

If b\_packed is true, calls to get\_tile will return a buffer view from B.
Otherwise, calls to get\_tile will copy a tile from B into a stack allocated
scratch buffer and return a view of that.

## Fields

* ​b (`NDBuffer[b_type, 2, origin, shape]`):
* ​b\_tile\_stack\_ptr (`UnsafePointer[SIMD[b_type, 1]]`):
* ​tile\_n\_k (`IndexList[2]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `get`

`static get(b: NDBuffer[b_type, 2, origin, shape], tile_n_k: IndexList[2]) -> Self`

### `get_tile`

`get_tile[inner_size: Int](self, global_offset: GemmShape, tile_dim_nk: IndexList[2], valid_data_dim_nk: IndexList[2]) -> NDBuffer[b_type, 3, MutableAnyOrigin, config.packed_shape]`

Get a packed matrix (B) tile.

valid\_data\_tile\_nk is ignored for pre-packing, where the tile is padded
to have shape of tile\_dim\_nk.

**Args:**

* ​global\_offset (`GemmShape`): Offset in the global M, N, K dimensions.
* ​tile\_dim\_nk (`IndexList[2]`): Tile shape based on cache size and matrix dimensions.
* ​valid\_data\_dim\_nk (`IndexList[2]`): The upper bounds for N and K dimensions.

**Returns:**

A view of the packed tile.

---

## PackMatrixCols

`struct PackMatrixCols[original_mut: Bool, //, original_shape: DimList, packed_shape: DimList, type: DType, simd_size: Int, column_inner_size: Int, use_vnni: Bool, use_i8mm: Bool, packed_origin: MutableOrigin, original_origin: Origin[original_mut]]`

Pack columns from a matrix into the mlas packed layout and extract inner vectors of columns into the packed inner dimension, e.g. extracts \[X, Y] and packs as \[Yo]\[X]\[Yi].

## Fields

* ​packed\_matrix (`NDBuffer[type, 3, packed_origin, packed_shape]`):
* ​original\_matrix (`NDBuffer[type, 2, original_origin, original_shape]`):
* ​global\_offset (`IndexList[2]`):
* ​pack\_tile\_dim (`IndexList[2]`):
* ​valid\_data\_dim (`IndexList[2]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `run`

`static run(packed_matrix: NDBuffer[type, 3, MutableAnyOrigin, packed_shape], original_matrix: NDBuffer[type, 2, MutableAnyOrigin, original_shape], global_offset: IndexList[2], pack_tile_dim: IndexList[2], valid_data_dim: IndexList[2])`

Interface function to run the packing routine. Args:     packed\_matrix(NDBuffer): pre-allocated buffer space for packed         data.     original\_matrix(NDBuffer): data buffer containing the original matrix         to pack.     global\_offset(IndexList): offset to use when indexing the         original matrix.     pack\_tile\_dim(IndexList): 2D dimension tuple describing the         size of the packed tile.     valid\_data\_dim(IndexList): 2D dimension tuple describing the         amount of valid data on the global buffer starting from the         offset.

---

## PackMatrixRows

`struct PackMatrixRows[original_mut: Bool, //, original_shape: DimList, packed_shape: DimList, type: DType, simd_size: Int, row_inner_size: Int, packed_origin: MutableOrigin, original_origin: Origin[original_mut]]`

Pack rows from a matrix into the mlas packed layout and extract inner vectors of rows into the packed inner dimension, e.g. extract tile \[X, Y] and pack into \[Xo]\[Y]\[Xi].

## Fields

* ​packed\_matrix (`NDBuffer[type, 3, packed_origin, packed_shape]`):
* ​original\_matrix (`NDBuffer[type, 2, original_origin, original_shape]`):
* ​global\_offset (`IndexList[2]`):
* ​pack\_tile\_dim (`IndexList[2]`):
* ​valid\_data\_dim (`IndexList[2]`):
* ​valid\_simd\_dim (`IndexList[2]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `run`

`static run(packed_matrix: NDBuffer[type, 3, packed_origin, packed_shape], original_matrix: NDBuffer[type, 2, original_origin, original_shape], global_offset: IndexList[2], pack_tile_dim: IndexList[2], valid_data_dim: IndexList[2])`

Interface function to run the packing routine. Args:     packed\_matrix(NDBuffer): pre-allocated buffer space for packed         data.     original\_matrix(NDBuffer): data buffer containing the original matrix         to pack.     global\_offset(IndexList): offset to use when indexing the         original matrix.     pack\_tile\_dim(IndexList): 2D dimension tuple describing the         size of the packed tile.     valid\_data\_dim(IndexList): 2D dimension tuple describing the         amount of valid data on the global buffer starting from the         offset.

---

## packing

## Structs

* [​`BTileGenerator`](./BTileGenerator): Struct to encapsulate a tile of B that supports prepacking.
* [​`PackMatrixCols`](./PackMatrixCols): Pack columns from a matrix into the mlas packed layout and extract inner vectors of columns into the packed inner dimension, e.g. extracts \[X, Y] and packs as \[Yo]\[X]\[Yi].
* [​`PackMatrixRows`](./PackMatrixRows): Pack rows from a matrix into the mlas packed layout and extract inner vectors of rows into the packed inner dimension, e.g. extract tile \[X, Y] and pack into \[Xo]\[Y]\[Xi].

## Functions

* [​`pack_b`](./pack_b): Utility function to pack the entire B matrix, such that each \[tile\_n // inner\_size, tile\_k, inner\_size] tile of src is contiguous in dst.
* [​`pack_b_ndbuffer`](./pack_b_ndbuffer):
* [​`pack_matmul_b_shape_func`](./pack_matmul_b_shape_func):
* [​`pack_transposed_b_ndbuffer`](./pack_transposed_b_ndbuffer):

---

## pack_b

`pack_b[transpose_b: Bool, simd_size: Int, inner_size: Int, a_type: DType, b_type: DType, c_type: DType, src_shape: DimList, dst_shape: DimList](dst: NDBuffer[b_type, 2, origin, dst_shape], src: NDBuffer[b_type, 2, origin, src_shape], tile_n: Int, tile_k: Int)`

Utility function to pack the entire B matrix, such that each \[tile\_n // inner\_size, tile\_k, inner\_size] tile of src is contiguous in dst.

Tiles (not tile contents) are stored in row major order, so tile\[i, j] is
tile\_n \* tile\_k bytes away from tile\[i, j+1].

---

## pack_b_ndbuffer

`pack_b_ndbuffer[b_mut: Bool, //, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, c_type: DType, c_shape: DimList, b_origin: Origin[b_mut], output_origin: MutableOrigin](b_input: NDBuffer[b_type, 2, b_origin, b_shape], output_buffer: NDBuffer[b_type, 2, output_origin])`

---

## pack_matmul_b_shape_func

`pack_matmul_b_shape_func[a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, c_type: DType, c_shape: DimList, transpose_in_0: Bool, single_thread_blocking_override: Bool](b_input: NDBuffer[b_type, 2, origin, b_shape]) -> IndexList[2]`

---

## pack_transposed_b_ndbuffer

`pack_transposed_b_ndbuffer[a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, c_type: DType, c_shape: DimList](b_input: NDBuffer[b_type, 2, origin, b_shape], output_buffer: NDBuffer[b_type, 2, origin])`

---

## apply_q

`apply_q[dtype: DType, element_layout: Layout](sigma: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], A: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], X: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Applies the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` to the `X` matrix.

See `qr_factorization` for more details on the construction of the
Householder reflector.

---

## form_q

`form_q[dtype: DType, element_layout: Layout](sigma: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], A: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], Q: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Forms the Q factor from the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` and stores the result in `Q`.

---

## qr_factorization

## Functions

* [​`apply_q`](./apply_q): Applies the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` to the `X` matrix.
* [​`form_q`](./form_q): Forms the Q factor from the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` and stores the result in `Q`.
* [​`qr_factorization`](./qr_factorization): Performs QR factorization of a matrix `A` using the Householder reflector method.

---

## qr_factorization

`qr_factorization[dtype: DType, element_layout: Layout](sigma: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], A: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Performs QR factorization of a matrix `A` using the Householder reflector method.

This function computes the QR factorization of matrix `A` in-place using
Householder reflections. The result is stored directly in the input matrix
`A`, with scaling factors in `sigma`. The implementation follows the LAPACK
algorithm for generating Householder reflectors in-place.

Algorithm:
The Householder reflector is defined as:
U = I - σww^H
where:
w = (x + νe₁)/ξ
σ = ξ/ν
ξ = x₀ + ν
ν = sign(x₀)‖x‖₂

```
This ensures that U^H x = -νe₁ and U^H U = I.
```

References:
\[1] Lehoucq, R. B. (1996). The computation of elementary unitary matrices.
ACM Transactions on Mathematical Software, 22(4), 393-400.

Note:
There is a typo in reference \[lawn72]. The correct result is U^H x =
-νe₁.

---

## transpose

The module implements Transpose functions.

## Functions

* [​`transpose`](./transpose): Permute the axis of `input` based on `perms`, and place the result in `output`.
* [​`transpose_2d`](./transpose_2d):
* [​`transpose_3d_swap_inner`](./transpose_3d_swap_inner):
* [​`transpose_3d_swap_outer`](./transpose_3d_swap_outer):
* [​`transpose_4d_swap_middle`](./transpose_4d_swap_middle):
* [​`transpose_inplace`](./transpose_inplace):
* [​`transpose_strided`](./transpose_strided):
* [​`transpose_trivial_memcpy`](./transpose_trivial_memcpy):

---

## transpose

`transpose[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape], perms: UnsafePointer[SIMD[index, 1]])`

Permute the axis of `input` based on `perms`, and place the result in `output`.

Example:

```mojo
transpose(output, input, [2, 0, 1])
# guarantees output[x, y, z] = input[z, x, y]
```

**Parameters:**

* ​rank (`Int`): The rank of input and output buffers.
* ​type (`DType`): The dtype of buffer elements.

**Args:**

* ​output (`NDBuffer[type, rank, origin, shape]`): The output buffer.
* ​input (`NDBuffer[type, rank, origin, shape]`): The input buffer.
* ​perms (`UnsafePointer[SIMD[index, 1]]`): Permutation of the input axes.

---

## transpose_2d

`transpose_2d[rank: Int, output_shape: DimList, input_shape: DimList, type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int, offset: Int)`

---

## transpose_3d_swap_inner

`transpose_3d_swap_inner[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int)`

---

## transpose_3d_swap_outer

`transpose_3d_swap_outer[rank: Int, output_shape: DimList, input_shape: DimList, type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int)`

---

## transpose_4d_swap_middle

`transpose_4d_swap_middle[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape, strides], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int)`

---

## transpose_inplace

`transpose_inplace[rows: Int, cols: Int, type: DType](buf: NDBuffer[type, 2, origin, __init__[::Indexer,::Indexer](rows, cols)])`

---

## transpose_strided

`transpose_strided[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape], perms: UnsafePointer[SIMD[index, 1]])`

---

## transpose_trivial_memcpy

`transpose_trivial_memcpy[rank: Int, output_shape: DimList, input_shape: DimList, type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape])`

---

## GemmShape

`@register_passable(trivial)`
`struct GemmShape`

Helper class to unpack gemm dimension and layout.

## Fields

* ​M (`Int`):
* ​N (`Int`):
* ​K (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(index: IndexList[3]) -> Self`

Constructor of a gemm shape record from a index tuple.

**Args:**

* ​index (`IndexList[3]`): The int tuple containing the index(m,n,k).

### `__getitem__`

`__getitem__(self, idx: Int) -> Int`

### `__setitem__`

`__setitem__(mut self, idx: Int, value: Int)`

### `__add__`

`__add__(self, rhs: Self) -> Self`

Coordinate-wise addition of two gemm shape records.

**Args:**

* ​rhs (`Self`): Another gemm shape record to add with.

### `__sub__`

`__sub__(self, rhs: Self) -> Self`

Coordinate-wise subtraction of two gemm shape records.

**Args:**

* ​rhs (`Self`): Another gemm shape record to subtract with.

### `get`

`static get[transpose_b: Bool](c: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> Self`

Constructor of a gemm shape record from input buffers.

M, N, and K are intentionally calculated using `a` and `c` ONLY. This
is because `b` may be padded to a multiple of the tile size if it has
been pre-packed.

**Args:**

* ​c (`NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): NDBuffer with allocated output space.
* ​a (`NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): NDBuffer containing matrix operand A.
* ​b (`NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): NDBuffer containing matrix operand B.

### `as_index`

`as_index(self) -> IndexList[3]`

Utility to convert the underlying data to an index tuple. So that the utilities such as elementwise add can be used.

**Returns:**

The constructed index tuple.

---

## InnerKernelID

`@register_passable(trivial)`
`struct InnerKernelID`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `DEFAULT`

`alias DEFAULT = InnerKernelID(0)`

### `I8MM`

`alias I8MM = InnerKernelID(3)`

### `NEON`

`alias NEON = InnerKernelID(2)`

### `VNNI`

`alias VNNI = InnerKernelID(1)`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

---

## KernelConfig

`struct KernelConfig`

Static configuration of the matmul inner kernel.

## Fields

* ​kernel\_rows (`Int`):
* ​kernel\_cols (`Int`):
* ​simd\_size (`Int`):
* ​packed\_shape (`DimList`):

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, *, kernel_rows: Int, kernel_cols: Int, simd_size: Int, packed_shape: DimList)`

---

## MicroKernelShape

`@register_passable(trivial)`
`struct MicroKernelShape`

Record describing the inner kernel shape.

## Fields

* ​simd\_rows (`Int`):
* ​simd\_cols (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(rows: Int, cols: Int) -> Self`

---

## SubMatmulConfig

`struct SubMatmulConfig`

Static configuration of sub-matrices in parallel matmul.

## Fields

* ​offset (`IndexList[3]`):
* ​shape (`IndexList[3]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `is_valid`

`is_valid(self) -> Bool`

---

## apply_epilogue

`apply_epilogue[elementwise_lambda: fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None, dst_layout: Layout, dst_element_layout: Layout = __init__[::Origin[::Bool(IntTuple(1), IntTuple(1))](src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], offset: Int)`

---

## calculate_tile_n_k

`calculate_tile_n_k[a_type: DType, b_type: DType, c_type: DType, kernel_cols: Int](n: Int, k: Int) -> IndexList[2]`

Helper heuristic function to decide on tile size to partition the matmul given the cache size and desired data layout.

**Parameters:**

* ​a\_type (`DType`): The type of the A tensor.
* ​b\_type (`DType`): The type of the B tensor.
* ​c\_type (`DType`): The type of the C tensor.
* ​kernel\_cols (`Int`): The umber of columns of the micro kernel.

**Returns:**

The calculated tile size to partition the matmul as (TileN, TileK).

`calculate_tile_n_k[a_type: DType, b_type: DType, c_type: DType, kernel_cols: Int](global_tile_shape: GemmShape) -> IndexList[2]`

---

## dispatch_get_kernel_type

`dispatch_get_kernel_type[: origin.set, //, func: fn[Bool]() raises capturing -> None](m: Int, n: Int, k: Int)`

`dispatch_get_kernel_type[: origin.set, //, func: fn[Bool]() capturing -> None](m: Int, n: Int, k: Int)`

---

## get_kernel_config

`get_kernel_config[a_type: DType, b_type: DType, c_type: DType, *, kernel_type: Bool = False]() -> KernelConfig`

Utility function to extract matmul configuration parameters for exported Functions.     TODO: Add target dependent configuration parameters.

---

## get_kernel_type

`get_kernel_type(m: Int, n: Int, k: Int) -> Bool`

---

## get_matmul_arch_factor

`get_matmul_arch_factor[use_vnni: Bool, use_i8mm: Bool]() -> Int`

---

## get_matmul_kernel_shape

`get_matmul_kernel_shape[a_type: DType, b_type: DType, c_type: DType, kernel_type: Bool]() -> MicroKernelShape`

---

## get_matmul_kernel_shape_ARM

`get_matmul_kernel_shape_ARM[a_type: DType, b_type: DType, c_type: DType, kernel_type: Bool]() -> MicroKernelShape`

---

## get_matmul_kernel_shape_x86

`get_matmul_kernel_shape_x86[kernel_type: Bool]() -> MicroKernelShape`

---

## get_matmul_num_tasks

`get_matmul_num_tasks[a_type: DType, b_type: DType, c_type: DType, simd_size: Int, kernel_type: Bool](m: Int, n: Int, k: Int, max_num_tasks: Int) -> Int`

Compute the number of tasks for parallel matmul. The max number of tasks is typically the number of threads/cores.

---

## get_matmul_prefetch_b_distance_k

`get_matmul_prefetch_b_distance_k() -> Int`

---

## get_min_task_size

`get_min_task_size() -> Int`

---

## get_packB_unroll_factor

`get_packB_unroll_factor() -> Int`

---

## get_pack_data_size

`get_pack_data_size[type: DType]() -> Int`

Utility to compute the number of elements to pack in each tile. Returns:     The number of elements to pack.

---

## get_partitioned_matmul

`get_partitioned_matmul[a_type: DType, b_type: DType, c_type: DType, kernel_rows: Int, kernel_cols: Int](m: Int, n: Int, k: Int, task_id: Int, num_tasks: Int) -> SubMatmulConfig`

---

## get_partitioned_matmul_mojo

`get_partitioned_matmul_mojo[b_type: DType, kernel_rows: Int, kernel_cols: Int, use_i8mm: Bool = False](m: Int, n: Int, k: Int, task_id: Int, num_tasks: Int) -> SubMatmulConfig`

---

## get_partitioned_matmul_mojo_shape

`get_partitioned_matmul_mojo_shape[b_type: DType, kernel_rows: Int, kernel_cols: Int, use_i8mm: Bool](m: Int, n: Int, k: Int, num_tasks: Int) -> IndexList[2]`

---

## utils

## Aliases

### `elementwise_compute_lambda_type`

`alias elementwise_compute_lambda_type = fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]`

### `elementwise_epilogue_type`

`alias elementwise_epilogue_type = fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None`

## Structs

* [​`GemmShape`](./GemmShape): Helper class to unpack gemm dimension and layout.
* [​`InnerKernelID`](./InnerKernelID):
* [​`KernelConfig`](./KernelConfig): Static configuration of the matmul inner kernel.
* [​`MicroKernelShape`](./MicroKernelShape): Record describing the inner kernel shape.
* [​`SubMatmulConfig`](./SubMatmulConfig): Static configuration of sub-matrices in parallel matmul.

## Functions

* [​`apply_epilogue`](./apply_epilogue):
* [​`calculate_tile_n_k`](./calculate_tile_n_k): Helper heuristic function to decide on tile size to partition the matmul given the cache size and desired data layout.
* [​`dispatch_get_kernel_type`](./dispatch_get_kernel_type):
* [​`get_kernel_config`](./get_kernel_config): Utility function to extract matmul configuration parameters for exported Functions.     TODO: Add target dependent configuration parameters.
* [​`get_kernel_type`](./get_kernel_type):
* [​`get_matmul_arch_factor`](./get_matmul_arch_factor):
* [​`get_matmul_kernel_shape`](./get_matmul_kernel_shape):
* [​`get_matmul_kernel_shape_ARM`](./get_matmul_kernel_shape_ARM):
* [​`get_matmul_kernel_shape_x86`](./get_matmul_kernel_shape_x86):
* [​`get_matmul_num_tasks`](./get_matmul_num_tasks): Compute the number of tasks for parallel matmul. The max number of tasks is typically the number of threads/cores.
* [​`get_matmul_prefetch_b_distance_k`](./get_matmul_prefetch_b_distance_k):
* [​`get_min_task_size`](./get_min_task_size):
* [​`get_pack_data_size`](./get_pack_data_size): Utility to compute the number of elements to pack in each tile. Returns:     The number of elements to pack.
* [​`get_packB_unroll_factor`](./get_packB_unroll_factor):
* [​`get_partitioned_matmul`](./get_partitioned_matmul):
* [​`get_partitioned_matmul_mojo`](./get_partitioned_matmul_mojo):
* [​`get_partitioned_matmul_mojo_shape`](./get_partitioned_matmul_mojo_shape):
* [​`packA_i8mm`](./packA_i8mm):
* [​`partition_work`](./partition_work):
* [​`select_inner_kernel`](./select_inner_kernel):
* [​`use_i8mm_fn`](./use_i8mm_fn):
* [​`use_vnni_fn`](./use_vnni_fn):

---

## packA_i8mm

`packA_i8mm[a_type: DType](t0: Int, t1: Int, k: Int, a_ptr: UnsafePointer[SIMD[a_type, 1]], a_packed_ptr: UnsafePointer[SIMD[a_type, 1]])`

---

## partition_work

`partition_work(task_id: Int, num_tasks: Int, work: Int, work_block_size: Int) -> IndexList[2]`

---

## select_inner_kernel

`select_inner_kernel[a_type: DType, b_type: DType, c_type: DType]() -> InnerKernelID`

---

## use_i8mm_fn

`use_i8mm_fn[a_type: DType, b_type: DType, c_type: DType]() -> Bool`

---

## use_vnni_fn

`use_vnni_fn[a_type: DType, b_type: DType, c_type: DType]() -> Bool`

---

## MatmulConfig

`@register_passable(trivial)`
`struct MatmulConfig[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool = False, mma_shape: IndexList[3] = get_mma_shape[::DType,::DType,::Int]()]`

Static configuration of GPU matmul.

## Fields

* ​block\_tile\_shape (`IndexList[3]`):
* ​warp\_tile\_shape (`IndexList[3]`):
* ​num\_pipeline\_stages (`UInt`):
* ​num\_k\_partitions (`UInt`):
* ​k\_group\_size (`UInt`):
* ​num\_warp\_k\_partitions (`UInt`):
* ​cluster\_shape (`IndexList[3]`):
* ​num\_consumer (`UInt`):
* ​partitioned\_multicast (`Bool`):
* ​scheduler\_hint (`IndexList[3]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `ACCUM_PRECISION`

`alias ACCUM_PRECISION = 1`

### `accum_type`

`alias accum_type = get_accum_type[::DType,::DType]()`

### `OUTPUT_PRECISION`

`alias OUTPUT_PRECISION = 2`

### `split_k_reduction_scheme`

`alias split_k_reduction_scheme = env_get_int[::StringSlice[::Bool()`

### `split_k_reduction_type`

`alias split_k_reduction_type = c_type if (env_get_int[::StringSlice[::Bool() == 2) else get_accum_type[::DType,::DType]()`

## Methods

### `__init__`

`__init__(block_tile_shape: IndexList[3] = Index(128, 128, 32), warp_tile_shape: IndexList[3] = Index(64, 64, 32), cluster_shape: IndexList[3] = Index(1, 1, 1), num_pipeline_stages: UInt = UInt(4), num_k_partitions: UInt = UInt(1), k_group_size: UInt = UInt(1), num_warp_k_partitions: UInt = UInt(1), num_consumer: UInt = UInt(1), partitioned_multicast: Bool = False, scheduler_hint: IndexList[3] = Index(2, 2, 2), pdl_level: PDLLevel = PDLLevel()) -> Self`

### `__eq__`

`__eq__(self, rhs: MatmulConfig[a_type, b_type, c_type, transpose_b, mma_shape]) -> Bool`

### `num_warps_m`

`num_warps_m(self) -> UInt`

### `num_warps_n`

`num_warps_n(self) -> UInt`

### `num_threads`

`num_threads(self) -> UInt`

### `shared_mem_usage`

`shared_mem_usage(self) -> Int`

### `grid_dim`

`grid_dim(self, m: UInt, n: UInt) -> IndexList[3]`

### `block_dim`

`block_dim(self) -> IndexList[3]`

### `work_space_size`

`work_space_size(self, M: UInt, N: UInt) -> UInt`

### `pdl_level`

`pdl_level(self) -> PDLLevel`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

### `__repr__`

`__repr__(self) -> String`

### `__hash__`

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with the underlying bytes.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

---

## MatmulKernels

`@register_passable(trivial)`
`struct MatmulKernels[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool = False]`

Supported matmul kernels.

The configurations are named as: **.
BK, mma shape, and warp tile shape are decided internally.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ampere_128x128_4`

`alias ampere_128x128_4 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(4), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `ampere_256x128_3`

`alias ampere_256x128_3 = MatmulConfig(Index(128, 256, (_bk_base[::DType,::Bool]() * 2)), Index(64, 64, (_bk_base[::DType,::Bool]() * 2)), Index(1, 1, 1), UInt(3), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `ampere_256x64_4`

`alias ampere_256x64_4 = MatmulConfig(Index(64, 256, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(4), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `hopper_128x128_4`

`alias hopper_128x128_4 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(4), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `mi300x_128x128_1`

`alias mi300x_128x128_1 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `mi300x_128x128_2`

`alias mi300x_128x128_2 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(2), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `mi300x_128x256_1`

`alias mi300x_128x256_1 = MatmulConfig(Index(128, 256, _bk_base[::DType,::Bool]()), Index(64, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 4, 2), PDLLevel())`

### `mi300x_192x256_1`

`alias mi300x_192x256_1 = MatmulConfig(Index(192, 256, _bk_base[::DType,::Bool]()), Index(96, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(4, 6, 2), PDLLevel())`

### `mi300x_224x256_1`

`alias mi300x_224x256_1 = MatmulConfig(Index(224, 256, _bk_base[::DType,::Bool]()), Index(112, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(4, 7, 2), PDLLevel())`

### `mi300x_256x256_1`

`alias mi300x_256x256_1 = MatmulConfig(Index(256, 256, _bk_base[::DType,::Bool]()), Index(128, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(4, 8, 2), PDLLevel())`

### `mi300x_64x64_1`

`alias mi300x_64x64_1 = MatmulConfig(Index(64, 64, _bk_base[::DType,::Bool]()), Index(32, 32, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `mi300x_64x64_splitk_1`

`alias mi300x_64x64_splitk_1 = MatmulConfig(Index(64, 64, _bk_base[::DType,::Bool]()), Index(32, 32, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(4), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `tuning_config`

`alias tuning_config = MatmulConfig(Index(env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool()), Index(env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool()), Index(1, 1, 1), UInt(env_get_int[::StringSlice[::Bool()), UInt(env_get_int[::StringSlice[::Bool()), UInt(1), UInt(env_get_int[::StringSlice[::Bool()), UInt(1), False, Index(2, 2, 2), PDLLevel())`

---

## block_swizzle

`block_swizzle(block_idx: IndexList[2, element_type=element_type], grid_dim: IndexList[2, element_type=element_type]) -> IndexList[2, element_type=element_type]`

---

## create_hilbert_lut

`create_hilbert_lut(ctx: DeviceContext, grid_x: Int, grid_y: Int) -> DeviceBuffer[uint32]`

Precompute Hilbert-curve block swizzle lookup-table for a rectangular grid.

The returned device pointer refers to a 1-D UInt32 array of length
grid\_x \* grid\_y.
For linear (row-major) block id `id`, the packed value at `lut[id]`
encodes the swizzled coordinates:  upper 16-bits = y, lower 16-bits = x.

---

## get_config_from_shape

`get_config_from_shape[a_type: DType, b_type: DType, c_type: DType, static_N: Int, static_K: Int, transpose_b: Bool = False, target: StringSlice[StaticConstantOrigin] = _accelerator_arch()](dyn_M: Int, ctx: DeviceContext) -> MatmulConfig[a_type, b_type, c_type, transpose_b]`

---

## get_hilbert_lut_with_cache

`get_hilbert_lut_with_cache(ctx: DeviceContext, grid_x: Int, grid_y: Int) -> DeviceBuffer[uint32]`

Get Hilbert lookup table using global cache (no struct needed).

---

## utils_gpu

## Structs

* [​`MatmulConfig`](./MatmulConfig): Static configuration of GPU matmul.
* [​`MatmulKernels`](./MatmulKernels): Supported matmul kernels.

## Functions

* [​`block_swizzle`](./block_swizzle):
* [​`create_hilbert_lut`](./create_hilbert_lut): Precompute Hilbert-curve block swizzle lookup-table for a rectangular grid.
* [​`get_config_from_shape`](./get_config_from_shape):
* [​`get_hilbert_lut_with_cache`](./get_hilbert_lut_with_cache): Get Hilbert lookup table using global cache (no struct needed).
* [​`select_config`](./select_config):

---

## select_config

`select_config[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool = False](M: Int, N: Int, K: Int, ctx: DeviceContext) -> MatmulConfig[a_type, b_type, c_type, transpose_b]`

---

## Backend

`@register_passable(trivial)`
`struct Backend`

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `AUTOMATIC`

`alias AUTOMATIC = Backend(0)`

### `CUBLAS`

`alias CUBLAS = Backend(1)`

### `CUBLASLT`

`alias CUBLASLT = Backend(2)`

### `HIPBLASLT`

`alias HIPBLASLT = Backend(4)`

### `ROCBLAS`

`alias ROCBLAS = Backend(3)`

## Methods

### `__init__`

`@implicit`
`__init__(value: Int) -> Self`

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

### `__is__`

`__is__(self, other: Self) -> Bool`

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

### `__int__`

`__int__(self) -> Int`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

---

## Handle

`struct Handle[backend: Backend = _resolve_backend[linalg::vendor_blas::Backend,::DType]()]`

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `resolved_backend`

`alias resolved_backend = _resolve_backend[linalg::vendor_blas::Backend,::DType]()`

### `type`

`alias type = Variant[UnsafePointer[NoneType], Handle, UnsafePointer[NoneType]]`

## Methods

### `__init__`

`__init__(out self)`

### `__is__`

`__is__(self, other: Backend) -> Bool`

### `__isnot__`

`__isnot__(self, other: Backend) -> Bool`

### `__enter__`

`__enter__(self) -> Self`

### `__exit__`

`__exit__(mut self)`

---

## vendor_blas

## Structs

* [​`Backend`](./Backend):
* [​`Handle`](./Handle):

## Functions

* [​`matmul`](./matmul): Matmul using the vendor BLAS library. With a global handle.

---

## matmul

`matmul[use_tf32: Bool = False](ctx: DeviceContext, c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], *, c_row_major: Bool = False, transpose_a: Bool = False, transpose_b: Bool = False, alpha: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](1), beta: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](0))`

Matmul using the vendor BLAS library. With a global handle.

`matmul[use_tf32: Bool = False](ctx: DeviceContext, handle: Handle[backend], c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], *, c_row_major: Bool = False, transpose_a: Bool = False, transpose_b: Bool = False, alpha: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](1), beta: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](0))`

---

## dot_i16_to_i32_AVX2

`dot_i16_to_i32_AVX2[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

The dot product of the two words in each int32 element of a and b plus a int32 from src.

**Constraints:**

Requires AVX2.
The size of the output vector must be 4, 8 or 16.

**Parameters:**

* ​width (`Int`): Size of the output SIMD vector.
* ​a\_type (`DType`): The DType for a.
* ​b\_type (`DType`): The DType for b.
* ​c\_type (`DType`): The DType for c.

**Args:**

* ​src (`SIMD[c_type, width]`): A int32 SIMD vector.
* ​a (`SIMD[a_type, width]`): A int16 SIMD vector.
* ​b (`SIMD[b_type, width]`): A int16 SIMD vector.

**Returns:**

A SIMD vector of width elements.

---

## dot_i16_to_i32_x86

`dot_i16_to_i32_x86[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

The dot product of the two words in each int32 element of a and b plus a int32 from src using VNNI or AVX2.

**Constraints:**

Requires AVX512\_VNNI or AVX2.
The size of the output vector must be 4, 8 or 16.

**Parameters:**

* ​width (`Int`): Size of the output SIMD vector.
* ​a\_type (`DType`): The DType for a.
* ​b\_type (`DType`): The DType for b.
* ​c\_type (`DType`): The DType for c.

**Args:**

* ​src (`SIMD[c_type, width]`): A int32 SIMD vector.
* ​a (`SIMD[a_type, width]`): A int16 SIMD vector.
* ​b (`SIMD[b_type, width]`): A int16 SIMD vector.

**Returns:**

A SIMD vector of width elements.

---

## dot_i8_to_i32_AVX2

`dot_i8_to_i32_AVX2[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

The dot product of the four bytes in each int32 element of a and b plus a int32 from src.

**Constraints:**

Requires AVX2.
The size of the output vector must be 4, 8 or 16.
The a argument has range \[0,255].
The b argument has range \[-128,127].

**Parameters:**

* ​width (`Int`): Size of the output SIMD vector.
* ​a\_type (`DType`): The DType for a.
* ​b\_type (`DType`): The DType for b.
* ​c\_type (`DType`): The DType for c.

**Args:**

* ​src (`SIMD[c_type, width]`): A int32 SIMD vector.
* ​a (`SIMD[a_type, width]`): A uint8 SIMD vector.
* ​b (`SIMD[b_type, width]`): A int8 SIMD vector.

**Returns:**

A SIMD vector of width elements.

---

## dot_i8_to_i32_saturated_AVX2

`dot_i8_to_i32_saturated_AVX2[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

The dot product of the four bytes in each int32 element of a and b plus a int32 from src.

**Constraints:**

Requires AVX2.
The size of the output vector must be 4, 8 or 16.
The a argument has range \[0,127] not \[0, 255].
The b argument has range \[-128,127].

**Parameters:**

* ​width (`Int`): Size of the output SIMD vector.
* ​a\_type (`DType`): The DType for a.
* ​b\_type (`DType`): The DType for b.
* ​c\_type (`DType`): The DType for c.

**Args:**

* ​src (`SIMD[c_type, width]`): A int32 SIMD vector.
* ​a (`SIMD[a_type, width]`): A uint8 SIMD vector.
* ​b (`SIMD[b_type, width]`): A int8 SIMD vector.

**Returns:**

A SIMD vector of width elements.

---

## dot_i8_to_i32_saturated_x86

`dot_i8_to_i32_saturated_x86[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2.

**Constraints:**

Requires AVX512\_VNNI or AVX2.
The size of the output vector must be 4, 8 or 16.
The a argument has range \[0,127] not \[0, 255].
The b argument has range \[-128,127].

**Parameters:**

* ​width (`Int`): Size of the output SIMD vector.
* ​a\_type (`DType`): The DType for a.
* ​b\_type (`DType`): The DType for b.
* ​c\_type (`DType`): The DType for c.

**Args:**

* ​src (`SIMD[c_type, width]`): A int32 SIMD vector.
* ​a (`SIMD[a_type, width]`): A uint8 SIMD vector.
* ​b (`SIMD[b_type, width]`): A int8 SIMD vector.

**Returns:**

A SIMD vector of width elements.

---

## dot_i8_to_i32_x86

`dot_i8_to_i32_x86[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2.

**Constraints:**

Requires AVX512\_VNNI or AVX2.
The size of the output vector must be 4, 8 or 16.
The a argument has range \[0,255].
The b argument has range \[-128,127].

**Parameters:**

* ​width (`Int`): Size of the output SIMD vector.
* ​a\_type (`DType`): The DType for a.
* ​b\_type (`DType`): The DType for b.
* ​c\_type (`DType`): The DType for c.

**Args:**

* ​src (`SIMD[c_type, width]`): A int32 SIMD vector.
* ​a (`SIMD[a_type, width]`): A uint8 SIMD vector.
* ​b (`SIMD[b_type, width]`): A int8 SIMD vector.

**Returns:**

A SIMD vector of width elements.

---

## vnni_intrinsics

## Functions

* [​`dot_i16_to_i32_AVX2`](./dot_i16_to_i32_AVX2): The dot product of the two words in each int32 element of a and b plus a int32 from src.
* [​`dot_i16_to_i32_x86`](./dot_i16_to_i32_x86): The dot product of the two words in each int32 element of a and b plus a int32 from src using VNNI or AVX2.
* [​`dot_i8_to_i32_AVX2`](./dot_i8_to_i32_AVX2): The dot product of the four bytes in each int32 element of a and b plus a int32 from src.
* [​`dot_i8_to_i32_saturated_AVX2`](./dot_i8_to_i32_saturated_AVX2): The dot product of the four bytes in each int32 element of a and b plus a int32 from src.
* [​`dot_i8_to_i32_saturated_x86`](./dot_i8_to_i32_saturated_x86): The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2.
* [​`dot_i8_to_i32_x86`](./dot_i8_to_i32_x86): The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2.
* [​`pmaddubs`](./pmaddubs):
* [​`pmaddw`](./pmaddw):
* [​`vpdpbusd`](./vpdpbusd):
* [​`vpdpbusds`](./vpdpbusds):
* [​`vpdpwssd`](./vpdpwssd):
* [​`vpdpwssds`](./vpdpwssds):

---

## pmaddubs

`pmaddubs[width: Int](a: SIMD[int32, width], b: SIMD[int32, width]) -> SIMD[int32, width]`

---

## pmaddw

`pmaddw[width: Int](a: SIMD[int32, width], b: SIMD[int32, width]) -> SIMD[int32, width]`

---

## vpdpbusd

`vpdpbusd[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

---

## vpdpbusds

`vpdpbusds[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

---

## vpdpwssd

`vpdpwssd[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

---

## vpdpwssds

`vpdpwssds[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

---

## elu

`elu[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]`

Compute the Elu Op using the equation $z if z >= 0 else alpha*(e^z -1)$.

**Parameters:**

* ​type (`DType`): DType used for the computation.
* ​simd\_width (`Int`): SIMD width used for the computation.

**Args:**

* ​x (`SIMD[type, simd_width]`): The value to compute the ELU operation on.

**Returns:**

The result of the ELU operation.

---

## gelu

`gelu[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]`

Compute the GELU Op using the equation $0.5 * x * (1 + erf(x / sqrt(2)))$.

**Constraints:**

Type must be a floating point type.

**Parameters:**

* ​type (`DType`): DType used for the computation.
* ​simd\_width (`Int`): SIMD width used for the computation.

**Args:**

* ​x (`SIMD[type, simd_width]`): The value to compute the GELU operation on.

**Returns:**

The result of the GELU operation.

---

## gelu_approximate

`gelu_approximate[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]`

Compute the approximate GELU Op using the equation $0.5 * x * (1 + tanh(sqrt(2 / pi) * (x + 0.044715 * x^3)))$.

**Constraints:**

Type must be a floating point type.

**Parameters:**

* ​type (`DType`): The `DType` used for the computation.
* ​simd\_width (`Int`): SIMD width used for the computation.

**Args:**

* ​x (`SIMD[type, simd_width]`): The value to compute the GELU operation on.

**Returns:**

The result of the approximate GELU operation.

---

## activations

The module contains implementations of activation functions.

## Functions

* [​`elu`](./elu): Compute the Elu Op using the equation $z if z >= 0 else alpha*(e^z -1)$.
* [​`gelu`](./gelu): Compute the GELU Op using the equation $0.5 * x * (1 + erf(x / sqrt(2)))$.
* [​`gelu_approximate`](./gelu_approximate): Compute the approximate GELU Op using the equation $0.5 * x * (1 + tanh(sqrt(2 / pi) * (x + 0.044715 * x^3)))$.
* [​`relu`](./relu): Compute the Relu Op using the equation $max(0, x)$.
* [​`relu_n1`](./relu_n1): Compute the Relu N1 Op using the equation $max(min(x,1),-1)$.
* [​`sign`](./sign): Compute the sign (0, 1) of the input value.

---

## relu

`relu[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]`

Compute the Relu Op using the equation $max(0, x)$.

**Parameters:**

* ​type (`DType`): DType used for the computation.
* ​simd\_width (`Int`): SIMD width used for the computation.

**Args:**

* ​x (`SIMD[type, simd_width]`): The value to compute the RELU operation on.

**Returns:**

The result of the RELU operation.

---

## relu_n1

`relu_n1[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]`

Compute the Relu N1 Op using the equation $max(min(x,1),-1)$.

**Parameters:**

* ​type (`DType`): DType used for the computation.
* ​simd\_width (`Int`): SIMD width used for the computation.

**Args:**

* ​x (`SIMD[type, simd_width]`): The value to compute the RELU N1 operation on.

**Returns:**

The result of the RELU N1 operation.

---

## sign

`sign[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]`

Compute the sign (0, 1) of the input value.

**Parameters:**

* ​type (`DType`): DType used for the computation.
* ​simd\_width (`Int`): SIMD width used for the computation.

**Args:**

* ​x (`SIMD[type, simd_width]`): The value to compute the sign operation on.

**Returns:**

The result of the sign operation.

---

## arange

`arange[type: DType, simd_width: Int](start: SIMD[type, 1], stop: SIMD[type, 1], step: SIMD[type, 1], index: IndexList[1]) -> SIMD[type, simd_width]`

---

## arange_shape

`arange_shape[type: DType, single_thread_blocking_override: Bool](start: SIMD[type, 1], stop: SIMD[type, 1], step: SIMD[type, 1]) -> IndexList[1]`

---

## arange

## Functions

* [​`arange`](./arange):
* [​`arange_shape`](./arange_shape):

---

## arg_nonzero

`arg_nonzero[type: DType, output_type: DType](input_buffer: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output_buffer: LayoutTensor[output_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Gather the indices of all non-zero elements in input buffer storing the indices in the output\_buffer.

**Parameters:**

* ​type (`DType`): The element type.
* ​output\_type (`DType`): The integer type to store the indices in.

**Args:**

* ​input\_buffer (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to count the non-zeros in.
* ​output\_buffer (`LayoutTensor[output_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The indices of all non-zero elements.

---

## arg_nonzero_shape

`arg_nonzero_shape[type: DType, single_thread_blocking_override: Bool](input_buffer: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> IndexList[2]`

Return \[NumNonZeros, InputRank] where NumNonZeros are the number of non-zero elements in the input.

**Parameters:**

* ​type (`DType`): The element type.
* ​single\_thread\_blocking\_override (`Bool`): This op can block.

**Args:**

* ​input\_buffer (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to count the non-zeros in.

**Returns:**

Shape of the arg\_nonzero kernel for this input \[NumNonZeros, InputRank].

---

## arg_nonzero

## Functions

* [​`arg_nonzero`](./arg_nonzero): Gather the indices of all non-zero elements in input buffer storing the indices in the output\_buffer.
* [​`arg_nonzero_shape`](./arg_nonzero_shape): Return \[NumNonZeros, InputRank] where NumNonZeros are the number of non-zero elements in the input.

---

## argmax

`argmax(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis: Int, output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Finds the indices of the maximum element along the specified axis.

**Args:**

* ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor.
* ​axis (`Int`): The axis.
* ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor.

`argmax(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis_buf: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Finds the indices of the maximum element along the specified axis.

**Args:**

* ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor.
* ​axis\_buf (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor.
* ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor.

---

## argmin

`argmin(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis: Int, output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Finds the indices of the minimum element along the specified axis.

**Args:**

* ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor.
* ​axis (`Int`): The axis.
* ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor.

`argmin(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis_buf: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Finds the indices of the minimum element along the specified axis.

**Args:**

* ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor.
* ​axis\_buf (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor.
* ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor.

---

## argmaxmin

## Functions

* [​`argmax`](./argmax): Finds the indices of the maximum element along the specified axis.
* [​`argmin`](./argmin): Finds the indices of the minimum element along the specified axis.

---

## argmax_gpu

`argmax_gpu[type: DType, output_type: DType, rank: Int](ctx: DeviceContext, input: NDBuffer[type, rank, origin], output: NDBuffer[output_type, rank, origin])`

---

## argmaxmin_gpu

`argmaxmin_gpu[type: DType, output_type: DType, rank: Int, largest: Bool](ctx: DeviceContext, input: NDBuffer[type, rank, origin], output: NDBuffer[output_type, rank, origin])`

Wraps the Top-K GPU kernel with K=1 to perform argmax on the inner-most dimension.

**Parameters:**

* ​type (`DType`): DType - The data type of the input tensor.
* ​output\_type (`DType`): DType - The data type of the output tensor.
* ​rank (`Int`): Int - The rank of the input tensor.
* ​largest (`Bool`): Bool - Whether to perform argmax or argmin.

---

## argmin_gpu

`argmin_gpu[type: DType, output_type: DType, rank: Int](ctx: DeviceContext, input: NDBuffer[type, rank, origin], output: NDBuffer[output_type, rank, origin])`

---

## argmaxmin_gpu

## Functions

* [​`argmax_gpu`](./argmax_gpu):
* [​`argmaxmin_gpu`](./argmaxmin_gpu): Wraps the Top-K GPU kernel with K=1 to perform argmax on the inner-most dimension.
* [​`argmin_gpu`](./argmin_gpu):

---

## argsort

`argsort[*, ascending: Bool = True, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ctx: DeviceContext)`

Performs argsort on input buffer, storing indices in output buffer.

**Parameters:**

* ​ascending (`Bool`): Sort direction (True for ascending, False for descending).
* ​target (`StringSlice[StaticConstantOrigin]`): Target device ("cpu" or "gpu").

**Args:**

* ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer to store sorted indices.
* ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer containing values to sort.
* ​ctx (`DeviceContext`): Device context for execution.

`argsort[ascending: Bool = True](output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

CPU-only version of argsort.

**Parameters:**

* ​ascending (`Bool`): Sort direction (True for ascending, False for descending).

**Args:**

* ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer to store sorted indices.
* ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer containing values to sort.

---

## argsort

## Functions

* [​`argsort`](./argsort): Performs argsort on input buffer, storing indices in output buffer.

---

## cpu_bicubic_kernel

`cpu_bicubic_kernel[type: DType, rank: Int, //](output_host: NDBuffer[type, rank, origin, shape, strides], input_host: NDBuffer[type, rank, origin, shape, strides])`

Perform bicubic interpolation on an NDBuffer of form NCHW.

**Args:**

* ​output\_host (`NDBuffer[type, rank, origin, shape, strides]`): Output tensor with desired dimensions.
* ​input\_host (`NDBuffer[type, rank, origin, shape, strides]`): Input tensor of shape \[B, C, H, W].

---

## cubic_kernel

`cubic_kernel(x: SIMD[float32, 1]) -> SIMD[float32, 1]`

Cubic interpolation kernel matching PyTorch/torchvision's BICUBIC filter.

This uses the Catmull-Rom variant (Robidoux cubic) with a = -0.75,
which is what PyTorch uses in get\_cubic\_upsample\_coefficients.
([Source](https://github.com/pytorch/pytorch/blob/59eb61b2d1e4b64debbefa036acd0d8c7d55f0a3/aten/src/ATen/native/UpSample.h#L410-L423)).
This also matches OpenCV's [interpolateCubic](https://github.com/opencv/opencv/blob/cf2a3c8e7430cc92569dd7f114609f9377b12d9e/modules/imgproc/src/resize.cpp#L907-L915).

**Args:**

* ​x (`SIMD[float32, 1]`): Distance from the center point.

**Returns:**

Weight contribution based on the distance.

`cubic_kernel(x: SIMD[dtype, size]) -> SIMD[dtype, size]`

Cubic interpolation kernel matching PyTorch/torchvision's BICUBIC filter.

This uses the Catmull-Rom variant (Robidoux cubic) with a = -0.75,
which is what PyTorch uses in get\_cubic\_upsample\_coefficients.
([Source](https://github.com/pytorch/pytorch/blob/59eb61b2d1e4b64debbefa036acd0d8c7d55f0a3/aten/src/ATen/native/UpSample.h#L410-L423)).
This also matches OpenCV's [interpolateCubic](https://github.com/opencv/opencv/blob/cf2a3c8e7430cc92569dd7f114609f9377b12d9e/modules/imgproc/src/resize.cpp#L907-L915).

**Args:**

* ​x (`SIMD[dtype, size]`): Distance from the center point.

**Returns:**

Weight contribution based on the distance.

---

## gpu_bicubic_kernel

`gpu_bicubic_kernel[type: DType, rank: Int](output: NDBuffer[type, rank, MutableAnyOrigin], input: NDBuffer[type, rank, MutableAnyOrigin])`

Perform bicubic interpolation using GPU.

**Args:**

* ​output (`NDBuffer[type, rank, MutableAnyOrigin]`): Output tensor with desired dimensions on the device.
* ​input (`NDBuffer[type, rank, MutableAnyOrigin]`): Input tensor of shape \[B, C, H, W] on the device.

---

## bicubic

This module provides CPU and GPU implementations for bicubic interpolation.

Bicubic interpolation is a 2D extension of cubic interpolation for resampling
digital images. It uses the weighted average of the 4x4 neighborhood of pixels
around the target location to compute the interpolated value.

## Functions

* [​`cpu_bicubic_kernel`](./cpu_bicubic_kernel): Perform bicubic interpolation on an NDBuffer of form NCHW.
* [​`cubic_kernel`](./cubic_kernel): Cubic interpolation kernel matching PyTorch/torchvision's BICUBIC filter.
* [​`gpu_bicubic_kernel`](./gpu_bicubic_kernel): Perform bicubic interpolation using GPU.
* [​`map_output_to_input_coord`](./map_output_to_input_coord): Map output pixel coordinate to input coordinate using center alignment. This implements the standard coordinate mapping for image resizing: input\_coord = (output\_coord + 0.5) \* scale - 0.5 The +0.5 and -0.5 terms ensure pixel centers are aligned properly. Args:     output\_coord: Output pixel coordinate.     scale: Scale factor (input\_size / output\_size). Returns:     Corresponding input coordinate as a float.
* [​`resize_bicubic`](./resize_bicubic): Perform bicubic interpolation.

---

## map_output_to_input_coord

`map_output_to_input_coord(output_coord: Int, scale: SIMD[float32, 1]) -> SIMD[float32, 1]`

Map output pixel coordinate to input coordinate using center alignment. This implements the standard coordinate mapping for image resizing: input\_coord = (output\_coord + 0.5) \* scale - 0.5 The +0.5 and -0.5 terms ensure pixel centers are aligned properly. Args:     output\_coord: Output pixel coordinate.     scale: Scale factor (input\_size / output\_size). Returns:     Corresponding input coordinate as a float.

---

## resize_bicubic

`resize_bicubic[type: DType, rank: Int, //, target: StringSlice[StaticConstantOrigin]](output: NDBuffer[type, rank, origin, shape, strides], input: NDBuffer[type, rank, origin, shape, strides], ctx: DeviceContextPtr)`

Perform bicubic interpolation.

**Args:**

* ​output (`NDBuffer[type, rank, origin, shape, strides]`): Output tensor with desired dimensions on host or device.
* ​input (`NDBuffer[type, rank, origin, shape, strides]`): Input tensor of shape \[B, C, H, W] on host or device.
* ​ctx (`DeviceContextPtr`): Device context to enqueue GPU kernels on.

---

## broadcast

`broadcast[type: DType](output: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

For each axis of `input`, if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`.

**Args:**

* ​output (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer.
* ​input (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input buffer.

---

## broadcast_impl

`broadcast_impl[type: DType](axis: Int, output: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input_prev_axis_stride: Int, output_prev_axis_stride: Int, input_offset: Int, output_offset: Int, rightmost_broadcast_axis: Int)`

For each axis of `input` ∈ \[axis, rank), if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`.

**Args:**

* ​axis (`Int`): The axis value.
* ​output (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer.
* ​input (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input buffer.
* ​input\_prev\_axis\_stride (`Int`): The stride at axis `axis - 1` for input.
* ​output\_prev\_axis\_stride (`Int`): The stride at axis `axis - 1` for output.
* ​input\_offset (`Int`): The offset at which we start copying data from.
* ​output\_offset (`Int`): The offset at which we start copying data to.
* ​rightmost\_broadcast\_axis (`Int`): The largest axis at which we need to duplicate `input` data.

---

## broadcast

## Functions

* [​`broadcast`](./broadcast): For each axis of `input`, if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`.
* [​`broadcast_impl`](./broadcast_impl): For each axis of `input` ∈ \[axis, rank), if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`.

---

## concat

`concat[rank: Int, type: DType, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]({:i1 0, 1})](output: NDBuffer[type, rank, origin], axis: Int, inputs: StaticTuple[NDBuffer[type, rank, MutableAnyOrigin], size], context: DeviceContextPtr = DeviceContextPtr())`

---

## concat_shape

`concat_shape[input_rank: Int, input_type: DType, single_thread_blocking_override: Bool](input_bufs: List[NDBuffer[input_type, input_rank, MutableAnyOrigin]], axis: Int) -> IndexList[input_rank]`

Compute the output shape of a `pad` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_rank (`Int`): Input\_rank of the input tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input\_bufs (`List[NDBuffer[input_type, input_rank, MutableAnyOrigin]]`): The input tensors list.
* ​axis (`Int`): The axis.

**Returns:**

The output shape.

---

## fused_concat

`fused_concat[type: DType, rank: Int, single_thread_blocking_override: Bool, input_fn: fn[Int, Int, Int](IndexList[$2]) capturing -> SIMD[type, $1], output_0_fn: fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](axis: Int, input_shapes: StaticTuple[IndexList[rank], size], output: NDBuffer[type, rank, origin], ctx: DeviceContextPtr)`

---

## concat

## Aliases

### `elementwise_epilogue_type`

`alias elementwise_epilogue_type = fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None`

## Functions

* [​`concat`](./concat):
* [​`concat_shape`](./concat_shape): Compute the output shape of a `pad` operation, and assert the inputs are compatible.
* [​`fused_concat`](./fused_concat):
* [​`memcpy_or_fuse`](./memcpy_or_fuse):

---

## memcpy_or_fuse

`memcpy_or_fuse[rank: Int, type: DType, epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]](dest_data: UnsafePointer[SIMD[int8, 1]], out_byte_offset: Int, src_data: UnsafePointer[SIMD[int8, 1]], n: Int, out_shape: IndexList[rank, element_type=element_type])`

---

## ConvDirectNHWC

`struct ConvDirectNHWC[input_mut: Bool, filter_mut: Bool, //, input_rank: Int, filter_rank: Int, output_rank: Int, input_origin: Origin[input_mut], filter_origin: Origin[filter_mut], output_origin: MutableOrigin, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, filter_packed: Bool, conv_attr: ConvInfoStatic[(input_rank + -2)], elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})]`

Implement the outer loops for direct convolution. Collapse N, HO, WO into one dimension n\_ho\_wo. Tile n\_ho\_wo, C, and F. The tile factor for C and F are chosen by a heuristic prioritizing C. n\_ho\_wo is tiled by micro kernel's height.

If n\_ho\_wo is large enough to spill LLC, we may need to tile n\_ho\_wo as the
outer most loop with a factor fit in LLC.

Assume F is divisible at least by simd\_size.

## Fields

* ​output (`NDBuffer[output_type, output_rank, output_origin, output_shape]`):
* ​input (`NDBuffer[input_type, input_rank, input_origin, input_shape]`):
* ​filter (`NDBuffer[filter_type, filter_rank, filter_origin, filter_shape]`):
* ​conv\_shape (`ConvShape[(input_rank + -2)]`):
* ​partition (`ConvPartition`):
* ​cf\_tile\_size (`IndexList[2]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `packed_and_fully_static`

`alias packed_and_fully_static = filter_packed if filter_shape.all_known[::Int]() if output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else filter_shape.all_known[::Int]() if output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known()`

## Methods

### `run`

`static run(output: NDBuffer[output_type, output_rank, output_origin, output_shape], input: NDBuffer[input_type, input_rank, input_origin, input_shape], filter: NDBuffer[filter_type, filter_rank, filter_origin, filter_shape], conv_shape: ConvShape[(input_rank + -2)])`

### `is_new_c_accum`

`is_new_c_accum(self, c_idx: Int) -> Bool`

### `update_output_tile_no_padding`

`update_output_tile_no_padding[micro_kernel_height: Int, micro_kernel_width: Int, c_fully_cached: Bool, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int, output_flat_coord: Int)`

### `output_space_flat_loop`

`output_space_flat_loop[micro_kernel_f_size: Int, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int)`

### `output_space_loop`

`output_space_loop[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int)`

### `output_space_loop_1d`

`output_space_loop_1d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)`

### `output_space_loop_2d`

`output_space_loop_2d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)`

### `output_space_loop_3d`

`output_space_loop_3d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)`

---

## CuDNNConvMeta

`@register_passable`
`struct CuDNNConvMeta`

## Fields

* ​ptr\_handle (`UnsafePointer[UnsafePointer[NoneType]]`):
* ​ptr\_input\_desc (`UnsafePointer[UnsafePointer[NoneType]]`):
* ​ptr\_filter\_desc (`UnsafePointer[UnsafePointer[NoneType]]`):
* ​ptr\_conv\_desc (`UnsafePointer[UnsafePointer[NoneType]]`):
* ​ptr\_output\_desc (`UnsafePointer[UnsafePointer[NoneType]]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

### `__del__`

`__del__(owned self)`

---

## Naive2dConvolution

`struct Naive2dConvolution[output_type: DType, input_type: DType, filter_type: DType]`

Struct wrapper for naive 2d convolution implementation.

## Fields

* ​output (`UnsafePointer[SIMD[output_type, 1]]`):
* ​input (`UnsafePointer[SIMD[input_type, 1]]`):
* ​filter (`UnsafePointer[SIMD[filter_type, 1]]`):
* ​pad\_d (`IndexList[2]`):
* ​pad\_h (`IndexList[2]`):
* ​pad\_w (`IndexList[2]`):
* ​stride (`IndexList[3]`):
* ​dilation (`IndexList[3]`):
* ​num\_groups (`Int`):
* ​output\_shape (`IndexList[5]`):
* ​input\_shape (`IndexList[5]`):
* ​filter\_shape (`IndexList[5]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, output: UnsafePointer[SIMD[output_type, 1]], input: UnsafePointer[SIMD[input_type, 1]], filter: UnsafePointer[SIMD[filter_type, 1]], output_shape: IndexList[5], input_shape: IndexList[5], filter_shape: IndexList[5], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[3], dilation: IndexList[3], num_groups: Int)`

### `run`

`static run(output: UnsafePointer[SIMD[output_type, 1]], input: UnsafePointer[SIMD[input_type, 1]], filter: UnsafePointer[SIMD[filter_type, 1]], output_shape: IndexList[5], input_shape: IndexList[5], filter_shape: IndexList[5], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[3], dilation: IndexList[3], num_groups: Int)`

---

## accumulate_wo_tile_1d

`accumulate_wo_tile_1d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load_filter: Bool, effected_by_padding: Bool, input_dt: DType, filter_dt: DType](c_tile_size: Int, S: Int, mut acc: _Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop], input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, input_stride_to_nbr: Int, filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, filter_stride_to_nbr: Int, partial_load_filter_size: Int, w: Int, W: Int, dilation: Int)`

Update one row in the output for a given (c, f) tile.

**Parameters:**

* ​micro\_kernel\_height (`Int`): Number of input points in register tiling.
* ​micro\_kernel\_width (`Int`): Number of SIMD resgiters assigned to F.
* ​simd\_size (`Int`): Number of elements in a SIMD register.
* ​partial\_load\_filter (`Bool`): Whether using partial load for filter.
* ​effected\_by\_padding (`Bool`): Whether the tile is effected by padding.
* ​input\_dt (`DType`): DType of input.
* ​filter\_dt (`DType`): DType of filter.

**Args:**

* ​c\_tile\_size (`Int`): Tile size in input channel.
* ​S (`Int`): Filter window width.
* ​acc (`_Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop]`): Pointer to register tile accumulator.
* ​input (`UnsafePointer[SIMD[input_dt, 1]]`): Pointer to the first input point in WO tile.
* ​input\_stride (`Int`): Stride between two input points, i.e., C w/ NHWC layout.
* ​input\_stride\_to\_nbr (`Int`): Stride between an input point and its neighbor.
* ​filter (`UnsafePointer[SIMD[filter_dt, 1]]`): Pointer to the first coef in the filter window.
* ​filter\_stride (`Int`): Stride between two segments of size `micro_kernel_width * simd_size`.
* ​filter\_stride\_to\_nbr (`Int`): Stride between between two neighbor coefs, i.e.,
  CF w/ RSCF layout.
* ​partial\_load\_filter\_size (`Int`): Size of partial load for filter.
* ​w (`Int`): Coordinate in an input row.
* ​W (`Int`): Input width.
* ​dilation (`Int`): Convolution dilation.

---

## accumulate_wo_tile_2d

`accumulate_wo_tile_2d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load_filter: Bool, effected_by_padding: Bool, input_dt: DType, filter_dt: DType](c_tile_size: Int, RS: IndexList[2], mut acc: _Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop], input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, input_stride_to_nbr: IndexList[2], filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, filter_stride_to_nbr: IndexList[2], partial_load_filter_size: Int, hw: IndexList[2], HW: IndexList[2], dilation: IndexList[2])`

---

## accumulate_wo_tile_3d

`accumulate_wo_tile_3d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load_filter: Bool, effected_by_padding: Bool, input_dt: DType, filter_dt: DType](c_tile_size: Int, QRS: IndexList[3], mut acc: _Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop], input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, input_stride_to_nbr: IndexList[3], filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, filter_stride_to_nbr: IndexList[3], partial_load_filter_size: Int, dhw: IndexList[3], DHW: IndexList[3], dilation: IndexList[3])`

---

## check_cudnn_error

`check_cudnn_error(stat: cudnnStatus_t)`

---

## conv1d_update_wo_tile

`conv1d_update_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, filter_packed: Bool, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType, elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], first_c_tile: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[rank], n: Int, wo: Int)`

---

## conv2d_gpu_naive_nhwc_rscf

`conv2d_gpu_naive_nhwc_rscf[input_dim: DimList, filter_dim: DimList, output_dim: DimList, input_type: DType, filter_type: DType, output_type: DType, block_size: Int, maybe_epilogue_func: OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]](input: NDBuffer[input_type, 4, MutableAnyOrigin, input_dim], filter: NDBuffer[filter_type, 4, MutableAnyOrigin, filter_dim], output: NDBuffer[output_type, 4, MutableAnyOrigin, output_dim], stride: IndexList[2], dilation: IndexList[2], padding: IndexList[2])`

---

## conv2d_update_wo_tile

`conv2d_update_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, filter_packed: Bool, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType, elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], first_c_tile: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[2], n: Int, howo: IndexList[2])`

---

## conv3d_gpu_naive_ndhwc_qrscf

`conv3d_gpu_naive_ndhwc_qrscf[input_dim: DimList, filter_dim: DimList, output_dim: DimList, input_type: DType, filter_type: DType, output_type: DType, block_size: Int, maybe_epilogue_func: OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]](input: NDBuffer[input_type, 5, MutableAnyOrigin, input_dim], filter: NDBuffer[filter_type, 5, MutableAnyOrigin, filter_dim], output: NDBuffer[output_type, 5, MutableAnyOrigin, output_dim], stride: IndexList[3], dilation: IndexList[3], padding: IndexList[3])`

---

## conv3d_update_wo_tile

`conv3d_update_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, filter_packed: Bool, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType, elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], first_c_tile: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[3], n: Int, dohowo: IndexList[3])`

---

## conv_cudnn

`conv_cudnn[input_type: DType, filter_type: DType, output_type: DType](input: NDBuffer[input_type, 4, MutableAnyOrigin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], filter: NDBuffer[filter_type, 4, MutableAnyOrigin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], output: NDBuffer[output_type, 4, MutableAnyOrigin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], stride: IndexList[2], dilation: IndexList[2], padding: IndexList[2], num_groups: Int, ctx: DeviceContext)`

---

## conv_gpu

`conv_gpu[input_rank: Int, filter_rank: Int, input_dim: DimList, filter_dim: DimList, output_dim: DimList, input_type: DType, filter_type: DType, output_type: DType, maybe_epilogue_func: OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]({:i1 0, 1}), filter_is_fcrs: Bool = False](input: NDBuffer[input_type, input_rank, MutableAnyOrigin, input_dim], filter: NDBuffer[filter_type, filter_rank, MutableAnyOrigin, filter_dim], output: NDBuffer[output_type, input_rank, MutableAnyOrigin, output_dim], stride: IndexList[(input_rank + -2)], dilation: IndexList[(input_rank + -2)], padding: IndexList[(input_rank + -2)], num_groups: Int, ctx: DeviceContext)`

---

## conv_nhwc_direct

`conv_nhwc_direct[input_rank: Int, filter_rank: Int, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, filter_packed: Bool, conv_info_static: ConvInfoStatic[(input_rank + -2)], lambdas_have_fusion: Bool, elementwise_lambda: fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None](input: NDBuffer[input_type, input_rank, origin, input_shape], filter: NDBuffer[filter_type, filter_rank, origin, filter_shape], output: NDBuffer[output_type, input_rank, origin, output_shape], stride: IndexList[(input_rank + -2)], dilation: IndexList[(input_rank + -2)], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], num_groups: Int)`

---

## conv_shape

`conv_shape[input_rank: Int, filter_rank: Int, input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], filter_buf: NDBuffer[filter_type, filter_rank, origin], strides_buf: NDBuffer[strides_type, 1, origin], dilations_buf: NDBuffer[dilations_type, 1, origin], paddings_buf: NDBuffer[paddings_type, 1, origin], num_groups_scalar: SIMD[dtype, 1]) -> IndexList[input_rank]`

Compute the output shape of a `conv` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_rank (`Int`): Rank of the input tensor.
* ​filter\_rank (`Int`): Rank of the filter tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​filter\_type (`DType`): Type of the filter tensor.
* ​strides\_type (`DType`): Type of the strides tensor.
* ​dilations\_type (`DType`): Type of the dilations tensor.
* ​paddings\_type (`DType`): Type of the paddings tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  ssynchronouslysing a single thread.

**Args:**

* ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor.
* ​filter\_buf (`NDBuffer[filter_type, filter_rank, origin]`): The filter tensor.
* ​strides\_buf (`NDBuffer[strides_type, 1, origin]`): The strides tensor.
* ​dilations\_buf (`NDBuffer[dilations_type, 1, origin]`): The dilations tensor.
* ​paddings\_buf (`NDBuffer[paddings_type, 1, origin]`): The paddings tensor.
* ​num\_groups\_scalar (`SIMD[dtype, 1]`): The num\_groups scalar.

**Returns:**

The output shape.

---

## get_cudnn_dtype

`get_cudnn_dtype[dtype: DType]() -> cudnnDataType_t`

Map Mojo DType to cuDNN data type.

Support only floating point dtypes for now.

---

## conv

## Structs

* [​`ConvDirectNHWC`](./ConvDirectNHWC): Implement the outer loops for direct convolution. Collapse N, HO, WO into one dimension n\_ho\_wo. Tile n\_ho\_wo, C, and F. The tile factor for C and F are chosen by a heuristic prioritizing C. n\_ho\_wo is tiled by micro kernel's height.
* [​`CuDNNConvMeta`](./CuDNNConvMeta):
* [​`Naive2dConvolution`](./Naive2dConvolution): Struct wrapper for naive 2d convolution implementation.

## Functions

* [​`accumulate_wo_tile_1d`](./accumulate_wo_tile_1d): Update one row in the output for a given (c, f) tile.
* [​`accumulate_wo_tile_2d`](./accumulate_wo_tile_2d):
* [​`accumulate_wo_tile_3d`](./accumulate_wo_tile_3d):
* [​`check_cudnn_error`](./check_cudnn_error):
* [​`conv1d_update_wo_tile`](./conv1d_update_wo_tile):
* [​`conv2d_gpu_naive_nhwc_rscf`](./conv2d_gpu_naive_nhwc_rscf):
* [​`conv2d_update_wo_tile`](./conv2d_update_wo_tile):
* [​`conv3d_gpu_naive_ndhwc_qrscf`](./conv3d_gpu_naive_ndhwc_qrscf):
* [​`conv3d_update_wo_tile`](./conv3d_update_wo_tile):
* [​`conv_cudnn`](./conv_cudnn):
* [​`conv_gpu`](./conv_gpu):
* [​`conv_nhwc_direct`](./conv_nhwc_direct):
* [​`conv_shape`](./conv_shape): Compute the output shape of a `conv` operation, and assert the inputs are compatible.
* [​`get_cudnn_dtype`](./get_cudnn_dtype): Map Mojo DType to cuDNN data type.
* [​`pack_conv_filter_shape`](./pack_conv_filter_shape): Compute the output shape of convolution filter packing.
* [​`pack_filter`](./pack_filter): This packs the filter form RSCF to FRSCf. Use the default micro kernel size for dynamic shapes.
* [​`pack_filter_shape`](./pack_filter_shape): Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel.
* [​`pack_filter_shape_impl`](./pack_filter_shape_impl): Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel.

---

## pack_conv_filter_shape

`pack_conv_filter_shape[single_thread_blocking_override: Bool](filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int) -> IndexList[(rank + 1)]`

Compute the output shape of convolution filter packing.

**Parameters:**

* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The filter to be packed.
* ​num\_groups (`Int`): The number of groups in the convolution.

**Returns:**

The output shape.

---

## pack_filter

`pack_filter(filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], packed_filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int)`

This packs the filter form RSCF to FRSCf. Use the default micro kernel size for dynamic shapes.

`pack_filter[simd_size: Int, micro_kernel_f_size: Int](filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], packed_filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int)`

This packs the filter form RSCF to FRSCf.

F is first broken down to segments of size micro\_kernel\_f\_size, then the
remainder is further divided by simd\_size. The last residual elements if
any is padded with zero to fill simd\_size.

**Parameters:**

* ​simd\_size (`Int`): Can differ from the simd size of the input type.
* ​micro\_kernel\_f\_size (`Int`): The size of the last dimension in FRSCf, which is
  equals the size of the micro kernel's F dimension.

**Args:**

* ​filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): Filter in RSCF layout (if 2D).
* ​packed\_filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): Packed filter in FRSCf layout (if 2D).
  F       - the index of continuous segments in micro kernel.
  R, S, C - original R, S, C.
  f       - the index within a continuous segments.
* ​num\_groups (`Int`): The number of groups in the convolution.

---

## pack_filter_shape

`pack_filter_shape[filter_type: DType, input_shape: DimList, filter_shape: DimList, output_shape: DimList, strides: DimList, dilations: DimList, paddings: DimList, num_groups: Int, single_thread_blocking_override: Bool](filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> IndexList[(rank + 1)]`

Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel.

**Returns:**

The output shape.

---

## pack_filter_shape_impl

`pack_filter_shape_impl[filter_type: DType](Q: Int, R: Int, S: Int, C: Int, F: Int, num_groups: Int) -> IndexList[6]`

Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel.

**Args:**

* ​Q (`Int`): Original Q filter dimension.
* ​R (`Int`): Original R filter dimension.
* ​S (`Int`): Original S filter dimension.
* ​C (`Int`): Original C filter dimension.
* ​F (`Int`): Original F filter dimension.
* ​num\_groups (`Int`): Number of groups in the convolution.

**Returns:**

The output shape.

---

## ConvTransposedPacked

`struct ConvTransposedPacked[input_mut: Bool, filter_mut: Bool, //, input_rank: Int, filter_rank: Int, output_rank: Int, input_origin: Origin[input_mut], filter_origin: Origin[filter_mut], output_origin: MutableOrigin, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, conv_attr: ConvInfoStatic[(input_rank + -2)], elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})]`

## Fields

* ​output (`NDBuffer[output_type, output_rank, output_origin, output_shape]`):
* ​input (`NDBuffer[input_type, input_rank, input_origin, input_shape]`):
* ​filter (`NDBuffer[filter_type, filter_rank, filter_origin, filter_shape]`):
* ​conv\_shape (`ConvShape[(input_rank + -2)]`):
* ​partition (`ConvPartition`):
* ​cf\_tile\_size (`IndexList[2]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `run`

`static run(output: NDBuffer[output_type, output_rank, output_origin, output_shape], input: NDBuffer[input_type, input_rank, input_origin, input_shape], filter: NDBuffer[filter_type, filter_rank, filter_origin, filter_shape], conv_shape: ConvShape[(input_rank + -2)])`

### `input_space_loop`

`input_space_loop[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int)`

### `input_space_loop_2d`

`input_space_loop_2d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)`

### `input_space_loop_3d`

`input_space_loop_3d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)`

### `apply_epilogue`

`apply_epilogue(self, n: Int, g: Int)`

---

## accumulate_wo_tile

`accumulate_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](c_tile_size: Int, output: UnsafePointer[SIMD[output_dt, 1]], output_stride: Int, input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, partial_load_size: Int)`

---

## conv_transpose_naive

`conv_transpose_naive[type: DType](output: NDBuffer[type, 5, MutableAnyOrigin], input: NDBuffer[type, 5, MutableAnyOrigin], filter: NDBuffer[type, 5, MutableAnyOrigin], stride: IndexList[3], dilation: IndexList[3], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2])`

Implements the ConvTranspose operator from the MO spec.

**Parameters:**

* ​type (`DType`): Type of the input, output, and kernel tensors.

**Args:**

* ​output (`NDBuffer[type, 5, MutableAnyOrigin]`): Output data tensor that contains the result of the convolution.
* ​input (`NDBuffer[type, 5, MutableAnyOrigin]`): Input data tensor from previous layer, with size of (N x H x W x C),
  where N is the batch size, C is the number of channels, and H and
  W are the height and width.
* ​filter (`NDBuffer[type, 5, MutableAnyOrigin]`): The weight (kernel) tensor, with size of (kH x kW x M/groups x C),
  where C is the number of channels, kH and kW are the height and
  width of the kernel, and M is the number of feature maps.
* ​stride (`IndexList[3]`): Stride along each spatial axis.
* ​dilation (`IndexList[3]`): Dilation value along each spatial axis of the filter.
* ​pad\_d (`IndexList[2]`): Padding in depth dimension.
* ​pad\_h (`IndexList[2]`): Padding in height dimension.
* ​pad\_w (`IndexList[2]`): Padding in width dimension.

---

## conv_transpose_shape

`conv_transpose_shape[input_rank: Int, kernel_rank: Int, type: DType, strides_type: DType, dilations_type: DType, pads_type: DType, output_pads_type: DType, single_thread_blocking_override: Bool](input: NDBuffer[type, input_rank, origin], kernel: NDBuffer[type, kernel_rank, origin], strides: NDBuffer[strides_type, 1, origin], dilations: NDBuffer[dilations_type, 1, origin], pads: NDBuffer[pads_type, 1, origin], output_pads: NDBuffer[output_pads_type, 1, origin]) -> IndexList[input_rank]`

Compute the output shape of a `conv-transpose` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_rank (`Int`): Rank of the input tensor.
* ​kernel\_rank (`Int`): Rank of the kernel tensor.
* ​type (`DType`): Element type of the input and kernel tensor.
* ​strides\_type (`DType`): Element type of the strides tensor.
* ​dilations\_type (`DType`): Element type of the dilations tensor.
* ​pads\_type (`DType`): Element type of the pads tensor.
* ​output\_pads\_type (`DType`): Element type of the output\_pads tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input (`NDBuffer[type, input_rank, origin]`): The input tensor.
* ​kernel (`NDBuffer[type, kernel_rank, origin]`): The kernel tensor.
* ​strides (`NDBuffer[strides_type, 1, origin]`): The strides tensor.
* ​dilations (`NDBuffer[dilations_type, 1, origin]`): The dilations tensor.
* ​pads (`NDBuffer[pads_type, 1, origin]`): The paddings tensor.
* ​output\_pads (`NDBuffer[output_pads_type, 1, origin]`): The output paddings tensor.

**Returns:**

The output shape.

---

## conv_transposed_cpu

`conv_transposed_cpu[input_rank: Int, filter_rank: Int, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, filter_packed: Bool, filter_is_cfrs: Bool, lambdas_have_fusion: Bool, elementwise_lambda: fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None](output: NDBuffer[output_type, input_rank, origin, output_shape], input: NDBuffer[input_type, input_rank, origin, input_shape], filter: NDBuffer[filter_type, filter_rank, origin, filter_shape], stride: IndexList[(input_rank + -2)], dilation: IndexList[(input_rank + -2)], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2])`

---

## conv_transposed_cudnn

`conv_transposed_cudnn[input_type: DType, filter_type: DType, output_type: DType](input: NDBuffer[input_type, 4, MutableAnyOrigin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], filter: NDBuffer[filter_type, 4, MutableAnyOrigin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], output: NDBuffer[output_type, 4, MutableAnyOrigin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], stride: IndexList[2], dilation: IndexList[2], padding: IndexList[2], ctx: DeviceContext)`

---

## conv_transposed_gpu

`conv_transposed_gpu[input_rank: Int, filter_rank: Int, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, elementwise_epilogue: OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]({:i1 0, 1})](output: NDBuffer[output_type, input_rank, origin, output_shape], input: NDBuffer[input_type, input_rank, origin, input_shape], filter: NDBuffer[filter_type, filter_rank, origin, filter_shape], stride: IndexList[(input_rank + -2)], dilation: IndexList[(input_rank + -2)], padding: IndexList[(input_rank + -2)], ctx: DeviceContext)`

---

## get_num_partitions

`get_num_partitions[micro_kernel_height: Int, micro_kernel_f_size: Int](num_threads: Int, conv_shape: ConvShape[rank]) -> IndexList[4]`

Partition the workload in (batch\&group, C, F, H) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions.

---

## get_partition

`get_partition(task_id: Int, num_partitions: IndexList[4], conv_shape: ConvShape[rank], micro_kernel_height: Int, micro_kernel_f_size: Int) -> ConvPartition`

---

## conv_transpose

## Structs

* [​`ConvTransposedPacked`](./ConvTransposedPacked):

## Functions

* [​`accumulate_wo_tile`](./accumulate_wo_tile):
* [​`conv_transpose_naive`](./conv_transpose_naive): Implements the ConvTranspose operator from the MO spec.
* [​`conv_transpose_shape`](./conv_transpose_shape): Compute the output shape of a `conv-transpose` operation, and assert the inputs are compatible.
* [​`conv_transposed_cpu`](./conv_transposed_cpu):
* [​`conv_transposed_cudnn`](./conv_transposed_cudnn):
* [​`conv_transposed_gpu`](./conv_transposed_gpu):
* [​`get_num_partitions`](./get_num_partitions): Partition the workload in (batch\&group, C, F, H) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions.
* [​`get_partition`](./get_partition):
* [​`pack_filter`](./pack_filter): This packs the filter form RSFC to FRSCf.
* [​`pack_filter_shape`](./pack_filter_shape): Compute the output shape of transposed convolution filter packing.
* [​`update_w_tile_2d`](./update_w_tile_2d):
* [​`update_w_tile_3d`](./update_w_tile_3d):

---

## pack_filter

`pack_filter(filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], packed_filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int)`

This packs the filter form RSFC to FRSCf.

---

## pack_filter_shape

`pack_filter_shape(filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int) -> IndexList[(rank + 1)]`

Compute the output shape of transposed convolution filter packing.

**Args:**

* ​filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The filter to be packed.
* ​num\_groups (`Int`): The number of groups in the convolution.

**Returns:**

The output shape.

---

## update_w_tile_2d

`update_w_tile_2d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], _init_output: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[2], n: Int, hw: IndexList[2])`

---

## update_w_tile_3d

`update_w_tile_3d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], _init_output: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[3], n: Int, hw: IndexList[3])`

---

## ConvAlgorithm

`@register_passable(trivial)`
`struct ConvAlgorithm`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `Default`

`alias Default = ConvAlgorithm(0)`

### `Direct`

`alias Direct = ConvAlgorithm(2)`

### `Im2Col`

`alias Im2Col = ConvAlgorithm(1)`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

---

## ConvInfoStatic

`struct ConvInfoStatic[rank: Int]`

## Fields

* ​pad (`DimList`):
* ​stride (`DimList`):
* ​dilation (`DimList`):
* ​num\_groups (`Dim`):

## Implemented traits

`AnyType`,
`Defaultable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

`__init__(out self, pad: DimList, stride: DimList, dilation: DimList, num_groups: Dim)`

`__init__(out self, pad: DimList, stride: DimList, dilation: DimList, input_c: Dim, filter_c: Dim)`

### `all_known`

`all_known(self) -> Bool`

### `pad_left`

`pad_left(self) -> Int`

### `pad_bottom`

`pad_bottom(self) -> Int`

### `strides`

`strides(self) -> IndexList[2]`

### `dilations`

`dilations(self) -> IndexList[2]`

---

## ConvPartition

`@register_passable(trivial)`
`struct ConvPartition`

Work range for a partition.

## Fields

* ​ng\_offset (`Int`):
* ​ng\_size (`Int`):
* ​f\_offset (`Int`):
* ​f\_size (`Int`):
* ​ho\_or\_howo\_offset (`Int`):
* ​ho\_or\_howo\_size (`Int`):
* ​c\_offset (`Int`):
* ​c\_size (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `empty`

`empty(self) -> Bool`

---

## ConvShape

`@register_passable(trivial)`
`struct ConvShape[rank: Int]`

A shape struct describing the convolution dimensions.

## Fields

* ​n (`Int`):
* ​input\_dims (`IndexList[rank]`):
* ​output\_dims (`IndexList[rank]`):
* ​filter\_dims (`IndexList[rank]`):
* ​c (`Int`):
* ​f (`Int`):
* ​stride (`IndexList[rank]`):
* ​dilation (`IndexList[rank]`):
* ​pad\_d (`IndexList[2]`):
* ​pad\_h (`IndexList[2]`):
* ​pad\_w (`IndexList[2]`):
* ​num\_groups (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `d`

`d(self) -> Int`

Input depth.

### `h`

`h(self) -> Int`

Input height.

### `w`

`w(self) -> Int`

Input width.

### `do`

`do(self) -> Int`

Output depth.

### `ho`

`ho(self) -> Int`

Output height.

### `wo`

`wo(self) -> Int`

Output width.

### `q`

`q(self) -> Int`

Filter window depth.

### `r`

`r(self) -> Int`

Filter window height.

### `s`

`s(self) -> Int`

Filter windown width.

### `filter_window_flat_size`

`filter_window_flat_size(self) -> Int`

### `input_image_flat_size`

`input_image_flat_size(self) -> Int`

### `output_image_flat_size`

`output_image_flat_size(self) -> Int`

### `output_space_dims`

`output_space_dims(self) -> IndexList[rank]`

### `output_flat_coord_to_input_offset`

`output_flat_coord_to_input_offset(self, n: Int, output_flat_coord: Int) -> Int`

### `matmul_M`

`matmul_M(self) -> Int`

### `matmul_N`

`matmul_N(self) -> Int`

### `matmul_K`

`matmul_K(self) -> Int`

### `padded`

`padded(self) -> Bool`

### `c_per_group`

`c_per_group(self) -> Int`

Returns the number of channels per group. Channel count must be divisible by group size.

### `f_per_group`

`f_per_group(self) -> Int`

Returns the number of filters per group. Filter count must be divisible by group size.

### `f_to_group`

`f_to_group(self, f_idx: Int) -> Int`

Given a global filter idx, returns the group idx of the group the filter belongs to.

### `c_to_group`

`c_to_group(self, c_idx: Int) -> Int`

Given a global channel idx, returns the group idx of the group the channel belongs to.

### `f_in_group`

`f_in_group(self, f_idx: Int) -> Int`

Given a global filter idx, returns the offset of the filter in its group.

### `c_in_group`

`c_in_group(self, c_idx: Int) -> Int`

Given a global channel idx, returns the offset of the channel in its group.

---

## align_down_residual

`align_down_residual(value: Int, alignment: Int) -> Int`

Returns the remainder after aligning down value to alignment.

**Args:**

* ​value (`Int`): The value to align.
* ​alignment (`Int`): Value to align to.

**Returns:**

The remainder after aligning down value to the closest multiple of
alignment. In other words, value - align\_down(value, alignment).

---

## append_shape

`append_shape[rank: Int](in_shape: IndexList[rank], last2nd: Int, last: Int) -> IndexList[(rank + 2)]`

Append input shape by inserting `last2nd` and `last` at the end.

---

## extend_shape

`extend_shape[rank: Int](in_shape: IndexList[rank], first: Int, last: Int) -> IndexList[(rank + 2)]`

Extend input shape by inserting `first` and `last` at both ends.

---

## get_conv2d_shape

`get_conv2d_shape[output_shape: DimList, input_shape: DimList, filter_shape: DimList, type: DType, data_layout: Image2DLayout, filter_layout: Image2DLayout](output: NDBuffer[type, 4, origin, output_shape], input: NDBuffer[type, 4, origin, input_shape], filter: NDBuffer[type, 4, origin, filter_shape], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[2], dilation: IndexList[2], num_groups: Int) -> ConvShape[2]`

`get_conv2d_shape[filter_rank: Int, output_shape: DimList, input_shape: DimList, filter_shape: DimList, type: DType, data_layout: Image2DLayout, filter_layout: Image2DLayout](output: NDBuffer[type, 4, origin, output_shape], input: NDBuffer[type, 4, origin, input_shape], filter: NDBuffer[type, filter_rank, origin, filter_shape], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[2], dilation: IndexList[2], num_groups: Int) -> ConvShape[2]`

---

## get_conv_num_partitions

`get_conv_num_partitions[micro_kernel_w: Int, micro_kernel_f: Int](num_threads: Int, conv_shape: ConvShape[rank]) -> IndexList[4]`

Partition the workload in (batch, C, F, HOWO) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions.

---

## get_conv_num_tasks

`get_conv_num_tasks(num_threads: Int, conv_shape: ConvShape[rank]) -> Int`

---

## get_conv_shape

`get_conv_shape[rank: Int, filter_packed: Bool](output: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], input: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], stride: IndexList[rank], dilation: IndexList[rank], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], num_groups: Int) -> ConvShape[rank]`

---

## get_conv_tile_shape

`get_conv_tile_shape[type: DType](c: Int, filter_window_size: Int, micro_kernel_width: Int) -> IndexList[2]`

Compute the (c, f) tile shape in L2. Assume NHWC layout, the tile shape is (R, S, c\_tile, f\_tile). R and S are by default fully covered. The heuristic tried to block in C as much as possible. If C is small, it would start to block F.

---

## get_conv_tile_size

`get_conv_tile_size[type: DType]() -> Int`

---

## get_direct_conv_micro_kernel_height

`get_direct_conv_micro_kernel_height() -> Int`

---

## get_direct_conv_micro_kernel_width

`get_direct_conv_micro_kernel_width() -> Int`

---

## get_micro_kernel_shape

`get_micro_kernel_shape[rank: Int, WO: Dim, F: Dim, conv_attr: ConvInfoStatic[rank], simd_size: Int]() -> IndexList[2]`

---

## get_partition

`get_partition(task_id: Int, num_partitions: IndexList[4], conv_shape: ConvShape[rank], micro_kernel_height: Int, micro_kernel_f_size: Int) -> ConvPartition`

---

## conv_utils

## Aliases

### `elementwise_epilogue_type`

`alias elementwise_epilogue_type = fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None`

### `elementwise_simd_epilogue_type`

`alias elementwise_simd_epilogue_type = fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None`

## Structs

* [​`ConvAlgorithm`](./ConvAlgorithm):
* [​`ConvInfoStatic`](./ConvInfoStatic):
* [​`ConvPartition`](./ConvPartition): Work range for a partition.
* [​`ConvShape`](./ConvShape): A shape struct describing the convolution dimensions.

## Functions

* [​`align_down_residual`](./align_down_residual): Returns the remainder after aligning down value to alignment.
* [​`append_shape`](./append_shape): Append input shape by inserting `last2nd` and `last` at the end.
* [​`extend_shape`](./extend_shape): Extend input shape by inserting `first` and `last` at both ends.
* [​`get_conv2d_shape`](./get_conv2d_shape):
* [​`get_conv_num_partitions`](./get_conv_num_partitions): Partition the workload in (batch, C, F, HOWO) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions.
* [​`get_conv_num_tasks`](./get_conv_num_tasks):
* [​`get_conv_shape`](./get_conv_shape):
* [​`get_conv_tile_shape`](./get_conv_tile_shape): Compute the (c, f) tile shape in L2. Assume NHWC layout, the tile shape is (R, S, c\_tile, f\_tile). R and S are by default fully covered. The heuristic tried to block in C as much as possible. If C is small, it would start to block F.
* [​`get_conv_tile_size`](./get_conv_tile_size):
* [​`get_direct_conv_micro_kernel_height`](./get_direct_conv_micro_kernel_height):
* [​`get_direct_conv_micro_kernel_width`](./get_direct_conv_micro_kernel_width):
* [​`get_micro_kernel_shape`](./get_micro_kernel_shape):
* [​`get_partition`](./get_partition):
* [​`reorder_padding`](./reorder_padding):

---

## reorder_padding

`reorder_padding[rank: Int](pad: DimList) -> DimList`

---

## cumsum

`cumsum[rank: Int, type: DType, exclusive: Bool, reverse: Bool](output: NDBuffer[type, rank, origin], input: NDBuffer[type, rank, origin], axis: Int)`

Implements the CumSum operator from the ONNX spec:  Computes cumulative sum of the input elements along the given axis. Cumulative sum can be inclusive or exclusive of the top element, and normal or reverse (direction along a given axis).

**Parameters:**

* ​rank (`Int`): Rank of the input and output tensors.
* ​type (`DType`): Type of the input and output tensors.
* ​exclusive (`Bool`): If set to True, return exclusive sum (top element not included).
* ​reverse (`Bool`): If set to True, perform cumsum operation in reverse direction.

**Args:**

* ​output (`NDBuffer[type, rank, origin]`): The output tensor.
* ​input (`NDBuffer[type, rank, origin]`): The input tensor.
* ​axis (`Int`): The axis on which to perform the cumsum operation.

---

## cumsum

## Functions

* [​`cumsum`](./cumsum): Implements the CumSum operator from the ONNX spec:  Computes cumulative sum of the input elements along the given axis. Cumulative sum can be inclusive or exclusive of the top element, and normal or reverse (direction along a given axis).

---

## flash_attention

`flash_attention[type: DType, rank: Int, mask_rank: Int, //, input_k_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_v_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_mask_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](q: NDBuffer[type, rank, origin, shape, strides], k_shape: IndexList[rank], v_shape: IndexList[rank], mask_shape: IndexList[mask_rank], output: NDBuffer[type, rank, origin, shape, strides], scale: SIMD[float32, 1])`

---

## flash_attention_kv_cache

`flash_attention_kv_cache[type: DType, cache_t: KVCacheT, //](q: NDBuffer[type, 4, origin, shape, strides], k: cache_t, v: cache_t, mask: NDBuffer[type, rank, origin, shape, strides], scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides])`

`flash_attention_kv_cache[type: DType, cache_t: KVCacheT, mask_t: MHAMask, //](q: NDBuffer[type, 4, origin, shape, strides], k: cache_t, v: cache_t, mask: mask_t, scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides])`

`flash_attention_kv_cache[type: DType, cache_t: KVCacheT, mask_t: MHAMask, //](q: NDBuffer[type, 3, origin, shape, strides], q_input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], k: cache_t, v: cache_t, mask: mask_t, scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides])`

Entrypoint for ragged tensors.

---

## flash_attention_split_kv

`flash_attention_split_kv[type: DType, rank: Int, mask_rank: Int, //, input_k_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_v_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_k_cache_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_v_cache_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_mask_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](q: NDBuffer[type, rank, origin, shape, strides], k_shape: IndexList[rank], v_shape: IndexList[rank], k_cache_shape: IndexList[(rank + 1)], v_cache_shape: IndexList[(rank + 1)], mask_shape: IndexList[mask_rank], output: NDBuffer[type, rank, origin, shape, strides], scale: SIMD[float32, 1])`

Variant of flash attention that takes the previous KV cache `input_{k,v}_cache_fn` and the current KV tensors `input_k_fn` and `input_v_fn` as separate arguments.

This works around the fact that fusion can't currently look through concat.
So this kernel does an in-place concat fusion by changing the input lambdas
`input_{k,v}_cache_fn_wrapper` to take previous sequence KV elements from
the KV cache, and current KV elements from tensors `k` and `v`.

---

## flash_attention

## Functions

* [​`flash_attention`](./flash_attention):
* [​`flash_attention_kv_cache`](./flash_attention_kv_cache):
* [​`flash_attention_split_kv`](./flash_attention_split_kv): Variant of flash attention that takes the previous KV cache `input_{k,v}_cache_fn` and the current KV tensors `input_k_fn` and `input_v_fn` as separate arguments.

---

## fold

`fold[dtype: DType, input_dim: DimList, output_dim: DimList, //, stride: Tuple[Int, Int], dilation: Tuple[Int, Int], padding: Tuple[Int, Int], target: StringSlice[StaticConstantOrigin]](input: NDBuffer[dtype, 3, MutableAnyOrigin, input_dim], output: NDBuffer[dtype, 4, MutableAnyOrigin, output_dim], output_size: IndexList[2], kernel_size: IndexList[2], ctx: DeviceContextPtr)`

Folds array of sliding local blocks into a single output tensor.

**Parameters:**

* ​dtype (`DType`): The data type for the input and output.
* ​input\_dim (`DimList`): The static shape of the input NDBuffer.
* ​output\_dim (`DimList`): The static shape of the output NDBuffer.
* ​stride (`Tuple[Int, Int]`): Stride of the sliding blocks.
* ​dilation (`Tuple[Int, Int]`): Dilation of the sliding blocks.
* ​padding (`Tuple[Int, Int]`): 0-paddings to be added on both sides of the inputs.
* ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to compile for.

**Args:**

* ​input (`NDBuffer[dtype, 3, MutableAnyOrigin, input_dim]`): Input tensor to fold, shape \[N, C x kernel size, num\_blocks].
* ​output (`NDBuffer[dtype, 4, MutableAnyOrigin, output_dim]`): Output tensor to write to, shape \[N, C, H, W].
* ​output\_size (`IndexList[2]`): Spatial shape of the output tensor (H, W).
* ​kernel\_size (`IndexList[2]`): Size of the sliding blocks.
* ​ctx (`DeviceContextPtr`): DeviceContextPtr.

---

## fold_shape

`fold_shape[dtype: DType, input_dim: DimList](input: NDBuffer[dtype, 3, MutableAnyOrigin, input_dim], output_size: IndexList[2], kernel_size: IndexList[2]) -> IndexList[4]`

Returns the shape of the output tensor of the fold operation.

---

## fold

Implements the fold operation.

## Functions

* [​`fold`](./fold): Folds array of sliding local blocks into a single output tensor.
* [​`fold_shape`](./fold_shape): Returns the shape of the output tensor of the fold operation.

---

## fused_qk_rope

`fused_qk_rope[type: DType, collection_t: KVCollectionT, //, cache_t: KVCacheT, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 4, origin, shape, strides], kv_collection: collection_t, freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: Optional[DeviceContext])`

---

## fused_qk_rope_ragged

`fused_qk_rope_ragged[type: DType, collection_t: KVCollectionT, //, cache_t: KVCacheT, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: Optional[DeviceContext])`

Applies RoPE (Rotary Position Embedding) to query and key tensors.

This function can applies RoPE only to the last `rope_dim` elements of each
head, leaving the first `unroped_dim` elements unchanged. This is required
for DeepSeek models where only part of each head undergoes rotary
transformation.

---

## get_identity_rope_coeff

`get_identity_rope_coeff[width: Int, type: DType]() -> SIMD[type, width]`

---

## get_safetensors_idx

`get_safetensors_idx(head_dim_idx: Int, head_size: Int) -> Tuple[Int, Int]`

---

## fused_qk_rope

## Functions

* [​`fused_qk_rope`](./fused_qk_rope):
* [​`fused_qk_rope_ragged`](./fused_qk_rope_ragged): Applies RoPE (Rotary Position Embedding) to query and key tensors.
* [​`get_identity_rope_coeff`](./get_identity_rope_coeff):
* [​`get_safetensors_idx`](./get_safetensors_idx):
* [​`rope_k_cache`](./rope_k_cache):
* [​`rope_q_proj`](./rope_q_proj):

---

## rope_k_cache

`rope_k_cache[type: DType, cache_t: KVCacheT, width: Int, //, *, interleaved: Bool](k_cache: cache_t, b_idx: Int, h_idx: Int, s_idx: Int, d_idx: Int, freq_val: SIMD[type, width], head_size: Int)`

---

## rope_q_proj

`rope_q_proj[type: DType, rank: Int, width: Int, //, *, interleaved: Bool](q_proj: NDBuffer[type, rank, origin, shape, strides], output: NDBuffer[type, rank, origin, shape, strides], idx: IndexList[rank], freq_val: SIMD[type, width], head_size: Int)`

---

## Axis

`@register_passable(trivial)`
`struct Axis`

## Fields

* ​axis (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`Indexer`,
`Intable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(axis: Int) -> Self`

`__init__(out self, axis: Int, rank: Int)`

### `__int__`

`__int__(self) -> Int`

### `__index__`

`__index__(self) -> index`

Convert to index.

**Returns:**

The corresponding \_\_mlir\_type.index value.

---

## gather

`gather[type: DType, indices_type: DType, //, *, axis: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](output: NDBuffer[type, rank, origin, shape, strides], input: NDBuffer[type, rank, origin, shape, strides], indices: NDBuffer[indices_type, rank, origin, shape, strides], *, context: DeviceContext)`

Gather operation as defined in .

Note that this is NOT the same as the default PyTorch gather (which is equivalent to
).

`gather[type: DType, indices_type: DType, //, *, axis: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](output: NDBuffer[type, rank, origin, shape, strides], input: NDBuffer[type, rank, origin, shape, strides], indices: NDBuffer[indices_type, rank, origin, shape, strides], *, context: DeviceContextPtr = DeviceContextPtr())`

Gather operation as defined in .

Note that this is NOT the same as the default PyTorch gather (which is equivalent to
).

`gather[*, type: DType, indices_type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], indices_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[indices_type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, prefetch_fn: OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None] = OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None]({:i1 0, 1}), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type], *, context: DeviceContext)`

Gather operation as defined in .

Note that this is NOT the same as the default PyTorch gather (which is equivalent to
).

`gather[*, type: DType, indices_type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], indices_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[indices_type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, prefetch_fn: OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None] = OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None]({:i1 0, 1}), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type], *, context: DeviceContextPtr = DeviceContextPtr())`

Gather operation as defined in .

Note that this is NOT the same as the default PyTorch gather (which is equivalent to
).

---

## gather_elements

`gather_elements[rank: Int, input_type: DType, indices_type: DType](input: NDBuffer[input_type, rank, origin], indices: NDBuffer[indices_type, rank, origin], _axis: Int, output: NDBuffer[input_type, rank, origin])`

Implements ONNX GatherElements op which is equivalent to Pytorch gather.

---

## gather_elementwise_fn_wrapper

`gather_elementwise_fn_wrapper[*, type: DType, indices_type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], indices_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[indices_type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, simd_width: Int, prefetch_fn: OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None] = OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None]({:i1 0, 1})](axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type], coords: IndexList[size, element_type=element_type])`

---

## gather_guards

`gather_guards(axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type])`

---

## gather_nd

`gather_nd[type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, output_rank: Int, batch_dims: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](data: NDBuffer[type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], output: NDBuffer[type, output_rank, origin], ctx: DeviceContextPtr)`

GatherND operation as defined in . Based on reference implementation: .

**Parameters:**

* ​type (`DType`): Type of data tensor.
* ​indices\_type (`DType`): Type of indices tensor.
* ​data\_rank (`Int`): Rank of data tensor (data\_rank >= 1).
* ​indices\_rank (`Int`): Rank of indices tensor (indices\_rank >= 1).
* ​output\_rank (`Int`): Rank of output tensor.
* ​batch\_dims (`Int`): Number of batch dimensions. The gather of indexing
  starts from dimension of data\[batch\_dims:].
* ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to execute on.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​data (`NDBuffer[type, data_rank, origin]`): Tensor of rank data\_rank >= 1.
* ​indices (`NDBuffer[indices_type, indices_rank, origin]`): Tensor of rank indices\_rank >= 1. All index values are expected
  to be within bounds \[-s, s-1] along axis of size s. It is an
  error if any of the index values are out of bounds.
* ​output (`NDBuffer[type, output_rank, origin]`): Tensor of rank data\_rank + indices\_rank - indices\_shape\[-1] - 1 - b.
* ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler.

---

## gather_nd_shape

`gather_nd_shape[input_rank: Int, indices_rank: Int, output_rank: Int, input_type: DType, indices_type: DType, batch_dims: Int, single_thread_blocking_override: Bool = True](input_buf: NDBuffer[input_type, input_rank, origin], indices_buf: NDBuffer[indices_type, indices_rank, origin]) -> IndexList[output_rank]`

Compute the output shape of a `gather` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_rank (`Int`): Rank of the input tensor.
* ​indices\_rank (`Int`): Rank of the indices tensor.
* ​output\_rank (`Int`): Rank of the output tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​indices\_type (`DType`): Type of the indices tensor.
* ​batch\_dims (`Int`): Batch dimensions.
* ​single\_thread\_blocking\_override (`Bool`): If True, then reduction is run
  synchronously using a single thread.

**Args:**

* ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor.
* ​indices\_buf (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor.

**Returns:**

The output shape.

---

## gather_reduce

`gather_reduce[type: DType, gather_axis: Int, reduce_axis: Int, simd_width: Int, reduce_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1], output_rank: Int, output_shape: DimList, input_rank: Int, input_shape: DimList, indices_rank: Int, indices_shape: DimList](output: NDBuffer[type, output_rank, origin, output_shape], input: NDBuffer[type, input_rank, origin, input_shape], indices: NDBuffer[int32, indices_rank, origin, indices_shape], reduce_init: SIMD[type, 1])`

Computes output\[i, j, k] = input\[indices\[i, j], k] and simultaneously reduces the output across axis 1 to produce output\[i, k].

The motivating use-case for this is multi-hot embeddings in recommender models.
This provides similar functionality to Torch's EmbeddingBag layer. In that
context, i is the batch dimension, j is the multi-hot dimension, and k is
the embedding dimension.

---

## gather_shape

`gather_shape[output_rank: Int, input_rank: Int, indices_rank: Int, input_type: DType, indices_type: DType, single_thread_blocking_override: Bool = False](input_buf: NDBuffer[input_type, input_rank, origin], indices_buf: NDBuffer[indices_type, indices_rank, origin], axis: Int) -> IndexList[output_rank]`

Compute the output shape of a `gather` operation, and assert the inputs are compatible.

**Parameters:**

* ​output\_rank (`Int`): Rank of the output tensor.
* ​input\_rank (`Int`): Rank of the input tensor.
* ​indices\_rank (`Int`): Rank of the indices tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​indices\_type (`DType`): Type of the indices tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor.
* ​indices\_buf (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor.
* ​axis (`Int`): The axis.

**Returns:**

The output shape.

---

## gather_scatter

## Structs

* [​`Axis`](./Axis):

## Functions

* [​`gather`](./gather): Gather operation as defined in .
* [​`gather_elements`](./gather_elements): Implements ONNX GatherElements op which is equivalent to Pytorch gather.
* [​`gather_elementwise_fn_wrapper`](./gather_elementwise_fn_wrapper):
* [​`gather_guards`](./gather_guards):
* [​`gather_nd`](./gather_nd): GatherND operation as defined in . Based on reference implementation: .
* [​`gather_nd_shape`](./gather_nd_shape): Compute the output shape of a `gather` operation, and assert the inputs are compatible.
* [​`gather_reduce`](./gather_reduce): Computes output\[i, j, k] = input\[indices\[i, j], k] and simultaneously reduces the output across axis 1 to produce output\[i, k].
* [​`gather_shape`](./gather_shape): Compute the output shape of a `gather` operation, and assert the inputs are compatible.
* [​`normalize_neg_index`](./normalize_neg_index): Indices passed to gather and scatter ops may be negative. This performs a normalization so that they can be used to index into a buffer.
* [​`scatter_elements`](./scatter_elements): Implements ONNX ScatterElements op which is equivalent to Pytorch scatter.
* [​`scatter_elements_shape`](./scatter_elements_shape): Compute the output shape of a `scatter_elements` operation, and assert the inputs are compatible.
* [​`scatter_nd`](./scatter_nd): Scatter\_nd operation without any reduction.
* [​`scatter_nd_generator`](./scatter_nd_generator): Implements ONNX ScatterND operation as defined in .
* [​`scatter_nd_shape`](./scatter_nd_shape): Compute the output shape of a `scatter_nd` operation, and assert the inputs are compatible.
* [​`scatter_set_constant`](./scatter_set_constant): Scatter the fill\_value into the data at the specified indices.

---

## normalize_neg_index

`normalize_neg_index(idx: Int, dim_size: Int) -> Int`

Indices passed to gather and scatter ops may be negative. This performs a normalization so that they can be used to index into a buffer.

Returns val + dim if val 

`normalize_neg_index[type: DType, width: Int, out_type: DType = index](idx: SIMD[type, width], dim_size: Int) -> SIMD[out_type, width]`

Indices passed to gather and scatter ops may be negative. This performs a normalization so that they can be used to index into a buffer.

Returns val + dim if val

---

## scatter_elements

`scatter_elements[reduce_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1], rank: Int, input_type: DType, indices_type: DType](input: ManagedTensorSlice[io_spec, static_spec=static_spec], indices: ManagedTensorSlice[io_spec, static_spec=static_spec], updates: ManagedTensorSlice[io_spec, static_spec=static_spec], _axis: Int, output: ManagedTensorSlice[io_spec, static_spec=static_spec])`

Implements ONNX ScatterElements op which is equivalent to Pytorch scatter.

---

## scatter_elements_shape

`scatter_elements_shape[rank: Int, input_type: DType, indices_type: DType, //, *, single_thread_blocking_override: Bool](input: NDBuffer[input_type, rank, origin], updates: NDBuffer[input_type, rank, origin], indices: NDBuffer[indices_type, rank, origin], axis: Int) -> IndexList[rank]`

Compute the output shape of a `scatter_elements` operation, and assert the inputs are compatible.

**Parameters:**

* ​rank (`Int`): Rank of the input tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​indices\_type (`DType`): Type of the indices tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input (`NDBuffer[input_type, rank, origin]`): The input tensor.
* ​updates (`NDBuffer[input_type, rank, origin]`): The input tensor.
* ​indices (`NDBuffer[indices_type, rank, origin]`): The indices tensor.
* ​axis (`Int`): The axis.

**Returns:**

The output shape.

---

## scatter_nd

`scatter_nd[output_type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, updates_rank: Int, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](data: NDBuffer[output_type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], updates: NDBuffer[output_type, updates_rank, origin], output: NDBuffer[output_type, data_rank, origin], context: DeviceContextPtr = DeviceContextPtr())`

Scatter\_nd operation without any reduction.

---

## scatter_nd_generator

`scatter_nd_generator[output_type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, updates_rank: Int, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), /, reduce_fn: OptionalReg[fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), *, _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("scatter_nd")](data: NDBuffer[output_type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], updates: NDBuffer[output_type, updates_rank, origin], output: NDBuffer[output_type, data_rank, origin], context: DeviceContextPtr = DeviceContextPtr())`

Implements ONNX ScatterND operation as defined in .

**Parameters:**

* ​output\_type (`DType`): Type of data, updates, and output tensors.
* ​indices\_type (`DType`): Type of the indices tensor.
* ​data\_rank (`Int`): Rank of input (data) tensor (data\_rank >= 1).
* ​indices\_rank (`Int`): Rank of input (data) tensor (indices\_rank >= 1).
* ​updates\_rank (`Int`): Rank of updates tensor (updates\_rank = data\_rank +
  indices\_rank - indices\_shape\[-1] - 1).
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​target (`StringSlice[StaticConstantOrigin]`): Target cpu or cuda.
* ​reduce\_fn (`OptionalReg[fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]`): Reduction function to apply: none (default), add, mul, max,
  min.
* ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): A description of the function, used for profiling and tracing.

**Args:**

* ​data (`NDBuffer[output_type, data_rank, origin]`): Tensor of rank data\_rank >= 1.
* ​indices (`NDBuffer[indices_type, indices_rank, origin]`): Tensor of rank indices\_rank containing indices for the scatter
  operation.
* ​updates (`NDBuffer[output_type, updates_rank, origin]`): Tensor containing values to update output tensor based on
  indices tensor.
* ​output (`NDBuffer[output_type, data_rank, origin]`): Tensor of rank data\_rank, shaped the same as data tensor.
* ​context (`DeviceContextPtr`): Pointer to DeviceContext.

---

## scatter_nd_shape

`scatter_nd_shape[input_rank: Int, updates_rank: Int, indices_rank: Int, input_type: DType, indices_type: DType, single_thread_blocking_override: Bool](input: NDBuffer[input_type, input_rank, origin], updates: NDBuffer[input_type, updates_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin]) -> IndexList[input_rank]`

Compute the output shape of a `scatter_nd` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_rank (`Int`): Rank of the input tensor.
* ​updates\_rank (`Int`): Rank of the updates tensor.
* ​indices\_rank (`Int`): Rank of the indices tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​indices\_type (`DType`): Type of the indices tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input (`NDBuffer[input_type, input_rank, origin]`): The input tensor.
* ​updates (`NDBuffer[input_type, updates_rank, origin]`): The input tensor.
* ​indices (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor.

**Returns:**

The output shape.

---

## scatter_set_constant

`scatter_set_constant[data_type: DType, index_type: DType, //, target: StringSlice[StaticConstantOrigin], single_thread_blocking_override: Bool = False](data: LayoutTensor[data_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], indices: LayoutTensor[index_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], fill_value: SIMD[data_type, 1], ctx: DeviceContextPtr)`

Scatter the fill\_value into the data at the specified indices.

Example:
Suppose we have a 3x3 matrix `data` initialized to zeros:

data = [\[0, 0, 0],
\[0, 0, 0],
\[0, 0, 0]]

And `indices` is a 2D tensor with shape \[2, 2]:

indices = [\[0, 1],
\[2, 0]]

If `fill_value` is 5, after calling `scatter_set_constant`, `data` will be:

data = [\[0, 5, 0],
\[0, 0, 0],
\[5, 0, 0]]

Arguments:
data: The data to scatter the updates into.
indices: The indices to scatter the updates into.
fill\_value: The value to fill the data with.
ctx: The device context.

---

## Image2DLayout

`@register_passable(trivial)`
`struct Image2DLayout`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `FRSCf`

`alias FRSCf = Image2DLayout(3)`

### `NCHW`

`alias NCHW = Image2DLayout(1)`

### `NHWC`

`alias NHWC = Image2DLayout(0)`

### `RSCF`

`alias RSCF = Image2DLayout(2)`

### `UNKNOWN`

`alias UNKNOWN = Image2DLayout(-1)`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

---

## ImageData

`@register_passable(trivial)`
`struct ImageData[shape: DimList, type: DType, static_layout: Image2DLayout, origin: MutableOrigin]`

Utility class that generalizes conv2d data and filter tensor with a given data layout.

## Fields

* ​data (`NDBuffer[type, 4, origin, shape]`):
* ​dynamic\_layout (`Image2DLayout`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(data: NDBuffer[type, 4, origin, shape], layout: Image2DLayout) -> Self`

Construct of an image data instance with dynamic layout param.

**Args:**

* ​data (`NDBuffer[type, 4, origin, shape]`): A 4d buffer containing the actual data.
* ​layout (`Image2DLayout`): Data layout tag.

`@implicit`
`__init__(data: NDBuffer[type, 4, origin, shape]) -> Self`

### `__getitem__`

`__getitem__(self, n: Int, c: Int, h: Int, w: Int) -> SIMD[type, 1]`

Reads the underlying data buffer based on the tensor index and under- lying data layout.

**Args:**

* ​n (`Int`): Index on the batch dimension.
* ​c (`Int`): Index on the channel dimension.
* ​h (`Int`): Index on the height dimension.
* ​w (`Int`): Index on the width dimension.

**Returns:**

The value stored at the given index position.

### `__setitem__`

`__setitem__(self, n: Int, c: Int, h: Int, w: Int, value: SIMD[type, 1])`

Writes the underlying data buffer based on the tensor index and under- lying data layout.

**Args:**

* ​n (`Int`): Index on the batch dimension.
* ​c (`Int`): Index on the channel dimension.
* ​h (`Int`): Index on the height dimension.
* ​w (`Int`): Index on the width dimension.
* ​value (`SIMD[type, 1]`): The value to store at the given index position.

### `to_static_layout`

`to_static_layout[new_static_layout: Image2DLayout](self) -> ImageData[shape, type, new_static_layout, origin]`

Conversion utility from a fully dynamic data structure, e.g. from c shim to one with compile-time known data layout.

**Returns:**

The image data with static data layout.

### `get_layout`

`get_layout(self) -> Image2DLayout`

The getter function of the underlying data layout, resolving from either statically or dynamically provided information.

**Returns:**

The resolved data layout tag for this image instance.

### `get_flat_index`

`get_flat_index(self, n: Int, c: Int, h: Int, w: Int) -> Int`

Converts the dimension index to the flat index of the underlying data based on the tensor layout.

**Args:**

* ​n (`Int`): Index on the batch dimension.
* ​c (`Int`): Index on the channel dimension.
* ​h (`Int`): Index on the height dimension.
* ​w (`Int`): Index on the width dimension.

**Returns:**

An integer containing the index based on the underlying
data layout.

### `get_tuple_index`

`get_tuple_index(self, idx: Int) -> IndexList[4]`

Converts the flat index to the dimension index of the underlying data based on the tensor layout.

**Args:**

* ​idx (`Int`): Flat index.

**Returns:**

A IndexList containing the index in NCHW order.

### `num_elements`

`num_elements(self) -> Int`

---

## ImageShape

`@register_passable(trivial)`
`struct ImageShape`

A data-layout agnostic representation of tensor shapes used in conv2d.

## Fields

* ​N (`Int`):
* ​C (`Int`):
* ​H (`Int`):
* ​W (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__[shape: DimList, type: DType, layout: Image2DLayout](image_data: ImageData[shape, type, layout, origin]) -> Self`

Constructor of an ImageShape instance from an ImageData.

**Args:**

* ​image\_data (`ImageData[shape, type, layout, origin]`): The image data instance to extract shape
  info from.

---

## PadHandling

`@register_passable(trivial)`
`struct PadHandling`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `EXCLUDE_PAD`

`alias EXCLUDE_PAD = PadHandling(0)`

### `INCLUDE_PAD`

`alias INCLUDE_PAD = PadHandling(2)`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

---

## image

## Structs

* [​`Image2DLayout`](./Image2DLayout):
* [​`ImageData`](./ImageData): Utility class that generalizes conv2d data and filter tensor with a given data layout.
* [​`ImageShape`](./ImageShape): A data-layout agnostic representation of tensor shapes used in conv2d.
* [​`PadHandling`](./PadHandling):

---

## nn

Provides neural network operators for deep learning models.

## Modules

* [​`activations`](./activations/): The module contains implementations of activation functions.
* [​`arange`](./arange/):
* [​`arg_nonzero`](./arg_nonzero/):
* [​`argmaxmin`](./argmaxmin/):
* [​`argmaxmin_gpu`](./argmaxmin_gpu/):
* [​`argsort`](./argsort/):
* [​`bicubic`](./bicubic/): This module provides CPU and GPU implementations for bicubic interpolation.
* [​`broadcast`](./broadcast/):
* [​`concat`](./concat/):
* [​`conv`](./conv/):
* [​`conv_transpose`](./conv_transpose/):
* [​`conv_utils`](./conv_utils/):
* [​`cumsum`](./cumsum/):
* [​`flash_attention`](./flash_attention/):
* [​`fold`](./fold/): Implements the fold operation.
* [​`fused_qk_rope`](./fused_qk_rope/):
* [​`gather_scatter`](./gather_scatter/):
* [​`image`](./image/):
* [​`index_tensor`](./index_tensor/):
* [​`irfft`](./irfft/): Inverse real FFT kernel using cuFFT.
* [​`kv_cache`](./kv_cache/):
* [​`kv_cache_ragged`](./kv_cache_ragged/):
* [​`mha`](./mha/):
* [​`mha_cross`](./mha_cross/):
* [​`mha_mask`](./mha_mask/):
* [​`mha_operand`](./mha_operand/):
* [​`mha_score_mod`](./mha_score_mod/):
* [​`mha_sm90`](./mha_sm90/):
* [​`mha_tile_scheduler`](./mha_tile_scheduler/):
* [​`mha_utils`](./mha_utils/):
* [​`mla`](./mla/):
* [​`moe`](./moe/):
* [​`nms`](./nms/):
* [​`normalization`](./normalization/):
* [​`pad`](./pad/):
* [​`pad_gpu`](./pad_gpu/):
* [​`pool`](./pool/):
* [​`rand_uniform`](./rand_uniform/):
* [​`randn`](./randn/):
* [​`repeat_interleave`](./repeat_interleave/):
* [​`reshape`](./reshape/):
* [​`resize`](./resize/):
* [​`roi_align`](./roi_align/):
* [​`sampling`](./sampling/):
* [​`shapes`](./shapes/):
* [​`slice`](./slice/):
* [​`softmax`](./softmax/):
* [​`split`](./split/):
* [​`tile`](./tile/):
* [​`topk`](./topk/):
* [​`toppminp`](./toppminp/):
* [​`toppminp_gpu`](./toppminp_gpu/):

---

## advanced_indexing_getitem

`advanced_indexing_getitem[input_rank: Int, index_rank: Int, input_type: DType, index_type: DType, //, start_axis: Int, num_index_tensors: Int, target: StringSlice[StaticConstantOrigin], single_thread_blocking_override: Bool, trace_description: StringSlice[StaticConstantOrigin], input_tensor_fn: fn[Int](IndexList[input_rank]) capturing -> SIMD[input_type, $0], indices_fn: fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]](out_tensor: NDBuffer[input_type, ((num_index_tensors * -1) + index_rank + input_rank), origin], in_tensor_strides: IndexList[input_rank], ctx: DeviceContextPtr)`

Implement basic numpy-style advanced indexing.

This is designed to be fused with other view-producing operations to
implement full numpy-indexing semantics.

This assumes the dimensions in `input_tensor` not indexed by index tensors
are ":", ie selecting all indices along the slice. For example in numpy:

```
# rank(indices1) == 3
# rank(indices2) == 3
out_tensor = input_tensor[:, :, :, indices1, indices2, :, :]
```

We calculate the following for all valid valued indexing variables:

```
out_tensor[a, b, c, i, j, k, d, e] = input_tensor[
    a, b, c,
    indices1[i, j, k],
    indices2[i, j, k],
    d, e
]
```

In this example `start_axis = 3` and `num_index_tensors = 2`.

TODO(GEX-1951): Support boolean tensor mask support
TODO(GEX-1952): Support non-contiguous indexing tensor case
TODO(GEX-1953): Support fusion (especially view-fusion)

**Parameters:**

* ​input\_rank (`Int`): The rank of the input tensor.
* ​index\_rank (`Int`): The rank of the indexing tensors.
* ​input\_type (`DType`): The dtype of the input tensor.
* ​index\_type (`DType`): The dtype of the indexing tensors.
* ​start\_axis (`Int`): The first dimension in input where the indexing tensors
  are applied. It is assumed the indexing tensors are applied in
  consecutive dimensions.
* ​num\_index\_tensors (`Int`): The number of indexing tensors.
* ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to operation on.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​trace\_description (`StringSlice[StaticConstantOrigin]`): For profiling, the trace name the operation will
  appear under.
* ​input\_tensor\_fn (`fn[Int](IndexList[input_rank]) capturing -> SIMD[input_type, $0]`): Fusion lambda for the input tensor.
* ​indices\_fn (`fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]`): Fusion lambda for the indices tensors.

**Args:**

* ​out\_tensor (`NDBuffer[input_type, ((num_index_tensors * -1) + index_rank + input_rank), origin]`): The output tensor to write to.
* ​in\_tensor\_strides (`IndexList[input_rank]`): The strides of the input tensor.
* ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler.

---

## advanced_indexing_getitem_shape

`advanced_indexing_getitem_shape[input_rank: Int, index_rank: Int, //, start_axis: Int, num_index_tensors: Int](input_shape: IndexList[input_rank], index_shape: IndexList[index_rank]) -> IndexList[((num_index_tensors * -1) + index_rank + input_rank)]`

Calculate the output shape from advanced indexing.

**Parameters:**

* ​input\_rank (`Int`): The rank of the input tensor.
* ​index\_rank (`Int`): The rank of the indexing tensors.
* ​start\_axis (`Int`): The first dimension in input where the indexing tensors
  are applied. It is assumed the indexing tensors are applied in
  consecutive dimensions.
* ​num\_index\_tensors (`Int`): The number of indexing tensors.

**Args:**

* ​input\_shape (`IndexList[input_rank]`): The shape of the input tensor in the operation.
* ​index\_shape (`IndexList[index_rank]`): The shape of the indexing tensors in the operation.

---

## advanced_indexing_setitem_inplace

`advanced_indexing_setitem_inplace[input_rank: Int, index_rank: Int, updates_rank: Int, input_type: DType, index_type: DType, //, start_axis: Int, num_index_tensors: Int, target: StringSlice[StaticConstantOrigin], single_thread_blocking_override: Bool, trace_description: StringSlice[StaticConstantOrigin], updates_tensor_fn: fn[Int](IndexList[updates_rank]) capturing -> SIMD[input_type, $0], indices_fn: fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]](input_tensor: NDBuffer[input_type, input_rank, origin], index_tensor_shape: IndexList[index_rank, element_type=element_type], updates_tensor_strides: IndexList[updates_rank], ctx: DeviceContextPtr)`

Implement basic numpy-style advanced indexing with assignment.

This is designed to be fused with other view-producing operations to
implement full numpy-indexing semantics.

This assumes the dimensions in `input_tensor` not indexed by index tensors
are ":", ie selecting all indices along the slice. For example in numpy:

```
# rank(indices1) == 2
# rank(indices2) == 2
# rank(updates) == 2
input_tensor[:, :, :, indices1, indices2, :, :] = updates
```

We calculate the following for all valid valued indexing variables:

```
input_tensor[
    a, b, c,
    indices1[i, j],
    indices2[i, j],
    d, e
] = updates[i, j]
```

In this example `start_axis = 3` and `num_index_tensors = 2`.

In terms of implementation details, our strategy is to iterate over
all indices over a common iteration range. The idea is we can map
indices in this range to the write location in `input_tensor` as well
as the data location in `updates`. An update can illustrate how this is
possible best:

Imagine the `input_tensor` shape is \[A, B, C, D] and we have indexing
tensors I1 and I2 with shape \[M, N, K]. Assume I1 and I2 are applied
to dimensions 1 and 2.

I claim an appropriate common iteration range is then (A, M, N, K, D).
Note we expect `updates` to be the shape \[A, M, N, K, D]. We will show
this by providing the mappings into `updates` and `input_tensor`:

Consider an arbitrary set of indices in this range (a, m, n, k, d):
\- The index into `updates` is (a, m, n, k, d).
\- The index into `input_tensor` is (a, I1\[m, n, k], I2\[m, n, k], d).

TODO(GEX-1951): Support boolean tensor mask support
TODO(GEX-1952): Support non-contiguous indexing tensor case
TODO(GEX-1953): Support fusion (especially view-fusion)
TODO(GEX-1954): Unify getitem and setitem using generic views.
(Requires non-strided view functions).

**Parameters:**

* ​input\_rank (`Int`): The rank of the input tensor.
* ​index\_rank (`Int`): The rank of the indexing tensors.
* ​updates\_rank (`Int`): The rank of the updates tensor.
* ​input\_type (`DType`): The dtype of the input tensor.
* ​index\_type (`DType`): The dtype of the indexing tensors.
* ​start\_axis (`Int`): The first dimension in input where the indexing tensors
  are applied. It is assumed the indexing tensors are applied in
  consecutive dimensions.
* ​num\_index\_tensors (`Int`): The number of indexing tensors.
* ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to operation on.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​trace\_description (`StringSlice[StaticConstantOrigin]`): For profiling, the trace name the operation will
  appear under.
* ​updates\_tensor\_fn (`fn[Int](IndexList[updates_rank]) capturing -> SIMD[input_type, $0]`): Fusion lambda for the update tensor.
* ​indices\_fn (`fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]`): Fusion lambda for the indices tensors.

**Args:**

* ​input\_tensor (`NDBuffer[input_type, input_rank, origin]`): The input tensor being indexed into and modified in-place.
* ​index\_tensor\_shape (`IndexList[index_rank, element_type=element_type]`): The shape of each index tensor.
* ​updates\_tensor\_strides (`IndexList[updates_rank]`): The strides of the update tensor.
* ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler.

---

## index_tensor

## Functions

* [​`advanced_indexing_getitem`](./advanced_indexing_getitem): Implement basic numpy-style advanced indexing.
* [​`advanced_indexing_getitem_shape`](./advanced_indexing_getitem_shape): Calculate the output shape from advanced indexing.
* [​`advanced_indexing_setitem_inplace`](./advanced_indexing_setitem_inplace): Implement basic numpy-style advanced indexing with assignment.
* [​`index_tensor`](./index_tensor): Index\_tensor operation; based on modified implementation of gather\_nd.
* [​`index_tensor_shape`](./index_tensor_shape): Compute the output shape of a `index_tensor` operation, and assert the inputs are compatible.

---

## index_tensor

`index_tensor[type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, output_rank: Int, batch_dims: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](data: NDBuffer[type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], output: NDBuffer[type, output_rank, origin], ctx: DeviceContextPtr)`

Index\_tensor operation; based on modified implementation of gather\_nd.

**Parameters:**

* ​type (`DType`): Type of data tensor.
* ​indices\_type (`DType`): Type of indices tensor.
* ​data\_rank (`Int`): Rank of data tensor (data\_rank >= 1).
* ​indices\_rank (`Int`): Rank of indices tensor (indices\_rank >= 1).
* ​output\_rank (`Int`): Rank of output tensor.
* ​batch\_dims (`Int`): Number of batch dimensions. The gather of indexing
  starts from dimension of data\[batch\_dims:].
* ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to execute on.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​data (`NDBuffer[type, data_rank, origin]`): Tensor of rank data\_rank >= 1.
* ​indices (`NDBuffer[indices_type, indices_rank, origin]`): Tensor of rank indices\_rank >= 1. All index values are expected
  to be within bounds \[-s, s-1] along axis of size s. It is an
  error if any of the index values are out of bounds.
* ​output (`NDBuffer[type, output_rank, origin]`): Tensor of rank data\_rank + indices\_rank - indices\_shape\[-1] - 1 - b.
* ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler.

---

## index_tensor_shape

`index_tensor_shape[input_rank: Int, indices_rank: Int, output_rank: Int, input_type: DType, indices_type: DType, batch_dims: Int, single_thread_blocking_override: Bool = True](input_buf: NDBuffer[input_type, input_rank, origin], indices_buf: NDBuffer[indices_type, indices_rank, origin]) -> IndexList[output_rank]`

Compute the output shape of a `index_tensor` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_rank (`Int`): Rank of the input tensor.
* ​indices\_rank (`Int`): Rank of the indices tensor.
* ​output\_rank (`Int`): Rank of the output tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​indices\_type (`DType`): Type of the indices tensor.
* ​batch\_dims (`Int`): Batch dimensions.
* ​single\_thread\_blocking\_override (`Bool`): If True, then reduction is run
  synchronously using a single thread.

**Args:**

* ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor.
* ​indices\_buf (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor.

**Returns:**

The output shape.

---

## global_cache_insert

`global_cache_insert(key: String, value: UnsafePointer[NoneType])`

---

## global_cache_lookup

`global_cache_lookup(key: String) -> UnsafePointer[NoneType]`

---

## irfft

Inverse real FFT kernel using cuFFT.

## Functions

* [​`global_cache_insert`](./global_cache_insert):
* [​`global_cache_lookup`](./global_cache_lookup):
* [​`irfft`](./irfft): Compute the inverse real FFT of the input tensor.

---

## irfft

`irfft[input_rank: Int, input_type: DType, output_type: DType](input: NDBuffer[input_type, input_rank, origin], output: NDBuffer[output_type, input_rank, origin], n: Int, ctx: DeviceContext)`

Compute the inverse real FFT of the input tensor.

Currently, only applies it to the last dimension.

**Args:**

* ​input (`NDBuffer[input_type, input_rank, origin]`): Complex input tensor (NDBuffer).
* ​output (`NDBuffer[output_type, input_rank, origin]`): Real output tensor (NDBuffer).
* ​n (`Int`): Output signal size (if ctx (`DeviceContext`): Device context.

---

## generic_flash_attention_kv_cache_padded

`generic_flash_attention_kv_cache_padded[collection_t: KVCollectionT, type: DType, //, *, target: StringSlice[StaticConstantOrigin], mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1, num_heads: Int = -1](q: NDBuffer[type, 4, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], valid_lengths: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_flash_attention_kv_cache_padded_materialized_mask

`generic_flash_attention_kv_cache_padded_materialized_mask[collection_t: KVCollectionT, type: DType, //, *, target: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1, num_heads: Int = -1](q: NDBuffer[type, 4, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], mask: NDBuffer[type, rank, origin, shape, strides], valid_lengths: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_fused_qk_rope_bshd_continuous_batch

`generic_fused_qk_rope_bshd_continuous_batch[type: DType, //, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 4, origin, shape, strides], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_], freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: DeviceContextPtr = DeviceContextPtr())`

Performs a fused RoPE projection for Q and K projections.

We have a manually fused QKV projection with mo.opaque types in our Llama model.
Due to a limitation in custom op definitions, we can't declare both a tensor
and opaque type as output from a custom kernel. This requires us to only note
Q\_proj as an output from the QKV projection. If we immediately follow the
QKV proj kernel with a RoPE kernel applied to K, we'll get a race condition
because the graph compiler doesn't know about the dependency between these
kernels in the graph definition. Here we fuse the RoPE kernel applied to
Q\_proj with K\_proj, so K\_proj RoPE is only executed after QKV completes.

---

## generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch

`generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch[type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](hidden_state: NDBuffer[type, 3, origin, shape], weight: NDBuffer[type, 2, origin, shape], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape], ctx: DeviceContextPtr)`

Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.

**Args:**

* ​hidden\_state (`NDBuffer[type, 3, origin, shape]`): Tensor with shape (batch\_size, seq\_len, num\_heads \* head\_size).
* ​weight (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​kv\_collection (`ContinuousBatchingKVCacheCollection[type_, kv_params_]`): The historical KVCache for keys and values. The KVCache for
  this layer is retrieved via layer\_idx.
* ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache
  for the given layer from kv\_collection.
* ​output (`NDBuffer[type, 3, origin, shape]`): The pre-allocated output buffer for Q projections. K and V
  projections are written in-place to k\_cache and v\_cache.
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## generic_get_continuous_cache

`generic_get_continuous_cache[type: DType, kv_params: KVCacheStaticParams](blocks: NDBuffer[type, 6, origin], cache_lengths: NDBuffer[uint32, 1, origin], lookup_table: NDBuffer[uint32, 1, origin], max_lengths: NDBuffer[uint32, 2, origin]) -> ContinuousBatchingKVCacheCollection[type, kv_params]`

---

## generic_get_paged_cache

`generic_get_paged_cache[type: DType, kv_params: KVCacheStaticParams, page_size: Int](blocks: NDBuffer[type, 6, origin], cache_lengths: NDBuffer[uint32, 1, origin], lookup_table: NDBuffer[uint32, 2, origin], max_lengths: NDBuffer[uint32, 2, origin], out result: PagedKVCacheCollection[type, kv_params, page_size])`

---

## kv_cache

## Aliases

### `embed_fn_type`

`alias embed_fn_type = fn[DType, Int](IndexList[4], SIMD[$0, $1]) capturing -> SIMD[$0, $1]`

## Functions

* [​`generic_flash_attention_kv_cache_padded`](./generic_flash_attention_kv_cache_padded):
* [​`generic_flash_attention_kv_cache_padded_materialized_mask`](./generic_flash_attention_kv_cache_padded_materialized_mask):
* [​`generic_fused_qk_rope_bshd_continuous_batch`](./generic_fused_qk_rope_bshd_continuous_batch): Performs a fused RoPE projection for Q and K projections.
* [​`generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch`](./generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.
* [​`generic_get_continuous_cache`](./generic_get_continuous_cache):
* [​`generic_get_paged_cache`](./generic_get_paged_cache):
* [​`managed_tensor_slice_to_ndbuffer`](./managed_tensor_slice_to_ndbuffer):
* [​`print_kv_cache_cont_batch_generic_cpu`](./print_kv_cache_cont_batch_generic_cpu):
* [​`print_kv_cache_cont_batch_generic_gpu`](./print_kv_cache_cont_batch_generic_gpu):
* [​`print_kv_cache_paged_generic_cpu`](./print_kv_cache_paged_generic_cpu):
* [​`print_kv_cache_paged_generic_gpu`](./print_kv_cache_paged_generic_gpu):
* [​`rms_norm_kv_cache_ragged_continuous_batching`](./rms_norm_kv_cache_ragged_continuous_batching): Performs RMSNorm in place on new entries in the key cache.
* [​`rms_norm_kv_cache_ragged_paged`](./rms_norm_kv_cache_ragged_paged): Performs RMSNorm in place on new entries in the key cache.

---

## managed_tensor_slice_to_ndbuffer

`managed_tensor_slice_to_ndbuffer[: DType, : Int, spec: StaticTensorSpec[$0, $1], //](tensor: ManagedTensorSlice[io_spec, static_spec=spec]) -> NDBuffer[dtype, rank, MutableAnyOrigin, spec.shape, spec.strides, alignment=spec.alignment, address_space=spec.address_space, exclusive=spec.exclusive]`

---

## print_kv_cache_cont_batch_generic_cpu

`print_kv_cache_cont_batch_generic_cpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: ContinuousBatchingKVCacheCollection[type, kv_params], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)`

---

## print_kv_cache_cont_batch_generic_gpu

`print_kv_cache_cont_batch_generic_gpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: ContinuousBatchingKVCacheCollection[type, kv_params], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)`

---

## print_kv_cache_paged_generic_cpu

`print_kv_cache_paged_generic_cpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams, page_size: Int](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: PagedKVCacheCollection[type, kv_params, page_size], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)`

---

## print_kv_cache_paged_generic_gpu

`print_kv_cache_paged_generic_gpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams, page_size: Int](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: PagedKVCacheCollection[type, kv_params, page_size], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)`

---

## rms_norm_kv_cache_ragged_continuous_batching

`rms_norm_kv_cache_ragged_continuous_batching[type: DType, num_heads: Int, head_dim: Int, //, target: StringSlice[StaticConstantOrigin], multiply_before_cast: Bool, per_head_norm: Bool](kv_collection: ContinuousBatchingKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim))], gamma: NDBuffer[type, 1, origin, shape, strides], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], layer_idx: SIMD[uint32, 1], total_seq_len: SIMD[uint32, 1], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], context: DeviceContextPtr)`

Performs RMSNorm in place on new entries in the key cache.

This is done by first creating the ragged tensor weight\_shape
(total\_seq\_len, num\_heads, head\_dim) of the new token tensor.
To do this we need to pass in `total_seq_len` on host.
Then, using `input_row_offsets` we find the corresponding batch and token
index, and use that together with the static head and channel indices to
store to/load from the key cache.
This uses the input/output lambdas on the RMSNorm kernel.

This function could apply RMSNorm to a subset of dimensions in each head,
determined by the size of the gamma tensor. In this case, it operates on a
ragged tensor view of the key cache with shape (total\_seq\_len, num\_heads,
rms\_norm\_cols), where rms\_norm\_cols is the length of gamma and must be

---

## rms_norm_kv_cache_ragged_paged

`rms_norm_kv_cache_ragged_paged[type: DType, num_heads: Int, head_dim: Int, //, target: StringSlice[StaticConstantOrigin], multiply_before_cast: Bool, per_head_norm: Bool](kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], gamma: NDBuffer[type, 1, origin, shape, strides], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], layer_idx: SIMD[uint32, 1], total_seq_len: SIMD[uint32, 1], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], context: DeviceContextPtr)`

Performs RMSNorm in place on new entries in the key cache.

This is done by first creating the ragged tensor weight\_shape
(total\_seq\_len, num\_heads, head\_dim) of the new token tensor.
To do this we need to pass in `total_seq_len` on host.
Then, using `input_row_offsets` we find the corresponding batch and token
index, and use that together with the static head and channel indices to
store to/load from the key cache.
This uses the input/output lambdas on the RMSNorm kernel.

This function could apply RMSNorm to a subset of dimensions in each head,
determined by the size of the gamma tensor. In this case, it operates on a
ragged tensor view of the key cache with shape (total\_seq\_len, num\_heads,
rms\_norm\_cols), where rms\_norm\_cols is the length of gamma and must be

---

## generic_cross_attention_kv_cache

`generic_cross_attention_kv_cache[collection_t: KVCollectionT, type: DType, //, target: StringSlice[StaticConstantOrigin], mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], q_input_row_offsets: ManagedTensorSlice[io_spec, static_spec=static_spec], q_max_seq_len: NDBuffer[uint32, 1, origin, shape, strides], kv_input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_flare_mla_decode_kv_cache_ragged

`generic_flare_mla_decode_kv_cache_ragged[collection_t: KVCollectionT, type: DType, //, mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], target: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_flare_mla_decompress_k_cache_ragged_paged

`generic_flare_mla_decompress_k_cache_ragged_paged[target: StringSlice[StaticConstantOrigin], type: DType](buffer_row_offsets_1d: NDBuffer[uint32, 1, origin, shape, strides], cache_offsets_1d: NDBuffer[uint32, 1, origin, shape, strides], buffer_length: SIMD[int32, 1], weight: NDBuffer[type, 2, origin, shape, strides], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size], layer_idx: SIMD[uint32, 1], k_latent_buffer: NDBuffer[type, 2, origin, shape, strides], k_buffer: NDBuffer[type, 2, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_flare_mla_prefill_kv_cache_ragged

`generic_flare_mla_prefill_kv_cache_ragged[collection_t: KVCollectionT, type: DType, //, softmax_type: DType, write_softmax_info: Bool, use_cascade_attention: Bool, mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], target: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], k: NDBuffer[type, 3, origin, shape, strides], v: NDBuffer[type, 3, origin, shape, strides], buffer_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], cache_offsets: NDBuffer[uint32, 1, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], softmax_info: NDBuffer[softmax_type, 3, MutableAnyOrigin], context: DeviceContextPtr, prev_output: OptionalReg[NDBuffer[type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[type, 3, MutableAnyOrigin]]({:i1 0, 1}), prev_softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}))`

---

## generic_flare_mla_prefill_ragged_paged_plan

`generic_flare_mla_prefill_ragged_paged_plan[target: StringSlice[StaticConstantOrigin]](input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size], layer_idx: SIMD[uint32, 1], buffer_token_size: SIMD[uint32, 1], buffer_row_offsets: NDBuffer[uint32, 2, origin, shape, strides], cache_offsets: NDBuffer[uint32, 2, origin, shape, strides], buffer_lengths: NDBuffer[int32, 1, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_flash_attention_kv_cache_ragged

`generic_flash_attention_kv_cache_ragged[collection_t: KVCollectionT, type: DType, //, *, target: StringSlice[StaticConstantOrigin], mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: ManagedTensorSlice[io_spec, static_spec=static_spec], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_fused_qk_rope_bshd_continuous_batch_ragged

`generic_fused_qk_rope_bshd_continuous_batch_ragged[type: DType, //, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_], freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_fused_qk_rope_bshd_paged_ragged

`generic_fused_qk_rope_bshd_paged_ragged[type: DType, //, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size], freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr = DeviceContextPtr())`

Performs a fused RoPE projection for Q and K projections.

We have a manually fused QKV projection with mo.opaque types in our Llama model.
Due to a limitation in custom op definitions, we can't declare both a tensor
and opaque type as output from a custom kernel. This requires us to only note
Q\_proj as an output from the QKV projection. If we immediately follow the
QKV proj kernel with a RoPE kernel applied to K, we'll get a race condition
because the graph compiler doesn't know about the dependency between these
kernels in the graph definition. Here we fuse the RoPE kernel applied to
Q\_proj with K\_proj, so K\_proj RoPE is only executed after QKV completes.

---

## generic_fused_qkv_matmul_kv_cache_cont_batch_ragged

`generic_fused_qkv_matmul_kv_cache_cont_batch_ragged[type: DType, //, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[type, 2, origin, shape], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 2, origin, shape], ctx: DeviceContextPtr)`

Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.

**Args:**

* ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size).
* ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,).
  The value at each index is the start\_idx of the corresponding batch in hidden\_state.
* ​weight (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​kv\_collection (`ContinuousBatchingKVCacheCollection[type_, kv_params_]`): The object storing the KVCache for this layer.
* ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from kv\_collection.
* ​output (`NDBuffer[type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V
  projections are written in-place to k\_cache and v\_cache.
  Shape: (sum(seq\_lens), num\_heads \* head\_size).
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## generic_fused_qkv_matmul_kv_cache_paged_ragged

`generic_fused_qkv_matmul_kv_cache_paged_ragged[type: DType, weight_type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), group_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), has_zp: OptionalReg[Bool] = OptionalReg[Bool]({:i1 0, 1})](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[weight_type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 2, origin, shape], ctx: DeviceContextPtr)`

Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.

**Args:**

* ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size).
* ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,).
  The value at each index is the start\_idx of the corresponding batch in hidden\_state.
* ​weight (`NDBuffer[weight_type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​kv\_collection (`PagedKVCacheCollection[type_, kv_params_, page_size]`): The object storing the KVCache for this layer.
* ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from kv\_collection.
* ​output (`NDBuffer[type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V
  projections are written in-place to k\_cache and v\_cache.
  Shape: (sum(seq\_lens), num\_heads \* head\_size).
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## generic_fused_qkv_matmul_kv_cache_paged_ragged_bias

`generic_fused_qkv_matmul_kv_cache_paged_ragged_bias[type: DType, weight_type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), group_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), has_zp: OptionalReg[Bool] = OptionalReg[Bool]({:i1 0, 1})](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[weight_type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 2, origin, shape], bias: NDBuffer[type, 1, origin], ctx: DeviceContextPtr)`

Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.

**Args:**

* ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size).
* ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,).
  The value at each index is the start\_idx of the corresponding batch in hidden\_state.
* ​weight (`NDBuffer[weight_type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​kv\_collection (`PagedKVCacheCollection[type_, kv_params_, page_size]`): The object storing the KVCache for this layer.
* ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from kv\_collection.
* ​output (`NDBuffer[type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V
  projections are written in-place to k\_cache and v\_cache.
  Shape: (sum(seq\_lens), num\_heads \* head\_size).
* ​bias (`NDBuffer[type, 1, origin]`): Bias to be added to the QKV Tensor. Tensor is concatenated q + k + v. Rank 1.
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## generic_fused_qkv_matmul_kv_cache_paged_ragged_scale

`generic_fused_qkv_matmul_kv_cache_paged_ragged_scale[type: DType, weight_type: DType, output_type: DType, scale_type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[weight_type, 2, origin, shape], input_scale: NDBuffer[scale_type, 2, origin, shape], weight_scale: NDBuffer[scale_type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size], layer_idx: SIMD[uint32, 1], output: NDBuffer[output_type, 2, origin, shape], ctx: DeviceContextPtr)`

Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.

**Args:**

* ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size).
* ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,).
  The value at each index is the start\_idx of the corresponding batch
  in hidden\_state.
* ​weight (`NDBuffer[weight_type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \*
  head\_size).
* ​input\_scale (`NDBuffer[scale_type, 2, origin, shape]`): Scale to be multiplied to the input Tensor.
* ​weight\_scale (`NDBuffer[scale_type, 2, origin, shape]`): Scale to be multiplied to the weight Tensor.
* ​kv\_collection (`PagedKVCacheCollection[type_, kv_params_, page_size]`): The object storing the KVCache for this layer.
* ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from
  kv\_collection.
* ​output (`NDBuffer[output_type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V
  projections are written in-place to k\_cache and v\_cache.
  Shape: (sum(seq\_lens), num\_heads \* head\_size).
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## kv_cache_ragged

## Functions

* [​`generic_cross_attention_kv_cache`](./generic_cross_attention_kv_cache):
* [​`generic_flare_mla_decode_kv_cache_ragged`](./generic_flare_mla_decode_kv_cache_ragged):
* [​`generic_flare_mla_decompress_k_cache_ragged_paged`](./generic_flare_mla_decompress_k_cache_ragged_paged):
* [​`generic_flare_mla_prefill_kv_cache_ragged`](./generic_flare_mla_prefill_kv_cache_ragged):
* [​`generic_flare_mla_prefill_ragged_paged_plan`](./generic_flare_mla_prefill_ragged_paged_plan):
* [​`generic_flash_attention_kv_cache_ragged`](./generic_flash_attention_kv_cache_ragged):
* [​`generic_fused_qk_rope_bshd_continuous_batch_ragged`](./generic_fused_qk_rope_bshd_continuous_batch_ragged):
* [​`generic_fused_qk_rope_bshd_paged_ragged`](./generic_fused_qk_rope_bshd_paged_ragged): Performs a fused RoPE projection for Q and K projections.
* [​`generic_fused_qkv_matmul_kv_cache_cont_batch_ragged`](./generic_fused_qkv_matmul_kv_cache_cont_batch_ragged): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.
* [​`generic_fused_qkv_matmul_kv_cache_paged_ragged`](./generic_fused_qkv_matmul_kv_cache_paged_ragged): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.
* [​`generic_fused_qkv_matmul_kv_cache_paged_ragged_bias`](./generic_fused_qkv_matmul_kv_cache_paged_ragged_bias): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.
* [​`generic_fused_qkv_matmul_kv_cache_paged_ragged_scale`](./generic_fused_qkv_matmul_kv_cache_paged_ragged_scale): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.
* [​`k_matmul_ragged_paged`](./k_matmul_ragged_paged): Performs a matmul, writing the output into a mutable PagedKVCacheCollection object.
* [​`kv_matmul_ragged_paged`](./kv_matmul_ragged_paged): Performs a matmul, writing the output into a mutable ContinuousBatchingKVCacheCollection object.
* [​`unfused_qkv_matmul_ragged_paged_gguf_quantized`](./unfused_qkv_matmul_ragged_paged_gguf_quantized): Performs a quantized matmul, writing the output into a mutable PagedKVCacheCollection object.
* [​`valid_length_managed_tensor_slice_to_ndbuffer`](./valid_length_managed_tensor_slice_to_ndbuffer):

---

## k_matmul_ragged_paged

`k_matmul_ragged_paged[type: DType, num_heads: Int, head_dim: Int, page_size: Int, //, target: StringSlice[StaticConstantOrigin]](hidden_state: NDBuffer[type, 2, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[type, 2, origin, shape, strides], kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], layer_idx: SIMD[uint32, 1], ctx: DeviceContextPtr)`

Performs a matmul, writing the output into a mutable PagedKVCacheCollection object.

**Args:**

* ​hidden\_state (`NDBuffer[type, 2, origin, shape, strides]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size).
* ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,)
  denoting the start of each sequence along the seq\_len dimension.
* ​weight (`NDBuffer[type, 2, origin, shape, strides]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​kv\_collection (`PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size]`): The historical KVCache for keys and values. The KVCache for
  this layer is retrieved via layer\_idx.
* ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache
  for the given layer from kv\_collection.
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## kv_matmul_ragged_paged

`kv_matmul_ragged_paged[type: DType, num_heads: Int, head_dim: Int, page_size: Int, //, target: StringSlice[StaticConstantOrigin]](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], layer_idx: SIMD[uint32, 1], ctx: DeviceContextPtr)`

Performs a matmul, writing the output into a mutable ContinuousBatchingKVCacheCollection object.

**Args:**

* ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size).
* ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,)
  denoting the start of each sequence along the seq\_len dimension.
* ​weight (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​kv\_collection (`PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size]`): The historical KVCache for keys and values. The KVCache for
  this layer is retrieved via layer\_idx.
* ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache
  for the given layer from kv\_collection.
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## unfused_qkv_matmul_ragged_paged_gguf_quantized

`unfused_qkv_matmul_ragged_paged_gguf_quantized[type: DType, num_heads: Int, head_dim: Int, page_size: Int, //, quantization_encoding_q: StringSlice[StaticConstantOrigin], quantization_encoding_k: StringSlice[StaticConstantOrigin], quantization_encoding_v: StringSlice[StaticConstantOrigin]](hidden_state: NDBuffer[float32, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], q_weight: NDBuffer[uint8, 2, origin, shape], k_weight: NDBuffer[uint8, 2, origin, shape], v_weight: NDBuffer[uint8, 2, origin, shape], kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], layer_idx: SIMD[uint32, 1], output: NDBuffer[float32, 2, origin, shape], ctx: DeviceContextPtr)`

Performs a quantized matmul, writing the output into a mutable PagedKVCacheCollection object.

Unlike the un-quantized version (kv\_matmul\_ragged\_continuous\_batching), this
implementation does not concat the q, k, and v weights together. Instead, it
performs three matmuls. This allows the q, k, and v weights to have different
quantization encodings.

This is only supported on CPU.

**Args:**

* ​hidden\_state (`NDBuffer[float32, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size).
* ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,)
  denoting the start of each sequence along the seq\_len dimension.
* ​q\_weight (`NDBuffer[uint8, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​k\_weight (`NDBuffer[uint8, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​v\_weight (`NDBuffer[uint8, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​kv\_collection (`PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size]`): The Collection object storing KVCache entries.
* ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache
  for the given layer from kv\_collection.
* ​output (`NDBuffer[float32, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_kv\_heads \* head\_size).
  This is the output buffer for the Q matmul.
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## valid_length_managed_tensor_slice_to_ndbuffer

`valid_length_managed_tensor_slice_to_ndbuffer(tensor: ManagedTensorSlice[io_spec, static_spec=static_spec]) -> NDBuffer[uint32, 1, MutableAnyOrigin]`

---

## flash_attention

`flash_attention[rank: Int, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), decoding_warp_split_k: Bool = False, naive_kernel: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, rank, origin, shape, strides], v: NDBuffer[type, rank, origin, shape, strides], mask: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], scale: SIMD[float32, 1], context: DeviceContextPtr = DeviceContextPtr(), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

`flash_attention[rank: Int, cache_t: KVCacheT, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, decoding_warp_split_k: Bool = False, naive_kernel: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: cache_t, v: cache_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

Flash attention 2 algorithm. Compute:     (1) Transpose (Q) BSHD -> BHSD;     (2) Transpose (K) BSHD -> BHSD;     (3) Transpose (V) BSHD -> BHSD;     (4) P = Bmm(Q, K), P is also called "score";     (5) P = P \* scale + mask;     (6) P = softmax(P);     (7) O = Bmm(P, V)     (8) Output = Transpose(O).

B, S, H, D denote batch size, sequence length, head count and depth, respectively.
(1), (2), (3) happens while loading the data into shared memory.
(8) happens when writing output to global memory.

All inputs (query, key, and value) must have BSHD layout. The mask can be
BSS or BHSS.

This kernel also handles grouped attention optimization. In this case the shape of
K and V are BShD where h = H / num\_groups.

This kernels handles batches with different valid lengths (i.e., before the
padding). Such lengths are passed in valid\_length argument.

`flash_attention[rank: Int, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), decoding_warp_split_k: Bool = False, _use_valid_length: Bool = False, _padded_ndbuffer: Bool = False, naive_kernel: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, rank, origin, shape, strides], v: NDBuffer[type, rank, origin, shape, strides], mask_functor: mask_t, score_mod_functor: score_mod_t, scale: SIMD[float32, 1], ctx: DeviceContext, num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), valid_length: OptionalReg[ManagedTensorSlice[IOSpec(), static_spec=create_unknown()]] = OptionalReg[ManagedTensorSlice[IOSpec(), static_spec=create_unknown()]]({:i1 0, 1}))`

---

## flash_attention_dispatch

`flash_attention_dispatch[rank: Int, k_t: MHAOperand, v_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, kv_num_heads: Int, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, _is_flash_attention_applicable: Bool = True, _is_cache_length_accurate: Bool = False, _use_valid_length: Bool = True, _padded_ndbuffer: Bool = False, decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: k_t, v: v_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], max_prompt_len: Int, max_cache_valid_length: Int, scale: SIMD[float32, 1], is_token_generation: Bool, ctx: DeviceContext, kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

---

## flash_attention_hw_supported

`flash_attention_hw_supported[qkv_type: DType]() -> Bool`

---

## get_mha_decoding_num_partitions

`get_mha_decoding_num_partitions[num_heads: Int, group: Int](batch_size: Int, num_keys: Int, ctx: DeviceContext) -> Int`

---

## mha

## Functions

* [​`flash_attention`](./flash_attention):
* [​`flash_attention_dispatch`](./flash_attention_dispatch):
* [​`flash_attention_hw_supported`](./flash_attention_hw_supported):
* [​`get_mha_decoding_num_partitions`](./get_mha_decoding_num_partitions):
* [​`managed_tensor_slice_to_ndbuffer`](./managed_tensor_slice_to_ndbuffer):
* [​`mha`](./mha):
* [​`mha_decoding`](./mha_decoding):
* [​`mha_decoding_single_batch`](./mha_decoding_single_batch): Flash attention v2 algorithm.
* [​`mha_decoding_single_batch_pipelined`](./mha_decoding_single_batch_pipelined): Flash attention v2 algorithm.
* [​`mha_gpu_naive`](./mha_gpu_naive):
* [​`mha_single_batch`](./mha_single_batch): MHA for token gen where seqlen = 1 and num\_keys >= 1.
* [​`mha_single_batch_pipelined`](./mha_single_batch_pipelined): MHA for token gen where seqlen = 1 and num\_keys >= 1.
* [​`mha_splitk_reduce`](./mha_splitk_reduce):
* [​`scale_and_mask_helper`](./scale_and_mask_helper):

---

## managed_tensor_slice_to_ndbuffer

`managed_tensor_slice_to_ndbuffer[: DType, : Int, spec: StaticTensorSpec[$0, $1], //](tensor: ManagedTensorSlice[io_spec, static_spec=spec]) -> NDBuffer[dtype, rank, MutableAnyOrigin, spec.shape, spec.strides, alignment=spec.alignment, address_space=spec.address_space, exclusive=spec.exclusive]`

---

## mha

`mha[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, config: MHAConfig, group: Int = 1, use_score_mod: Bool = False, ragged: Bool = False, is_shared_kv: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False, _padded_ndbuffer: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], scale: SIMD[float32, 1], batch_size: Int, seq_len_arg: Int, num_keys_arg: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]], mask: mask_t, score_mod: score_mod_t)`

---

## mha_decoding

`mha_decoding[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, ragged: Bool = False, is_shared_kv: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], batch_size: Int, num_partitions: Int, max_cache_valid_length: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], mask: mask_t, score_mod: score_mod_t)`

---

## mha_decoding_single_batch

`mha_decoding_single_batch[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], num_keys: UInt, num_partitions: UInt, max_cache_valid_length: UInt, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)`

Flash attention v2 algorithm.

---

## mha_decoding_single_batch_pipelined

`mha_decoding_single_batch_pipelined[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], num_keys: UInt, num_partitions: UInt, max_cache_valid_length: UInt, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)`

Flash attention v2 algorithm.

---

## mha_gpu_naive

`mha_gpu_naive[output_type: DType, k_t: MHAOperand, v_t: MHAOperand, mask_t: MHAMask, rank: Int, //, ragged: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False](q: NDBuffer[type, rank, origin, shape, strides], k: k_t, v: v_t, mask_functor: mask_t, output: NDBuffer[output_type, rank, origin, shape, strides], valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], batch_size: Int, max_prompt_len: Int, max_cache_size: Int, num_heads: Int, depth: Int, group: Int, ctx: DeviceContext)`

`mha_gpu_naive[q_type: DType, k_type: DType, v_type: DType, output_type: DType, rank: Int, mask_type: DType, mask_rank: Int, //](q: NDBuffer[q_type, rank, origin, shape, strides], k: NDBuffer[k_type, rank, origin, shape, strides], v: NDBuffer[v_type, rank, origin, shape, strides], mask: NDBuffer[mask_type, mask_rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], output: NDBuffer[output_type, rank, origin, shape, strides], scale: SIMD[float32, 1], batch_size: Int, seq_len: Int, num_keys: Int, num_heads: Int, depth: Int, group: Int, ctx: DeviceContext)`

`mha_gpu_naive[q_type: DType, output_type: DType, cache_t: KVCacheT, mask_t: MHAMask, rank: Int, //, ragged: Bool = False](q: NDBuffer[q_type, rank, origin, shape, strides], k: cache_t, v: cache_t, mask_functor: mask_t, output: NDBuffer[output_type, rank, origin, shape, strides], valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], batch_size: Int, max_prompt_len: Int, max_cache_size: Int, num_heads: Int, depth: Int, group: Int, ctx: DeviceContext)`

---

## mha_single_batch

`mha_single_batch[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, config: MHAConfig, group: Int = 1, use_score_mod: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], scale: SIMD[float32, 1], seq_len: Int, max_seq_len: Int, start_pos: SIMD[uint32, 1], num_keys: Int, mask_tensor_col: Int, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)`

MHA for token gen where seqlen = 1 and num\_keys >= 1.

The general data layout and steps conform to flash attention. Two exceptions:

1 Partition across B, H, and num\_keys (TODO).  The last one is split-K and
will need a separate reduction kernel at the end.

2 First bmm becomes gemv and second bmm becomes gevm.
TODO: use more optimized kernels for them

---

## mha_single_batch_pipelined

`mha_single_batch_pipelined[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, config: MHAConfig, group: Int = 1, use_score_mod: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], scale: SIMD[float32, 1], seq_len: Int, max_seq_len: Int, start_pos: SIMD[uint32, 1], num_keys: Int, mask_tensor_col: Int, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)`

MHA for token gen where seqlen = 1 and num\_keys >= 1.

The general data layout and steps conform to flash attention. Two exceptions:

1 Partition across B, H, and num\_keys (TODO).  The last one is split-K and
will need a separate reduction kernel at the end.

2 First bmm becomes gemv and second bmm becomes gevm.
TODO: use more optimized kernels for them

---

## mha_splitk_reduce

`mha_splitk_reduce[output_type: DType, depth: UInt, num_heads: UInt, num_threads: UInt, group: UInt = UInt(1), use_exp2: Bool = False](intermediate_ptr: UnsafePointer[SIMD[output_type, 1]], output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], batch_size: Int, num_partitions: Int)`

---

## scale_and_mask_helper

`scale_and_mask_helper[p_type: DType, p_layout: Layout, mask_t: MHAMask, score_mod_t: ScoreModTrait, group: Int, num_n_mmas: Int, WN: Int, MMA_N: Int, simd_width: Int, use_score_mod: Bool = False](p_reg_tile: LayoutTensor[p_type, p_layout, origin, address_space=AddressSpace(5)], scale: SIMD[float32, 1], num_keys: UInt, bound: UInt, lane: UInt, warp: UInt, mask: mask_t, score_mod: score_mod_t, kv_tile_start_row: Int, mask_stride: UInt, max_seq_len: Int)`

---

## mha_cross

## Functions

* [​`mha_cross_gpu_naive`](./mha_cross_gpu_naive): Naive cross attention on GPU.

---

## mha_cross_gpu_naive

`mha_cross_gpu_naive[cache_t: KVCacheT, mask_t: MHAMask, type: DType, q_shape: DimList, //, rank: Int](output: NDBuffer[type, rank, MutableAnyOrigin, shape, strides], q: NDBuffer[type, rank, MutableAnyOrigin, q_shape, strides], q_input_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin, shape, strides], q_max_seq_len: Int, k: cache_t, v: cache_t, kv_input_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin, shape, strides], mask_functor: mask_t, scale: SIMD[float32, 1], ctx: DeviceContext)`

Naive cross attention on GPU.

Note that this assumes ragged tensor inputs and uses a mask functor.

Computes:
(1) Transpose (Q) BSHD -> BHSD;
(2) Transpose (K) BSHD -> BHSD;
(3) Transpose (V) BSHD -> BHSD;
(4) P = Bmm(Q, K), P is also called "score";
(5) P = P \* scale + mask;
(6) P = softmax(P);
(7) O = Bmm(P, V)
(8) Output = Transpose(O).

B, S, H, D denote batch size, sequence length, head count and depth, respectively.
(1), (2), (3) happens while loading the data into shared memory.
(8) happens when writing output to global memory.

All inputs (query, key, and value) must have BSHD layout. The mask can be
BSS or BHSS.

This kernel also handles grouped attention optimization. In this case the shape of
K and V are BShD where h = H / num\_groups.

---

## AndMask

`@register_passable(trivial)`
`struct AndMask[T: MHAMask, S: MHAMask, //, lhs: T, rhs: S]`

Mask that's the AND of two masks.

## Implemented traits

`AnyType`,
`Copyable`,
`MHAMask`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask = get_vtable_entry(:trait T, "apply_log2e_after_mask") if get_vtable_entry(:trait T, "apply_log2e_after_mask") else get_vtable_entry(:trait S, "apply_log2e_after_mask")`

### `mask_out_of_bound`

`alias mask_out_of_bound = get_vtable_entry(:trait T, "mask_out_of_bound") if get_vtable_entry(:trait T, "mask_out_of_bound") else get_vtable_entry(:trait S, "mask_out_of_bound")`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds = get_vtable_entry(:trait S, "mask_safe_out_of_bounds") if get_vtable_entry(:trait T, "mask_safe_out_of_bounds") else get_vtable_entry(:trait T, "mask_safe_out_of_bounds")`

## Methods

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

### `status`

`status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

---

## CausalMask

`@register_passable(trivial)`
`struct CausalMask`

MHA causal mask ensures a token is only affected by previous tokens.

## Implemented traits

`AnyType`,
`Copyable`,
`MHAMask`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask = False`

### `mask_out_of_bound`

`alias mask_out_of_bound = is_nvidia_gpu()`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds = True`

## Methods

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

### `status`

`status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

---

## ChunkedCausalMask

`ChunkedCausalMask[local_window_size: Int]() -> OrMask[CausalMask(), ChunkedMask()]`

Mask implementing Chunked Causal attention for Llama4 models.

This groups the mask into chunks of size `local_window_size` and performs causal
attention within each local chunk. Considering the following case:

* Q\_len = 7
* K\_len = 10
* start\_pos = 3
* local\_window\_size = 4

The mask will be applied as follows:
K > 0 1 2 3 4 5 6 7 8 9
Q v x--------------------x
0 | 1 1 1 1 0 0 0 0 0 0
1 | 0 0 0 0 1 0 0 0 0 0
2 | 0 0 0 0 1 1 0 0 0 0
3 | 0 0 0 0 1 1 1 0 0 0
4 | 0 0 0 0 1 1 1 1 0 0
5 | 0 0 0 0 0 0 0 0 1 0
6 | 0 0 0 0 0 0 0 0 1 1

---

## ChunkedMask

`@register_passable(trivial)`
`struct ChunkedMask[local_window_size: Int]`

Mask implementing Chunked attention.

This groups the mask into chunks of size `local_window_size`.
Considering the following case:

* Q\_len = 7
* K\_len = 10
* local\_window\_size = 4

The mask will be applied as follows:
K > 0 1 2 3 4 5 6 7 8 9
Q v x--------------------x
0 | 1 1 1 1 0 0 0 0 0 0
1 | 0 0 0 0 1 1 1 1 0 0
2 | 0 0 0 0 1 1 1 1 0 0
3 | 0 0 0 0 1 1 1 1 0 0
4 | 0 0 0 0 1 1 1 1 0 0
5 | 0 0 0 0 0 0 0 0 1 1
6 | 0 0 0 0 0 0 0 0 1 1

## Implemented traits

`AnyType`,
`Copyable`,
`MHAMask`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask = False`

### `mask_out_of_bound`

`alias mask_out_of_bound = True`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds = True`

## Methods

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

### `status`

`status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

---

## MHAMask

The MHAMask trait describes masks for MHA kernels, such as the causal mask.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask`

Does the mask require `log2e` to be applied after the mask, or can it be fused with the scaling?

### `mask_out_of_bound`

`alias mask_out_of_bound`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds`

Is the mask safe to read out of bounds?

## Methods

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self: _Self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

Return mask vector at given coordinates.

Arguments:
coord is (seq\_id, head, q\_idx, k\_idx)
score\_vec is at `coord` of the score matrix

The functor could capture an mask tensor and add to the score e.g. Replit.

### `status`

`status[*, element_type: DType = uint32](self: _Self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

Given a tile's index range, return its masking status.

---

## MaskName

`struct MaskName`

A tile's masking status.

## Fields

* ​name (`String`):

## Implemented traits

`AnyType`,
`Stringable`,
`UnknownDestructibility`

## Aliases

### `CAUSAL`

`alias CAUSAL = MaskName(__init__[__mlir_type.!kgen.string]("causal"))`

### `CHUNKED`

`alias CHUNKED = MaskName(__init__[__mlir_type.!kgen.string]("chunked"))`

### `CHUNKED_CAUSAL`

`alias CHUNKED_CAUSAL = MaskName(__init__[__mlir_type.!kgen.string]("chunked_causal"))`

### `MATERIALIZED`

`alias MATERIALIZED = MaskName(__init__[__mlir_type.!kgen.string]("materialized"))`

### `NULL`

`alias NULL = MaskName(__init__[__mlir_type.!kgen.string]("null"))`

### `SLIDING_WINDOW_CAUSAL`

`alias SLIDING_WINDOW_CAUSAL = MaskName(__init__[__mlir_type.!kgen.string]("sliding_window_causal"))`

## Methods

### `__init__`

`__init__(out self, name: String)`

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

`__eq__(self, rhs: String) -> Bool`

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

### `__str__`

`__str__(self) -> String`

---

## MaterializedMask

`@register_passable(trivial)`
`struct MaterializedMask[type_: DType, rank_: Int, shape_: DimList]`

Mask that's backed by a materialized tensor.

## Fields

* ​mask\_tensor (`NDBuffer[type_, rank_, MutableAnyOrigin, shape_]`):
* ​start\_pos (`OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]`):
* ​is\_multiple\_of\_2 (`Bool`):

## Implemented traits

`AnyType`,
`Copyable`,
`MHAMask`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask = True`

### `mask_out_of_bound`

`alias mask_out_of_bound = True`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds = False`

### `MaskType`

`alias MaskType = NDBuffer[type_, rank_, MutableAnyOrigin, shape_]`

## Methods

### `__init__`

`__init__(mask_tensor: NDBuffer[type_, rank_, MutableAnyOrigin, shape_], start_pos: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1})) -> Self`

### `get_start_pos`

`get_start_pos(self, batch_idx: Int) -> Int`

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

### `status`

`status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

---

## NullMask

`@register_passable(trivial)`
`struct NullMask`

Mask that's effectively a noop.

## Implemented traits

`AnyType`,
`Copyable`,
`MHAMask`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask = False`

### `mask_out_of_bound`

`alias mask_out_of_bound = True`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds = True`

## Methods

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

### `status`

`status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

---

## OrMask

`@register_passable(trivial)`
`struct OrMask[T: MHAMask, S: MHAMask, //, lhs: T, rhs: S]`

Mask that's the OR of two masks.

## Implemented traits

`AnyType`,
`Copyable`,
`MHAMask`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask = get_vtable_entry(:trait T, "apply_log2e_after_mask") if get_vtable_entry(:trait T, "apply_log2e_after_mask") else get_vtable_entry(:trait S, "apply_log2e_after_mask")`

### `mask_out_of_bound`

`alias mask_out_of_bound = get_vtable_entry(:trait S, "mask_out_of_bound") if get_vtable_entry(:trait T, "mask_out_of_bound") else get_vtable_entry(:trait T, "mask_out_of_bound")`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds = get_vtable_entry(:trait S, "mask_safe_out_of_bounds") if get_vtable_entry(:trait T, "mask_safe_out_of_bounds") else get_vtable_entry(:trait T, "mask_safe_out_of_bounds")`

## Methods

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

### `status`

`status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

---

## SlidingWindowCausalMask

`@register_passable(trivial)`
`struct SlidingWindowCausalMask[window_size: Int]`

Mask implementing Sliding Window attention.

Considering the following case:

* Q\_len = 7
* K\_len = 7
* window\_size = 3

The mask will be applied as follows:
K > 0 1 2 3 4 5 6
Q v x------------x
0 | 1 0 0 0 0 0 0
1 | 1 1 0 0 0 0 0
2 | 1 1 1 0 0 0 0
3 | 0 1 1 1 0 0 0
4 | 0 0 1 1 1 0 0
5 | 0 0 0 1 1 1 0
6 | 0 0 0 0 1 1 1

## Implemented traits

`AnyType`,
`Copyable`,
`MHAMask`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask = False`

### `mask_out_of_bound`

`alias mask_out_of_bound = True`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds = True`

## Methods

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

### `status`

`status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

---

## TileMaskStatus

`@register_passable(trivial)`
`struct TileMaskStatus`

A tile's masking status.

## Fields

* ​status (`SIMD[uint8, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `FULL_MASK`

`alias FULL_MASK = TileMaskStatus(__init__[__mlir_type.!pop.int_literal](3))`

### `NO_MASK`

`alias NO_MASK = TileMaskStatus(__init__[__mlir_type.!pop.int_literal](0))`

### `PARTIAL_MASK`

`alias PARTIAL_MASK = TileMaskStatus(__init__[__mlir_type.!pop.int_literal](1))`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

### `__is__`

`__is__(self, rhs: Self) -> Bool`

### `__and__`

`__and__(self, rhs: Self) -> Self`

### `__or__`

`__or__(self, rhs: Self) -> Self`

### `__is_not__`

`__is_not__(self, rhs: Self) -> Bool`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

---

## mha_mask

## Aliases

### `MASK_VALUE`

`alias MASK_VALUE = -10000`

## Structs

* [​`AndMask`](./AndMask): Mask that's the AND of two masks.
* [​`CausalMask`](./CausalMask): MHA causal mask ensures a token is only affected by previous tokens.
* [​`ChunkedMask`](./ChunkedMask): Mask implementing Chunked attention.
* [​`MaskName`](./MaskName): A tile's masking status.
* [​`MaterializedMask`](./MaterializedMask): Mask that's backed by a materialized tensor.
* [​`NullMask`](./NullMask): Mask that's effectively a noop.
* [​`OrMask`](./OrMask): Mask that's the OR of two masks.
* [​`SlidingWindowCausalMask`](./SlidingWindowCausalMask): Mask implementing Sliding Window attention.
* [​`TileMaskStatus`](./TileMaskStatus): A tile's masking status.

## Traits

* [​`MHAMask`](./MHAMask): The MHAMask trait describes masks for MHA kernels, such as the causal mask.

## Functions

* [​`ChunkedCausalMask`](./ChunkedCausalMask): Mask implementing Chunked Causal attention for Llama4 models.

---

## KVCacheMHAOperand

`@register_passable(trivial)`
`struct KVCacheMHAOperand[cache_t: KVCacheT]`

An implementation for `mo.opaque` KVCacheT arguments to MHA kernels.

We can eventually remove this trait and just add it as a sub-trait in the
KVCacheT type, but we need to solve some cyclic dependencies first.

## Fields

* ​cache (`cache_t`):

## Implemented traits

`AnyType`,
`Copyable`,
`MHAOperand`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `type`

`alias type = get_vtable_entry(:trait cache_t, "type")`

## Methods

### `__init__`

`__init__(cache: cache_t) -> Self`

### `block_paged_ptr`

`block_paged_ptr[tile_size: Int](self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[get_vtable_entry(:trait cache_t, "type"), 1]]`

### `cache_length`

`cache_length(self, batch_idx: Int) -> Int`

### `max_context_length`

`max_context_length(self) -> SIMD[uint32, 1]`

---

## MHAOperand

This serves as the trait to support arguments to our MHA kernel.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `type`

`alias type`

## Methods

### `block_paged_ptr`

`block_paged_ptr[tile_size: Int](self: _Self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[get_vtable_entry(:trait _Self, "type"), 1]]`

### `cache_length`

`cache_length(self: _Self, batch_idx: Int) -> Int`

Returns the length of the cache for a given batch index.

### `max_context_length`

`max_context_length(self: _Self) -> SIMD[uint32, 1]`

Returns the maximum cache length in a given batch index.

---

## NDBufferMHAOperand

`@register_passable(trivial)`
`struct NDBufferMHAOperand[type_: DType, rank: Int, shape: DimList, stride: DimList]`

An implementation for NDBuffer arguments to MHA kernels.

## Fields

* ​buffer (`NDBuffer[type_, rank, MutableAnyOrigin, shape, stride]`):

## Implemented traits

`AnyType`,
`Copyable`,
`MHAOperand`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `type`

`alias type = type_`

## Methods

### `__init__`

`__init__(buffer: NDBuffer[type_, rank, MutableAnyOrigin, shape, stride]) -> Self`

### `block_paged_ptr`

`block_paged_ptr[tile_size: Int](self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[type_, 1]]`

### `cache_length`

`cache_length(self, batch_idx: Int) -> Int`

### `max_context_length`

`max_context_length(self) -> SIMD[uint32, 1]`

---

## RaggedMHAOperand

`@register_passable(trivial)`
`struct RaggedMHAOperand[type_: DType, shape: DimList, stride: DimList]`

An implementation for ragged NDBuffer arguments to MHA kernels.

## Fields

* ​buffer (`NDBuffer[type_, 3, MutableAnyOrigin, shape, stride]`):
* ​cache\_row\_offsets (`NDBuffer[uint32, 1, MutableAnyOrigin]`):

## Implemented traits

`AnyType`,
`Copyable`,
`MHAOperand`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `type`

`alias type = type_`

## Methods

### `__init__`

`__init__(buffer: NDBuffer[type_, 3, MutableAnyOrigin, shape, stride], cache_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin, shape, strides]) -> Self`

### `block_paged_ptr`

`block_paged_ptr[tile_size: Int](self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[type_, 1]]`

### `cache_length`

`cache_length(self, batch_idx: Int) -> Int`

### `max_context_length`

`max_context_length(self) -> SIMD[uint32, 1]`

---

## mha_operand

## Structs

* [​`KVCacheMHAOperand`](./KVCacheMHAOperand): An implementation for `mo.opaque` KVCacheT arguments to MHA kernels.
* [​`NDBufferMHAOperand`](./NDBufferMHAOperand): An implementation for NDBuffer arguments to MHA kernels.
* [​`RaggedMHAOperand`](./RaggedMHAOperand): An implementation for ragged NDBuffer arguments to MHA kernels.

## Traits

* [​`MHAOperand`](./MHAOperand): This serves as the trait to support arguments to our MHA kernel.

---

## AlibiScoreMod

`@register_passable(trivial)`
`struct AlibiScoreMod[num_heads: Int]`

AlibiScoreMod adds the appropriate ALiBi constant bias to attention score.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`ScoreModTrait`,
`UnknownDestructibility`

## Aliases

### `name_str`

`alias name_str = __init__[__mlir_type.!kgen.string]("alibi")`

## Methods

### `score_mod`

`score_mod[type: DType, width: Int, //, *, element_type: DType = int32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width], max_prompt_len: Int) -> SIMD[type, width]`

---

## IdentityScoreMod

`@register_passable(trivial)`
`struct IdentityScoreMod`

IdentityScoreMod simply returns attention score.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`ScoreModTrait`,
`UnknownDestructibility`

## Aliases

### `name_str`

`alias name_str = __init__[__mlir_type.!kgen.string]("no_pos")`

## Methods

### `score_mod`

`score_mod[type: DType, width: Int, //, *, element_type: DType = int32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width], max_prompt_len: Int = 0) -> SIMD[type, width]`

---

## ScoreModTrait

The ScoreMod trait desctribes score\_mod for mha kernel like alibi bias.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `name_str`

`alias name_str`

## Methods

### `score_mod`

`score_mod[type: DType, width: Int, //, *, element_type: DType = int32](self: _Self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width], max_prompt_len: Int = 0) -> SIMD[type, width]`

Return score vector at given coordinates given a score\_mod.

Arguments:
coord is (seq\_id, head, q\_idx, k\_idx)
score\_vec is at `coord` of the score matrix

Score\_mod calculates a tensor given the functor and adds to score\_vec.

---

## mha_score_mod

## Structs

* [​`AlibiScoreMod`](./AlibiScoreMod): AlibiScoreMod adds the appropriate ALiBi constant bias to attention score.
* [​`IdentityScoreMod`](./IdentityScoreMod): IdentityScoreMod simply returns attention score.

## Traits

* [​`ScoreModTrait`](./ScoreModTrait): The ScoreMod trait desctribes score\_mod for mha kernel like alibi bias.

---

## MHAPosition

`@register_passable(trivial)`
`struct MHAPosition[BM: Int, BN: Int, depth: Int, num_heads: Int, group: Int, decoding: Bool]`

Position of the MHA-kernel. When `decoding=False`, `q_head_stride == num_heads`. When `decoding=True`, `q_head_stride == 1`.

## Fields

* ​q\_out\_offset (`Int`):
* ​num\_keys (`SIMD[uint32, 1]`):
* ​start\_pos (`SIMD[uint32, 1]`):
* ​seq\_len (`SIMD[uint32, 1]`):
* ​head\_idx (`SIMD[uint32, 1]`):
* ​prompt\_offset (`SIMD[uint32, 1]`):
* ​prompt\_idx (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `q_output_gmem_layout`

`alias q_output_gmem_layout = __init__[::Origin[::Bool(IntTuple(BM, depth), IntTuple(depth if decoding else (depth * num_heads), 1))`

### `q_stride`

`alias q_stride = depth if decoding else (depth * num_heads)`

## Methods

### `__init__`

`__init__(q_out_offset: Int, num_keys: SIMD[uint32, 1], start_pos: SIMD[uint32, 1], seq_info: SeqInfo) -> Self`

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

### `q_head_idx`

`q_head_idx(self) -> SIMD[uint32, 1]`

### `kv_head_idx`

`kv_head_idx(self) -> SIMD[uint32, 1]`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

### `q_tile_num_rows`

`q_tile_num_rows(self) -> SIMD[uint32, 1]`

### `q_out_gmem_tensor`

`q_out_gmem_tensor[dtype: DType](self, ptr: UnsafePointer[SIMD[dtype, 1]]) -> LayoutTensor[dtype, __init__[::Origin[::Bool(IntTuple(BM, depth), IntTuple(depth if decoding else (depth * num_heads), 1)), MutableAnyOrigin, layout_int_type=int32, linear_idx_type=int32, masked=True]`

### `mask_status`

`mask_status[mask_t: MHAMask](self, mask: mask_t, kv_tile_start_row: SIMD[uint32, 1]) -> TileMaskStatus`

### `exp_sum_qk_max_ptr`

`exp_sum_qk_max_ptr[partition_t: MHAPartitionScheme](self, partition: partition_t, batch_size: SIMD[uint32, 1]) -> Tuple[UnsafePointer[SIMD[get_vtable_entry(:trait partition_t, "accum_dtype"), 1]], UnsafePointer[SIMD[get_vtable_entry(:trait partition_t, "accum_dtype"), 1]]]`

### `get_start_and_end_for_partitions`

`get_start_and_end_for_partitions[partition_t: MHAPartitionScheme, //, BN: Int](self, partition: partition_t) -> Tuple[SIMD[uint32, 1], SIMD[uint32, 1]]`

---

## mha_sm90

## Structs

* [​`MHAPosition`](./MHAPosition): Position of the MHA-kernel. When `decoding=False`, `q_head_stride == num_heads`. When `decoding=True`, `q_head_stride == 1`.

## Functions

* [​`mha_sm90_dispatch`](./mha_sm90_dispatch):
* [​`valid_length_managed_tensor_slice_to_ndbuffer`](./valid_length_managed_tensor_slice_to_ndbuffer):

---

## mha_sm90_dispatch

`mha_sm90_dispatch[k_t: MHAOperand, v_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, output_type: DType, max_prompt_len_t: OptionallyStaticInt, partition_t: MHAPartitionScheme, //, config: MHAConfig, group: Int, use_score_mod: Bool, ragged: Bool, _is_cache_length_accurate: Bool](output: UnsafePointer[SIMD[output_type, 1]], q: UnsafePointer[SIMD[type, 1]], k: k_t, v: v_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], max_prompt_len_arg: max_prompt_len_t, max_cache_valid_length_arg: Int, scale: SIMD[float32, 1], kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]], batch_size_arg: Int, partition: partition_t, ctx: DeviceContext)`

---

## valid_length_managed_tensor_slice_to_ndbuffer

`valid_length_managed_tensor_slice_to_ndbuffer(tensor: ManagedTensorSlice[io_spec, static_spec=static_spec]) -> NDBuffer[uint32, 1, MutableAnyOrigin]`

---

## MHASchedule

`@register_passable(trivial)`
`struct MHASchedule`

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `DEFAULT`

`alias DEFAULT = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))`

### `PROMPT_ROTATE`

`alias PROMPT_ROTATE = MHASchedule(__init__[__mlir_type.!pop.int_literal](1))`

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

---

## MHASchedulerSynchronization

`@register_passable(trivial)`
`struct MHASchedulerSynchronization`

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ALL`

`alias ALL = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](2))`

### `DEFAULT`

`alias DEFAULT = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))`

### `NONE`

`alias NONE = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](0))`

### `PRODUCER`

`alias PRODUCER = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))`

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

---

## MHATileScheduler

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `may_advance`

`alias may_advance`

### `mha_schedule`

`alias mha_schedule`

The MHATileScheduler trait describes a schedule for the persistent kernel.

## Methods

### `get_current_work_info`

`get_current_work_info(self: _Self, ts: MHATileSummary, state: MHATileState) -> WorkInfo`

Returns the current `WorkInfo`.

### `advance`

`advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self: _Self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]`

Advance state to the next work item. `func` must return a `Bool` indicating whether there is more work. Returns `True` if there is more work.

### `grid_dim`

`static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]`

Return the grid\_dim required for the kernel.

### `initial_state`

`initial_state(self: _Self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState`

Create the initial state object.

### `unsafe_seq_info`

`unsafe_seq_info[ragged: Bool](self: _Self, ts: MHATileSummary, state: MHATileState) -> SeqInfo`

---

## MHATileState

`@register_passable(trivial)`
`struct MHATileState`

## Fields

* ​idx (`SIMD[uint32, 1]`):
* ​sidx\_ptr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)]`):
* ​max\_idx (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(idx: SIMD[uint32, 1], sidx_ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], max_idx: SIMD[uint32, 1]) -> Self`

### `is_valid`

`is_valid(self, idx: SIMD[uint32, 1]) -> Bool`

`is_valid(self) -> Bool`

---

## MHATileSummary

`@register_passable(trivial)`
`struct MHATileSummary`

## Fields

* ​batch\_size (`SIMD[uint32, 1]`):
* ​max\_num\_prompt\_tiles (`SIMD[uint32, 1]`):
* ​valid\_length (`NDBuffer[uint32, 1, MutableAnyOrigin]`):
* ​max\_seq\_len (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1], valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_len: SIMD[uint32, 1]) -> Self`

### `get_current_work_info`

`get_current_work_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], schedule: MHASchedule](self, idx: SIMD[uint32, 1]) -> WorkInfo`

`get_current_work_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], schedule: MHASchedule](self, idx: MHATileState) -> WorkInfo`

### `unsafe_get_current_work_info`

`unsafe_get_current_work_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], schedule: MHASchedule](self, idx: SIMD[uint32, 1]) -> WorkInfo`

### `max_idx`

`max_idx(self, num_heads: SIMD[uint32, 1]) -> SIMD[uint32, 1]`

### `grid_dim`

`static grid_dim[num_heads: SIMD[uint32, 1]](max_num_prompt_tiles: SIMD[uint32, 1], batch_size: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]`

### `seq_info`

`seq_info[ragged: Bool](self, work: WorkInfo) -> SeqInfo`

### `unsafe_seq_info`

`unsafe_seq_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], ragged: Bool, schedule: MHASchedule](self, idx: SIMD[uint32, 1]) -> SeqInfo`

`unsafe_seq_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], ragged: Bool, schedule: MHASchedule](self, state: MHATileState) -> SeqInfo`

---

## QueuedTileScheduler

`@register_passable(trivial)`
`struct QueuedTileScheduler[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], /, decoding: Bool, num_ctas: SIMD[uint32, 1] = SIMD(Info(__init__[__mlir_type.!kgen.string]("H100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("hopper"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](9), __init__[__mlir_type.!kgen.string]("sm_90a"), 132, 32, 2048, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)), schedule: MHASchedule = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))]`

If `decoding == False`, then `num_heads` is `q_num_heads`. If `decoding == True`, then `num_heads` is `kv_num_heads`.

## Fields

* ​gidx\_ptr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(1)]`):

## Implemented traits

`AnyType`,
`Copyable`,
`MHATileScheduler`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `may_advance`

`alias may_advance = True`

### `mha_schedule`

`alias mha_schedule = schedule`

## Methods

### `__init__`

`__init__(gidx_ptr: UnsafePointer[SIMD[uint32, 1]]) -> Self`

### `get_current_work_info`

`get_current_work_info(self, ts: MHATileSummary, state: MHATileState) -> WorkInfo`

### `advance`

`advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]`

The parameter `func` must return a `Bool` indicating whether the `WorkInfo` arg is valid. This function returns whether the current idx corresponds to a valid `WorkInfo`. Note that if `MHASchedulerSynchronization` is `NONE`, then we assume it is only called by `thread_idx.x==0`.

### `grid_dim`

`static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]`

### `initial_state`

`initial_state(self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState`

### `unsafe_seq_info`

`unsafe_seq_info[ragged: Bool](self, ts: MHATileSummary, state: MHATileState) -> SeqInfo`

---

## SeqInfo

`@register_passable(trivial)`
`struct SeqInfo`

## Fields

* ​seq\_len (`SIMD[uint32, 1]`):
* ​start\_of\_seq (`SIMD[uint32, 1]`):
* ​prompt\_offset (`SIMD[uint32, 1]`):
* ​head\_idx (`SIMD[uint32, 1]`):
* ​prompt\_idx (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(seq_len: SIMD[uint32, 1], start_of_seq: SIMD[uint32, 1], work: WorkInfo) -> Self`

### `is_valid`

`is_valid(self) -> Bool`

### `create`

`static create[ragged: Bool](work: WorkInfo, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_len: SIMD[uint32, 1]) -> Self`

---

## TileScheduler

`@register_passable(trivial)`
`struct TileScheduler[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], /, num_ctas: SIMD[uint32, 1] = SIMD(Info(__init__[__mlir_type.!kgen.string]("H100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("hopper"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](9), __init__[__mlir_type.!kgen.string]("sm_90a"), 132, 32, 2048, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)), schedule: MHASchedule = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))]`

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`MHATileScheduler`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `may_advance`

`alias may_advance = True`

### `mha_schedule`

`alias mha_schedule = schedule`

## Methods

### `__init__`

`__init__() -> Self`

### `get_current_work_info`

`get_current_work_info(self, ts: MHATileSummary, state: MHATileState) -> WorkInfo`

### `fetch_next_work`

`fetch_next_work(self, ts: MHATileSummary, mut state: MHATileState) -> WorkInfo`

### `advance`

`advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]`

### `grid_dim`

`static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]`

### `initial_state`

`initial_state(self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState`

### `unsafe_seq_info`

`unsafe_seq_info[ragged: Bool](self, ts: MHATileSummary, state: MHATileState) -> SeqInfo`

---

## TransientScheduler

`@register_passable(trivial)`
`struct TransientScheduler[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1]]`

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`MHATileScheduler`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `may_advance`

`alias may_advance = False`

### `mha_schedule`

`alias mha_schedule = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))`

## Methods

### `__init__`

`__init__() -> Self`

### `get_current_work_info`

`get_current_work_info(self) -> WorkInfo`

`get_current_work_info(self, ts: MHATileSummary, state: MHATileState) -> WorkInfo`

### `advance`

`advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]`

### `grid_dim`

`static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]`

### `initial_state`

`initial_state(self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState`

### `unsafe_seq_info`

`unsafe_seq_info[ragged: Bool](self, ts: MHATileSummary, state: MHATileState) -> SeqInfo`

---

## WorkInfo

`@register_passable(trivial)`
`struct WorkInfo`

## Fields

* ​prompt\_offset (`SIMD[uint32, 1]`):
* ​head\_idx (`SIMD[uint32, 1]`):
* ​prompt\_idx (`SIMD[uint32, 1]`):
* ​is\_valid\_tile (`Bool`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `is_valid`

`is_valid(self) -> Bool`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

---

## mha_tile_scheduler

## Structs

* [​`MHASchedule`](./MHASchedule):
* [​`MHASchedulerSynchronization`](./MHASchedulerSynchronization):
* [​`MHATileState`](./MHATileState):
* [​`MHATileSummary`](./MHATileSummary):
* [​`QueuedTileScheduler`](./QueuedTileScheduler): If `decoding == False`, then `num_heads` is `q_num_heads`. If `decoding == True`, then `num_heads` is `kv_num_heads`.
* [​`SeqInfo`](./SeqInfo):
* [​`TileScheduler`](./TileScheduler):
* [​`TransientScheduler`](./TransientScheduler):
* [​`WorkInfo`](./WorkInfo):

## Traits

* [​`MHATileScheduler`](./MHATileScheduler):

---

## DynamicInt

`@register_passable(trivial)`
`struct DynamicInt`

## Fields

* ​value (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Intable`,
`Movable`,
`OptionallyStaticInt`,
`UnknownDestructibility`

## Aliases

### `static_value`

`alias static_value = OptionalReg[Int]({:i1 0, 1})`

## Methods

### `__init__`

`__init__(value: Int) -> Self`

### `__int__`

`__int__(self) -> Int`

### `as_uint32`

`as_uint32(self) -> SIMD[uint32, 1]`

---

## FlashAttentionAlgorithm

`@register_passable(trivial)`
`struct FlashAttentionAlgorithm`

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `FLASH_ATTENTION_1`

`alias FLASH_ATTENTION_1 = FlashAttentionAlgorithm(1)`

### `FLASH_ATTENTION_2`

`alias FLASH_ATTENTION_2 = FlashAttentionAlgorithm(2)`

### `FLASH_ATTENTION_3`

`alias FLASH_ATTENTION_3 = FlashAttentionAlgorithm(3)`

### `NAIVE`

`alias NAIVE = FlashAttentionAlgorithm(0)`

## Methods

### `__init__`

`__init__() -> Self`

`@implicit`
`__init__(value: Int) -> Self`

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

---

## MHAConfig

`@register_passable(trivial)`
`struct MHAConfig`

## Fields

* ​type (`DType`):
* ​num\_heads (`UInt`):
* ​depth (`UInt`):
* ​num\_queries\_per\_block (`UInt`):
* ​num\_keys\_per\_block (`UInt`):
* ​BK (`UInt`):
* ​WM (`UInt`):
* ​WN (`UInt`):
* ​num\_pipeline\_stages (`UInt`):
* ​k\_group\_size (`UInt`):
* ​algorithm (`FlashAttentionAlgorithm`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(type: DType, num_heads: UInt, depth: UInt, num_queries_per_block: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), num_keys_per_block: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), BK: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), WM: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), WN: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), num_pipeline_stages: UInt = UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), k_group_size: UInt = UInt(1), algorithm: FlashAttentionAlgorithm = FlashAttentionAlgorithm()) -> Self`

### `block_m`

`block_m(self) -> UInt`

### `block_n`

`block_n(self) -> UInt`

### `block_k`

`block_k(self) -> UInt`

### `warp_m`

`warp_m(self) -> UInt`

### `warp_n`

`warp_n(self) -> UInt`

### `num_warps_m`

`num_warps_m(self) -> UInt`

### `num_warps_n`

`num_warps_n(self) -> UInt`

### `num_consumer_threads`

`num_consumer_threads(self) -> UInt`

### `num_producer_threads`

`num_producer_threads[producer_consumer_kernel: Bool = False](self) -> UInt`

### `num_threads`

`num_threads[producer_consumer_kernel: Bool = False](self) -> UInt`

### `q_smem_size`

`q_smem_size(self, fa3: Bool = False) -> UInt`

### `kv_smem_size`

`kv_smem_size(self, fa3: Bool = False) -> UInt`

### `k_smem_size`

`k_smem_size(self, sm_90: Bool = False) -> UInt`

### `v_smem_size`

`v_smem_size(self, sm_90: Bool = False) -> UInt`

### `p_smem_size`

`p_smem_size(self) -> UInt`

### `warp_scratch_smem_size`

`warp_scratch_smem_size(self) -> UInt`

### `shared_mem_bytes`

`shared_mem_bytes[shared_kv: Bool = False, sm_90: Bool = False](self) -> UInt`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

---

## MHAPartitionScheme

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `accum_dtype`

`alias accum_dtype`

### `do_partition`

`alias do_partition`

## Methods

### `num_partitions`

`num_partitions(self: _Self) -> SIMD[uint32, 1]`

### `get_exp_sum_qk_max_pointer`

`get_exp_sum_qk_max_pointer(self: _Self) -> UnsafePointer[SIMD[get_vtable_entry(:trait _Self, "accum_dtype"), 1]]`

---

## NoPartition

`@register_passable(trivial)`
`struct NoPartition[dtype: DType]`

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`MHAPartitionScheme`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `accum_dtype`

`alias accum_dtype = dtype`

### `do_partition`

`alias do_partition = False`

## Methods

### `__init__`

`__init__() -> Self`

### `num_partitions`

`num_partitions(self) -> SIMD[uint32, 1]`

### `get_exp_sum_qk_max_pointer`

`get_exp_sum_qk_max_pointer(self) -> UnsafePointer[SIMD[dtype, 1]]`

---

## OptionallyStaticInt

## Implemented traits

`AnyType`,
`Copyable`,
`Intable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `static_value`

`alias static_value`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `as_uint32`

`as_uint32(self: _Self) -> SIMD[uint32, 1]`

### `__int__`

`__int__(self: _Self) -> Int`

Get the integral representation of the value.

**Returns:**

The integral representation of the value.

---

## SplitKPartition

`@register_passable(trivial)`
`struct SplitKPartition[dtype: DType]`

## Fields

* ​ptr (`UnsafePointer[SIMD[dtype, 1]]`):
* ​num\_partitions\_value (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`MHAPartitionScheme`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `accum_dtype`

`alias accum_dtype = dtype`

### `do_partition`

`alias do_partition = True`

## Methods

### `__init__`

`__init__(ptr: UnsafePointer[SIMD[dtype, 1]], num_partitions_value: SIMD[uint32, 1]) -> Self`

### `num_partitions`

`num_partitions(self) -> SIMD[uint32, 1]`

### `get_exp_sum_qk_max_pointer`

`get_exp_sum_qk_max_pointer(self) -> UnsafePointer[SIMD[dtype, 1]]`

---

## StaticInt

`@register_passable(trivial)`
`struct StaticInt[value: Int]`

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Intable`,
`Movable`,
`OptionallyStaticInt`,
`UnknownDestructibility`

## Aliases

### `static_value`

`alias static_value = OptionalReg[Int]({:@stdlib::@builtin::@int::@Int value, 0})`

## Methods

### `__init__`

`__init__() -> Self`

### `__int__`

`__int__(self) -> Int`

### `as_uint32`

`as_uint32(self) -> SIMD[uint32, 1]`

---

## dispatch_mask_and_score_mod

`dispatch_mask_and_score_mod[mask_type: String, score_mod_type: String, callback_fn: fn[MHAMask, ScoreModTrait](mask: $0, score_mod: $1) raises capturing -> None, local_window_size: Int = -1, num_heads: Int = -1]()`

---

## dispatch_materialized_mask_and_score_mod

`dispatch_materialized_mask_and_score_mod[score_mod_type: String, callback_fn: fn[MHAMask, ScoreModTrait](mask: $0, score_mod: $1) raises capturing -> None, num_heads: Int = -1](mask_nd: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], start_pos_nd: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}))`

---

## get_start_and_end_for_partitions

`get_start_and_end_for_partitions[tile_size: Int](num_keys: Int, num_partitions: Int, partition_idx: Int) -> Tuple[Int, Int]`

Calculate start and end indices for a partition.

**Args:**

* ​num\_keys (`Int`): Total number of keys (sequence length).
* ​num\_partitions (`Int`): Number of partitions to split keys into.
* ​partition\_idx (`Int`): Index of current partition (0 to num\_partitions-1).

**Returns:**

Tuple of (start\_idx, end\_idx) for the partition, aligned to tile\_size.

---

## mha_utils

## Aliases

### `callback_fn_type`

`alias callback_fn_type = fn[MHAMask, ScoreModTrait](mask: $0, score_mod: $1) raises capturing -> None`

### `is_sm100`

`alias is_sm100 = _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100"))`

### `is_sm90`

`alias is_sm90 = _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90"))`

### `is_sm90or100`

`alias is_sm90or100 = _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100"))`

## Structs

* [​`DynamicInt`](./DynamicInt):
* [​`FlashAttentionAlgorithm`](./FlashAttentionAlgorithm):
* [​`MHAConfig`](./MHAConfig):
* [​`NoPartition`](./NoPartition):
* [​`SplitKPartition`](./SplitKPartition):
* [​`StaticInt`](./StaticInt):

## Traits

* [​`MHAPartitionScheme`](./MHAPartitionScheme):
* [​`OptionallyStaticInt`](./OptionallyStaticInt):

## Functions

* [​`dispatch_mask_and_score_mod`](./dispatch_mask_and_score_mod):
* [​`dispatch_materialized_mask_and_score_mod`](./dispatch_materialized_mask_and_score_mod):
* [​`get_start_and_end_for_partitions`](./get_start_and_end_for_partitions): Calculate start and end indices for a partition.

---

## flare_mla_decoding

`flare_mla_decoding[rank: Int, cache_t: KVCacheT, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: cache_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

MLA decoding kernel that would only be called in the optimized compute graph.

The Q input has a shape of \[seq\_len, num\_heads, depth].
The K input has a shape of \[seq\_len, 1, depth].
The V tensor is derived by reusing K, where V = K\[:, :, :depth\_v].

Specifically, for DeepSeek V2/3, depth = 576 and depth\_v = 512.

This kernel computes attention without needing to load V twice. This kernel
only handles decoding requests. In this case q\_max\_seq\_len = 1.

This kernel handles batches with different valid lengths (i.e., before the
padding). Such lengths are passed in valid\_length argument.

`flare_mla_decoding[rank: Int, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, rank, origin, shape, strides], mask_functor: mask_t, score_mod_functor: score_mod_t, scale: SIMD[float32, 1], ctx: DeviceContext, num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

---

## flare_mla_decoding_dispatch

`flare_mla_decoding_dispatch[rank: Int, k_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, kv_num_heads: Int, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, _is_cache_length_accurate: Bool = False, _use_valid_length: Bool = True, decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: k_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], max_prompt_len: Int, max_cache_valid_length: Int, scale: SIMD[float32, 1], ctx: DeviceContext, kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

---

## flare_mla_prefill

`flare_mla_prefill[rank: Int, cache_t: KVCacheT, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, output_type: DType, softmax_type: DType, q_shape: DimList, //, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False](output: NDBuffer[output_type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, 3, origin, shape, strides], v: NDBuffer[type, 3, origin, shape, strides], k_rope: cache_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], cache_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}), cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), prev_output: OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]] = OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]]({:i1 0, 1}), prev_softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}))`

MLA prefill kernel that would only be called in the optimized compute graph. Only supports ragged Q/K/V inputs.

The Q input has a shape of \[seq\_len, num\_heads, q\_depth].
The K and V input has a shape of \[cache\_len, num\_heads, depth].
The K\_rope input is retrieved from the KV cache, with a shape of
\[cache\_len, 1, q\_depth - depth].

Specifically, for DeepSeek V2/3, depth = 128 and q\_depth = 192.

When computing attention scores (Q @ K), each head of K is smaller than Q
head. The missing 64 elements of K are retrieved from the K cache, and
broadcasted to all the heads. This kernel also handles that output has
reduced dimension compared to input Q.

This kernel handles batches with different valid lengths (i.e., before the
padding). Such lengths are passed in valid\_length argument.

`flare_mla_prefill[rank: Int, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, softmax_type: DType, q_shape: DimList, //, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, 3, origin, shape, strides], v: NDBuffer[type, 3, origin, shape, strides], k_rope: NDBuffer[type, 4, origin, shape, strides], mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], cache_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}), cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}))`

---

## flare_mla_prefill_dispatch

`flare_mla_prefill_dispatch[rank: Int, k_t: MHAOperand, v_t: MHAOperand, k_rope_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, output_type: DType, softmax_type: DType, q_shape: DimList, //, kv_num_heads: Int, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False, q_depth: Int = 192, cache_depth: Int = 576, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), _ndbuffer_mha_operand: Bool = False](output: NDBuffer[output_type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: k_t, v: v_t, k_rope: k_rope_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], max_prompt_len: Int, scale: SIMD[float32, 1], ctx: DeviceContext, softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}), cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), prev_output: OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]] = OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]]({:i1 0, 1}), prev_softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}))`

---

## mla

## Functions

* [​`flare_mla_decoding`](./flare_mla_decoding): MLA decoding kernel that would only be called in the optimized compute graph.
* [​`flare_mla_decoding_dispatch`](./flare_mla_decoding_dispatch):
* [​`flare_mla_prefill`](./flare_mla_prefill): MLA prefill kernel that would only be called in the optimized compute graph. Only supports ragged Q/K/V inputs.
* [​`flare_mla_prefill_dispatch`](./flare_mla_prefill_dispatch):
* [​`mla_decoding`](./mla_decoding):
* [​`mla_decoding_single_batch`](./mla_decoding_single_batch): Flash attention v2 algorithm.
* [​`mla_prefill`](./mla_prefill):
* [​`mla_prefill_plan`](./mla_prefill_plan): This calls a GPU kernel that plans how to process a batch of sequences with varying lengths using a fixed-size buffer.
* [​`mla_prefill_plan_kernel`](./mla_prefill_plan_kernel):
* [​`mla_prefill_single_batch`](./mla_prefill_single_batch): MLA for encoding where seqlen > 1.

---

## mla_decoding

`mla_decoding[q_type: DType, k_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, ragged: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], batch_size: Int, num_partitions: Int, max_cache_valid_length: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], mask: mask_t, score_mod: score_mod_t)`

---

## mla_decoding_single_batch

`mla_decoding_single_batch[q_type: DType, k_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, depth_v: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], num_keys: UInt, num_partitions: UInt, max_cache_valid_length: UInt, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)`

Flash attention v2 algorithm.

---

## mla_prefill

`mla_prefill[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, k_rope_t: MHAOperand, output_type: DType, softmax_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, config: MHAConfig, group: Int = 128, q_depth: Int = 192, cache_depth: Int = 576, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False, _ndbuffer_mha_operand: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, k_rope: k_rope_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], softmax_info_ptr: UnsafePointer[SIMD[softmax_type, 1]], prev_output_ptr: UnsafePointer[SIMD[output_type, 1]], prev_softmax_info_ptr: UnsafePointer[SIMD[softmax_type, 1]], scale: SIMD[float32, 1], batch_size: Int, seq_len_arg: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]], mask: mask_t, score_mod: score_mod_t)`

---

## mla_prefill_plan

`mla_prefill_plan[cache_t: KVCacheT](buffer_row_offsets: NDBuffer[uint32, 2, origin, shape, strides], cache_offsets: NDBuffer[uint32, 2, origin, shape, strides], buffer_lengths: NDBuffer[int32, 1, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], k_cache: cache_t, buffer_token_size: SIMD[uint32, 1], ctx: DeviceContext)`

This calls a GPU kernel that plans how to process a batch of sequences with varying lengths using a fixed-size buffer.

Each sequence in the batch has some existing cached tokens and new input
tokens. The kernel divides the total tokens into chunks of buffer\_token\_size.

For each chunk (iteration), it calculates:
1\. Buffer offsets for each sequence in each chunk
2\. Cache offsets for each sequence in each chunk
3\. Total buffer lengths for each processing iteration

---

## mla_prefill_plan_kernel

`mla_prefill_plan_kernel[buffer_lengths_shape: DimList, cache_t: KVCacheT](buffer_row_offsets: NDBuffer[uint32, 2, MutableAnyOrigin], cache_offsets: NDBuffer[uint32, 2, MutableAnyOrigin], buffer_lengths: NDBuffer[int32, 1, MutableAnyOrigin, buffer_lengths_shape], input_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], k_cache: cache_t, buffer_token_size: SIMD[uint32, 1])`

---

## mla_prefill_single_batch

`mla_prefill_single_batch[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, k_rope_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, config: MHAConfig, group: Int = 1, q_depth: Int = 192, cache_depth: Int = 576, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, k_rope: k_rope_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], softmax_info_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], prev_output_ptr: UnsafePointer[SIMD[output_type, 1]], prev_softmax_info_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], seq_len: Int, max_seq_len: Int, start_pos: SIMD[uint32, 1], cache_start_pos: SIMD[uint32, 1], num_keys: Int, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)`

MLA for encoding where seqlen > 1.

---

## moe

## Functions

* [​`moe_create_indices`](./moe_create_indices):
* [​`moe_create_indices_kernel`](./moe_create_indices_kernel):

---

## moe_create_indices

`moe_create_indices[input_type: DType, //, target: StringSlice[StaticConstantOrigin]](token_expert_order: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], expert_start_indices: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], restore_token_order: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], expert_ids: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], expert_usage_stats: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], topk_ids: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], context: DeviceContextPtr)`

---

## moe_create_indices_kernel

`moe_create_indices_kernel[input_type: DType, num_threads: Int, token_expert_order_layout: Layout, expert_start_indices_layout: Layout, restore_token_order_layout: Layout, expert_ids_layout: Layout, expert_usage_stats_layout: Layout, indices_padded_layout: Layout, padded_input_layout: Layout, topk_ids_layout: Layout](token_expert_order: LayoutTensor[uint32, token_expert_order_layout, MutableAnyOrigin], expert_start_indices: LayoutTensor[uint32, expert_start_indices_layout, MutableAnyOrigin], restore_token_order: LayoutTensor[uint32, restore_token_order_layout, MutableAnyOrigin], expert_ids: LayoutTensor[uint32, expert_ids_layout, MutableAnyOrigin], expert_usage_stats: LayoutTensor[uint32, expert_usage_stats_layout, MutableAnyOrigin], indices_padded: LayoutTensor[uint32, indices_padded_layout, MutableAnyOrigin], topk_ids_padded: LayoutTensor[input_type, padded_input_layout, MutableAnyOrigin], topk_ids: LayoutTensor[input_type, topk_ids_layout, MutableAnyOrigin])`

---

## BoundingBox

`struct BoundingBox[type: DType]`

## Fields

* ​nw (`SIMD[type, 2]`):
* ​se (`SIMD[type, 2]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, y1: SIMD[type, 1], x1: SIMD[type, 1], y2: SIMD[type, 1], x2: SIMD[type, 1])`

### `iou`

`iou(self, other: Self) -> SIMD[type, 1]`

### `intersection_area`

`intersection_area(self, other: Self) -> SIMD[type, 1]`

### `area`

`area(self) -> SIMD[type, 1]`

---

## nms

## Structs

* [​`BoundingBox`](./BoundingBox):

## Functions

* [​`non_max_suppression`](./non_max_suppression): Buffer semantic overload.
* [​`non_max_suppression_shape_func`](./non_max_suppression_shape_func): Overload to compute the output shape. Can be removed once the graph compiler supports value semantic kernels that allocate their own output.

---

## non_max_suppression

`non_max_suppression[type: DType](boxes: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scores: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[int64, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], max_output_boxes_per_class: Int, iou_threshold: SIMD[float32, 1], score_threshold: SIMD[float32, 1])`

Buffer semantic overload.

`non_max_suppression[: origin.set, //, type: DType, func: fn(SIMD[int64, 1], SIMD[int64, 1], SIMD[int64, 1]) capturing -> None](boxes: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scores: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], max_output_boxes_per_class: Int, iou_threshold: SIMD[float32, 1], score_threshold: SIMD[float32, 1])`

Implements the NonMaxSuppression operator from the ONNX spec .

---

## non_max_suppression_shape_func

`non_max_suppression_shape_func[type: DType](boxes: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scores: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], max_output_boxes_per_class: Int, iou_threshold: SIMD[float32, 1], score_threshold: SIMD[float32, 1]) -> IndexList[2]`

Overload to compute the output shape. Can be removed once the graph compiler supports value semantic kernels that allocate their own output.

---

## block_reduce

`block_reduce[type: DType, max_warps_per_block: Int](val: SIMD[type, 1]) -> SIMD[type, 1]`

---

## group_norm

`group_norm[type: DType, rank: Int, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], gamma_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0], beta_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0], /, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("gpu")](shape: IndexList[rank], epsilon: SIMD[type, 1], groups: SIMD[int32, 1], output: NDBuffer[type, rank, origin, shape, strides], ctx: DeviceContextPtr)`

---

## group_norm_gpu

`group_norm_gpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], gamma_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0], beta_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0]](shape: IndexList[rank, element_type=element_type], epsilon: SIMD[type, 1], output: NDBuffer[type, rank, origin, shape, strides], num_groups: Int, ctx: DeviceContext)`

---

## group_norm_gpu_block

`group_norm_gpu_block[type: DType, simd_width: UInt, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0], beta_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0]](output: NDBuffer[type, 2, MutableAnyOrigin], epsilon: SIMD[type, 1], num_groups: Int, channels_per_group: Int, spatial: Int)`

---

## group_norm_gpu_warp_tiling

`group_norm_gpu_warp_tiling[type: DType, simd_width: UInt, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0], beta_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0]](output: NDBuffer[type, 2, MutableAnyOrigin], epsilon: SIMD[type, 1], num_groups: Int, channels_per_group: Int, spatial: Int)`

---

## group_norm_reshape

`group_norm_reshape[type: DType, rank: Int](shape: IndexList[rank, element_type=element_type], buf: NDBuffer[type, rank, origin, shape, strides], channels_per_group: Int, spatial: Int) -> NDBuffer[type, 2, origin]`

Reshapes an input buffer for group normalization by flattening all dimensions except the group dimension. Returns a 2D buffer of shape (num\_groups \* N, group\_size), where group\_size is the product of channels\_per\_group and spatial.

---

## group_norm_shape

`group_norm_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], gamma: NDBuffer[type, 1, origin], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], num_groups: SIMD[int32, 1]) -> IndexList[rank]`

---

## normalization

## Functions

* [​`block_reduce`](./block_reduce):
* [​`group_norm`](./group_norm):
* [​`group_norm_gpu`](./group_norm_gpu):
* [​`group_norm_gpu_block`](./group_norm_gpu_block):
* [​`group_norm_gpu_warp_tiling`](./group_norm_gpu_warp_tiling):
* [​`group_norm_reshape`](./group_norm_reshape): Reshapes an input buffer for group normalization by flattening all dimensions except the group dimension. Returns a 2D buffer of shape (num\_groups \* N, group\_size), where group\_size is the product of channels\_per\_group and spatial.
* [​`group_norm_shape`](./group_norm_shape):
* [​`layer_norm`](./layer_norm):
* [​`layer_norm_cpu`](./layer_norm_cpu): Computes layernorm(elementwise\_fn(x)) across the last dimension of x, where layernorm is defined as $(x-mean(x))/(sqrt(var(x)+eps)*gamma_fn + beta$.
* [​`layer_norm_gpu`](./layer_norm_gpu):
* [​`layer_norm_gpu_block`](./layer_norm_gpu_block):
* [​`layer_norm_gpu_warp_tiling`](./layer_norm_gpu_warp_tiling):
* [​`layer_norm_reshape`](./layer_norm_reshape):
* [​`layer_norm_shape`](./layer_norm_shape): Compute the output shape of a `layer_norm` operation.
* [​`rms_norm`](./rms_norm):
* [​`rms_norm_cpu`](./rms_norm_cpu):
* [​`rms_norm_gpu`](./rms_norm_gpu):
* [​`rms_norm_gpu_block`](./rms_norm_gpu_block):
* [​`rms_norm_gpu_warp_tiling`](./rms_norm_gpu_warp_tiling):
* [​`rms_norm_shape`](./rms_norm_shape):
* [​`welford_block_all_reduce`](./welford_block_all_reduce):
* [​`welford_combine`](./welford_combine):
* [​`welford_update`](./welford_update):
* [​`welford_warp_all_reduce`](./welford_warp_all_reduce):
* [​`welford_warp_reduce`](./welford_warp_reduce):

---

## layer_norm

`layer_norm[type: DType, rank: Int, input_0_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_1_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], /, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](shape: IndexList[rank], gamma_shape: IndexList[1], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], output: NDBuffer[type, rank, origin, shape, strides], ctx: DeviceContextPtr)`

---

## layer_norm_cpu

`layer_norm_cpu[type: DType, //, input_fn: fn[Int](Int, Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](out_buf: NDBuffer[type, 2, origin, shape], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1])`

Computes layernorm(elementwise\_fn(x)) across the last dimension of x, where layernorm is defined as $(x-mean(x))/(sqrt(var(x)+eps)*gamma_fn + beta$.

Currently performs 3 passes over the input data. This can be reduced to 2 by
fusing the add, mean, and variance loops using Welford's algorithm.

**Parameters:**

* ​type (`DType`): The x and out buffers' elements dtype.
* ​input\_fn (`fn[Int](Int, Int) capturing -> SIMD[type, $0]`): Function called to generate an input value.
* ​gamma\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): Function called to generate a gamma value.

**Args:**

* ​out\_buf (`NDBuffer[type, 2, origin, shape]`): The output buffer.
* ​beta (`NDBuffer[type, 1, origin]`): The beta value to use in the layernorm calculation.
* ​epsilon (`SIMD[type, 1]`): The eps value to use in the layernorm calculation.

`layer_norm_cpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](shape: IndexList[rank, element_type=element_type], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], output: NDBuffer[type, rank, origin, shape, strides])`

---

## layer_norm_gpu

`layer_norm_gpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](shape: IndexList[rank, element_type=element_type], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], output: NDBuffer[type, rank, origin, shape, strides], *, ctx: DeviceContext)`

---

## layer_norm_gpu_block

`layer_norm_gpu_block[type: DType, //, simd_width: UInt, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](output: NDBuffer[type, 2, MutableAnyOrigin], beta: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1])`

---

## layer_norm_gpu_warp_tiling

`layer_norm_gpu_warp_tiling[type: DType, //, simd_width: UInt, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](output: NDBuffer[type, 2, MutableAnyOrigin], beta: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1])`

---

## layer_norm_reshape

`layer_norm_reshape[type: DType, rank: Int, //, output_rank: Int](shape: IndexList[rank, element_type=element_type], buf: NDBuffer[type, rank, origin, shape, strides]) -> NDBuffer[type, output_rank, origin]`

---

## layer_norm_shape

`layer_norm_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], gamma: NDBuffer[type, 1, origin, __init__[::Intable](1)], beta: NDBuffer[type, 1, origin, __init__[::Intable](1)], epsilon: SIMD[type, 1]) -> IndexList[rank]`

Compute the output shape of a `layer_norm` operation.

**Parameters:**

* ​type (`DType`): Type of the input tensors.
* ​rank (`Int`): Rank of the input tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input (`NDBuffer[type, rank, origin]`): The input tensor.
* ​gamma (`NDBuffer[type, 1, origin, __init__[::Intable](1)]`): The tensor for gamma coefficient.
* ​beta (`NDBuffer[type, 1, origin, __init__[::Intable](1)]`): The tensor for beta coefficient.
* ​epsilon (`SIMD[type, 1]`): The tensor for epsilon coefficient.

**Returns:**

The output shape.

---

## rms_norm

`rms_norm[type: DType, rank: Int, input_0_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], /, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), multiply_before_cast: Bool = True](shape: IndexList[rank], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], output: NDBuffer[type, rank, origin], ctx: DeviceContextPtr)`

---

## rms_norm_cpu

`rms_norm_cpu[type: DType, //, input_fn: fn[Int](Int, Int) capturing -> SIMD[type, $0], output_fn: fn[Int](Int, Int, SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], out_shape: IndexList[2])`

`rms_norm_cpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int](IndexList[rank], SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](shape: IndexList[rank], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1])`

---

## rms_norm_gpu

`rms_norm_gpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int](IndexList[rank], SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](shape: IndexList[rank, element_type=element_type], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], ctx: DeviceContext)`

---

## rms_norm_gpu_block

`rms_norm_gpu_block[type: DType, //, simd_width: Int, max_warps_per_block: Int, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], output_fn: fn[Int](row: Int, col: Int, val: SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](gamma: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], num_cols: Int)`

---

## rms_norm_gpu_warp_tiling

`rms_norm_gpu_warp_tiling[type: DType, //, simd_width: Int, max_warps_per_block: Int, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], output_fn: fn[Int](row: Int, col: Int, val: SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](gamma: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], num_cols: Int)`

---

## rms_norm_shape

`rms_norm_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1]) -> IndexList[rank]`

---

## welford_block_all_reduce

`welford_block_all_reduce[type: DType, //](thread_mean: SIMD[type, 1], thread_m2: SIMD[type, 1], thread_count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])`

---

## welford_combine

`welford_combine[type: DType, //](mean: SIMD[type, 1], m2: SIMD[type, 1], count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])`

---

## welford_update

`welford_update[type: DType, //](val: SIMD[type, 1], mut mean: SIMD[type, 1], mut m2: SIMD[type, 1], mut count: SIMD[type, 1])`

---

## welford_warp_all_reduce

`welford_warp_all_reduce[type: DType, //](thread_mean: SIMD[type, 1], thread_m2: SIMD[type, 1], thread_count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])`

---

## welford_warp_reduce

`welford_warp_reduce[type: DType, //](thread_mean: SIMD[type, 1], thread_m2: SIMD[type, 1], thread_count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])`

---

## pad

## Functions

* [​`pad_constant`](./pad_constant): Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`.
* [​`pad_reflect`](./pad_reflect): Fill `output` with values from `input`, and edges padded with reflected values from the unpadded region.
* [​`pad_repeat`](./pad_repeat): Fill `output` with values from `input`, and edges padded boundary values from the unpadded region.
* [​`pad_shape`](./pad_shape): Compute the output shape of a `pad` operation, and assert the inputs are compatible.

---

## pad_constant

`pad_constant[output_layout: Layout, input_layout: Layout, type: DType, paddings_type: DType, constant_type: DType](output: LayoutTensor[type, output_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, input_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings: UnsafePointer[SIMD[paddings_type, 1]], constant: SIMD[constant_type, 1])`

Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`.

Example:
var input\_shape = (X, Y, Z)
var paddings = [x0, x1, y0, y1, z0, z1]

out\[x, y, z] =
input\[x - x0, y - y0, z - z0] if x ∈ \[x0, x0 + X] &&
y ∈ \[y0, y0 + Y] &&
z ∈ \[z0, z0 + Z]
else constant

**Args:**

* ​output (`LayoutTensor[type, output_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer.
* ​input (`LayoutTensor[type, input_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input buffer.
* ​paddings (`UnsafePointer[SIMD[paddings_type, 1]]`): Ordered (before, after) padding sizes for each axis.
* ​constant (`SIMD[constant_type, 1]`): The constant to pad output with.

---

## pad_reflect

`pad_reflect[output_layout: Layout, input_layout: Layout, type: DType, paddings_type: DType](output: LayoutTensor[type, output_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, input_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings: UnsafePointer[SIMD[paddings_type, 1]])`

Fill `output` with values from `input`, and edges padded with reflected values from the unpadded region.

Example:
var input = [\[1, 2],
\[3, 4]]
var paddings = [2, 2, 1, 0]

Yields:
output = [\[2, 1, 2],
\[4, 3, 4],
\[2, 1, 2],
\[4, 3, 4],
\[2, 1, 2],
\[4, 3, 4]]

**Args:**

* ​output (`LayoutTensor[type, output_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer.
* ​input (`LayoutTensor[type, input_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input buffer.
* ​paddings (`UnsafePointer[SIMD[paddings_type, 1]]`): Ordered (before, after) padding sizes for each axis.

---

## pad_repeat

`pad_repeat[output_layout: Layout, input_layout: Layout, type: DType, paddings_type: DType](output: LayoutTensor[type, output_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, input_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings: UnsafePointer[SIMD[paddings_type, 1]])`

Fill `output` with values from `input`, and edges padded boundary values from the unpadded region.

Example:
var input = [\[1, 2],
\[3, 4]]
var paddings = [2, 2, 1, 0]

Yields:
output = [\[1, 1, 2],
\[1, 1, 2],
\[1, 1, 2],
\[3, 3, 4],
\[3, 3, 4],
\[3, 3, 4]]

**Parameters:**

* ​output\_layout (`Layout`): Layout of the output buffer.
* ​input\_layout (`Layout`): Layout of the input buffer.
* ​type (`DType`): DType of the input/output buffer.
* ​paddings\_type (`DType`): DType of the input, output, and padding buffers.

**Args:**

* ​output (`LayoutTensor[type, output_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer.
* ​input (`LayoutTensor[type, input_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input buffer.
* ​paddings (`UnsafePointer[SIMD[paddings_type, 1]]`): Ordered (before, after) padding sizes for each axis.

---

## pad_shape

`pad_shape[input_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings_buf: LayoutTensor[paddings_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> IndexList[layout.rank()]`

Compute the output shape of a `pad` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_type (`DType`): Type of the input tensor.
* ​paddings\_type (`DType`): Type of the padding tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input\_buf (`LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to pad.
* ​paddings\_buf (`LayoutTensor[paddings_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The paddings tensor, of shape (input\_rank, 2).

**Returns:**

The output shape.

---

## get_padding_output_shape

`get_padding_output_shape[rank: Int](input_shape: IndexList[rank], paddings: LayoutTensor[index, __init__[::Origin[::Bool(IntTuple((rank * 2))), origin]) -> IndexList[rank]`

---

## pad_gpu

## Functions

* [​`get_padding_output_shape`](./get_padding_output_shape):
* [​`pad_constant`](./pad_constant): Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`.

---

## pad_constant

`pad_constant[rank: Int, type: DType, padding_type: DType](output: UnsafePointer[SIMD[type, 1]], output_shape: IndexList[rank], input: UnsafePointer[SIMD[type, 1]], input_shape: IndexList[rank], paddings: UnsafePointer[SIMD[padding_type, 1]], constant: SIMD[type, 1], ctx: DeviceContext)`

Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`.

Example:

```mojo
var input_shape = (X, Y, Z)
var paddings = [x0, x1, y0, y1, z0, z1]

out[x, y, z] =
  input[x - x0, y - y0, z - z0] if x ∈ [x0, x0 + X] &&
                                   y ∈ [y0, y0 + Y] &&
                                   z ∈ [z0, z0 + Z]
  else constant
```

**Args:**

* ​output (`UnsafePointer[SIMD[type, 1]]`): The output buffer.
* ​output\_shape (`IndexList[rank]`): The output shape.
* ​input (`UnsafePointer[SIMD[type, 1]]`): The input buffer.
* ​input\_shape (`IndexList[rank]`): The input shape.
* ​paddings (`UnsafePointer[SIMD[padding_type, 1]]`): Ordered (before, after) padding sizes for each axis.
* ​constant (`SIMD[type, 1]`): The constant to pad output with.
* ​ctx (`DeviceContext`): Device context for participating GPU.

---

## PoolMethod

`@register_passable(trivial)`
`struct PoolMethod`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `AVG`

`alias AVG = PoolMethod(1)`

### `MAX`

`alias MAX = PoolMethod(0)`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

---

## avg_pool

`avg_pool[type: DType, int_type: DType, rank: Int = 4, count_boundary: Bool = False](input: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], filter: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], strides: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dilations: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ceil_mode: Bool = False)`

Computes the average pool.

Params:
count\_boundary: Whether to count the boundary in the average computation.

**Args:**

* ​input (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Batched image input to the pool2d operator.
* ​filter (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Filter size on height and width dimensions with assumed tuple
  def (filter\_h, filter\_w).
* ​strides (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Strides on height and width dimensions with assumed
  tuple def (stride\_h, stride\_w).
* ​dilations (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Dilations on height and width dimensions with assumed
  tuple def (dilation\_h, dilation\_w).
* ​paddings (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Paddings on height and width dimensions with assumed
  tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)).
* ​output (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Pre-allocated output tensor space.
* ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding.

---

## avg_pool_gpu

`avg_pool_gpu[type: DType, int_type: DType, count_boundary: Bool = False](ctx: DeviceContext, input: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], filter: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], strides: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dilations: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ceil_mode: Bool = False)`

Computes the average pool on GPU.

Params:
count\_boundary: Whether to count the boundary in the average computation.

**Args:**

* ​ctx (`DeviceContext`): The DeviceContext to use for GPU execution.
* ​input (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On device) Batched image input to the pool2d operator.
* ​filter (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Filter size on height and width dimensions with assumed tuple
  def (filter\_h, filter\_w).
* ​strides (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Strides on height and width dimensions with assumed
  tuple def (stride\_h, stride\_w).
* ​dilations (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Dilations on height and width dimensions with assumed
  tuple def (dilation\_h, dilation\_w).
* ​paddings (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Paddings on height and width dimensions with assumed
  tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)).
* ​output (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On device) Pre-allocated output tensor space.
* ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding.

---

## pool

## Structs

* [​`PoolMethod`](./PoolMethod):

## Functions

* [​`avg_pool`](./avg_pool): Computes the average pool.
* [​`avg_pool_gpu`](./avg_pool_gpu): Computes the average pool on GPU.
* [​`max_pool`](./max_pool): Computes fp32 pooling.
* [​`max_pool_gpu`](./max_pool_gpu): Computes max pooling on GPU.
* [​`pool_shape`](./pool_shape):
* [​`pool_shape_ceil`](./pool_shape_ceil):
* [​`pool_shape_impl`](./pool_shape_impl): Compute the output shape of a pooling operation, and assert the inputs are compatible. Works for 2D pool operations only in the NHWC format.

---

## max_pool

`max_pool[type: DType, int_type: DType](input: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], filter: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], strides: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dilations: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ceil_mode: Bool = False)`

Computes fp32 pooling.

**Args:**

* ​input (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Batched image input to the pool2d operator.
* ​filter (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Filter size on height and width dimensions with assumed tuple
  def (filter\_h, filter\_w).
* ​strides (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Strides on height and width dimensions with assumed
  tuple def (stride\_h, stride\_w).
* ​dilations (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Dilations on height and width dimensions with assumed
  tuple def (dilation\_h, dilation\_w).
* ​paddings (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Paddings on height and width dimensions with assumed
  tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)).
* ​output (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Pre-allocated output tensor space.
* ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding.

---

## max_pool_gpu

`max_pool_gpu[type: DType, int_type: DType](ctx: DeviceContext, input: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], filter: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], strides: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dilations: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ceil_mode: Bool = False)`

Computes max pooling on GPU.

**Args:**

* ​ctx (`DeviceContext`): The DeviceContext to use for GPU execution.
* ​input (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On device) Batched image input to the pool2d operator.
* ​filter (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Filter size on height and width dimensions with assumed tuple
  def (filter\_h, filter\_w).
* ​strides (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Strides on height and width dimensions with assumed
  tuple def (stride\_h, stride\_w).
* ​dilations (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Dilations on height and width dimensions with assumed
  tuple def (dilation\_h, dilation\_w).
* ​paddings (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Paddings on height and width dimensions with assumed
  tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)).
* ​output (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On device) Pre-allocated output tensor space.
* ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding.

---

## pool_shape

`pool_shape[input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], filter_buf: LayoutTensor[filter_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], strides_buf: LayoutTensor[strides_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dilations_buf: LayoutTensor[dilations_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings_buf: LayoutTensor[paddings_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> IndexList[layout.rank()]`

---

## pool_shape_ceil

`pool_shape_ceil[input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], filter_buf: LayoutTensor[filter_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], strides_buf: LayoutTensor[strides_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dilations_buf: LayoutTensor[dilations_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings_buf: LayoutTensor[paddings_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> IndexList[layout.rank()]`

---

## pool_shape_impl

`pool_shape_impl[input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool, ceil_mode: Bool](input_buf: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], filter_buf: LayoutTensor[filter_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], strides_buf: LayoutTensor[strides_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dilations_buf: LayoutTensor[dilations_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings_buf: LayoutTensor[paddings_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> IndexList[layout.rank()]`

Compute the output shape of a pooling operation, and assert the inputs are compatible. Works for 2D pool operations only in the NHWC format.

**Parameters:**

* ​input\_type (`DType`): Type of the input tensor.
* ​filter\_type (`DType`): Type of the filter tensor.
* ​strides\_type (`DType`): Type of the strides tensor.
* ​dilations\_type (`DType`): Type of the dilations tensor.
* ​paddings\_type (`DType`): Type of the paddings tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​ceil\_mode (`Bool`): Define rounding mode for shape calculation.

**Args:**

* ​input\_buf (`LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor.
* ​filter\_buf (`LayoutTensor[filter_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The filter size buffer.
* ​strides\_buf (`LayoutTensor[strides_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The strides size buffer.
* ​dilations\_buf (`LayoutTensor[dilations_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The dilations size buffer.
* ​paddings\_buf (`LayoutTensor[paddings_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The paddings size buffer.

**Returns:**

The output shape.

---

## rand_uniform

## Functions

* [​`random_uniform`](./random_uniform): Call `output_fn` with values generated from a uniform distribution on \[lower\_bound, upper\_bound] for floating-point types or \[lower\_bound, upper\_bound) for integer types.

---

## random_uniform

`random_uniform[: origin.set, dtype: DType, rank: Int, //, output_fn: fn[Int, Int](idx: IndexList[$1], val: SIMD[dtype, $0]) capturing -> None, target: StringSlice[StaticConstantOrigin]](shape: IndexList[rank], lower_bound: SIMD[dtype, 1], upper_bound: SIMD[dtype, 1], seed_value: SIMD[uint64, 1], ctx: DeviceContextPtr)`

Call `output_fn` with values generated from a uniform distribution on \[lower\_bound, upper\_bound] for floating-point types or \[lower\_bound, upper\_bound) for integer types.

**Parameters:**

* ​dtype (`DType`): The data type to generate.
* ​rank (`Int`): The rank of the underlying buffer.
* ​output\_fn (`fn[Int, Int](idx: IndexList[$1], val: SIMD[dtype, $0]) capturing -> None`): The function which stores the generated values.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.

**Args:**

* ​shape (`IndexList[rank]`): The shape of the output being stored into by output\_fn.
* ​lower\_bound (`SIMD[dtype, 1]`): The lower bound on the uniform range.
* ​upper\_bound (`SIMD[dtype, 1]`): The upper bound on the uniform range.
* ​seed\_value (`SIMD[uint64, 1]`): Seed value used to initialize the random number generator.
* ​ctx (`DeviceContextPtr`): The device context.

---

## randn

## Functions

* [​`random_normal`](./random_normal): Fill `output` with values generated from Normal(mean, variance) distribution.

---

## random_normal

`random_normal[type: DType, mean: SIMD[float64, 1], variance: SIMD[float64, 1]](output: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Fill `output` with values generated from Normal(mean, variance) distribution.

**Args:**

* ​output (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer.

---

## repeat_interleave

## Functions

* [​`repeat_interleave`](./repeat_interleave): Fill `output` by repeating values from `input` along `axis` based on the values in `repeats` buffer.
* [​`repeat_interleave_shape`](./repeat_interleave_shape):

---

## repeat_interleave

`repeat_interleave[type: DType, rank: Int, type_repeats: DType](input: NDBuffer[type, rank, origin], repeats: NDBuffer[type_repeats, 1, origin], axis: Int, output: NDBuffer[type, rank, origin])`

Fill `output` by repeating values from `input` along `axis` based on the values in `repeats` buffer.

This is intended to implement the same functionality as torch.repeat:

**Args:**

* ​input (`NDBuffer[type, rank, origin]`): The input buffer.
* ​repeats (`NDBuffer[type_repeats, 1, origin]`): The number of repetitions each element in input.
* ​axis (`Int`): The axis along which to repeat values.
* ​output (`NDBuffer[type, rank, origin]`): The output buffer.

---

## repeat_interleave_shape

`repeat_interleave_shape[type_repeats: DType](input: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], repeats: NDBuffer[type_repeats, 1, origin], axis: Int) -> IndexList[rank]`

---

## reshape

## Functions

* [​`ndbuffer_reshape`](./ndbuffer_reshape):
* [​`reshape`](./reshape):
* [​`reshape_shape`](./reshape_shape):

---

## ndbuffer_reshape

`ndbuffer_reshape[rank: Int, output_rank: Int, type: DType, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], new_shape: IndexList[output_rank]) -> NDBuffer[type, output_rank, origin]`

---

## reshape

`reshape[rank: Int, type: DType, //, output_rank: Int, single_thread_blocking_override: Bool = True](input: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], new_shape: IndexList[output_rank]) -> NDBuffer[type, output_rank, origin]`

---

## reshape_shape

`reshape_shape[input_rank: Int, output_rank: Int, input_type: DType, target_shape_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], target_shape_buf: NDBuffer[target_shape_type, 1, origin]) -> IndexList[output_rank]`

---

## CoordinateTransformationMode

`struct CoordinateTransformationMode`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `AlignCorners`

`alias AlignCorners = CoordinateTransformationMode(1)`

### `Asymmetric`

`alias Asymmetric = CoordinateTransformationMode(2)`

### `HalfPixel`

`alias HalfPixel = CoordinateTransformationMode(0)`

### `HalfPixel1D`

`alias HalfPixel1D = CoordinateTransformationMode(3)`

## Methods

### `__init__`

`@implicit`
`__init__(out self, value: Int)`

### `__eq__`

`__eq__(self, other: Self) -> Bool`

---

## InterpolationMode

`struct InterpolationMode`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `Linear`

`alias Linear = InterpolationMode(0)`

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

---

## Interpolator

`@register_passable(trivial)`
`struct Interpolator[mode: InterpolationMode]`

## Fields

* ​cubic\_coeff (`SIMD[float32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(cubic_coeff: SIMD[float32, 1]) -> Self`

`__init__() -> Self`

### `filter_length`

`static filter_length() -> Int`

### `filter`

`filter(self, x: SIMD[float32, 1]) -> SIMD[float32, 1]`

---

## RoundMode

`struct RoundMode`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `Ceil`

`alias Ceil = RoundMode(3)`

### `Floor`

`alias Floor = RoundMode(2)`

### `HalfDown`

`alias HalfDown = RoundMode(0)`

### `HalfUp`

`alias HalfUp = RoundMode(1)`

## Methods

### `__init__`

`@implicit`
`__init__(out self, value: Int)`

### `__eq__`

`__eq__(self, other: Self) -> Bool`

---

## coord_transform

`coord_transform[mode: CoordinateTransformationMode](out_coord: Int, in_dim: Int, out_dim: Int, scale: SIMD[float32, 1]) -> SIMD[float32, 1]`

---

## resize

## Structs

* [​`CoordinateTransformationMode`](./CoordinateTransformationMode):
* [​`InterpolationMode`](./InterpolationMode):
* [​`Interpolator`](./Interpolator):
* [​`RoundMode`](./RoundMode):

## Functions

* [​`coord_transform`](./coord_transform):
* [​`interpolate_point_1d`](./interpolate_point_1d):
* [​`linear_filter`](./linear_filter): This is a tent filter.
* [​`resize_linear`](./resize_linear): Resizes input to output shape using linear interpolation.
* [​`resize_nearest_neighbor`](./resize_nearest_neighbor):

---

## interpolate_point_1d

`interpolate_point_1d[coordinate_transformation_mode: CoordinateTransformationMode, antialias: Bool, rank: Int, type: DType, interpolation_mode: InterpolationMode](interpolator: Interpolator[interpolation_mode], dim: Int, out_coords: IndexList[rank], scale: SIMD[float32, 1], input: NDBuffer[type, rank, origin], output: NDBuffer[type, rank, origin])`

---

## linear_filter

`linear_filter(x: SIMD[float32, 1]) -> SIMD[float32, 1]`

This is a tent filter.

f(x) = 1 + x, x = 1

---

## resize_linear

`resize_linear[coordinate_transformation_mode: CoordinateTransformationMode, antialias: Bool, rank: Int, type: DType](input: NDBuffer[type, rank, origin], output: NDBuffer[type, rank, origin])`

Resizes input to output shape using linear interpolation.

Does not use anti-aliasing filter for downsampling (coming soon).

**Parameters:**

* ​coordinate\_transformation\_mode (`CoordinateTransformationMode`): How to map a coordinate in output to a coordinate in input.
* ​antialias (`Bool`): Whether or not to use an antialiasing linear/cubic filter, which when downsampling, uses
  more points to avoid aliasing artifacts. Effectively stretches the filter by a factor of 1 / scale.
* ​rank (`Int`): Rank of the input and output.
* ​type (`DType`): Type of input and output.

**Args:**

* ​input (`NDBuffer[type, rank, origin]`): The input to be resized.
* ​output (`NDBuffer[type, rank, origin]`): The output containing the resized input.

---

## resize_nearest_neighbor

`resize_nearest_neighbor[coordinate_transformation_mode: CoordinateTransformationMode, round_mode: RoundMode, rank: Int, type: DType](input: NDBuffer[type, rank, origin], output: NDBuffer[type, rank, origin])`

---

## Weighted2DPoint

`@register_passable(trivial)`
`struct Weighted2DPoint[type: DType]`

Utility class to wrap 2-d point coordinates and floating point weight for bilinear interpolation.

## Fields

* ​y (`Int`):
* ​x (`Int`):
* ​w (`SIMD[type, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(y: Int, x: Int, weight: SIMD[type, 1]) -> Self`

---

## roi_align

## Structs

* [​`Weighted2DPoint`](./Weighted2DPoint): Utility class to wrap 2-d point coordinates and floating point weight for bilinear interpolation.

## Functions

* [​`roi_align_nhwc`](./roi_align_nhwc): Compute ROIAlign a batch of rois of shape \[M, 5] where the first dim is the batch index, followed by region box coordinates (y0, x0) (y1, x1). For inputs of NHWC format. The output shape is \[M, output\_height, output\_width, C].

---

## roi_align_nhwc

`roi_align_nhwc[type: DType, output_layout: Layout, input_layout: Layout, roi_layout: Layout, //, aligned: Bool, mode: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("AVG")](output: LayoutTensor[type, output_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, input_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], rois: LayoutTensor[type, roi_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output_height: Int, output_width: Int, in_spatial_scale: SIMD[dtype, 1], in_sampling_ratio: SIMD[dtype, 1])`

Compute ROIAlign a batch of rois of shape \[M, 5] where the first dim is the batch index, followed by region box coordinates (y0, x0) (y1, x1). For inputs of NHWC format. The output shape is \[M, output\_height, output\_width, C].

**Parameters:**

* ​type (`DType`): Type of the input tensor.
* ​output\_layout (`Layout`): The output layout.
* ​input\_layout (`Layout`): The input layout.
* ​roi\_layout (`Layout`): The layout of the regions of interests (ROI).
* ​aligned (`Bool`): If not true offset the ROIs by 0.5.
* ​mode (`StringSlice[StaticConstantOrigin]`): The pooling mode "AVG" for average and "MAX" for max pooling.

---

## apply_penalties_to_logits

`apply_penalties_to_logits[logit_type: DType, penalty_type: DType, //, target: StringSlice[StaticConstantOrigin]](logits: LayoutTensor[logit_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], compressed_frequency_data: LayoutTensor[int32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], frequency_offsets: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], frequency_penalty: LayoutTensor[penalty_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], presence_penalty: LayoutTensor[penalty_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], repetition_penalty: LayoutTensor[penalty_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ctx: DeviceContextPtr)`

Apply penalties to the logits based on the frequency of the tokens in the batch.

The frequency data is stored in a CSR format, where the frequency\_offsets is the
starting index of each sequence in the frequency\_data array. The frequency\_data
array is a 2D array, where:

* frequency\_data\[i, 0] is the token id
* frequency\_data\[i, 1] is the frequency of the token in the sequence

---

## sampling

## Functions

* [​`apply_penalties_to_logits`](./apply_penalties_to_logits): Apply penalties to the logits based on the frequency of the tokens in the batch.
* [​`update_frequency_data`](./update_frequency_data): Update the frequency data for the given new tokens.

---

## update_frequency_data

`update_frequency_data[token_type: DType, //, target: StringSlice[StaticConstantOrigin]](compressed_frequency_data: LayoutTensor[int32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], frequency_offsets: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], new_tokens: LayoutTensor[token_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ctx: DeviceContextPtr)`

Update the frequency data for the given new tokens.

The frequency data is stored in a CSR format. This kernel expects there will be
enough padding for each sequence to store the new tokens.

---

## get_sliding_window_out_dim

`get_sliding_window_out_dim[ceil_mode: Bool = False](in_dim: Int, ft_dim: Int, dilation: Int, stride: Int, pad: Int) -> Int`

Return output dimension for a sliding window operation along some dimension.

**Parameters:**

* ​ceil\_mode (`Bool`): Define rounding mode for shape calculation.

**Args:**

* ​in\_dim (`Int`): The size of the input dimension.
* ​ft\_dim (`Int`): The size of the corresponding filter dimension.
* ​dilation (`Int`): The dilation for the sliding window operation.
* ​stride (`Int`): The stride for the sliding window operation.
* ​pad (`Int`): The total padding for the sliding window operation.

**Returns:**

The size of the output dimension.

---

## shapes

## Functions

* [​`get_sliding_window_out_dim`](./get_sliding_window_out_dim): Return output dimension for a sliding window operation along some dimension.

---

## copy_to_slice

`copy_to_slice[type: DType, start_type: DType, end_type: DType, step_type: DType, in_rank: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](buffer: NDBuffer[type, in_rank, origin], in_slice: NDBuffer[type, in_rank, origin], start: NDBuffer[start_type, 1, origin], end: NDBuffer[end_type, 1, origin], step: NDBuffer[step_type, 1, origin], context: DeviceContextPtr = DeviceContextPtr())`

---

## slice

## Functions

* [​`copy_to_slice`](./copy_to_slice):
* [​`slice_as_copy`](./slice_as_copy):
* [​`slice_as_view`](./slice_as_view):
* [​`slice_dim_as_view`](./slice_dim_as_view):
* [​`slice_shape`](./slice_shape):

---

## slice_as_copy

`slice_as_copy[type: DType, index_type: DType, in_rank: Int](output: NDBuffer[type, in_rank, origin], tensor: NDBuffer[type, in_rank, origin], start: NDBuffer[index_type, 1, origin], end: NDBuffer[index_type, 1, origin], step: NDBuffer[index_type, 1, origin])`

---

## slice_as_view

`slice_as_view[type: DType, start_type: DType, end_type: DType, step_type: DType, rank: Int](tensor: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], starts: NDBuffer[start_type, 1, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], ends: NDBuffer[end_type, 1, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], steps: NDBuffer[step_type, 1, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> NDBuffer[type, rank, origin]`

---

## slice_dim_as_view

`slice_dim_as_view[type: DType, rank: Int, dim: Int](tensor: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], start: Int, end: Int, step: Int) -> NDBuffer[type, rank, origin]`

---

## slice_shape

`slice_shape[input_rank: Int, input_type: DType, start_type: DType, stop_type: DType, step_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], start_buf: NDBuffer[start_type, 1, origin], stop_buf: NDBuffer[stop_type, 1, origin], step_buf: NDBuffer[step_type, 1, origin]) -> IndexList[input_rank]`

---

## identity

`identity(x: SIMD[dtype, size]) -> SIMD[dtype, size]`

---

## softmax

## Functions

* [​`identity`](./identity):
* [​`logsoftmax`](./logsoftmax): Performs an unbatched logsoftmax on an input tensor using the three-pass algorithm.
* [​`mul`](./mul):
* [​`reciprocal`](./reciprocal):
* [​`reduce_add_simd`](./reduce_add_simd): This functions adds val to either the scalar value or the vector value depending on the step\_simd\_width. This is useful when the simd\_width varies between iterations as in vectorize.
* [​`softmax`](./softmax):
* [​`softmax_2_pass`](./softmax_2_pass): Performs an unbatched softmax on an input tensor using the two-pass online algorithm.
* [​`softmax_3_pass`](./softmax_3_pass): Performs an unbatched softmax on an input tensor using the three-pass algorithm.
* [​`softmax_kernel`](./softmax_kernel):
* [​`sub`](./sub):

---

## logsoftmax

`logsoftmax[simd_width: Int, buffer_size: Dim, type: DType, origins: origin.set, input_fn_1d: fn[Int](Int) capturing -> SIMD[type, $0]](output: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)])`

Performs an unbatched logsoftmax on an input tensor using the three-pass algorithm.

The unbatched three-pass softmax is defined as:
procedure SoftmaxUnbatched(InputInput)
maxVal = -∞
denom = 0
STEP 1: find the max value in each batch
for i = 0 to N do
maxVal = max(maxVal, Input\[b, i])
end for
STEP 2: compute the sum of exponential of each batch
for i = 0 to N do
Output\[b, i] = Input\[b, i] - maxVal
accum += exp(Output\[b, i])
end for
STEP 3: normalize each batch
for i = 0 to N do
Output\[b, i] -= log(accum)
end for

**Parameters:**

* ​simd\_width (`Int`): The simd\_width to use in vectorization.
* ​buffer\_size (`Dim`): The size of the input and output buffers.
* ​type (`DType`): The type of the input and output buffers.
* ​origins (`origin.set`): The OriginSet of captured arguments by the input\_fn\_1d.
* ​input\_fn\_1d (`fn[Int](Int) capturing -> SIMD[type, $0]`): The elementwise input lambda.

**Args:**

* ​output (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The output buffer in which to store the softmax values.

`logsoftmax[: origin.set, //, type: DType, simd_width: Int, rank: Int, static_shape: DimList, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](shape: IndexList[rank], output: NDBuffer[type, rank, origin, static_shape], axis: Int)`

`logsoftmax[type: DType, simd_width: Int, rank: Int, static_shape: DimList](input: NDBuffer[type, rank, origin, static_shape], output: NDBuffer[type, rank, origin, static_shape], axis: Int)`

---

## mul

`mul(x: SIMD[dtype, size], y: SIMD[dtype, size]) -> SIMD[dtype, size]`

---

## reciprocal

`reciprocal(x: SIMD[dtype, size]) -> SIMD[dtype, size]`

---

## reduce_add_simd

`reduce_add_simd[simd_width: Int, step_simd_width: Int, type: DType](mut scalar: SIMD[type, 1], mut vector: SIMD[type, simd_width], val: SIMD[type, step_simd_width])`

This functions adds val to either the scalar value or the vector value depending on the step\_simd\_width. This is useful when the simd\_width varies between iterations as in vectorize.

---

## softmax

`softmax[type: DType, simd_width: Int, rank: Int, static_shape: DimList](input: NDBuffer[type, rank, origin, static_shape], output: NDBuffer[type, rank, origin, static_shape], axis: Int)`

`softmax[: origin.set, //, type: DType, simd_width: Int, rank: Int, static_shape: DimList, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](shape: IndexList[rank], output: NDBuffer[type, rank, origin, static_shape], axis: Int, context: DeviceContextPtr = DeviceContextPtr())`

---

## softmax_2_pass

`softmax_2_pass[simd_width: Int, buffer_size: Dim, type: DType](output: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)], input: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)])`

Performs an unbatched softmax on an input tensor using the two-pass online algorithm.

The unbatched two-pass online softmax is described in "Online
normalizer calculation for softmax" () and
"A full-stack search technique for domain optimized deep learning
accelerators" () and is
defined as:

procedure SoftmaxUnbatched(InputInput)
runningMax = -∞
runningSum = 0
STAGE 1:
for i = 0 to N do
newMax = max(runningMax, Input\[i])
runningSum = runningSum\*exp(runningMax-newMax) + exp(Input\[i]-newMax)
runningMax = newMax
end for
for i = 0 to N do
Output\[i] = exp(Input\[i] - runningMax) / runningSum
end for

**Parameters:**

* ​simd\_width (`Int`): The simd\_width to use in vectorization.
* ​buffer\_size (`Dim`): The size of the input and output buffers.
* ​type (`DType`): The type of the input and output buffers.

**Args:**

* ​output (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The output buffer in which to store the softmax values.
* ​input (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The input buffer used to compute the softmax.

---

## softmax_3_pass

`softmax_3_pass[simd_width: Int, buffer_size: Dim, type: DType, origins: origin.set, input_fn_1d: fn[Int](Int) capturing -> SIMD[type, $0]](output: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)])`

Performs an unbatched softmax on an input tensor using the three-pass algorithm.

The unbatched three-pass softmax is defined as:
procedure SoftmaxUnbatched(InputInput)
maxVal = -∞
denom = 0
STEP 1: find the max value in each batch
for i = 0 to N do
maxVal = max(maxVal, Input\[b, i])
end for
STEP 2: compute the exponential for each batch
for i = 0 to N do
Output\[b, i] = exp(Input\[b, i] - maxVal)
denom += Output\[b, i]
end for
STEP 3: normalize each batch
for i = 0 to N do
Output\[b, i] /= denom
end for

**Parameters:**

* ​simd\_width (`Int`): The simd\_width to use in vectorization.
* ​buffer\_size (`Dim`): The size of the input and output buffers.
* ​type (`DType`): The type of the input and output buffers.
* ​origins (`origin.set`): The OriginSet of captured arguments by the input\_fn\_1d.
* ​input\_fn\_1d (`fn[Int](Int) capturing -> SIMD[type, $0]`): The elementwise input lambda.

**Args:**

* ​output (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The output buffer in which to store the softmax values.

---

## softmax_kernel

`softmax_kernel[: origin.set, //, BLOCK_SIZE: Int, input_fn: fn[DType, Int, Int](IndexList[$2]) capturing -> SIMD[$0, $1], type: DType, rank: Int, accum_type: DType = get_accum_type[::DType,::DType]()](shape: IndexList[rank], output: NDBuffer[type, rank, MutableAnyOrigin])`

---

## sub

`sub(x: SIMD[dtype, size], y: SIMD[dtype, size]) -> SIMD[dtype, size]`

---

## split

## Functions

* [​`split`](./split):

---

## split

`split[type: DType, num_outputs: Int, target: StringSlice[StaticConstantOrigin], trace_description: StringSlice[StaticConstantOrigin], outputs_origin: MutableOrigin, outputs_layout: Layout](input: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis: Int, outputs: StaticTuple[LayoutTensor[type, outputs_layout, outputs_origin], num_outputs], ctx: DeviceContext)`

---

## tile

## Functions

* [​`tile`](./tile): Implements the `Tile` operator from the ONNX spec. This behaves like Numpy tile, but without broadcast.
* [​`tile_shape`](./tile_shape): Compute the output shape of a `tile` operation, and assert the inputs are compatible.

---

## tile

`tile[type: DType, type_repeats: DType](input: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], repeats: LayoutTensor[type_repeats, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Implements the `Tile` operator from the ONNX spec. This behaves like Numpy tile, but without broadcast.

**Parameters:**

* ​type (`DType`): Type of the input and output tensors.
* ​type\_repeats (`DType`): Type of the repeats tensor.

**Args:**

* ​input (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor. Currently repeats (`LayoutTensor[type_repeats, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): One-dimensional tensor that specifies the number of repeated
  copies along each of the input's dimensions. Length equals
  input tensor rank.
* ​output (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor. Has the same dimensions and type as input.

---

## tile_shape

`tile_shape[input_type: DType, repeats_type: DType, single_thread_blocking_override: Bool](input_buf: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], repeats_buf: LayoutTensor[repeats_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> IndexList[layout.rank()]`

Compute the output shape of a `tile` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_type (`DType`): Type of the input tensor.
* ​repeats\_type (`DType`): Type of the repeats tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input\_buf (`LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor.
* ​repeats\_buf (`LayoutTensor[repeats_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The repeats tensor.

**Returns:**

The output shape.

---

## TopK_2

`@register_passable(trivial)`
`struct TopK_2[T: DType, largest: Bool = True]`

## Fields

* ​p (`Int`):
* ​u (`SIMD[T, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

### `insert`

`insert(mut self, elem: SIMD[T, 1], elem_id: Int)`

---

## bottom_k_shape

`bottom_k_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], max_k: Int, axis: Int) -> IndexList[rank]`

---

## fused_token_sampling_cpu

`fused_token_sampling_cpu[type: DType, rank: Int, out_idx_type: DType](max_k: Int, input: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], k: OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]({:i1 0, 1}), temperature: OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]({:i1 0, 1}), top_p: OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]({:i1 0, 1}), seed: OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]]({:i1 0, 1}))`

Generalized implementation of the Top K algorithm with sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume.

**Parameters:**

* ​type (`DType`): Data type of the input buffer.
* ​rank (`Int`): Rank of the input.
* ​out\_idx\_type (`DType`): Data type of the output indices.

**Args:**

* ​max\_k (`Int`): Largest number of top elements.
* ​input (`NDBuffer[type, rank, origin]`): NDBuffer\[type, rank] (Any shape)- The input tensor.
* ​out\_idxs (`NDBuffer[out_idx_type, rank, origin]`): NDBuffer\[out\_idx\_type, rank] (shape of \[input\_shape\[:-1]] + \[1]) - The output indices.
* ​k (`OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]`): Optional device buffer of top elements to keep for each batch element.
* ​temperature (`OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]`): The temperature based scaling.
* ​top\_p (`OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]`): Only use the tokens whose cumulative probability exceeds this threshold.
* ​seed (`OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]]`): The seed to use for the random number generator.

---

## fused_token_sampling_gpu

`fused_token_sampling_gpu[type: DType, rank: Int, out_idx_type: DType, //](ctx: DeviceContext, max_k: Int, input: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], block_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), num_blocks_per_input: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), k: OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]({:i1 0, 1}), temperature: OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]({:i1 0, 1}), top_p: OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]({:i1 0, 1}), seed: OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]]({:i1 0, 1}))`

Top K algorithm with fused sampling. Returns the sampled indices from the Top-K of the innermost dimension of the input tensor for each row/subvolume.

---

## topk

## Structs

* [​`TopK_2`](./TopK_2):

## Functions

* [​`bottom_k_shape`](./bottom_k_shape):
* [​`fused_token_sampling_cpu`](./fused_token_sampling_cpu): Generalized implementation of the Top K algorithm with sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume.
* [​`fused_token_sampling_gpu`](./fused_token_sampling_gpu): Top K algorithm with fused sampling. Returns the sampled indices from the Top-K of the innermost dimension of the input tensor for each row/subvolume.
* [​`top_k`](./top_k): Implementation of the Top K algorithm. Returns the top or bottom K elements and their index along a specified axis.
* [​`top_k_shape`](./top_k_shape):
* [​`top_k_shape_impl`](./top_k_shape_impl): Compute the output shape of a top/bottom k operation.
* [​`topk_gpu`](./topk_gpu): Generalized implementation of the Top K algorithm with/without sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume or the top K values and indices across the tensor.

---

## top_k

`top_k[rank: Int, type: DType, out_idx_type: DType, //, largest: Bool = True, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input: NDBuffer[type, rank, origin], max_k: Int, axis: Int, out_vals: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], sorted: Bool, ctx: DeviceContextPtr, k: OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]({:i1 0, 1}))`

Implementation of the Top K algorithm. Returns the top or bottom K elements and their index along a specified axis.

**Parameters:**

* ​rank (`Int`): Rank of the input.
* ​type (`DType`): Data type of the input buffer.
* ​out\_idx\_type (`DType`): The data type of the output indices (default is DType.int64).
* ​largest (`Bool`): Whether to find the maximum (top k) or minimum value (bottom k).
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.

**Args:**

* ​input (`NDBuffer[type, rank, origin]`): The input tensor.
* ​max\_k (`Int`): The largest number of top elements.
* ​axis (`Int`): The axis along which to operate.
* ​out\_vals (`NDBuffer[type, rank, origin]`): Output values.
* ​out\_idxs (`NDBuffer[out_idx_type, rank, origin]`): Output indices.
* ​sorted (`Bool`): Indicates if the top/bottom K elements are in (stable) sorted order.
* ​ctx (`DeviceContextPtr`): The device call context.
* ​k (`OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]`): Per batch element k value.

---

## top_k_shape

`top_k_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], max_k: Int, axis: Int) -> IndexList[rank]`

---

## top_k_shape_impl

`top_k_shape_impl[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], max_k: Int, axis: Int) -> IndexList[rank]`

Compute the output shape of a top/bottom k operation.

**Parameters:**

* ​type (`DType`): Data type of the input buffer.
* ​rank (`Int`): Rank of the input.
* ​single\_thread\_blocking\_override (`Bool`): If this function can block.

**Args:**

* ​input (`NDBuffer[type, rank, origin]`): The input tensor.
* ​max\_k (`Int`): The maximum K value.
* ​axis (`Int`): The axis value in a tensor.

**Returns:**

The output shape.

---

## topk_gpu

`topk_gpu[type: DType, rank: Int, out_idx_type: DType, //, sampling: Bool = True, largest: Bool = True](ctx: DeviceContext, max_k: Int, input: NDBuffer[type, rank, origin], out_vals: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], block_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), num_blocks_per_input: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), k: OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]({:i1 0, 1}), temperature: OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]({:i1 0, 1}), top_p: OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]({:i1 0, 1}), seed: OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]]({:i1 0, 1}))`

Generalized implementation of the Top K algorithm with/without sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume or the top K values and indices across the tensor.

**Parameters:**

* ​type (`DType`): DType - The data type of the input tensor.
* ​rank (`Int`): Int - The rank of the input tensor.
* ​out\_idx\_type (`DType`): DType - The data type of the output indices (default is DType.index).
* ​sampling (`Bool`): Bool - Whether to return token samples from topK dist (default is True).
* ​largest (`Bool`): Bool - Whether to find the maximum or minimum value.

**Args:**

* ​ctx (`DeviceContext`): DeviceContext
  The context for GPU execution.
* ​max\_k (`Int`): Int
  Largest number of top elements to keep for each batch element.
* ​input (`NDBuffer[type, rank, origin]`): NDBuffer\[type, rank]
  Input tensor as a device NDBuffer.
* ​out\_vals (`NDBuffer[type, rank, origin]`): NDBuffer\[type, rank]
  Output buffer on device for the K largest values.
* ​out\_idxs (`NDBuffer[out_idx_type, rank, origin]`): NDBuffer\[DType.index, rank]
  Output buffer on device for the indices of the K largest values, or sampled token indices.
  Last dimension is 1 if sampling is True, otherwise K.
* ​block\_size (`OptionalReg[Int]`): Int
  The number of threads per block (default is 256 from TRT and empirical testing).
* ​num\_blocks\_per\_input (`OptionalReg[Int]`): OptionalReg\[Int]
  Number of blocks per input (default computed from input size and block size).
  This is the equivalent of "BLOCKS\_PER\_BEAM" in TRT-LLM kernel allowing for much larger
  batch sizes through packing several elements per thread in the first stage.
* ​k (`OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]`): Optional NDBuffer\[DType.int64, 1, MutableAnyOrigin]
  Device buffer of top elements to keep for each batch element.
* ​temperature (`OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]`): The temperature based scaling.
* ​top\_p (`OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]`): Only use the tokens whose cumulative probability exceeds this threshold.
* ​seed (`OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]]`): The seed to use for the random number generator.

---

## toppminp

## Functions

* [​`merge`](./merge): Merge two sorted subarrays into one sorted array.
* [​`merge_sort_recursive`](./merge_sort_recursive): Recursive merge sort implementation.
* [​`min_p_sampling`](./min_p_sampling): Naive CPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the calculated probability threshold (Min-P).
* [​`sort_buf_descending`](./sort_buf_descending): Sort each batch separately in descending order using parallel merge sort.
* [​`top_p_sampling`](./top_p_sampling): Naive CPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the cumulative probability mass (Top-P).

---

## merge

`merge[type: DType, out_idx_type: DType](mut buf_keys: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mut buf_ids: LayoutTensor[out_idx_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], start: Int, mid: Int, end: Int)`

Merge two sorted subarrays into one sorted array.

---

## merge_sort_recursive

`merge_sort_recursive[type: DType, out_idx_type: DType](mut buf_keys: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mut buf_ids: LayoutTensor[out_idx_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], start: Int, end: Int)`

Recursive merge sort implementation.

---

## min_p_sampling

`min_p_sampling[type: DType, out_idx_type: DType, //, _test_sort: Bool = False](min_ps: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input_logits: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], out_token_ids: LayoutTensor[out_idx_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))`

Naive CPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the calculated probability threshold (Min-P).

---

## sort_buf_descending

`sort_buf_descending[type: DType, out_idx_type: DType](mut buf_keys: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mut buf_ids: LayoutTensor[out_idx_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], vocab_size: Int)`

Sort each batch separately in descending order using parallel merge sort.

---

## top_p_sampling

`top_p_sampling[type: DType, out_idx_type: DType, //, _test_sort: Bool = False](top_ps: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input_logits: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], out_token_ids: LayoutTensor[out_idx_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))`

Naive CPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the cumulative probability mass (Top-P).

---

## toppminp_gpu

## Aliases

### `DEBUG_FILE`

`alias DEBUG_FILE = False`

### `SEED`

`alias SEED = 42`

## Functions

* [​`min_p_sampling_gpu`](./min_p_sampling_gpu): GPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the calculated probability threshold (Min-P).
* [​`normalize`](./normalize):
* [​`normalize_u32`](./normalize_u32):
* [​`radix_sort_pairs_kernel`](./radix_sort_pairs_kernel): Radix pair sort kernel for (default) descending order.
* [​`run_radix_sort_pairs_gpu`](./run_radix_sort_pairs_gpu):
* [​`top_p_sampling_gpu`](./top_p_sampling_gpu): GPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the cumulative probability mass (Top-P).
* [​`topk_wrapper`](./topk_wrapper): Copy of `Kernels/mojo/nn/topk.mojo:_topk_stage1` with the addition of max\_vals and p\_threshold arguments to determine if sorting is needed for top-p/min-p sampling.
* [​`topp_minp_sampling_kernel`](./topp_minp_sampling_kernel): Top P-Min P sampling kernel.

---

## min_p_sampling_gpu

`min_p_sampling_gpu[type: DType, rank: Int, out_idx_type: DType, //, _test_sort: Bool = False](ctx: DeviceContext, min_ps: NDBuffer[type, 1, origin], input_logits: NDBuffer[type, rank, origin], out_token_ids: NDBuffer[out_idx_type, rank, origin], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))`

GPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the calculated probability threshold (Min-P).

---

## normalize

`normalize(value: SIMD[bfloat16, 1]) -> SIMD[uint16, 1]`

`normalize(value: SIMD[int32, 1]) -> SIMD[uint32, 1]`

`normalize(value: SIMD[uint16, 1]) -> SIMD[uint16, 1]`

`normalize(value: SIMD[float32, 1]) -> SIMD[uint32, 1]`

`normalize(value: SIMD[dtype, 1]) -> SIMD[_uint_type_of_width[::Int](), 1]`

Normalize the value to the appropriate unsigned integer type. This is needed for radix sort to work correctly.

---

## normalize_u32

`normalize_u32(value: SIMD[uint32, 1]) -> SIMD[uint32, 1]`

---

## radix_sort_pairs_kernel

`radix_sort_pairs_kernel[type: DType, out_idx_type: DType, current_bit: Int, ascending: Bool = False, BLOCK_SIZE: Int = 256, NUM_BITS_PER_PASS: Int = 4](input_keys_: UnsafePointer[SIMD[type, 1]], output_keys_: UnsafePointer[SIMD[type, 1]], input_key_ids_: UnsafePointer[SIMD[out_idx_type, 1]], output_key_ids_: UnsafePointer[SIMD[out_idx_type, 1]], num_keys: Int, skip_sort: UnsafePointer[SIMD[bool, 1]])`

Radix pair sort kernel for (default) descending order.

Implementation based on:
AMD. Introduction to GPU Radix Sort. GPUOpen, 2017. Available at:
.

**Parameters:**

* ​type (`DType`): DType - Data type.
* ​out\_idx\_type (`DType`): DType - Output index type.
* ​current\_bit (`Int`): Int - Current bit to start sorting NUM\_BITS\_PER\_PASS bits at.
* ​ascending (`Bool`): Bool - Whether to sort in ascending order.
* ​BLOCK\_SIZE (`Int`): Int - Block size.
* ​NUM\_BITS\_PER\_PASS (`Int`): Int - Number of bits per pass.

**Args:**

* ​input\_keys\_ (`UnsafePointer[SIMD[type, 1]]`): Input tensor values to sort.
* ​output\_keys\_ (`UnsafePointer[SIMD[type, 1]]`): Output tensor values sorted in (default) descending order.
* ​input\_key\_ids\_ (`UnsafePointer[SIMD[out_idx_type, 1]]`): Input tensor indices.
* ​output\_key\_ids\_ (`UnsafePointer[SIMD[out_idx_type, 1]]`): Output tensor indices sorted in (default) descending order.
* ​num\_keys (`Int`): Number of keys to sort per batch.
* ​skip\_sort (`UnsafePointer[SIMD[bool, 1]]`): Whether sorting is skipped for this batch.

---

## run_radix_sort_pairs_gpu

`run_radix_sort_pairs_gpu[type: DType, out_idx_type: DType, rank: Int, ascending: Bool = False, BLOCK_SIZE: Int = 256, NUM_BITS_PER_PASS: Int = 4](ctx: DeviceContext, mut input_keys: NDBuffer[type, rank, MutableAnyOrigin], mut output_keys: NDBuffer[type, rank, MutableAnyOrigin], mut input_key_ids: NDBuffer[out_idx_type, rank, MutableAnyOrigin], mut output_key_ids: NDBuffer[out_idx_type, rank, MutableAnyOrigin], skip_sort: NDBuffer[bool, rank, origin])`

---

## top_p_sampling_gpu

`top_p_sampling_gpu[type: DType, rank: Int, out_idx_type: DType, //, _test_sort: Bool = False](ctx: DeviceContext, top_ps: NDBuffer[type, 1, origin], input_logits: NDBuffer[type, rank, origin], out_token_ids: NDBuffer[out_idx_type, rank, origin], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))`

GPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the cumulative probability mass (Top-P).

---

## topk_wrapper

`topk_wrapper[T: DType, out_idx_type: DType, is_top_p: Bool, largest: Bool = True, _test_sort: Bool = False](K: Int, num_elements: Int, num_blocks_per_input: Int, in_buffer: UnsafePointer[SIMD[T, 1]], local_topk_vals: UnsafePointer[SIMD[T, 1]], local_topk_idxs: UnsafePointer[SIMD[out_idx_type, 1]], p_threshold: UnsafePointer[SIMD[T, 1]], skip_sort: UnsafePointer[SIMD[bool, 1]])`

Copy of `Kernels/mojo/nn/topk.mojo:_topk_stage1` with the addition of max\_vals and p\_threshold arguments to determine if sorting is needed for top-p/min-p sampling.

Arguments:
K: Int - Number of top elements to select per block
num\_elements: Int - Size of last dimension of input buffer (vocab size)
num\_blocks\_per\_input: Int - Number of blocks used to process the input data
in\_buffer: UnsafePointer\[Scalar\[T]] - Input buffer containing the elements to process
local\_topk\_vals: UnsafePointer\[Scalar\[T]] - Output buffer to store the local top-K values
local\_topk\_idxs: UnsafePointer\[Scalar\[out\_idx\_type]] - Output buffer to store the indices of local top-K elements
p\_threshold: UnsafePointer\[Scalar\[T]] - Threshold for top-p sampling if is\_top\_p is True else min-p coefficient
skip\_sort: UnsafePointer\[Scalar\[DType.bool]] - Output buffer to store whether sorting is needed

**Parameters:**

* ​T (`DType`): DType - The data type of the elements.
* ​out\_idx\_type (`DType`): DType - The data type of the output indices.
* ​is\_top\_p (`Bool`): Bool - Whether this if for top-p sampling or min-p sampling.
* ​largest (`Bool`): Bool - Whether to find the maximum or minimum value.
* ​\_test\_sort (`Bool`): Bool - An internal test flag to not skip sort if testing.

---

## topp_minp_sampling_kernel

`topp_minp_sampling_kernel[type: DType, out_idx_type: DType, is_top_p: Bool](p_thresholds_: UnsafePointer[SIMD[type, 1]], sorted_probs_: UnsafePointer[SIMD[type, 1]], sorted_ids_: UnsafePointer[SIMD[out_idx_type, 1]], out_token_ids: UnsafePointer[SIMD[out_idx_type, 1]], skip_sort: UnsafePointer[SIMD[bool, 1]], vocab_size: Int)`

Top P-Min P sampling kernel.

**Parameters:**

* ​type (`DType`): DType - scalar values dtype.
* ​out\_idx\_type (`DType`): DType - output index type.
* ​is\_top\_p (`Bool`): Bool - Whether to use Top-P (True) or Min-P (False) sampling.

---

## nvml

Implements wrappers around the NVIDIA Management Library (nvml).

## Modules

* [​`nvml`](./nvml/): Implements wrappers around the NVIDIA Management Library (nvml).

---

## ClockType

`@register_passable(trivial)`
`struct ClockType`

## Fields

* ​code (`SIMD[int32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `GRAPHICS`

`alias GRAPHICS = ClockType(__init__[__mlir_type.!pop.int_literal](0))`

Graphics clock domain

### `MEM`

`alias MEM = ClockType(__init__[__mlir_type.!pop.int_literal](2))`

Memory clock domain

### `SM`

`alias SM = ClockType(__init__[__mlir_type.!pop.int_literal](1))`

SM clock domain

### `VIDEO`

`alias VIDEO = ClockType(__init__[__mlir_type.!pop.int_literal](2))`

Video clock domain

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

---

## Device

`struct Device`

## Fields

* ​idx (`Int`):
* ​device (`_DeviceImpl`):

## Implemented traits

`AnyType`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(out self, idx: Int = 0)`

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

### `get_driver_version`

`get_driver_version(self) -> DriverVersion`

Returns NVIDIA driver version.

### `max_mem_clock`

`max_mem_clock(self) -> Int`

### `max_graphics_clock`

`max_graphics_clock(self) -> Int`

### `mem_clocks`

`mem_clocks(self) -> List[Int, True]`

### `graphics_clocks`

`graphics_clocks(self, memory_clock_mhz: Int) -> List[Int, True]`

### `set_clock`

`set_clock(self, mem_clock: Int, graphics_clock: Int)`

### `gpu_turbo_enabled`

`gpu_turbo_enabled(self) -> Bool`

Returns True if the gpu turbo is enabled.

### `set_gpu_turbo`

`set_gpu_turbo(self, enabled: Bool = True)`

Sets the GPU turbo state.

### `get_persistence_mode`

`get_persistence_mode(self) -> Bool`

Returns True if the gpu persistence mode is enabled.

### `set_persistence_mode`

`set_persistence_mode(self, enabled: Bool = True)`

Sets the persistence mode.

### `set_max_gpu_clocks`

`set_max_gpu_clocks(device)`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

### `__repr__`

`__repr__(self) -> String`

---

## DriverVersion

`struct DriverVersion`

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`StringableRaising`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(out self, value: List[String])`

### `major`

`major(self) -> Int`

### `minor`

`minor(self) -> Int`

### `patch`

`patch(self) -> Int`

### `__str__`

`__str__(self) -> String`

---

## EnableState

`@register_passable(trivial)`
`struct EnableState`

## Fields

* ​code (`SIMD[int32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `DISABLED`

`alias DISABLED = EnableState(__init__[__mlir_type.!pop.int_literal](0))`

Feature disabled

### `ENABLED`

`alias ENABLED = EnableState(__init__[__mlir_type.!pop.int_literal](1))`

Feature enabled

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

---

## Result

`@register_passable(trivial)`
`struct Result`

## Fields

* ​code (`SIMD[int32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`

## Aliases

### `ALREADY_INITIALIZED`

`alias ALREADY_INITIALIZED = Result(__init__[__mlir_type.!pop.int_literal](5))`

Deprecated: Multiple initializations are now allowed through ref counting

### `ARGUMENT_VERSION_MISMATCH`

`alias ARGUMENT_VERSION_MISMATCH = Result(__init__[__mlir_type.!pop.int_literal](25))`

The provided version is invalid/unsupported

### `CORRUPTED_INFOROM`

`alias CORRUPTED_INFOROM = Result(__init__[__mlir_type.!pop.int_literal](14))`

infoROM is corrupted

### `DEPRECATED`

`alias DEPRECATED = Result(__init__[__mlir_type.!pop.int_literal](26))`

The requested functionality has been deprecated

### `DRIVER_NOT_LOADED`

`alias DRIVER_NOT_LOADED = Result(__init__[__mlir_type.!pop.int_literal](9))`

NVIDIA driver is not loaded

### `FREQ_NOT_SUPPORTED`

`alias FREQ_NOT_SUPPORTED = Result(__init__[__mlir_type.!pop.int_literal](24))`

Ran out of critical resources, other than memory

### `FUNCTION_NOT_FOUND`

`alias FUNCTION_NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](13))`

Local version of NVML doesn't implement this function

### `GPU_IS_LOST`

`alias GPU_IS_LOST = Result(__init__[__mlir_type.!pop.int_literal](15))`

The GPU has fallen off the bus or has otherwise become inaccessible

### `GPU_NOT_FOUND`

`alias GPU_NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](28))`

No GPUs were found

### `IN_USE`

`alias IN_USE = Result(__init__[__mlir_type.!pop.int_literal](19))`

An operation cannot be performed because the GPU is currently in use

### `INSUFFICIENT_POWER`

`alias INSUFFICIENT_POWER = Result(__init__[__mlir_type.!pop.int_literal](8))`

A device's external power cables are not properly attached

### `INSUFFICIENT_RESOURCES`

`alias INSUFFICIENT_RESOURCES = Result(__init__[__mlir_type.!pop.int_literal](23))`

Ran out of critical resources, other than memory

### `INSUFFICIENT_SIZE`

`alias INSUFFICIENT_SIZE = Result(__init__[__mlir_type.!pop.int_literal](7))`

An input argument is not large enough

### `INVALID_ARGUMENT`

`alias INVALID_ARGUMENT = Result(__init__[__mlir_type.!pop.int_literal](2))`

A supplied argument is invalid

### `IRQ_ISSUE`

`alias IRQ_ISSUE = Result(__init__[__mlir_type.!pop.int_literal](11))`

NVIDIA Kernel detected an interrupt issue with a GPU

### `LIB_RM_VERSION_MISMATCH`

`alias LIB_RM_VERSION_MISMATCH = Result(__init__[__mlir_type.!pop.int_literal](18))`

RM detects a driver/library version mismatch

### `LIBRARY_NOT_FOUND`

`alias LIBRARY_NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](12))`

NVML Shared Library couldn't be found or loaded

### `MEMORY`

`alias MEMORY = Result(__init__[__mlir_type.!pop.int_literal](20))`

Insufficient memory

### `NO_DATA`

`alias NO_DATA = Result(__init__[__mlir_type.!pop.int_literal](21))`

No data

### `NO_PERMISSION`

`alias NO_PERMISSION = Result(__init__[__mlir_type.!pop.int_literal](4))`

The current user does not have permission for operation

### `NOT_FOUND`

`alias NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](6))`

A query to find an object was unsuccessful

### `NOT_READY`

`alias NOT_READY = Result(__init__[__mlir_type.!pop.int_literal](27))`

The system is not ready for the request

### `NOT_SUPPORTED`

`alias NOT_SUPPORTED = Result(__init__[__mlir_type.!pop.int_literal](3))`

The requested operation is not available on target device

### `OPERATING_SYSTEM`

`alias OPERATING_SYSTEM = Result(__init__[__mlir_type.!pop.int_literal](17))`

The GPU control device has been blocked by the operating system/cgroups

### `RESET_REQUIRED`

`alias RESET_REQUIRED = Result(__init__[__mlir_type.!pop.int_literal](16))`

The GPU requires a reset before it can be used again

### `SUCCESS`

`alias SUCCESS = Result(__init__[__mlir_type.!pop.int_literal](0))`

The operation was successful

### `TIMEOUT`

`alias TIMEOUT = Result(__init__[__mlir_type.!pop.int_literal](10))`

User provided timeout passed

### `UNINITIALIZED`

`alias UNINITIALIZED = Result(__init__[__mlir_type.!pop.int_literal](1))`

NVML was not first initialized with nvmlInit()

### `UNKNOWN`

`alias UNKNOWN = Result(__init__[__mlir_type.!pop.int_literal](999))`

An internal driver error occurred

### `VGPU_ECC_NOT_SUPPORTED`

`alias VGPU_ECC_NOT_SUPPORTED = Result(__init__[__mlir_type.!pop.int_literal](22))`

The requested vgpu operation is not available on target device, because ECC is enabled

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

### `__str__`

`__str__(self) -> String`

---

## nvml

Implements wrappers around the NVIDIA Management Library (nvml).

## Aliases

### `CUDA_NVML_LIBRARY`

`alias CUDA_NVML_LIBRARY = _Global[__init__[__mlir_type.!kgen.string]("CUDA_NVML_LIBRARY"), _OwnedDLHandle, _init_dylib]`

### `CUDA_NVML_LIBRARY_BASE_NAME`

`alias CUDA_NVML_LIBRARY_BASE_NAME = "libnvidia-ml"`

### `CUDA_NVML_LIBRARY_DIR`

`alias CUDA_NVML_LIBRARY_DIR = __init__[__mlir_type.!kgen.string]("/usr/lib/x86_64-linux-gnu")`

### `CUDA_NVML_LIBRARY_EXT`

`alias CUDA_NVML_LIBRARY_EXT = ".so"`

## Structs

* [​`ClockType`](./ClockType):
* [​`Device`](./Device):
* [​`DriverVersion`](./DriverVersion):
* [​`EnableState`](./EnableState):
* [​`Result`](./Result):

---

## quantization

This package contains a set of APIs for quantizing tensor data.

Quantization is a technique used to reduce the precision of floating-point
numbers, which are used in most neural networks. Quantization is a type of
lossy compression, which means that some precision is lost, but the resulting
tensors take less memory and computations are faster.

## Modules

* [​`per_channel_grouped_4bit`](./per_channel_grouped_4bit/):
* [​`qmatmul`](./qmatmul/):
* [​`qmatmul_gpu`](./qmatmul_gpu/):
* [​`qmatmul_k`](./qmatmul_k/):

---

## Q4sym

`struct Q4sym[group_size: Int, float_dtype: DType = float32]`

Q4sym: compresses values of type `float_dtype` to 4bit unsigned integers which have been dynamically symmetrically quantized with the given scale factor.

`group_size` determines the number of elements which share quantization
parameters.

We store things in a strided fashion:
Example:

Assume `group_size = 8` and we want to process uint4 numbers:
A, B, C, D, E, F, G, H which have associated bits aaaa, bbbb, cccc, ....

eeeeaaaa|ffffbbbb|ggggcccc|hhhhdddd

To uncompress to floating point, take the decoded uint4 value, subtract
the implicit zero-point of 2^4=8, and multiply by the scale factor.

## Parameters

* ​group\_size (`Int`): The number of encoded numbers stored in this struct.
* ​float\_dtype (`DType`): The floating point dtype this struct works with.

## Fields

* ​scale (`StaticTuple[SIMD[uint8, 1], 2]`): The FP16 scale of the group, stored as individual bytes.
* ​bits (`StaticTuple[SIMD[uint8, 1], (div_s(#lit.struct.extract, 2) + -1) if ((group_size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]`): The bits of the encoded uint4 numbers.

## Implemented traits

`AnyType`,
`Defaultable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Construct a default initialized Q4sym.

`@implicit`
`__init__(out self, data: SIMD[float_dtype, group_size])`

Construct an encoded Q4sym from data.

**Args:**

* ​data (`SIMD[float_dtype, group_size]`): The floating point data to encode and store.

### `decode_scale`

`decode_scale(mut self) -> SIMD[float16, 1]`

Obtain the scale factor.

**Returns:**

The decoded scale factor.

### `decode_unsigned`

`decode_unsigned(mut self) -> SIMD[uint8, group_size]`

Decode the stored uint4 numbers to uint8.

**Returns:**

The decoded stored numbers as uint8 numbers. These have an implicit
zero-point of 8.

### `decode_signed`

`decode_signed(mut self) -> SIMD[int8, group_size]`

Decode the stored uint4 numbers to requantized int4 numbers.

This is done by simply subtracting an implicit zp of 8 from the
unsigned decoding.

**Returns:**

The decoded stored numbers as int8 numbers. These have a zero-point of
0\.

### `decode_fully`

`decode_fully(mut self) -> SIMD[float_dtype, group_size]`

Decode the stored numbers into floating point representation.

**Returns:**

The decoded numbers.

### `quantize_and_write_to_tensor`

`static quantize_and_write_to_tensor[rank: Int](input_tensor: NDBuffer[float_dtype, rank, origin], output_tensor: NDBuffer[uint8, rank, origin], input_shape: IndexList[rank])`

Encodes the floating point numbers in `input_tensor` along the inner-most dimension and writes the result to output\_tensor.

**Parameters:**

* ​rank (`Int`): The rank of the input and output tensors.

**Args:**

* ​input\_tensor (`NDBuffer[float_dtype, rank, origin]`): The input tensor we are encoding.
* ​output\_tensor (`NDBuffer[uint8, rank, origin]`): The output tensor containing the encoded input.
  The shape of the output should be the same as the input
  except along the inner dimension where if the original inner
  dimension was `d`, the corresponding output dimension should be:
  ceil(`d` / group\_size) \* sizeof(self).
* ​input\_shape (`IndexList[rank]`): The shape of the input tensor.

### `dequantize_and_write_to_tensor`

`static dequantize_and_write_to_tensor[rank: Int, //](input_tensor: NDBuffer[uint8, rank, origin], output_tensor: NDBuffer[float_dtype, rank, origin], output_shape: IndexList[rank])`

Encodes the floating point numbers in `input_tensor` along the inner-most dimension and writes the result to output\_tensor.

**Parameters:**

* ​rank (`Int`): The rank of the input and output tensors.

**Args:**

* ​input\_tensor (`NDBuffer[uint8, rank, origin]`): The input tensor we are decoding.
* ​output\_tensor (`NDBuffer[float_dtype, rank, origin]`): The output tensor containing the decoded input.
* ​output\_shape (`IndexList[rank]`): The shape of the output tensor.

---

## block_Q4_K

`struct block_Q4_K`

## Fields

* ​base\_scale (`SIMD[float16, 1]`):
* ​base\_min (`SIMD[float16, 1]`):
* ​q\_scales\_and\_mins (`InlineArray[SIMD[uint8, 1], 12]`):
* ​q\_bits (`InlineArray[SIMD[uint8, 1], 128]`):

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `group_count`

`alias group_count = 8`

### `group_size`

`alias group_size = 32`

---

## block_Q6_K

`struct block_Q6_K`

## Fields

* ​q\_bits\_lo (`InlineArray[SIMD[uint8, 1], 128]`):
* ​q\_bits\_hi (`InlineArray[SIMD[uint8, 1], 64]`):
* ​q\_scales (`InlineArray[SIMD[int8, 1], 16]`):
* ​base\_scale (`SIMD[float16, 1]`):

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `group_count`

`alias group_count = 16`

### `group_size`

`alias group_size = 16`

---

## block_QK_K

`struct block_QK_K`

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `quantized_k`

`alias quantized_k = 256`

---

## calculate_symmetric_vector

`calculate_symmetric_vector[input_dtype: DType, simd_width: Int, output_bits: Int](data: SIMD[input_dtype, simd_width]) -> Tuple[SIMD[uint8, simd_width], SIMD[input_dtype, 1]]`

Symmetrically quantizes the given SIMD vector `data` with input type `input_dtype` and `simd_width` elements, assuming we want the results to fit in an unsigned integer of size `output_bits`.

**Parameters:**

* ​input\_dtype (`DType`): The dtype of the input tensor.
* ​simd\_width (`Int`): The width of the SIMD input.
* ​output\_bits (`Int`): The bits we want to fit the unsigned integral result in.

**Args:**

* ​data (`SIMD[input_dtype, simd_width]`): The input SIMD we want to quantize.

**Returns:**

A vector of the quantized values.
The associated scale factor.

---

## per_channel_grouped_4bit

## Structs

* [​`block_Q4_K`](./block_Q4_K):
* [​`block_Q6_K`](./block_Q6_K):
* [​`block_QK_K`](./block_QK_K):
* [​`Q4sym`](./Q4sym): Q4sym: compresses values of type `float_dtype` to 4bit unsigned integers which have been dynamically symmetrically quantized with the given scale factor.

## Functions

* [​`calculate_symmetric_vector`](./calculate_symmetric_vector): Symmetrically quantizes the given SIMD vector `data` with input type `input_dtype` and `simd_width` elements, assuming we want the results to fit in an unsigned integer of size `output_bits`.
* [​`q4_k_dequantize_impl`](./q4_k_dequantize_impl):
* [​`q6_k_dequantize_impl`](./q6_k_dequantize_impl):
* [​`scale_min_k4`](./scale_min_k4):

---

## q4_k_dequantize_impl

`q4_k_dequantize_impl(input_tensor: NDBuffer[uint8, 2, origin], output_tensor: NDBuffer[float32, 2, origin])`

---

## q6_k_dequantize_impl

`q6_k_dequantize_impl(input_tensor: NDBuffer[uint8, 2, origin], output_tensor: NDBuffer[float32, 2, origin], output_shape: IndexList[2])`

---

## scale_min_k4

`scale_min_k4(src_ptr: UnsafePointer[block_Q4_K], g: Int) -> Tuple[SIMD[float32, 1], SIMD[float32, 1]]`

---

## qmatmul

## Aliases

### `K_BATCH_SIZE`

`alias K_BATCH_SIZE = 512`

Defines the batch size of K used to pack A and unpack B weights.

## Functions

* [​`matmul_qint4`](./matmul_qint4):
* [​`matmul_qint4_pack_b`](./matmul_qint4_pack_b):

---

## matmul_qint4

`matmul_qint4[group_size: Int, b_static_shape: DimList = create_unknown[::Int](), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a: NDBuffer[float32, 2, origin], b: NDBuffer[uint8, 2, origin, b_static_shape], c: NDBuffer[float32, 2, origin])`

---

## matmul_qint4_pack_b

`matmul_qint4_pack_b[group_size: Int](b: NDBuffer[uint8, 2, origin], b_rot: NDBuffer[uint8, 2, origin])`

---

## args_to_tuple

`args_to_tuple[swap: Bool](arg_0: Int, arg_1: Int) -> Tuple[Int, Int]`

---

## gpu_qint4_repack_GPTQ

`gpu_qint4_repack_GPTQ[b_shape: DimList, b_packed_shape: DimList, //, group_size: Int, target: StringSlice[StaticConstantOrigin]](b: NDBuffer[uint8, 2, origin, b_shape], b_packed: NDBuffer[uint8, 2, origin, b_packed_shape], perm_idx: OptionalReg[NDBuffer[int32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[int32, 1, MutableAnyOrigin]]({:i1 0, 1}), ctx: DeviceContextPtr = DeviceContextPtr())`

---

## gpu_qint4_repack_Q4_0

`gpu_qint4_repack_Q4_0[b_shape: DimList, //, target: StringSlice[StaticConstantOrigin]](b: NDBuffer[uint8, 2, origin, b_shape], b_packed: NDBuffer[uint8, 2, origin, b_shape], ctx: DeviceContextPtr = DeviceContextPtr())`

---

## qmatmul_gpu

## Functions

* [​`args_to_tuple`](./args_to_tuple):
* [​`gpu_qint4_repack_GPTQ`](./gpu_qint4_repack_GPTQ):
* [​`gpu_qint4_repack_Q4_0`](./gpu_qint4_repack_Q4_0):
* [​`matmul_gpu_qint4`](./matmul_gpu_qint4):
* [​`matmul_gpu_qint4_impl`](./matmul_gpu_qint4_impl):
* [​`multistage_gemm_q`](./multistage_gemm_q):
* [​`multistage_mma_q`](./multistage_mma_q):
* [​`multistage_qgemm_kernel`](./multistage_qgemm_kernel):
* [​`pack_Q_tile`](./pack_Q_tile):
* [​`q_smem_usage`](./q_smem_usage):
* [​`repack_GPTQ_for_sm8x`](./repack_GPTQ_for_sm8x):
* [​`repack_Q4_0_for_sm8x`](./repack_Q4_0_for_sm8x):
* [​`unpack_4bit_int`](./unpack_4bit_int):

---

## matmul_gpu_qint4

`matmul_gpu_qint4[c_type: DType, a_type: DType, //, group_size: Int, target: StringSlice[StaticConstantOrigin], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[uint8, 2, origin, shape], ctx: DeviceContextPtr = DeviceContextPtr())`

---

## matmul_gpu_qint4_impl

`matmul_gpu_qint4_impl[c_type: DType, a_type: DType, //, group_size: Int, target: StringSlice[StaticConstantOrigin], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[uint8, 2, origin, shape], ctx: Optional[DeviceContext])`

---

## multistage_gemm_q

`multistage_gemm_q[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, group_size: Int, pack_factor: Int, config: MatmulConfig[a_type, b_type, c_type, True], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, c_shape], a: NDBuffer[a_type, 2, origin, a_shape], b: NDBuffer[b_type, 2, origin, b_shape], runtime_config: MatmulConfig[a_type, b_type, c_type, True], ctx: DeviceContext)`

---

## multistage_mma_q

`multistage_mma_q[BM: Int, BN: Int, BK: Int, WM: Int, WN: Int, num_threads: Int, num_pipeline_stages: Int, transpose_b: Bool, group_size: Int, pack_factor: Int, c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, a_smem_layout: Layout, b_type: DType, b_layout: Layout, b_smem_layout: Layout, scales_type: DType, scales_layout: Layout, scales_smem_layout: Layout, /, *, swizzle_a: Bool = True, static_num_iters: Dim = Dim(-31337), prefetch_init: Bool = True, continue_prefetch_b: Bool = False, transpose_b_next: Bool = False, b_next_gmem_layout: Layout = Layout(), b_next_smem_layout: Layout = Layout(), next_op_b_iter_alignment: Int = alignof[::DType,__mlir_type.!kgen.target]()](c: LayoutTensor[c_type, c_layout, origin, address_space=AddressSpace(5)], a_iter_arg: LayoutTensorIter[type, a_layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b_iter_arg: LayoutTensorIter[b_type, b_layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], a_smem_iter_arg: LayoutTensorIter[a_type, a_smem_layout, origin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut b_smem_iter: LayoutTensorIter[b_type, b_smem_layout, origin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], scales_smem_iter_arg: LayoutTensorIter[scales_type, scales_smem_layout, origin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], scales_iter_arg: LayoutTensorIter[scales_type, scales_layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], num_iters: Int, /, *, num_b_rows: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

---

## multistage_qgemm_kernel

`multistage_qgemm_kernel[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, b_packed_type: DType, b_layout: Layout, group_size: Int, pack_factor: Int, transpose_b: Bool, config: MatmulConfig[a_type, b_packed_type, c_type, transpose_b], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: LayoutTensor[c_type, c_layout, MutableAnyOrigin], a: LayoutTensor[a_type, a_layout, MutableAnyOrigin], b_packed: LayoutTensor[b_packed_type, b_layout, MutableAnyOrigin])`

---

## pack_Q_tile

`pack_Q_tile(input: SIMD[uint8, 16]) -> SIMD[uint32, 4]`

---

## q_smem_usage

`q_smem_usage[: DType, : DType, : DType, : Bool, : IndexList[3], //, config: MatmulConfig[$0, $1, $2, $3, $4], group_size: Int]() -> Int`

---

## repack_GPTQ_for_sm8x

`repack_GPTQ_for_sm8x[in_layout: Layout, out_layout: Layout, scales_type: DType, group_size: Int, has_perm: Bool, *, perm_layout: Layout = Layout()](in_tensor: LayoutTensor[uint8, in_layout, MutableAnyOrigin], out_tensor: LayoutTensor[uint8, out_layout, MutableAnyOrigin], perm_idx: LayoutTensor[int32, perm_layout, MutableAnyOrigin])`

---

## repack_Q4_0_for_sm8x

`repack_Q4_0_for_sm8x[q_layout: Layout, repack_layout: Layout, scales_type: DType](q_weight: LayoutTensor[uint8, q_layout, MutableAnyOrigin], q_packed_weight: LayoutTensor[uint8, repack_layout, MutableAnyOrigin])`

---

## unpack_4bit_int

`unpack_4bit_int(val: SIMD[uint32, size], idx: Int) -> SIMD[uint8, 1]`

---

## qmatmul_k

## Functions

* [​`matmul_Q4_K`](./matmul_Q4_K):
* [​`matmul_Q4_K_pack_b`](./matmul_Q4_K_pack_b):
* [​`matmul_Q6_K`](./matmul_Q6_K):
* [​`matmul_Q6_K_pack_b`](./matmul_Q6_K_pack_b):

---

## matmul_Q4_K

`matmul_Q4_K[elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a: NDBuffer[float32, 2, origin], b: NDBuffer[uint8, 2, origin], c: NDBuffer[float32, 2, origin])`

---

## matmul_Q4_K_pack_b

`matmul_Q4_K_pack_b[b_origin: MutableOrigin, b_packed_origin: MutableOrigin](b: NDBuffer[uint8, 2, b_origin], b_packed: NDBuffer[uint8, 2, b_packed_origin])`

---

## matmul_Q6_K

`matmul_Q6_K[elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a: NDBuffer[float32, 2, origin], b: NDBuffer[uint8, 2, origin], c: NDBuffer[float32, 2, origin])`

---

## matmul_Q6_K_pack_b

`matmul_Q6_K_pack_b[b_origin: MutableOrigin, b_packed_origin: MutableOrigin](b: NDBuffer[uint8, 2, b_origin], b_packed: NDBuffer[uint8, 2, b_packed_origin])`

---

## elementwise

`elementwise[: origin.set, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: Int)`

Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed.

**Parameters:**

* ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function.
* ​simd\_width (`Int`): The SIMD vector width to use.
* ​use\_blocking\_impl (`Bool`): Do not invoke the function using asynchronous calls.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.
* ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace.

**Args:**

* ​shape (`Int`): The shape of the buffer.

`elementwise[: origin.set, rank: Int, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: IndexList[rank, element_type=element_type])`

Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed.

**Parameters:**

* ​rank (`Int`): The rank of the buffer.
* ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function.
* ​simd\_width (`Int`): The SIMD vector width to use.
* ​use\_blocking\_impl (`Bool`): Do not invoke the function using asynchronous calls.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.
* ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace.

**Args:**

* ​shape (`IndexList[rank, element_type=element_type]`): The shape of the buffer.

`elementwise[: origin.set, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: Int, context: DeviceContext)`

Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed.

**Parameters:**

* ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function.
* ​simd\_width (`Int`): The SIMD vector width to use.
* ​use\_blocking\_impl (`Bool`): Do not invoke the function using asynchronous calls.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.
* ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace.

**Args:**

* ​shape (`Int`): The shape of the buffer.
* ​context (`DeviceContext`): The device context to use.

`elementwise[: origin.set, rank: Int, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: IndexList[rank, element_type=element_type], context: DeviceContext)`

Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed.

**Parameters:**

* ​rank (`Int`): The rank of the buffer.
* ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function.
* ​simd\_width (`Int`): The SIMD vector width to use.
* ​use\_blocking\_impl (`Bool`): Do not invoke the function using asynchronous calls.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.
* ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace.

**Args:**

* ​shape (`IndexList[rank, element_type=element_type]`): The shape of the buffer.
* ​context (`DeviceContext`): The device context to use.

`elementwise[: origin.set, rank: Int, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: IndexList[rank, element_type=element_type], context: DeviceContextPtr)`

Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed.

**Parameters:**

* ​rank (`Int`): The rank of the buffer.
* ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function.
* ​simd\_width (`Int`): The SIMD vector width to use.
* ​use\_blocking\_impl (`Bool`): Do not invoke the function using asynchronous calls.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.
* ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace.

**Args:**

* ​shape (`IndexList[rank, element_type=element_type]`): The shape of the buffer.
* ​context (`DeviceContextPtr`): The device context to use.

---

## functional

Implements higher-order functions.

You can import these APIs from the `algorithm` package. For example:

```mojo
from algorithm import map
```

## Aliases

### `BinaryTile1DTileUnitFunc`

`alias BinaryTile1DTileUnitFunc = fn[Int](Int, Int) capturing -> None`

Signature of a tiled function that performs some work with a dynamic tile size and a secondary static tile size.

### `Dynamic1DTileUnitFunc`

`alias Dynamic1DTileUnitFunc = fn(Int, Int) capturing -> None`

Signature of a 1d tiled function that performs some work with a dynamic tile size   and an offset. i.e. func(offset: Int, tile\_size: Int)

### `Dynamic1DTileUnswitchUnitFunc`

`alias Dynamic1DTileUnswitchUnitFunc = fn[Bool](Int, Int, Int) capturing -> None`

### `Static1DTileUnitFunc`

`alias Static1DTileUnitFunc = fn[Int](Int) capturing -> None`

Signature of a 1d tiled function that performs some work with a static tile size and an offset. i.e. func\ (offset: Int)

### `Static1DTileUnitFuncWithFlag`

`alias Static1DTileUnitFuncWithFlag = fn[Int, Bool](Int) capturing -> None`

### `Static1DTileUnitFuncWithFlags`

`alias Static1DTileUnitFuncWithFlags = fn[Int, Bool, Bool](Int) capturing -> None`

### `Static1DTileUnswitchUnitFunc`

`alias Static1DTileUnswitchUnitFunc = fn[Int, Bool](Int, Int) capturing -> None`

Signature of a tiled function that performs some work with a static tile size   and an offset. i.e. func\ (offset: Int)

### `Static2DTileUnitFunc`

`alias Static2DTileUnitFunc = fn[Int, Int](Int, Int) capturing -> None`

Signature of a 2d tiled function that performs some work with a static tile size and an offset. i.e. func\ (offset\_x: Int, offset\_y: Int)

### `stencil`

`alias stencil = _stencil_impl_cpu[__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,::DType,::DType,::DType,::Int,::Int,::IndexList[$10, $6],::Int,::DType,fn[::DType]`

### `stencil_gpu`

`alias stencil_gpu = _stencil_impl_gpu[__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,::DType,::DType,::DType,::Int,::Int,::IndexList[$10, $6],::Int,::DType,fn[::DType]`

### `SwitchedFunction`

`alias SwitchedFunction = fn[Bool]() raises capturing -> None`

### `SwitchedFunction2`

`alias SwitchedFunction2 = fn[Bool, Bool]() capturing -> None`

## Functions

* [​`elementwise`](/mojo/stdlib/algorithm/functional/elementwise): Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed.
* [​`map`](/mojo/stdlib/algorithm/functional/map): Maps a function over a range from 0 to size.
* [​`parallelize`](/mojo/stdlib/algorithm/functional/parallelize): Executes func(0) ... func(num\_work\_items-1) as sub-tasks in parallel, and returns when all are complete.
* [​`parallelize_over_rows`](/mojo/stdlib/algorithm/functional/parallelize_over_rows): Parallelize func over non-axis dims of shape.
* [​`sync_parallelize`](/mojo/stdlib/algorithm/functional/sync_parallelize): Executes func(0) ... func(num\_work\_items-1) as parallel sub-tasks, and returns when all are complete.
* [​`tile`](/mojo/stdlib/algorithm/functional/tile): A generator that launches work groups in specified list of tile sizes.
* [​`tile_and_unswitch`](/mojo/stdlib/algorithm/functional/tile_and_unswitch): Performs time and unswitch functional transformation.
* [​`tile_middle_unswitch_boundaries`](/mojo/stdlib/algorithm/functional/tile_middle_unswitch_boundaries): Divides 1d iteration space into three parts and tiles them with different steps.
* [​`unswitch`](/mojo/stdlib/algorithm/functional/unswitch): Performs a functional unswitch transformation.
* [​`vectorize`](/mojo/stdlib/algorithm/functional/vectorize): Simplifies SIMD optimized loops by mapping a function across a range from 0 to `size`, incrementing by `simd_width` at each step. The remainder of `size % simd_width` will run in separate iterations.

---

## map

`map[origins: origin.set, //, func: fn(Int) capturing -> None](size: Int)`

Maps a function over a range from 0 to size.

**Parameters:**

* ​origins (`origin.set`): The capture origins.
* ​func (`fn(Int) capturing -> None`): Function to map.

**Args:**

* ​size (`Int`): The number of elements.

---

## parallelize

`parallelize[origins: origin.set, //, func: fn(Int) capturing -> None](num_work_items: Int)`

Executes func(0) ... func(num\_work\_items-1) as sub-tasks in parallel, and returns when all are complete.

**Parameters:**

* ​origins (`origin.set`): The capture origins.
* ​func (`fn(Int) capturing -> None`): The function to invoke.

**Args:**

* ​num\_work\_items (`Int`): Number of parallel tasks.

`parallelize[origins: origin.set, //, func: fn(Int) capturing -> None](num_work_items: Int, num_workers: Int)`

Executes func(0) ... func(num\_work\_items-1) as sub-tasks in parallel, and returns when all are complete.

**Parameters:**

* ​origins (`origin.set`): The capture origins.
* ​func (`fn(Int) capturing -> None`): The function to invoke.

**Args:**

* ​num\_work\_items (`Int`): Number of parallel tasks.
* ​num\_workers (`Int`): The number of workers to use for execution.

---

## parallelize_over_rows

`parallelize_over_rows[: origin.set, //, func: fn(Int, Int) capturing -> None](shape: IndexList[size, element_type=element_type], axis: Int, grain_size: Int)`

Parallelize func over non-axis dims of shape.

**Parameters:**

* ​func (`fn(Int, Int) capturing -> None`): Function to call on range of rows.

**Args:**

* ​shape (`IndexList[size, element_type=element_type]`): Shape to parallelize over.
* ​axis (`Int`): Rows are slices along the axis dimension of shape.
* ​grain\_size (`Int`): The minimum number of elements to warrant using an additional thread.

---

## sync_parallelize

`sync_parallelize[origins: origin.set, //, func: fn(Int) capturing -> None](num_work_items: Int)`

Executes func(0) ... func(num\_work\_items-1) as parallel sub-tasks, and returns when all are complete.

**Parameters:**

* ​origins (`origin.set`): The capture origins.
* ​func (`fn(Int) capturing -> None`): The function to invoke.

**Args:**

* ​num\_work\_items (`Int`): Number of parallel tasks.

`sync_parallelize[origins: origin.set, //, func: fn(Int) raises capturing -> None](num_work_items: Int)`

Executes func(0) ... func(num\_work\_items-1) as parallel sub-tasks, and returns when all are complete.

TODO: Currently exceptions raised by func will cause a trap rather than
be propagated back to the caller.

**Parameters:**

* ​origins (`origin.set`): The capture origins.
* ​func (`fn(Int) raises capturing -> None`): The function to invoke.

**Args:**

* ​num\_work\_items (`Int`): Number of parallel tasks.

---

## tile

`tile[: origin.set, //, workgroup_function: fn[Int](Int) capturing -> None, tile_size_list: VariadicList[Int]](offset: Int, upperbound: Int)`

A generator that launches work groups in specified list of tile sizes.

A workgroup function is a function that can process a configurable
consecutive "tile" of workload. E.g.
`work_on[3](5)`
should launch computation on item 5,6,7, and should be semantically
equivalent to
`work_on[1](5)`, `work_on[1](6)`, `work_on[1](7)`.

This generator will try to proceed with the given list of tile sizes on the
listed order. E.g.
`tile[func, (3,2,1)](offset, upperbound)`
will try to call `func[3]` starting from offset until remaining work is less
than 3 from upperbound and then try `func[2]`, and then `func[1]`, etc.

**Parameters:**

* ​workgroup\_function (`fn[Int](Int) capturing -> None`): Workgroup function that processes one tile of
  workload.
* ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work.

**Args:**

* ​offset (`Int`): The initial index to start the work from.
* ​upperbound (`Int`): The runtime upperbound that the work function should not
  exceed.

`tile[: origin.set, //, workgroup_function: fn(Int, Int) capturing -> None](offset: Int, upperbound: Int, tile_size_list: VariadicList[Int])`

A generator that launches work groups in specified list of tile sizes.

This is the version of tile generator for the case where work\_group function
can take the tile size as a runtime value.

**Parameters:**

* ​workgroup\_function (`fn(Int, Int) capturing -> None`): Workgroup function that processes one tile of
  workload.

**Args:**

* ​offset (`Int`): The initial index to start the work from.
* ​upperbound (`Int`): The runtime upperbound that the work function should not
  exceed.
* ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work.

`tile[: origin.set, //, secondary_tile_size_list: VariadicList[Int], secondary_cleanup_tile: Int, workgroup_function: fn[Int](Int, Int) capturing -> None](offset: Int, upperbound: Int, primary_tile_size_list: VariadicList[Int], primary_cleanup_tile: Int)`

A generator that launches work groups in specified list of tile sizes until the sum of primary\_tile\_sizes has exceeded the upperbound.

**Parameters:**

* ​secondary\_tile\_size\_list (`VariadicList[Int]`): List of static tile sizes to launch work.
* ​secondary\_cleanup\_tile (`Int`): Last static tile to use when primary tile sizes
  don't fit exactly within the upperbound.
* ​workgroup\_function (`fn[Int](Int, Int) capturing -> None`): Workgroup function that processes one tile of
  workload.

**Args:**

* ​offset (`Int`): The initial index to start the work from.
* ​upperbound (`Int`): The runtime upperbound that the work function should not
  exceed.
* ​primary\_tile\_size\_list (`VariadicList[Int]`): List of dynamic tile sizes to launch work.
* ​primary\_cleanup\_tile (`Int`): Last dynamic tile to use when primary tile sizes
  don't fit exactly within the upperbound.

`tile[: origin.set, //, workgroup_function: fn[Int, Int](Int, Int) capturing -> None, tile_sizes_x: VariadicList[Int], tile_sizes_y: VariadicList[Int]](offset_x: Int, offset_y: Int, upperbound_x: Int, upperbound_y: Int)`

Launches workgroup\_function using the largest tile sizes possible in each dimension, starting from the x and y offset, until the x and y upperbounds are reached.

**Parameters:**

* ​workgroup\_function (`fn[Int, Int](Int, Int) capturing -> None`): Function that is invoked for each tile and offset.
* ​tile\_sizes\_x (`VariadicList[Int]`): List of tile sizes to use for the first parameter of workgroup\_function.
* ​tile\_sizes\_y (`VariadicList[Int]`): List of tile sizes to use for the second parameter of workgroup\_function.

**Args:**

* ​offset\_x (`Int`): Initial x offset passed to workgroup\_function.
* ​offset\_y (`Int`): Initial y offset passed to workgroup\_function.
* ​upperbound\_x (`Int`): Max offset in x dimension passed to workgroup function.
* ​upperbound\_y (`Int`): Max offset in y dimension passed to workgroup function.

---

## tile_and_unswitch

`tile_and_unswitch[: origin.set, //, workgroup_function: fn[Int, Bool](Int, Int) capturing -> None, tile_size_list: VariadicList[Int]](offset: Int, upperbound: Int)`

Performs time and unswitch functional transformation.

A variant of static tile given a workgroup function that can be unswitched.
This generator is a fused version of tile and unswitch, where the static
unswitch is true throughout the "inner" portion of the workload and is
false only on the residue tile.

**Parameters:**

* ​workgroup\_function (`fn[Int, Bool](Int, Int) capturing -> None`): Workgroup function that processes one tile of
  workload.
* ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work.

**Args:**

* ​offset (`Int`): The initial index to start the work from.
* ​upperbound (`Int`): The runtime upperbound that the work function should not
  exceed.

`tile_and_unswitch[: origin.set, //, workgroup_function: fn[Bool](Int, Int, Int) capturing -> None](offset: Int, upperbound: Int, tile_size_list: VariadicList[Int])`

Performs time and unswitch functional transformation.

A variant of dynamic tile given a workgroup function that can be
unswitched. This generator is a fused version of tile and unswitch, where
the static unswitch is true throughout the "inner" portion of the workload
and is false only on the residue tile.

**Parameters:**

* ​workgroup\_function (`fn[Bool](Int, Int, Int) capturing -> None`): Workgroup function that processes one tile of
  workload.

**Args:**

* ​offset (`Int`): The initial index to start the work from.
* ​upperbound (`Int`): The runtime upperbound that the work function should not exceed.
* ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work.

---

## tile_middle_unswitch_boundaries

`tile_middle_unswitch_boundaries[: origin.set, //, work_fn: fn[Int, Bool](Int) capturing -> None, middle_tile_sizes: VariadicList[Int], left_tile_size: Int = 1, right_tile_size: Int = 1](left_boundary_start: Int, left_boundary_end: Int, right_boundary_start: Int, right_boundary_end: Int)`

Divides 1d iteration space into three parts and tiles them with different steps.

The 1d iteration space is divided into:
1\. \[left\_boundary\_start, left\_boundary\_end), effected by left boundary.
2\. \[left\_boundary\_end, right\_boundary\_start), not effected by any boundary.
3\. \[right\_boundary\_start, right\_boundary\_end), effected by right boundary.

work\_fn's switch is true for the left and right boundaries, implying boundary
conditions like padding in convolution. The middle part is tiled with static
tile sizes with the switch as false.

`middle_tile_sizes` should be in descending order for optimal performance.
(Larger tile size appeared later in the list fails the while-loop.)

**Parameters:**

* ​work\_fn (`fn[Int, Bool](Int) capturing -> None`): Work function that processes one tile of workload.
* ​middle\_tile\_sizes (`VariadicList[Int]`): List of tile sizes for the middle part.
* ​left\_tile\_size (`Int`): Tile size for the left boundary region.
* ​right\_tile\_size (`Int`): Tile size for the right boundary region.

**Args:**

* ​left\_boundary\_start (`Int`): Start index of the left boundary.
* ​left\_boundary\_end (`Int`): End index of the left boundary.
* ​right\_boundary\_start (`Int`): Start index of the right boundary.
* ​right\_boundary\_end (`Int`): End index of the right boundary.

`tile_middle_unswitch_boundaries[: origin.set, //, work_fn: fn[Int, Bool, Bool](Int) capturing -> None, tile_size: Int, size: Int]()`

Tile 1d iteration space with boundary conditions at both ends.

This generator is primarily for convolution with static shapes. `work_fn`'s
flags hints the function to handle padding at the boundary. The size is the
static output row size, i.e., WO dimension.

**Parameters:**

* ​work\_fn (`fn[Int, Bool, Bool](Int) capturing -> None`): Work function that updates one tile. It has two flags for
  left and right boundaries, respectively.
* ​tile\_size (`Int`): 1D Tile size.
* ​size (`Int`): Iteration range is \[0, size).

---

## unswitch

`unswitch[: origin.set, //, switched_func: fn[Bool]() raises capturing -> None](dynamic_switch: Bool)`

Performs a functional unswitch transformation.

Unswitch is a simple pattern that is similar idea to loop unswitching
pass but extended to functional patterns. The pattern facilitates the
following code transformation that reduces the number of branches in the
generated code

Before:

```
for i in range(...)
    if i switched\_func (`fn[Bool]() raises capturing -> None`): The function containing the inner loop logic that can be
  unswitched.

**Args:**

* ​dynamic\_switch (`Bool`): The dynamic condition that enables the unswitched code
  path.

`unswitch[: origin.set, //, switched_func: fn[Bool]() capturing -> None](dynamic_switch: Bool)`

Performs a functional unswitch transformation.

Unswitch is a simple pattern that is similar idea to loop unswitching
pass but extended to functional patterns. The pattern facilitates the
following code transformation that reduces the number of branches in the
generated code

Before:

```
for i in range(...)
    if i switched\_func (`fn[Bool]() capturing -> None`): The function containing the inner loop logic that can be
  unswitched.

**Args:**

* ​dynamic\_switch (`Bool`): The dynamic condition that enables the unswitched code
  path.

`unswitch[: origin.set, //, switched_func: fn[Bool, Bool]() capturing -> None](dynamic_switch_a: Bool, dynamic_switch_b: Bool)`

Performs a functional 2-predicates unswitch transformation.

**Parameters:**

* ​switched\_func (`fn[Bool, Bool]() capturing -> None`): The function containing the inner loop logic that has 2
  predicates which can be unswitched.

**Args:**

* ​dynamic\_switch\_a (`Bool`): The first dynamic condition that enables the outer
  unswitched code path.
* ​dynamic\_switch\_b (`Bool`): The second dynamic condition that enables the inner
  unswitched code path.

---

## vectorize

`vectorize[origins: origin.set, //, func: fn[Int](Int) capturing -> None, simd_width: Int, /, *, unroll_factor: Int = 1](size: Int)`

Simplifies SIMD optimized loops by mapping a function across a range from 0 to `size`, incrementing by `simd_width` at each step. The remainder of `size % simd_width` will run in separate iterations.

The below example demonstrates how you could improve the performance of a
loop, by setting multiple values at the same time using SIMD registers on
the machine:

```mojo
from algorithm.functional import vectorize
from memory import UnsafePointer

# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit into the SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()  # assumed to be 4 in this example

fn main():
    var p = UnsafePointer[Int32].alloc(size)

    # @parameter allows the closure to capture the `p` pointer
    @parameter
    fn closure[width: Int](i: Int):
        print("storing", width, "els at pos", i)
        p.store[width=width](i, i)

    vectorize[closure, simd_width](size)
    print(p.load[width=simd_width]())
    print(p.load[width=simd_width](simd_width))
```

On a machine with a SIMD register size of 128, this will set 4xInt32 values
on each iteration. The remainder of 10 % 4 is 2, so those last two elements
will be set in two separate iterations:

```plaintext
storing 4 els at pos 0
storing 4 els at pos 4
storing 1 els at pos 8
storing 1 els at pos 9
[0, 0, 0, 0, 4, 4, 4, 4, 8, 9]
```

You can also unroll the loop to potentially improve performance at the cost
of binary size:

```
vectorize[closure, width, unroll_factor=2](size)
```

In the generated assembly the function calls will be repeated, resulting in
fewer arithmetic, comparison, and conditional jump operations. The assembly
would look like this in pseudocode:

```
closure[4](0)
closure[4](4)
# Remainder loop won't unroll unless `size` is passed as a parameter
for i in range(8, 10):
    closure[1](i)
    closure[1](i)
```

You can pass `size` as a parameter if it's compile time known to reduce the
iterations for the remainder. This only occurs if the remainder is an
exponent of 2 (2, 4, 8, 16, ...). The remainder loop will still unroll for
performance improvements if not an exponent of 2.

**Parameters:**

* ​origins (`origin.set`): The capture origins.
* ​func (`fn[Int](Int) capturing -> None`): The function that will be called in the loop body.
* ​simd\_width (`Int`): The SIMD vector width.
* ​unroll\_factor (`Int`): The unroll factor for the main loop (Default 1).

**Args:**

* ​size (`Int`): The upper limit for the loop.

`vectorize[origins: origin.set, //, func: fn[Int](Int) capturing -> None, simd_width: Int, /, *, size: Int, unroll_factor: Int = size if is_nvidia_gpu() else 1]()`

Simplifies SIMD optimized loops by mapping a function across a range from 0 to `size`, incrementing by `simd_width` at each step. The remainder of `size % simd_width` will run in a single iteration if it's an exponent of 2.

The below example demonstrates how you could improve the performance of a
loop, by setting multiple values at the same time using SIMD registers on
the machine:

```mojo
from algorithm.functional import vectorize
from memory import UnsafePointer

# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit into the SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()  # assumed to be 4 in this example

fn main():
    var p = UnsafePointer[Int32].alloc(size)

    # @parameter allows the closure to capture the `p` pointer
    @parameter
    fn closure[width: Int](i: Int):
        print("storing", width, "els at pos", i)
        p.store[width=width](i, i)

    vectorize[closure, simd_width](size)
    print(p.load[width=simd_width]())
    print(p.load[width=simd_width](simd_width))
```

On a machine with a SIMD register size of 128, this will set 4xInt32 values
on each iteration. The remainder of 10 % 4 is 2, so those last two elements
will be set in a single iteration:

```plaintext
storing 4 els at pos 0
storing 4 els at pos 4
storing 2 els at pos 8
[0, 0, 0, 0, 4, 4, 4, 4, 8, 8]
```

If the remainder is not an exponent of 2 (2, 4, 8, 16 ...) there will be a
separate iteration for each element. However passing `size` as a parameter
also allows the loop for the remaining elements to be unrolled.

You can also unroll the main loop to potentially improve performance at the
cost of binary size:

```
vectorize[closure, width, size=size, unroll_factor=2]()
```

In the generated assembly the function calls will be repeated, resulting in
fewer arithmetic, comparison, and conditional jump operations. The assembly
would look like this in pseudocode:

```
closure[4](0)
closure[4](4)
closure[2](8)
```

**Parameters:**

* ​origins (`origin.set`): The capture origins.
* ​func (`fn[Int](Int) capturing -> None`): The function that will be called in the loop body.
* ​simd\_width (`Int`): The SIMD vector width.
* ​size (`Int`): The upper limit for the loop.
* ​unroll\_factor (`Int`): The unroll factor for the main loop (Default 1).

---

## algorithm

Implements the algorithm package.

## Modules

* [​`functional`](/mojo/stdlib/algorithm/functional/): Implements higher-order functions.
* [​`memory`](/mojo/stdlib/algorithm/memory/): Implements `parallel_memcpy`.
* [​`reduction`](/mojo/stdlib/algorithm/reduction/): Implements SIMD reductions.

---

## memory

Implements `parallel_memcpy`.

You can import these APIs from the `algorithm` package. For example:

```mojo
from algorithm import parallel_memcpy
```

## Functions

* [​`parallel_memcpy`](/mojo/stdlib/algorithm/memory/parallel_memcpy): Copies `count` elements from a memory buffer `src` to `dest` in parallel by spawning `num_tasks` tasks each copying `count_per_task` elements.

---

## parallel_memcpy

`parallel_memcpy[type: DType](dest: UnsafePointer[SIMD[type, 1]], src: UnsafePointer[SIMD[type, 1]], count: Int, count_per_task: Int, num_tasks: Int)`

Copies `count` elements from a memory buffer `src` to `dest` in parallel by spawning `num_tasks` tasks each copying `count_per_task` elements.

**Parameters:**

* ​type (`DType`): The element dtype.

**Args:**

* ​dest (`UnsafePointer[SIMD[type, 1]]`): The destination buffer.
* ​src (`UnsafePointer[SIMD[type, 1]]`): The source buffer.
* ​count (`Int`): Number of elements in the buffer.
* ​count\_per\_task (`Int`): Task size.
* ​num\_tasks (`Int`): Number of tasks to run in parallel.

`parallel_memcpy[type: DType](dest: UnsafePointer[SIMD[type, 1]], src: UnsafePointer[SIMD[type, 1]], count: Int)`

Copies `count` elements from a memory buffer `src` to `dest` in parallel.

**Parameters:**

* ​type (`DType`): The element type.

**Args:**

* ​dest (`UnsafePointer[SIMD[type, 1]]`): The destination pointer.
* ​src (`UnsafePointer[SIMD[type, 1]]`): The source pointer.
* ​count (`Int`): The number of elements to copy.

---

## all_true

`all_true(src: NDBuffer[type, 1, origin]) -> Bool`

Returns True if all the elements in a buffer are True and False otherwise.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.

**Returns:**

True if all of the elements of the buffer are True and False otherwise.

---

## any_true

`any_true(src: NDBuffer[type, 1, origin]) -> Bool`

Returns True if any the elements in a buffer are True and False otherwise.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.

**Returns:**

True if any of the elements of the buffer are True and False otherwise.

---

## cumsum

`cumsum(dst: NDBuffer[type, 1, origin], src: NDBuffer[type, 1, origin, shape, strides])`

Computes the cumulative sum of all elements in a buffer.    dst\[i] = src\[i] + src\[i-1] + ... + src\[0].

**Args:**

* ​dst (`NDBuffer[type, 1, origin]`): The buffer that stores the result of cumulative sum operation.
* ​src (`NDBuffer[type, 1, origin, shape, strides]`): The buffer of elements for which the cumulative sum is computed.

---

## reduction

Implements SIMD reductions.

You can import these APIs from the `algorithm` package. For example:

```mojo
from algorithm import map_reduce
```

## Functions

* [​`all_true`](/mojo/stdlib/algorithm/reduction/all_true): Returns True if all the elements in a buffer are True and False otherwise.
* [​`any_true`](/mojo/stdlib/algorithm/reduction/any_true): Returns True if any the elements in a buffer are True and False otherwise.
* [​`cumsum`](/mojo/stdlib/algorithm/reduction/cumsum): Computes the cumulative sum of all elements in a buffer.    dst\[i] = src\[i] + src\[i-1] + ... + src\[0].
* [​`map_reduce`](/mojo/stdlib/algorithm/reduction/map_reduce): Stores the result of calling input\_gen\_fn in dst and simultaneously reduce the result using a custom reduction function.
* [​`max`](/mojo/stdlib/algorithm/reduction/max): Computes the max element in a buffer.
* [​`mean`](/mojo/stdlib/algorithm/reduction/mean): Computes the mean value of the elements in a buffer.
* [​`min`](/mojo/stdlib/algorithm/reduction/min): Computes the min element in a buffer.
* [​`none_true`](/mojo/stdlib/algorithm/reduction/none_true): Returns True if none of the elements in a buffer are True and False otherwise.
* [​`product`](/mojo/stdlib/algorithm/reduction/product): Computes the product of the buffer elements.
* [​`reduce`](/mojo/stdlib/algorithm/reduction/reduce): Computes a custom reduction of buffer elements.
* [​`reduce_boolean`](/mojo/stdlib/algorithm/reduction/reduce_boolean): Computes a bool reduction of buffer elements. The reduction will early exit if the `continue_fn` returns False.
* [​`sum`](/mojo/stdlib/algorithm/reduction/sum): Computes the sum of buffer elements.
* [​`variance`](/mojo/stdlib/algorithm/reduction/variance): Given a mean, computes the variance of elements in a buffer.

---

## map_reduce

`map_reduce[simd_width: Int, size: Dim, type: DType, acc_type: DType, origins_gen: origin.set, input_gen_fn: fn[DType, Int](Int) capturing -> SIMD[$0, $1], origins_vec: origin.set, reduce_vec_to_vec_fn: fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2], reduce_vec_to_scalar_fn: fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1]](dst: NDBuffer[type, 1, origin, __init__[::Intable](size)], init: SIMD[acc_type, 1]) -> SIMD[acc_type, 1]`

Stores the result of calling input\_gen\_fn in dst and simultaneously reduce the result using a custom reduction function.

**Parameters:**

* ​simd\_width (`Int`): The vector width for the computation.
* ​size (`Dim`): The buffer size.
* ​type (`DType`): The buffer elements dtype.
* ​acc\_type (`DType`): The dtype of the reduction accumulator.
* ​origins\_gen (`origin.set`): The OriginSet of captured arguments by the input\_gen\_fn.
* ​input\_gen\_fn (`fn[DType, Int](Int) capturing -> SIMD[$0, $1]`): A function that generates inputs to reduce.
* ​origins\_vec (`origin.set`): The OriginSet of captured arguments by the reduce\_vec\_to\_vec\_fn.
* ​reduce\_vec\_to\_vec\_fn (`fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]`): A mapping function. This function is used to
  combine (accumulate) two chunks of input data: e.g. we load two
  `8xfloat32` vectors of elements and need to reduce them into a single
  `8xfloat32` vector.
* ​reduce\_vec\_to\_scalar\_fn (`fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1]`): A reduction function. This function is used to
  reduce a vector to a scalar. E.g. when we got `8xfloat32` vector and want
  to reduce it to an `float32` scalar.

**Args:**

* ​dst (`NDBuffer[type, 1, origin, __init__[::Intable](size)]`): The output buffer.
* ​init (`SIMD[acc_type, 1]`): The initial value to use in accumulator.

**Returns:**

The computed reduction value.

---

## max

`max(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]`

Computes the max element in a buffer.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.

**Returns:**

The maximum of the buffer elements.

`max[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])`

Computes the max across reduce\_axis of an NDBuffer.

**Parameters:**

* ​reduce\_axis (`Int`): The axis to reduce across.

**Args:**

* ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer.
* ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer.

`max[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())`

Computes the max across the input and output shape.

This performs the max computation on the domain specified by `input_shape`,
loading the inputs using the `input_fn`. The results are stored using
the `output_fn`.

**Parameters:**

* ​type (`DType`): The type of the input and output.
* ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input.
* ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.

**Args:**

* ​input\_shape (`IndexList[size]`): The input shape.
* ​reduce\_dim (`Int`): The axis to perform the max on.
* ​context (`DeviceContextPtr`): The pointer to DeviceContext.

---

## mean

`mean(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]`

Computes the mean value of the elements in a buffer.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer of elements for which the mean is computed.

**Returns:**

The mean value of the elements in the given buffer.

`mean[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])`

Computes the mean across reduce\_axis of an NDBuffer.

**Parameters:**

* ​reduce\_axis (`Int`): The axis to reduce across.

**Args:**

* ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer.
* ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer.

`mean[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, output_shape: IndexList[size], context: DeviceContextPtr = DeviceContextPtr())`

Computes the mean across the input and output shape.

This performs the mean computation on the domain specified by `input_shape`,
loading the inputs using the `input_fn`. The results' domain is
`output_shape` which are stored using the `output_fn`.

**Parameters:**

* ​type (`DType`): The type of the input and output.
* ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input.
* ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.

**Args:**

* ​input\_shape (`IndexList[size]`): The input shape.
* ​reduce\_dim (`Int`): The axis to perform the mean on.
* ​output\_shape (`IndexList[size]`): The output shape.
* ​context (`DeviceContextPtr`): The pointer to DeviceContext.

---

## min

`min(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]`

Computes the min element in a buffer.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.

**Returns:**

The minimum of the buffer elements.

`min[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])`

Computes the min across reduce\_axis of an NDBuffer.

**Parameters:**

* ​reduce\_axis (`Int`): The axis to reduce across.

**Args:**

* ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer.
* ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer.

`min[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())`

Computes the min across the input and output shape.

This performs the min computation on the domain specified by `input_shape`,
loading the inputs using the `input_fn`. The results are stored using
the `output_fn`.

**Parameters:**

* ​type (`DType`): The type of the input and output.
* ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input.
* ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.

**Args:**

* ​input\_shape (`IndexList[size]`): The input shape.
* ​reduce\_dim (`Int`): The axis to perform the min on.
* ​context (`DeviceContextPtr`): The pointer to DeviceContext.

---

## none_true

`none_true(src: NDBuffer[type, 1, origin]) -> Bool`

Returns True if none of the elements in a buffer are True and False otherwise.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.

**Returns:**

True if none of the elements of the buffer are True and False otherwise.

---

## product

`product(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]`

Computes the product of the buffer elements.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.

**Returns:**

The product of the buffer elements.

`product[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])`

Computes the product across reduce\_axis of an NDBuffer.

**Parameters:**

* ​reduce\_axis (`Int`): The axis to reduce across.

**Args:**

* ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer.
* ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer.

`product[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())`

Computes the product across the input and output shape.

This performs the product computation on the domain specified by `input_shape`,
loading the inputs using the `input_fn`. The results are stored using
the `output_fn`.

**Parameters:**

* ​type (`DType`): The type of the input and output.
* ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input.
* ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.

**Args:**

* ​input\_shape (`IndexList[size]`): The input shape.
* ​reduce\_dim (`Int`): The axis to perform the product on.
* ​context (`DeviceContextPtr`): The pointer to DeviceContext.

---

## reduce

`reduce[: origin.set, //, reduce_fn: fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]](src: NDBuffer[type, 1, origin], init: SIMD[dtype, 1]) -> SIMD[dtype, 1]`

Computes a custom reduction of buffer elements.

**Parameters:**

* ​reduce\_fn (`fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]`): The lambda implementing the reduction.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The input buffer.
* ​init (`SIMD[dtype, 1]`): The initial value to use in accumulator.

**Returns:**

The computed reduction value.

`reduce[: origin.set, //, map_fn: fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2], reduce_fn: fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1], reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], init: SIMD[type, 1])`

Performs a reduction across reduce\_axis of an NDBuffer (src) and stores the result in an NDBuffer (dst).

First src is reshaped into a 3D tensor. Without loss of generality, the three
axes will be referred to as \[H,W,C], where the axis to reduce across is W,
the axes before the reduce axis are packed into H, and the axes after the
reduce axis are packed into C. i.e. a tensor with dims \[D1, D2, ..., Di, ..., Dn]
reducing across axis i gets packed into a 3D tensor with dims \[H, W, C],
where H=prod(D1,...,Di-1), W = Di, and C = prod(Di+1,...,Dn).

**Parameters:**

* ​map\_fn (`fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]`): A mapping function. This function is used when to combine
  (accumulate) two chunks of input data: e.g. we load two 8xfloat32 vectors
  of elements and need to reduce them to a single 8xfloat32 vector.
* ​reduce\_fn (`fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1]`): A reduction function. This function is used to reduce a
  vector to a scalar. E.g. when we got 8xfloat32 vector and want to reduce
  it to 1xfloat32.
* ​reduce\_axis (`Int`): The axis to reduce across.

**Args:**

* ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer.
* ​dst (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The output buffer.
* ​init (`SIMD[type, 1]`): The initial value to use in accumulator.

---

## reduce_boolean

`reduce_boolean[: origin.set, : origin.set, //, reduce_fn: fn[DType, Int](SIMD[$0, $1]) capturing -> Bool, continue_fn: fn(Bool) capturing -> Bool](src: NDBuffer[type, 1, origin], init: Bool) -> Bool`

Computes a bool reduction of buffer elements. The reduction will early exit if the `continue_fn` returns False.

**Parameters:**

* ​reduce\_fn (`fn[DType, Int](SIMD[$0, $1]) capturing -> Bool`): A boolean reduction function. This function is used to reduce
  a vector to a scalar. E.g. when we got `8xfloat32` vector and want to
  reduce it to a `bool`.
* ​continue\_fn (`fn(Bool) capturing -> Bool`): A function to indicate whether we want to continue
  processing the rest of the iterations. This takes the result of the
  reduce\_fn and returns True to continue processing and False to early
  exit.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The input buffer.
* ​init (`Bool`): The initial value to use.

**Returns:**

The computed reduction value.

---

## sum

`sum(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]`

Computes the sum of buffer elements.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.

**Returns:**

The sum of the buffer elements.

`sum[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])`

Computes the sum across reduce\_axis of an NDBuffer.

**Parameters:**

* ​reduce\_axis (`Int`): The axis to reduce across.

**Args:**

* ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer.
* ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer.

`sum[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())`

Computes the sum across the input and output shape.

This performs the sum computation on the domain specified by `input_shape`,
loading the inputs using the `input_fn`. The results are stored using
the `output_fn`.

**Parameters:**

* ​type (`DType`): The type of the input and output.
* ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input.
* ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.

**Args:**

* ​input\_shape (`IndexList[size]`): The input shape.
* ​reduce\_dim (`Int`): The axis to perform the sum on.
* ​context (`DeviceContextPtr`): The pointer to DeviceContext.

---

## variance

`variance(src: NDBuffer[type, 1, origin], mean_value: SIMD[type, 1], correction: Int = 1) -> SIMD[type, 1]`

Given a mean, computes the variance of elements in a buffer.

The mean value is used to avoid a second pass over the data:

```
variance(x) = sum((x - E(x))^2) / (size - correction)
```

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.
* ​mean\_value (`SIMD[type, 1]`): The mean value of the buffer.
* ​correction (`Int`): Normalize variance by size - correction.

**Returns:**

The variance value of the elements in a buffer.

`variance(src: NDBuffer[type, 1, origin], correction: Int = 1) -> SIMD[type, 1]`

Computes the variance value of the elements in a buffer.

```
variance(x) = sum((x - E(x))^2) / (size - correction)
```

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.
* ​correction (`Int`): Normalize variance by size - correction (Default=1).

**Returns:**

The variance value of the elements in a buffer.

---

## b16decode

`b16decode(str: StringSlice[origin]) -> String`

Performs base16 decoding on the input string.

**Args:**

* ​str (`StringSlice[origin]`): A base16 encoded string.

**Returns:**

The decoded string.

---

## b16encode

`b16encode(str: StringSlice[origin]) -> String`

Performs base16 encoding on the input string slice.

**Args:**

* ​str (`StringSlice[origin]`): The input string slice.

**Returns:**

Base16 encoding of the input string.

---

## b64decode

`b64decode[*, validate: Bool = False](str: StringSlice[origin]) -> String`

Performs base64 decoding on the input string.

**Parameters:**

* ​validate (`Bool`): If true, the function will validate the input string.

**Args:**

* ​str (`StringSlice[origin]`): A base64 encoded string.

**Returns:**

The decoded string.

---

## b64encode

`b64encode(input_bytes: Span[SIMD[uint8, 1], origin], mut result: String)`

Performs base64 encoding on the input string.

Notes:
This method reserves the necessary capacity. `result` can be a 0
capacity string.

**Args:**

* ​input\_bytes (`Span[SIMD[uint8, 1], origin]`): The input string buffer.
* ​result (`String`): The string in which to store the values.

`b64encode(input_string: StringSlice[origin]) -> String`

Performs base64 encoding on the input string.

**Args:**

* ​input\_string (`StringSlice[origin]`): The input string buffer.

**Returns:**

The ASCII base64 encoded string.

`b64encode(input_bytes: Span[SIMD[uint8, 1], origin]) -> String`

Performs base64 encoding on the input string.

**Args:**

* ​input\_bytes (`Span[SIMD[uint8, 1], origin]`): The input string buffer.

**Returns:**

The ASCII base64 encoded string.

---

## base64

Provides functions for base64 encoding strings.

You can import these APIs from the `base64` package. For example:

```mojo
from base64 import b64encode
```

## Functions

* [​`b16decode`](/mojo/stdlib/base64/base64/b16decode): Performs base16 decoding on the input string.
* [​`b16encode`](/mojo/stdlib/base64/base64/b16encode): Performs base16 encoding on the input string slice.
* [​`b64decode`](/mojo/stdlib/base64/base64/b64decode): Performs base64 decoding on the input string.
* [​`b64encode`](/mojo/stdlib/base64/base64/b64encode): Performs base64 encoding on the input string.

---

## base64

Implements the base64 package.

## Modules

* [​`base64`](/mojo/stdlib/base64/base64/): Provides functions for base64 encoding strings.

---

## Bench

`struct Bench`

Constructs a Benchmark object, used for running multiple benchmarks and comparing the results.

Example:

```mojo
from benchmark import (
    Bench,
    BenchConfig,
    Bencher,
    BenchId,
    ThroughputMeasure,
    BenchMetric,
    Format,
)
from utils import IndexList
from gpu.host import DeviceContext
from pathlib import Path

fn example_kernel():
    print("example_kernel")

var shape = IndexList[2](1024, 1024)
var bench = Bench(BenchConfig(max_iters=100))

@parameter
@always_inline
fn example(mut b: Bencher, shape: IndexList[2]) capturing raises:
    @parameter
    @always_inline
    fn kernel_launch(ctx: DeviceContext) raises:
        ctx.enqueue_function[example_kernel](
            grid_dim=shape[0], block_dim=shape[1]
        )

    var bench_ctx = DeviceContext()
    b.iter_custom[kernel_launch](bench_ctx)

bench.bench_with_input[IndexList[2], example](
    BenchId("top_k_custom", "gpu"),
    shape,
    ThroughputMeasure(
        BenchMetric.elements, shape.flattened_length()
    ),
    ThroughputMeasure(
        BenchMetric.flops, shape.flattened_length() * 3 # number of ops
    ),
)
# Add more benchmarks like above to compare results

# Pretty print in table format
print(bench)

# Dump report to csv file
bench.config.out_file = Path("out.csv")
bench.dump_report()

# Print in tabular csv format
bench.config.format = Format.tabular
print(bench)
```

You can pass arguments when running a program that makes use of `Bench`:

```sh
mojo benchmark.mojo -o out.csv -r 10
```

This will repeat the benchmarks 10 times and write the output to `out.csv`
in csv format.

## Fields

* ​config (`BenchConfig`): Constructs a Benchmark object based on specific configuration and mode.
* ​mode (`Mode`): Benchmark mode object representing benchmark or test mode.
* ​info\_vec (`List[BenchmarkInfo]`): A list containing the benchmark info.

## Implemented traits

`AnyType`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(out self, config: Optional[BenchConfig] = Optional(None), mode: Mode = Mode(0))`

Constructs a Benchmark object based on specific configuration and mode.

**Args:**

* ​config (`Optional[BenchConfig]`): Benchmark configuration object to control length and frequency of benchmarks.
* ​mode (`Mode`): Benchmark mode object representing benchmark or test mode.

### `bench_with_input`

`bench_with_input[: origin.set, //, T: AnyType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, measures: List[ThroughputMeasure] = List())`

Benchmarks an input function with input args of type AnyType.

**Parameters:**

* ​T (`AnyType`): Benchmark function input type.
* ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​input (`T`): Represents the target function's input arguments.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`bench_with_input[: origin.set, //, T: AnyType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, *measures: ThroughputMeasure)`

Benchmarks an input function with input args of type AnyType.

**Parameters:**

* ​T (`AnyType`): Benchmark function input type.
* ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​input (`T`): Represents the target function's input arguments.
* ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's.

`bench_with_input[: origin.set, //, T: AnyTrivialRegType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, measures: List[ThroughputMeasure] = List())`

Benchmarks an input function with input args of type AnyTrivialRegType.

**Parameters:**

* ​T (`AnyTrivialRegType`): Benchmark function input type.
* ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​input (`T`): Represents the target function's input arguments.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`bench_with_input[: origin.set, //, T: AnyTrivialRegType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, *measures: ThroughputMeasure)`

Benchmarks an input function with input args of type AnyTrivialRegType.

**Parameters:**

* ​T (`AnyTrivialRegType`): Benchmark function input type.
* ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​input (`T`): Represents the target function's input arguments.
* ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's.

### `bench_function`

`bench_function[: origin.set, //, bench_fn: fn() raises capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmarks or Tests an input function.

**Parameters:**

* ​bench\_fn (`fn() raises capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`bench_function[: origin.set, //, bench_fn: fn() capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmarks or Tests an input function.

**Parameters:**

* ​bench\_fn (`fn() capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`bench_function[: origin.set, //, bench_fn: fn(mut Bencher) capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmarks or Tests an input function.

**Parameters:**

* ​bench\_fn (`fn(mut Bencher) capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`bench_function[: origin.set, //, bench_fn: fn(mut Bencher) capturing -> None](mut self, bench_id: BenchId, *measures: ThroughputMeasure)`

Benchmarks or Tests an input function.

**Parameters:**

* ​bench\_fn (`fn(mut Bencher) capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's.

`bench_function[: origin.set, //, bench_fn: fn(mut Bencher) raises capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmarks or Tests an input function.

**Parameters:**

* ​bench\_fn (`fn(mut Bencher) raises capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`bench_function[: origin.set, //, bench_fn: fn(mut Bencher) raises capturing -> None](mut self, bench_id: BenchId, *measures: ThroughputMeasure)`

Benchmarks or Tests an input function.

**Parameters:**

* ​bench\_fn (`fn(mut Bencher) raises capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's.

### `dump_report`

`dump_report(mut self)`

Prints out the report from a Benchmark execution. If `Bench.config.out_file` is set, it will also write the output in the format set in `out_file_format` to the file defined in `out_file`.

### `pad`

`pad[pad_str: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")](self, width: Int, string: String) -> String`

Pads a string to a given width.

**Parameters:**

* ​pad\_str (`StringSlice[StaticConstantOrigin]`): The length 1 string to use for the padding.

**Args:**

* ​width (`Int`): The width to pad the string to.
* ​string (`String`): The string to pad.

**Returns:**

A string padded to the given width.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the benchmark results.

**Returns:**

A string representing the benchmark results.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the benchmark results to a writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writer trait.

**Args:**

* ​writer (`W`): The writer to write to.

---

## BenchConfig

`struct BenchConfig`

Defines a benchmark configuration struct to control execution times and frequency.

## Fields

* ​out\_file (`Optional[Path]`): Output file to write results to.
* ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs.
* ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs.
* ​min\_warmuptime\_secs (`SIMD[float64, 1]`): Lower bound on warmup time in secs.
* ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement.
* ​max\_iters (`Int`): Max number of iterations to run.
* ​num\_repetitions (`Int`): Number of times the benchmark has to be repeated.
* ​flush\_denormals (`Bool`): Whether or not the denormal values are flushed.
* ​show\_progress (`Bool`): If True, print progress of each benchmark.
* ​format (`Format`): The format to print results. (default: "table").
* ​out\_file\_format (`Format`): The format to write out the file with `dump_file` (default: "csv").
* ​verbose\_timing (`Bool`): Whether to print verbose timing results.
* ​verbose\_metric\_names (`Bool`): If True print the metric name and unit, else print the unit only.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `VERBOSE_TIMING_LABELS`

`alias VERBOSE_TIMING_LABELS = List(__init__[__mlir_type.!kgen.string]("min (ms)"), __init__[__mlir_type.!kgen.string]("mean (ms)"), __init__[__mlir_type.!kgen.string]("max (ms)"), __init__[__mlir_type.!kgen.string]("duration (ms)"), Tuple())`

Labels to print verbose timing results.

## Methods

### `__init__`

`__init__(out self, out_file: Optional[Path] = Optional(None), min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](2), min_warmuptime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1), max_batch_size: Int = 0, max_iters: Int = 1000000000, num_repetitions: Int = 1, flush_denormals: Bool = True)`

Constructs and initializes Benchmark config object with default and inputted values.

**Args:**

* ​out\_file (`Optional[Path]`): Output file to write results to.
* ​min\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `0.1`).
* ​max\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `1`).
* ​min\_warmuptime\_secs (`SIMD[float64, 1]`): Lower bound on warmup time in secs (default `1.0`).
* ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement.
* ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`).
* ​num\_repetitions (`Int`): Number of times the benchmark has to be repeated.
* ​flush\_denormals (`Bool`): Whether or not the denormal values are flushed.

`__init__(out self, *, other: Self)`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

---

## BenchId

`struct BenchId`

Defines a benchmark Id struct to identify and represent a particular benchmark execution.

## Fields

* ​func\_name (`String`): The target function name.
* ​input\_id (`Optional[String]`): The target function input id phrase.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, func_name: String, input_id: String)`

Constructs a Benchmark Id object from input function name and Id phrase.

**Args:**

* ​func\_name (`String`): The target function name.
* ​input\_id (`String`): The target function input id phrase.

`@implicit`
`__init__(out self, func_name: String)`

Constructs a Benchmark Id object from input function name.

**Args:**

* ​func\_name (`String`): The target function name.

`@implicit`
`__init__(out self, func_name: StringLiteral[value])`

Constructs a Benchmark Id object from input function name.

**Args:**

* ​func\_name (`StringLiteral[value]`): The target function name.

---

## BenchMetric

`struct BenchMetric`

Defines a benchmark throughput metric.

## Fields

* ​code (`Int`): Op-code of the Metric.
* ​name (`String`): Metric's name.
* ​unit (`String`): Metric's throughput rate unit (count/second).

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `bytes`

`alias bytes = BenchMetric(1, __init__[__mlir_type.!kgen.string]("DataMovement"), __init__[__mlir_type.!kgen.string]("GB/s"))`

### `DEFAULTS`

`alias DEFAULTS = List(BenchMetric(0, __init__[__mlir_type.!kgen.string]("throughput"), __init__[__mlir_type.!kgen.string]("GElems/s")), BenchMetric(1, __init__[__mlir_type.!kgen.string]("DataMovement"), __init__[__mlir_type.!kgen.string]("GB/s")), BenchMetric(2, __init__[__mlir_type.!kgen.string]("Arithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s")), Tuple())`

Default set of benchmark metrics.

### `elements`

`alias elements = BenchMetric(0, __init__[__mlir_type.!kgen.string]("throughput"), __init__[__mlir_type.!kgen.string]("GElems/s"))`

### `flops`

`alias flops = BenchMetric(2, __init__[__mlir_type.!kgen.string]("Arithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s"))`

### `theoretical_flops`

`alias theoretical_flops = BenchMetric(3, __init__[__mlir_type.!kgen.string]("TheoreticalArithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s"))`

## Methods

### `__init__`

`__init__(out self, *, other: Self)`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compares two metrics for equality.

**Args:**

* ​other (`Self`): The metric to compare.

**Returns:**

True if the two metrics are equal.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Compares two metrics for inequality.

**Args:**

* ​other (`Self`): The metric to compare.

**Returns:**

True if the two metrics are NOT equal.

### `__str__`

`__str__(self) -> String`

Gets a string representation of this metric.

**Returns:**

The string representation.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this BenchMetric to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `check_name`

`check_name(self, alt_name: String) -> Bool`

Checks whether a string contains the metric's name.

**Args:**

* ​alt\_name (`String`): Alternative name of a metric.

**Returns:**

True if 'alt\_name' is valid alternative of the metric's name.

### `get_metric_from_list`

`static get_metric_from_list(name: String, metric_list: List[BenchMetric]) -> Self`

Gets a metric from a given list using only the metric's name.

**Args:**

* ​name (`String`): Metric's name.
* ​metric\_list (`List[BenchMetric]`): List of metrics to search.

**Returns:**

The selected metric.

---

## Bencher

`@register_passable`
`struct Bencher`

Defines a Bencher struct which facilitates the timing of a target function.

## Fields

* ​num\_iters (`Int`): Number of iterations to run the target function.
* ​elapsed (`Int`): The total time elapsed when running the target function.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(num_iters: Int) -> Self`

Constructs a Bencher object to run and time a function.

**Args:**

* ​num\_iters (`Int`): Number of times to run the target function.

### `iter`

`iter[: origin.set, //, iter_fn: fn() capturing -> None](mut self)`

Returns the total elapsed time by running a target function a particular number of times.

**Parameters:**

* ​iter\_fn (`fn() capturing -> None`): The target function to benchmark.

`iter[iter_fn: fn() raises capturing -> None](mut self)`

Returns the total elapsed time by running a target function a particular number of times.

**Parameters:**

* ​iter\_fn (`fn() raises capturing -> None`): The target function to benchmark.

### `iter_preproc`

`iter_preproc[: origin.set, : origin.set, //, iter_fn: fn() capturing -> None, preproc_fn: fn() capturing -> None](mut self)`

Returns the total elapsed time by running a target function a particular number of times.

**Parameters:**

* ​iter\_fn (`fn() capturing -> None`): The target function to benchmark.
* ​preproc\_fn (`fn() capturing -> None`): The function to preprocess the target function.

### `iter_custom`

`iter_custom[: origin.set, //, iter_fn: fn(Int) capturing -> Int](mut self)`

Times a target function with custom number of iterations.

**Parameters:**

* ​iter\_fn (`fn(Int) capturing -> Int`): The target function to benchmark.

`iter_custom[: origin.set, //, kernel_launch_fn: fn(DeviceContext) raises capturing -> None](mut self, ctx: DeviceContext)`

Times a target GPU function with custom number of iterations via DeviceContext ctx.

**Parameters:**

* ​kernel\_launch\_fn (`fn(DeviceContext) raises capturing -> None`): The target GPU kernel launch function to benchmark.

**Args:**

* ​ctx (`DeviceContext`): The GPU DeviceContext for launching kernel.

`iter_custom[: origin.set, //, kernel_launch_fn: fn(DeviceContext, Int) raises capturing -> None](mut self, ctx: DeviceContext)`

Times a target GPU function with custom number of iterations via DeviceContext ctx.

**Parameters:**

* ​kernel\_launch\_fn (`fn(DeviceContext, Int) raises capturing -> None`): The target GPU kernel launch function to benchmark.

**Args:**

* ​ctx (`DeviceContext`): The GPU DeviceContext for launching kernel.

`iter_custom[iter_fn: fn(Int) raises capturing -> Int](mut self)`

Times a target function with custom number of iterations.

**Parameters:**

* ​iter\_fn (`fn(Int) raises capturing -> Int`): The target function to benchmark.

### `iter_custom_multicontext`

`iter_custom_multicontext[: origin.set, //, kernel_launch_fn: fn() raises capturing -> None](mut self, ctxs: List[DeviceContext])`

Times a target GPU function with custom number of iterations via DeviceContext ctx.

**Parameters:**

* ​kernel\_launch\_fn (`fn() raises capturing -> None`): The target GPU kernel launch function to benchmark.

**Args:**

* ​ctxs (`List[DeviceContext]`): The list of GPU DeviceContext's for launching kernel.

---

## BenchmarkInfo

`struct BenchmarkInfo`

Defines a Benchmark Info struct to record execution Statistics.

## Fields

* ​name (`String`): The name of the benchmark.
* ​result (`Report`): The output report after executing a benchmark.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.
* ​verbose\_timing (`Bool`): Whether to print verbose timing results.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, name: String, result: Report, measures: List[ThroughputMeasure] = List(), verbose_timing: Bool = False)`

Constructs a `BenchmarkInfo` object to return benchmark report and statistics.

**Args:**

* ​name (`String`): The name of the benchmark.
* ​result (`Report`): The output report after executing a benchmark.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.
* ​verbose\_timing (`Bool`): Whether to print verbose timing results.

`__init__(out self, *, other: Self)`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

---

## Format

`struct Format`

Defines a format for the benchmark output when printing or writing to a file.

## Fields

* ​value (`StringSlice[StaticConstantOrigin]`): The format to print results.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `csv`

`alias csv = __init__[__mlir_type.!kgen.string]("csv")`

Comma separated values with no alignment.

### `table`

`alias table = __init__[__mlir_type.!kgen.string]("table")`

Table format with dynamically aligned columns.

### `tabular`

`alias tabular = __init__[__mlir_type.!kgen.string]("tabular")`

Comma separated values with dynamically aligned columns.

## Methods

### `__init__`

`@implicit`
`__init__(out self, value: StringSlice[origin])`

Constructs a Format object from a string.

**Args:**

* ​value (`StringSlice[origin]`): The format to print results.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two Format objects are equal.

**Args:**

* ​other (`Self`): The `Format` to compare with.

**Returns:**

True if the two `Format` objects are equal, false otherwise.

### `__str__`

`__str__(self) -> String`

Returns the string representation of the format.

**Returns:**

The string representation of the format.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the format to a writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The writer to write the `Format` to.

---

## Mode

`struct Mode`

Defines a Benchmark Mode to distinguish between test runs and actual benchmarks.

## Fields

* ​value (`Int`): Represents the mode type.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `Benchmark`

`alias Benchmark = Mode(0)`

### `Test`

`alias Test = Mode(1)`

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Check if its Benchmark mode or test mode.

**Args:**

* ​other (`Self`): The mode to be compared against.

**Returns:**

If its a test mode or benchmark mode.

---

## ThroughputMeasure

`struct ThroughputMeasure`

Records a throughput metric of metric BenchMetric and value.

## Fields

* ​metric (`BenchMetric`): Type of throughput metric.
* ​value (`Int`): Measured count of throughput metric.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, name: String, value: Int, reference: List[BenchMetric] = List(BenchMetric(0, __init__[__mlir_type.!kgen.string]("throughput"), __init__[__mlir_type.!kgen.string]("GElems/s")), BenchMetric(1, __init__[__mlir_type.!kgen.string]("DataMovement"), __init__[__mlir_type.!kgen.string]("GB/s")), BenchMetric(2, __init__[__mlir_type.!kgen.string]("Arithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s")), Tuple()))`

Creates a `ThroughputMeasure` based on metric's name.

Example:
For the default bench metrics `BenchMetric.DEFAULTS` the
following are equivalent:
\- `ThroughputMeasure(BenchMetric.fmas, 1024)`
\- `ThroughputMeasure("fmas", 1024)`
\- `ThroughputMeasure("fmas", 1024, BenchMetric.DEFAULTS)`

**Args:**

* ​name (`String`): The name of BenchMetric in its corresponding reference.
* ​value (`Int`): The measured value to assign to this metric.
* ​reference (`List[BenchMetric]`): List of BenchMetrics that contains this metric.

`__init__(out self, *, other: Self)`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

### `__str__`

`__str__(self) -> String`

Gets a string representation of this `ThroughputMeasure`.

**Returns:**

The string representation.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this ThroughputMeasure to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `compute`

`compute(self, elapsed_sec: SIMD[float64, 1]) -> SIMD[float64, 1]`

Computes throughput rate for this metric per unit of time (second).

**Args:**

* ​elapsed\_sec (`SIMD[float64, 1]`): Elapsed time measured in seconds.

**Returns:**

The throughput values as a floating point 64.

---

## bencher

## Structs

* [​`Bench`](/mojo/stdlib/benchmark/bencher/Bench): Constructs a Benchmark object, used for running multiple benchmarks and comparing the results.
* [​`BenchConfig`](/mojo/stdlib/benchmark/bencher/BenchConfig): Defines a benchmark configuration struct to control execution times and frequency.
* [​`Bencher`](/mojo/stdlib/benchmark/bencher/Bencher): Defines a Bencher struct which facilitates the timing of a target function.
* [​`BenchId`](/mojo/stdlib/benchmark/bencher/BenchId): Defines a benchmark Id struct to identify and represent a particular benchmark execution.
* [​`BenchmarkInfo`](/mojo/stdlib/benchmark/bencher/BenchmarkInfo): Defines a Benchmark Info struct to record execution Statistics.
* [​`BenchMetric`](/mojo/stdlib/benchmark/bencher/BenchMetric): Defines a benchmark throughput metric.
* [​`Format`](/mojo/stdlib/benchmark/bencher/Format): Defines a format for the benchmark output when printing or writing to a file.
* [​`Mode`](/mojo/stdlib/benchmark/bencher/Mode): Defines a Benchmark Mode to distinguish between test runs and actual benchmarks.
* [​`ThroughputMeasure`](/mojo/stdlib/benchmark/bencher/ThroughputMeasure): Records a throughput metric of metric BenchMetric and value.

---

## Batch

`@register_passable(trivial)`
`struct Batch`

A batch of benchmarks, the benchmark.run() function works out how many iterations to run in each batch based the how long the previous iterations took.

## Fields

* ​duration (`Int`): Total duration of batch stored as nanoseconds.
* ​iterations (`Int`): Total iterations in the batch.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(*, other: Self) -> Self`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

### `mean`

`mean(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]`

Returns the average duration of the batch.

**Args:**

* ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`).

**Returns:**

The average duration of the batch.

---

## Report

`struct Report`

Contains the average execution time, iterations, min and max of each batch.

## Fields

* ​warmup\_duration (`Int`): The total duration it took to warmup.
* ​runs (`List[Batch]`): A `List` of benchmark runs.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Default initializer for the Report.

Sets all values to 0

`__init__(out self, *, other: Self)`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Creates a shallow copy (it doesn't copy the data).

**Args:**

* ​existing (`Self`): The `Report` to copy.

### `iters`

`iters(self) -> Int`

The total benchmark iterations.

**Returns:**

The total benchmark iterations.

### `duration`

`duration(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]`

The total duration it took to run all benchmarks.

**Args:**

* ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`).

**Returns:**

The total duration it took to run all benchmarks.

### `mean`

`mean(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]`

The average duration of all benchmark runs.

**Args:**

* ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`).

**Returns:**

The average duration of all benchmark runs.

### `min`

`min(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]`

The batch of benchmarks that was the fastest to run.

**Args:**

* ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`).

**Returns:**

The fastest duration out of all batches.

### `max`

`max(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]`

The batch of benchmarks that was the slowest to run.

**Args:**

* ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`).

**Returns:**

The slowest duration out of all batches.

### `print`

`print(self, unit: String = __init__[__mlir_type.!kgen.string]("s"))`

Prints out the shortened version of the report.

**Args:**

* ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`).

### `print_full`

`print_full(self, unit: String = __init__[__mlir_type.!kgen.string]("s"))`

Prints out the full version of the report with each batch of benchmark runs.

**Args:**

* ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`).

---

## Unit

`struct Unit`

Time Unit used by Benchmark Report.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `ms`

`alias ms = "ms"`

Milliseconds

### `ns`

`alias ns = "ns"`

Nanoseconds

### `s`

`alias s = "s"`

Seconds

---

## benchmark

Implements the benchmark module for runtime benchmarking.

You can import these APIs from the `benchmark` package. For example:

```mojo
import benchmark
from time import sleep
```

You can pass any `fn` as a parameter into `benchmark.run[...]()`, it will return
a `Report` where you can get the mean, duration, max, and more:

```mojo
fn sleeper():
    sleep(.01)

var report = benchmark.run[sleeper]()
print(report.mean())
```

```output
0.012256487394957985
```

You can print a full report:

```mojo
report.print()
```

```output
---------------------
Benchmark Report (s)
---------------------
Mean: 0.012265747899159664
Total: 1.459624
Iters: 119
Warmup Total: 0.025020000000000001
Fastest Mean: 0.0121578
Slowest Mean: 0.012321428571428572

```

Or all the batch runs:

```mojo
report.print_full()
```

```output
---------------------
Benchmark Report (s)
---------------------
Mean: 0.012368649122807017
Total: 1.410026
Iters: 114
Warmup Total: 0.023341000000000001
Fastest Mean: 0.012295586956521738
Slowest Mean: 0.012508099999999999

Batch: 1
Iterations: 20
Mean: 0.012508099999999999
Duration: 0.250162

Batch: 2
Iterations: 46
Mean: 0.012295586956521738
Duration: 0.56559700000000002

Batch: 3
Iterations: 48
Mean: 0.012380562499999999
Duration: 0.59426699999999999
```

If you want to use a different time unit you can bring in the Unit and pass
it in as an argument:

```mojo
from benchmark import Unit

report.print(Unit.ms)
```

```output
---------------------
Benchmark Report (ms)
---------------------
Mean: 0.012312411764705882
Total: 1.465177
Iters: 119
Warmup Total: 0.025010999999999999
Fastest Mean: 0.012015649999999999
Slowest Mean: 0.012421204081632654
```

The unit's are just aliases for string constants, so you can for example:

```mojo
print(report.mean("ms"))
```

```output
12.199145299145298
```

Benchmark.run takes four arguments to change the behaviour, to set warmup
iterations to 5:

```mojo
r = benchmark.run[sleeper](5)
```

```output
0.012004808080808081
```

To set 1 warmup iteration, 2 max iterations, a min total time of 3 sec, and a
max total time of 4 s:

```mojo
r = benchmark.run[sleeper](1, 2, 3, 4)
```

Note that the min total time will take precedence over max iterations

## Structs

* [​`Batch`](/mojo/stdlib/benchmark/benchmark/Batch): A batch of benchmarks, the benchmark.run() function works out how many iterations to run in each batch based the how long the previous iterations took.
* [​`Report`](/mojo/stdlib/benchmark/benchmark/Report): Contains the average execution time, iterations, min and max of each batch.
* [​`Unit`](/mojo/stdlib/benchmark/benchmark/Unit): Time Unit used by Benchmark Report.

## Functions

* [​`run`](/mojo/stdlib/benchmark/benchmark/run): Benchmarks the function passed in as a parameter.

---

## run

`run[func: fn() raises -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report`

Benchmarks the function passed in as a parameter.

Benchmarking continues until 'min\_time\_ns' has elapsed and either
`max_time_ns` OR `max_iters` is achieved.

**Parameters:**

* ​func (`fn() raises -> None`): The function to benchmark.

**Args:**

* ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`).
* ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`).
* ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`).
* ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time
  measurement.

**Returns:**

Average execution time of func in ns.

`run[func: fn() -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report`

Benchmarks the function passed in as a parameter.

Benchmarking continues until 'min\_time\_ns' has elapsed and either
`max_time_ns` OR `max_iters` is achieved.

**Parameters:**

* ​func (`fn() -> None`): The function to benchmark.

**Args:**

* ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`).
* ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`).
* ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`).
* ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time
  measurement.

**Returns:**

Average execution time of func in ns.

`run[: origin.set, //, func: fn() raises capturing -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report`

Benchmarks the function passed in as a parameter.

Benchmarking continues until 'min\_time\_ns' has elapsed and either
`max_time_ns` OR `max_iters` is achieved.

**Parameters:**

* ​func (`fn() raises capturing -> None`): The function to benchmark.

**Args:**

* ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`).
* ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`).
* ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`).
* ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time
  measurement.

**Returns:**

Average execution time of func in ns.

`run[: origin.set, //, func: fn() capturing -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report`

Benchmarks the function passed in as a parameter.

Benchmarking continues until 'min\_time\_ns' has elapsed and either
`max_time_ns` OR `max_iters` is achieved.

**Parameters:**

* ​func (`fn() capturing -> None`): The function to benchmark.

**Args:**

* ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`).
* ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`).
* ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`).
* ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time
  measurement.

**Returns:**

Average execution time of func in ns.

---

## compiler

## Functions

* [​`keep`](/mojo/stdlib/benchmark/compiler/keep): Provides a hint to the compiler to not optimize the variable use away.

---

## keep

`keep(val: Bool)`

Provides a hint to the compiler to not optimize the variable use away.

This is useful in benchmarking to avoid the compiler not deleting the
code to be benchmarked because the variable is not used in a side-effecting
manner.

**Args:**

* ​val (`Bool`): The value to not optimize away.

`keep(val: Int)`

Provides a hint to the compiler to not optimize the variable use away.

This is useful in benchmarking to avoid the compiler not deleting the
code to be benchmarked because the variable is not used in a side-effecting
manner.

**Args:**

* ​val (`Int`): The value to not optimize away.

`keep[type: DType, simd_width: Int](val: SIMD[type, simd_width])`

Provides a hint to the compiler to not optimize the variable use away.

This is useful in benchmarking to avoid the compiler not deleting the
code to be benchmarked because the variable is not used in a side-effecting
manner.

**Parameters:**

* ​type (`DType`): The `dtype` of the input and output SIMD vector.
* ​simd\_width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​val (`SIMD[type, simd_width]`): The value to not optimize away.

`keep[type: AnyType](val: UnsafePointer[type])`

Provides a hint to the compiler to not optimize the variable use away.

This is useful in benchmarking to avoid the compiler not deleting the
code to be benchmarked because the variable is not used in a side-effecting
manner.

**Parameters:**

* ​type (`AnyType`): The type of the input.

**Args:**

* ​val (`UnsafePointer[type]`): The value to not optimize away.

`keep[type: AnyTrivialRegType](mut val: type)`

Provides a hint to the compiler to not optimize the variable use away.

This is useful in benchmarking to avoid the compiler not deleting the
code to be benchmarked because the variable is not used in a side-effecting
manner.

**Parameters:**

* ​type (`AnyTrivialRegType`): The type of the input.

**Args:**

* ​val (`type`): The value to not optimize away.

---

## benchmark

Implements the benchmark package for runtime benchmarking.

You can import these APIs from the `benchmark` package. For example:

```mojo
import benchmark
from time import sleep
```

You can pass any `fn` as a parameter into `benchmark.run[...]()`, it will return
a `Report` where you can get the mean, duration, max, and more:

```mojo
fn sleeper():
    sleep(.01)

var report = benchmark.run[sleeper]()
print(report.mean())
```

```output
0.012256487394957985
```

You can print a full report:

```mojo
report.print()
```

```output
---------------------
Benchmark Report (s)
---------------------
Mean: 0.012265747899159664
Total: 1.459624
Iters: 119
Warmup Mean: 0.01251
Warmup Total: 0.025020000000000001
Warmup Iters: 2
Fastest Mean: 0.0121578
Slowest Mean: 0.012321428571428572

```

Or all the batch runs:

```mojo
report.print_full()
```

```output
---------------------
Benchmark Report (s)
---------------------
Mean: 0.012368649122807017
Total: 1.410026
Iters: 114
Warmup Mean: 0.0116705
Warmup Total: 0.023341000000000001
Warmup Iters: 2
Fastest Mean: 0.012295586956521738
Slowest Mean: 0.012508099999999999

Batch: 1
Iterations: 20
Mean: 0.012508099999999999
Duration: 0.250162

Batch: 2
Iterations: 46
Mean: 0.012295586956521738
Duration: 0.56559700000000002

Batch: 3
Iterations: 48
Mean: 0.012380562499999999
Duration: 0.59426699999999999
```

If you want to use a different time unit you can bring in the Unit and pass
it in as an argument:

```mojo
from benchmark import Unit

report.print(Unit.ms)
```

```output
---------------------
Benchmark Report (ms)
---------------------
Mean: 0.012312411764705882
Total: 1.465177
Iters: 119
Warmup Mean: 0.012505499999999999
Warmup Total: 0.025010999999999999
Warmup Iters: 2
Fastest Mean: 0.012015649999999999
Slowest Mean: 0.012421204081632654
```

The unit's are just aliases for string constants, so you can for example:

```mojo
print(report.mean("ms"))
```

```output
12.199145299145298
```

Benchmark.run takes four arguments to change the behaviour, to set warmup
iterations to 5:

```mojo
r = benchmark.run[sleeper](5)
```

```output
0.012004808080808081
```

To set 1 warmup iteration, 2 max iterations, a min total time of 3 sec, and a
max total time of 4 s:

```mojo
r = benchmark.run[sleeper](1, 2, 3, 4)
```

Note that the min total time will take precedence over max iterations

## Modules

* [​`bencher`](/mojo/stdlib/benchmark/bencher/):
* [​`benchmark`](/mojo/stdlib/benchmark/benchmark/): Implements the benchmark module for runtime benchmarking.
* [​`compiler`](/mojo/stdlib/benchmark/compiler/):
* [​`memory`](/mojo/stdlib/benchmark/memory/):
* [​`quick_bench`](/mojo/stdlib/benchmark/quick_bench/):

---

## clobber_memory

`clobber_memory()`

Forces all pending memory writes to be flushed to memory.

This ensures that the compiler does not optimize away memory writes if it
deems them to be not necessary. In effect, this operation acts as a barrier
to memory reads and writes.

---

## memory

## Functions

* [​`clobber_memory`](/mojo/stdlib/benchmark/memory/clobber_memory): Forces all pending memory writes to be flushed to memory.

---

## QuickBench

`struct QuickBench`

Defines a struct to facilitate benchmarking and avoiding `Bencher` boilerplate.

## Fields

* ​m (`Bench`): Bench object to collect the results.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Just initialize the Bench object.

### `dump_report`

`dump_report(mut self)`

Prints out the report from a Benchmark execution collected in Bench object.

### `run`

`run[T_out: AnyTrivialRegType](mut self, func: fn() -> T_out, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with no input arguments and return type `T_out`.

**Parameters:**

* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn() -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0) -> T_out, x0: T0, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 1 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1) -> T_out, x0: T0, x1: T1, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 2 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2) -> T_out, x0: T0, x1: T1, x2: T2, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 3 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 4 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2, T3) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​x3 (`T3`): The 4th argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 5 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func.
* ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2, T3, T4) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​x3 (`T3`): The 4th argument of func.
* ​x4 (`T4`): The 5th argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 6 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func.
* ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func.
* ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2, T3, T4, T5) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​x3 (`T3`): The 4th argument of func.
* ​x4 (`T4`): The 5th argument of func.
* ​x5 (`T5`): The 6th argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 7 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func.
* ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func.
* ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func.
* ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2, T3, T4, T5, T6) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​x3 (`T3`): The 4th argument of func.
* ​x4 (`T4`): The 5th argument of func.
* ​x5 (`T5`): The 6th argument of func.
* ​x6 (`T6`): The 7th argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, T7: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6, T7) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, x7: T7, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 8 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func.
* ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func.
* ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func.
* ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func.
* ​T7 (`AnyTrivialRegType`): Type of the 8th argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2, T3, T4, T5, T6, T7) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​x3 (`T3`): The 4th argument of func.
* ​x4 (`T4`): The 5th argument of func.
* ​x5 (`T5`): The 6th argument of func.
* ​x6 (`T6`): The 7th argument of func.
* ​x7 (`T7`): The 8th argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, T7: AnyTrivialRegType, T8: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6, T7, T8) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, x7: T7, x8: T8, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 9 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func.
* ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func.
* ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func.
* ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func.
* ​T7 (`AnyTrivialRegType`): Type of the 8th argument of func.
* ​T8 (`AnyTrivialRegType`): Type of the 9th argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2, T3, T4, T5, T6, T7, T8) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​x3 (`T3`): The 4th argument of func.
* ​x4 (`T4`): The 5th argument of func.
* ​x5 (`T5`): The 6th argument of func.
* ​x6 (`T6`): The 7th argument of func.
* ​x7 (`T7`): The 8th argument of func.
* ​x8 (`T8`): The 9th argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, T7: AnyTrivialRegType, T8: AnyTrivialRegType, T9: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6, T7, T8, T9) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, x7: T7, x8: T8, x9: T9, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 10 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func.
* ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func.
* ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func.
* ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func.
* ​T7 (`AnyTrivialRegType`): Type of the 8th argument of func.
* ​T8 (`AnyTrivialRegType`): Type of the 9th argument of func.
* ​T9 (`AnyTrivialRegType`): Type of the 10th argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2, T3, T4, T5, T6, T7, T8, T9) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​x3 (`T3`): The 4th argument of func.
* ​x4 (`T4`): The 5th argument of func.
* ​x5 (`T5`): The 6th argument of func.
* ​x6 (`T6`): The 7th argument of func.
* ​x7 (`T7`): The 8th argument of func.
* ​x8 (`T8`): The 9th argument of func.
* ​x9 (`T9`): The 10th argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

---

## quick_bench

## Structs

* [​`QuickBench`](/mojo/stdlib/benchmark/quick_bench/QuickBench): Defines a struct to facilitate benchmarking and avoiding `Bencher` boilerplate.

---

## bit_not

`bit_not[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs a bitwise NOT operation on an SIMD vector of integer values.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` is computed as a bitwise
NOT of the integer value at position `i` of the input value.

---

## bit_reverse

`bit_reverse(val: Int) -> Int`

Reverses the bitpattern of an integer value.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The input value with its bitpattern reversed.

`bit_reverse[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Element-wise reverses the bitpattern of a SIMD vector of integer values.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` has a reversed bitpattern
of an integer value of the element at position `i` of the input value.

---

## bit_width

`bit_width(val: Int) -> Int`

Computes the minimum number of bits required to represent the integer.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The number of bits required to represent the integer.

`bit_width[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the minimum number of bits required to represent each element of a SIMD vector of integer values.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` equals the number of bits required to represent the integer at position `i` of the input.

---

## byte_swap

`byte_swap(val: Int) -> Int`

Byte-swaps an integer value with an even number of bytes.

Byte swap an integer value (8 bytes) with an even number of bytes (positive multiple
of 16 bits). This returns an integer value (8 bytes) that has its bytes swapped. For
example, if the input bytes are numbered 0, 1, 2, 3, 4, 5, 6, 7 then the returned
integer will have its bytes in 7, 6, 5, 4, 3, 2, 1, 0 order.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The input value with its bytes swapped.

`byte_swap[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Byte-swaps a SIMD vector of integer values with an even number of bytes.

Byte swap an integer value or vector of integer values with an even number
of bytes (positive multiple of 16 bits). For example, The Int16 returns an
Int16 value that has the high and low byte of the input Int16 swapped.
Similarly, Int32 returns an Int32 value that has the four bytes of the input Int32 swapped,
so that if the input bytes are numbered 0, 1, 2, 3 then the returned Int32 will
have its bytes in 3, 2, 1, 0 order. Int64 and other integer type extend this
concept to additional even-byte lengths (6 bytes, 8 bytes and more, respectively).

**Constraints:**

The element type of the input vector must be an integral type.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` is the value of the
element at position `i` of the input value with its bytes swapped.

---

## count_leading_zeros

`count_leading_zeros(val: Int) -> Int`

Counts the number of leading zeros of an integer.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The number of leading zeros of the input.

`count_leading_zeros[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Counts the per-element number of leading zeros in a SIMD vector.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `DType` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` contains the number of
leading zeros at position `i` of the input value.

---

## count_trailing_zeros

`count_trailing_zeros(val: Int) -> Int`

Counts the number of trailing zeros for an integer.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The number of trailing zeros of the input.

`count_trailing_zeros[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Counts the per-element number of trailing zeros in a SIMD vector.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` contains the number of
trailing zeros at position `i` of the input value.

---

## bit

Provides functions for bit manipulation.

You can import these APIs from the `bit` package. For example:

```mojo
from bit import count_leading_zeros
```

## Functions

* [​`bit_not`](/mojo/stdlib/bit/bit/bit_not): Performs a bitwise NOT operation on an SIMD vector of integer values.
* [​`bit_reverse`](/mojo/stdlib/bit/bit/bit_reverse): Reverses the bitpattern of an integer value.
* [​`bit_width`](/mojo/stdlib/bit/bit/bit_width): Computes the minimum number of bits required to represent the integer.
* [​`byte_swap`](/mojo/stdlib/bit/bit/byte_swap): Byte-swaps an integer value with an even number of bytes.
* [​`count_leading_zeros`](/mojo/stdlib/bit/bit/count_leading_zeros): Counts the number of leading zeros of an integer.
* [​`count_trailing_zeros`](/mojo/stdlib/bit/bit/count_trailing_zeros): Counts the number of trailing zeros for an integer.
* [​`log2_floor`](/mojo/stdlib/bit/bit/log2_floor): Returns the floor of the base-2 logarithm of an integer value.
* [​`next_power_of_two`](/mojo/stdlib/bit/bit/next_power_of_two): Computes the smallest power of 2 that is greater than or equal to the input value. Any integral value less than or equal to 1 will be ceiled to 1.
* [​`pop_count`](/mojo/stdlib/bit/bit/pop_count): Counts the number of bits set in an integer value.
* [​`prev_power_of_two`](/mojo/stdlib/bit/bit/prev_power_of_two): Computes the largest power of 2 that is less than or equal to the input value. Any integral value less than or equal to 0 will be floored to 0.
* [​`rotate_bits_left`](/mojo/stdlib/bit/bit/rotate_bits_left): Shifts the bits of an input to the left by `shift` bits (with wrap-around).
* [​`rotate_bits_right`](/mojo/stdlib/bit/bit/rotate_bits_right): Shifts the bits of an input to the right by `shift` bits (with wrap-around).

---

## log2_floor

`log2_floor(val: Int) -> Int`

Returns the floor of the base-2 logarithm of an integer value.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The floor of the base-2 logarithm of the input value, which is equal to
the position of the highest set bit. Returns -1 if val is 0.

---

## next_power_of_two

`next_power_of_two(val: Int) -> Int`

Computes the smallest power of 2 that is greater than or equal to the input value. Any integral value less than or equal to 1 will be ceiled to 1.

Notes:
This operation is called `bit_ceil()` in C++.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The smallest power of 2 that is greater than or equal to the input
value.

`next_power_of_two(val: UInt) -> UInt`

Computes the smallest power of 2 that is greater than or equal to the input value. Any integral value less than or equal to 1 will be ceiled to 1.

Notes:
This operation is called `bit_ceil()` in C++.

**Args:**

* ​val (`UInt`): The input value.

**Returns:**

The smallest power of 2 that is greater than or equal to the input
value.

`next_power_of_two[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the smallest power of 2 that is greater than or equal to the input value for each element of a SIMD vector. Any integral value less than or equal to 1 will be ceiled to 1.

This operation is called `bit_ceil()` in C++.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` is the smallest power of 2
that is greater than or equal to the integer at position `i` of the input
value.

---

## pop_count

`pop_count(val: Int) -> Int`

Counts the number of bits set in an integer value.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The number of bits set in the input value.

`pop_count[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Counts the number of bits set in a SIMD vector of integer values.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` contains the number of
bits set in the element at position `i` of the input value.

---

## prev_power_of_two

`prev_power_of_two(val: Int) -> Int`

Computes the largest power of 2 that is less than or equal to the input value. Any integral value less than or equal to 0 will be floored to 0.

This operation is called `bit_floor()` in C++.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The largest power of 2 that is less than or equal to the input value.

`prev_power_of_two[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the largest power of 2 that is less than or equal to the input value for each element of a SIMD vector. Any integral value less than or equal to 0 will be floored to 0.

This operation is called `bit_floor()` in C++.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` is the largest power of 2
that is less than or equal to the integer at position `i` of the input
value.

---

## rotate_bits_left

`rotate_bits_left[shift: Int](x: Int) -> Int`

Shifts the bits of an input to the left by `shift` bits (with wrap-around).

**Constraints:**

`-size shift (`Int`): The number of bit positions by which to rotate the bits of the
  integer to the left (with wrap-around).

**Args:**

* ​x (`Int`): The input value.

**Returns:**

The input rotated to the left by `shift` elements (with wrap-around).

`rotate_bits_left[dtype: DType, width: Int, //, shift: Int](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Shifts bits to the left by `shift` positions (with wrap-around) for each element of a SIMD vector.

**Constraints:**

`0 dtype (`DType`): The `dtype` of the input and output SIMD vector. Must be integral and unsigned.
* ​width (`Int`): The width of the SIMD vector.
* ​shift (`Int`): The number of positions to rotate left.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector input.

**Returns:**

SIMD vector with each element rotated left by `shift` bits.

---

## rotate_bits_right

`rotate_bits_right[shift: Int](x: Int) -> Int`

Shifts the bits of an input to the right by `shift` bits (with wrap-around).

**Constraints:**

`-size shift (`Int`): The number of bit positions by which to rotate the bits of the
  integer to the right (with wrap-around).

**Args:**

* ​x (`Int`): The input value.

**Returns:**

The input rotated to the right by `shift` elements (with wrap-around).

`rotate_bits_right[dtype: DType, width: Int, //, shift: Int](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Shifts bits to the right by `shift` positions (with wrap-around) for each element of a SIMD vector.

**Constraints:**

`0 dtype (`DType`): The `dtype` of the input and output SIMD vector. Must be integral and unsigned.
* ​width (`Int`): The width of the SIMD vector.
* ​shift (`Int`): The number of positions to rotate right.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector input.

**Returns:**

SIMD vector with each element rotated right by `shift` bits.

---

## bit

Implements the bit package.

## Modules

* [​`bit`](/mojo/stdlib/bit/bit/): Provides functions for bit manipulation.

---

## NDBuffer

`@register_passable(trivial)`
`struct NDBuffer[mut: Bool, //, type: DType, rank: Int, origin: Origin[mut], shape: DimList = create_unknown[::Int](), strides: DimList = create_unknown[::Int](), *, alignment: Int = 1, address_space: AddressSpace = AddressSpace(0), exclusive: Bool = True]`

An N-dimensional buffer.

NDBuffer can be parametrized on rank, static dimensions and Dtype. It does
not own its underlying pointer.

## Parameters

* ​mut (`Bool`): The inferred mutability.
* ​type (`DType`): The element type of the buffer.
* ​rank (`Int`): The rank of the buffer.
* ​origin (`Origin[mut]`): The origin of the memory being addressed.
* ​shape (`DimList`): The static size (if known) of the buffer.
* ​strides (`DimList`): The strides (if known) of the buffer.
* ​alignment (`Int`): The preferred address alignment of the buffer.
* ​address\_space (`AddressSpace`): The address space of the buffer.
* ​exclusive (`Bool`): The underlying memory allocation of the tensor is known
  only to be accessible through this pointer.

## Fields

* ​data (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): The underlying data for the buffer. The pointer is not owned by the NDBuffer.
* ​dynamic\_shape (`IndexList[rank, element_type=uint64]`): The dynamic value of the shape.
* ​dynamic\_stride (`IndexList[rank, element_type=uint64]`): The dynamic stride of the buffer.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Default initializer for NDBuffer. By default the fields are all initialized to 0.

`@implicit`
`__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self`

Constructs an NDBuffer with statically known rank, shapes and type.

**Constraints:**

The rank, shapes, and type are known.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the data.

`@implicit`
`__init__(span: Span[SIMD[type, 1], origin, address_space=address_space, alignment=alignment]) -> Self`

Constructs an NDBuffer with statically known rank, shapes and type.

**Constraints:**

The rank, shapes, and type are known.

**Args:**

* ​span (`Span[SIMD[type, 1], origin, address_space=address_space, alignment=alignment]`): Span of the data.

`@implicit`
`__init__(other: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> Self`

Converts NDBuffers between different variants which do not effect the underlying memory representation.

E.g. this allows implicit conversion between

`NDBuffer[type, rank, DimList(1, 2, 3), DimList(6, 6, 1), alignment=16]`
to
`NDBuffer[type, rank, DimList(1, 2, 3), DimList.create_unknown[rank](), alignment=4]`

**Args:**

* ​other (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The other NDBuffer type.

`__init__(ptr: UnsafePointer[scalar>, address_space=address_space, mut=mut, origin=origin], dynamic_shape: IndexList[rank, element_type=element_type]) -> Self`

Constructs an NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​ptr (`UnsafePointer[scalar>, address_space=address_space, mut=mut, origin=origin]`): Pointer to the data.
* ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes.

`__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: IndexList[rank, element_type=element_type]) -> Self`

Constructs an NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data.
* ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes.

`__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: IndexList[rank, element_type=element_type]) -> Self`

Constructs an NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Span of the data.
* ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes.

`__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: DimList) -> Self`

Constructs an NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data.
* ​dynamic\_shape (`DimList`): A static tuple of size 'rank' representing shapes.

`__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: DimList) -> Self`

Constructs an NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Span of the data.
* ​dynamic\_shape (`DimList`): A static tuple of size 'rank' representing shapes.

`__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: IndexList[rank, element_type=element_type], dynamic_stride: IndexList[rank, element_type=element_type]) -> Self`

Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data.
* ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes.
* ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides.

`__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: IndexList[rank, element_type=element_type], dynamic_stride: IndexList[rank, element_type=element_type]) -> Self`

Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Span over the data.
* ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes.
* ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides.

`__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: DimList, dynamic_stride: IndexList[rank, element_type=element_type]) -> Self`

Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data.
* ​dynamic\_shape (`DimList`): A DimList of size 'rank' representing shapes.
* ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides.

`__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: DimList, dynamic_stride: IndexList[rank, element_type=element_type]) -> Self`

Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Pointer to the data.
* ​dynamic\_shape (`DimList`): A DimList of size 'rank' representing shapes.
* ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides.

### `__getitem__`

`__getitem__(self, *idx: Int) -> SIMD[type, 1]`

Gets an element from the buffer from the specified index.

**Args:**

* ​\*idx (`Int`): Index of the element to retrieve.

**Returns:**

The value of the element.

`__getitem__(self, idx: IndexList[rank, element_type=element_type]) -> SIMD[type, 1]`

Gets an element from the buffer from the specified index.

**Args:**

* ​idx (`IndexList[rank, element_type=element_type]`): Index of the element to retrieve.

**Returns:**

The value of the element.

### `__setitem__`

`__setitem__(self: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], idx: IndexList[rank, element_type=element_type], val: SIMD[type, 1])`

Stores a single value into the buffer at the specified index.

**Args:**

* ​idx (`IndexList[rank, element_type=element_type]`): The index into the buffer.
* ​val (`SIMD[type, 1]`): The value to store.

`__setitem__(self: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], *idx: Int, *, val: SIMD[type, 1])`

Stores a single value into the buffer at the specified index.

**Args:**

* ​\*idx (`Int`): Index of the element to retrieve.
* ​val (`SIMD[type, 1]`): The value to store.

### `origin_cast`

`origin_cast[mut: Bool = mut, origin: Origin[mut] = origin](self) -> NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`

Changes the origin or mutability of a pointer.

**Parameters:**

* ​mut (`Bool`): Whether the origin is mutable.
* ​origin (`Origin[mut]`): Origin of the destination pointer.

**Returns:**

A new `NDBuffer` object with the same type and the same address,
as the original `NDBuffer` and the new specified mutability and origin.

### `get_rank`

`get_rank(self) -> Int`

Returns the rank of the buffer.

**Returns:**

The rank of NDBuffer.

### `get_shape`

`get_shape(self) -> IndexList[rank]`

Returns the shapes of the buffer.

**Returns:**

A static tuple of size 'rank' representing shapes of the NDBuffer.

### `get_strides`

`get_strides(self) -> IndexList[rank]`

Returns the strides of the buffer.

**Returns:**

A static tuple of size 'rank' representing strides of the NDBuffer.

### `get_nd_index`

`get_nd_index(self, idx: Int) -> IndexList[rank]`

Computes the NDBuffer's ND-index based on the flat index.

**Args:**

* ​idx (`Int`): The flat index.

**Returns:**

The index positions.

### `__len__`

`__len__(self) -> Int`

Computes the NDBuffer's number of elements.

**Returns:**

The total number of elements in the NDBuffer.

### `num_elements`

`num_elements(self) -> Int`

Computes the NDBuffer's number of elements.

**Returns:**

The total number of elements in the NDBuffer.

### `size`

`size(self) -> Int`

Computes the NDBuffer's number of elements.

**Returns:**

The total number of elements in the NDBuffer.

### `__str__`

`__str__(self) -> String`

Gets the buffer as a string.

**Returns:**

A compact string of the buffer.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this buffer to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__repr__`

`__repr__(self) -> String`

Gets the buffer as a string.

**Returns:**

A compact string representation of the buffer.

### `tile`

`tile[*tile_sizes: Dim](self, tile_coords: IndexList[rank, element_type=element_type]) -> NDBuffer[type, rank, origin, DimList(VariadicList(tile_sizes)), address_space=address_space]`

Returns an n-d tile "slice" of the buffer of size tile\_sizes at    coords.

**Parameters:**

* ​\*tile\_sizes (`Dim`): The size of the tiles.

**Args:**

* ​tile\_coords (`IndexList[rank, element_type=element_type]`): The tile index.

**Returns:**

The tiled buffer at tile\_coords.

### `load`

`load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, *idx: Int) -> SIMD[type, width]`

Loads a simd value from the buffer at the specified index.

**Constraints:**

The buffer must be contiguous or width must be 1.

**Parameters:**

* ​width (`Int`): The simd\_width of the load.
* ​alignment (`Int`): The alignment value.

**Args:**

* ​\*idx (`Int`): The index into the NDBuffer.

**Returns:**

The simd value starting at the `idx` position and ending at
`idx+width`.

`load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, idx: VariadicList[Int]) -> SIMD[type, width]`

Loads a simd value from the buffer at the specified index.

**Constraints:**

The buffer must be contiguous or width must be 1.

**Parameters:**

* ​width (`Int`): The simd\_width of the load.
* ​alignment (`Int`): The alignment value.

**Args:**

* ​idx (`VariadicList[Int]`): The index into the NDBuffer.

**Returns:**

The simd value starting at the `idx` position and ending at
`idx+width`.

`load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, idx: IndexList[size, element_type=element_type]) -> SIMD[type, width]`

Loads a simd value from the buffer at the specified index.

**Constraints:**

The buffer must be contiguous or width must be 1.

**Parameters:**

* ​width (`Int`): The simd\_width of the load.
* ​alignment (`Int`): The alignment value.

**Args:**

* ​idx (`IndexList[size, element_type=element_type]`): The index into the NDBuffer.

**Returns:**

The simd value starting at the `idx` position and ending at
`idx+width`.

`load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, idx: StaticTuple[Int, rank]) -> SIMD[type, width]`

Loads a simd value from the buffer at the specified index.

**Constraints:**

The buffer must be contiguous or width must be 1.

**Parameters:**

* ​width (`Int`): The simd\_width of the load.
* ​alignment (`Int`): The alignment value.

**Args:**

* ​idx (`StaticTuple[Int, rank]`): The index into the NDBuffer.

**Returns:**

The simd value starting at the `idx` position and ending at
`idx+width`.

### `store`

`store[_alignment: Int, //, *, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self: NDBuffer[type, rank, origin, shape, strides, alignment=_alignment, address_space=address_space, exclusive=exclusive], idx: IndexList[rank, element_type=element_type], val: SIMD[type, width])`

Stores a simd value into the buffer at the specified index.

**Constraints:**

The buffer must be contiguous or width must be 1.

**Parameters:**

* ​\_alignment (`Int`): The inferred alignment of self.
* ​width (`Int`): The width of the simd vector.
* ​alignment (`Int`): The alignment value.

**Args:**

* ​idx (`IndexList[rank, element_type=element_type]`): The index into the buffer.
* ​val (`SIMD[type, width]`): The value to store.

`store[_alignment: Int, //, *, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self: NDBuffer[type, rank, origin, shape, strides, alignment=_alignment, address_space=address_space, exclusive=exclusive], idx: StaticTuple[Int, rank], val: SIMD[type, width])`

Stores a simd value into the buffer at the specified index.

**Constraints:**

The buffer must be contiguous or width must be 1.

**Parameters:**

* ​\_alignment (`Int`): The inferred alignment of self.
* ​width (`Int`): The width of the simd vector.
* ​alignment (`Int`): The alignment value.

**Args:**

* ​idx (`StaticTuple[Int, rank]`): The index into the buffer.
* ​val (`SIMD[type, width]`): The value to store.

### `dim`

`dim[index: Int](self) -> Int`

Gets the buffer dimension at the given index.

**Parameters:**

* ​index (`Int`): The number of dimension to get.

**Returns:**

The buffer size at the given dimension.

`dim(self, index: Int) -> Int`

Gets the buffer dimension at the given index.

**Args:**

* ​index (`Int`): The number of dimension to get.

**Returns:**

The buffer size at the given dimension.

### `stride`

`stride[index: Int](self) -> Int`

Gets the buffer stride at the given index.

**Parameters:**

* ​index (`Int`): The number of dimension to get the stride for.

**Returns:**

The stride at the given dimension.

`stride(self, index: Int) -> Int`

Gets the buffer stride at the given index.

**Args:**

* ​index (`Int`): The number of dimension to get the stride for.

**Returns:**

The stride at the given dimension.

### `is_contiguous`

`is_contiguous(self) -> Bool`

Checks if the buffer is contiguous in memory.

**Returns:**

True if the buffer is contiguous in memory and False otherwise.

### `flatten`

`flatten(self) -> NDBuffer[type, 1, origin, __init__[::Intable](shape.product()), address_space=address_space]`

Constructs a flattened buffer counterpart for this NDBuffer.

**Constraints:**

The buffer must be contiguous.

**Returns:**

Constructed buffer object.

### `make_dims_unknown`

`make_dims_unknown(self) -> NDBuffer[type, rank, origin, address_space=address_space]`

Rebinds the NDBuffer to one with unknown shape.

**Returns:**

The rebound NDBuffer with unknown shape.

### `bytecount`

`bytecount(self) -> Int`

Returns the size of the NDBuffer in bytes.

**Returns:**

The size of the NDBuffer in bytes.

### `zero`

`zero(self: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])`

Sets all bytes of the NDBuffer to 0.

**Constraints:**

The buffer must be contiguous.

### `tofile`

`tofile(self, path: Path)`

Write values to a file.

**Args:**

* ​path (`Path`): Path to the output file.

### `fill`

`fill(self: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], val: SIMD[type, 1])`

Assigns val to all elements in the buffer.

The fill is performed in chunks of size N, where N is the native SIMD
width of type on the system.

**Args:**

* ​val (`SIMD[type, 1]`): The value to store.

### `stack_allocation`

`static stack_allocation[*, alignment: Int = alignof[::DType,__mlir_type.!kgen.target]()]() -> Self`

Constructs an NDBuffer instance backed by stack allocated memory space.

**Parameters:**

* ​alignment (`Int`): Address alignment requirement for the allocation.

**Returns:**

Constructed NDBuffer with the allocated space.

### `prefetch`

`prefetch[params: PrefetchOptions](self, *idx: Int)`

Prefetches the data at the given index.

**Parameters:**

* ​params (`PrefetchOptions`): The prefetch configuration.

**Args:**

* ​\*idx (`Int`): The N-D index of the prefetched location.

`prefetch[params: PrefetchOptions](self, indices: IndexList[rank])`

Prefetches the data at the given index.

**Parameters:**

* ​params (`PrefetchOptions`): The prefetch configuration.

**Args:**

* ​indices (`IndexList[rank]`): The N-D index of the prefetched location.

---

## buffer

Implements the NDBuffer struct.

You can import these APIs from the `memory` package. For example:

```mojo
from buffer import NDBuffer
```

## Structs

* [​`NDBuffer`](/mojo/stdlib/buffer/buffer/NDBuffer): An N-dimensional buffer.

## Functions

* [​`partial_simd_load`](/mojo/stdlib/buffer/buffer/partial_simd_load): Loads a vector with dynamic bound.
* [​`partial_simd_store`](/mojo/stdlib/buffer/buffer/partial_simd_store): Stores a vector with dynamic bound.
* [​`prod_dims`](/mojo/stdlib/buffer/buffer/prod_dims): Computes the product of a slice of the given buffer's dimensions.

---

## partial_simd_load

`partial_simd_load[type: DType, //, width: Int](storage: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], lbound: Int, rbound: Int, pad_value: SIMD[type, 1]) -> SIMD[type, width]`

Loads a vector with dynamic bound.

Out of bound data will be filled with pad value. Data is valid if
lbound type (`DType`): The DType of storage.
* ​width (`Int`): The system simd vector size.

**Args:**

* ​storage (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the address to perform load.
* ​lbound (`Int`): Lower bound of valid index within simd (inclusive).
* ​rbound (`Int`): Upper bound of valid index within simd (non-inclusive).
* ​pad\_value (`SIMD[type, 1]`): Value to fill for out of bound indices.

**Returns:**

The SIMD vector loaded and zero-filled.

---

## partial_simd_store

`partial_simd_store[type: DType, //, width: Int](storage: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], lbound: Int, rbound: Int, data: SIMD[type, width])`

Stores a vector with dynamic bound.

Out of bound data will ignored. Data is valid if lbound type (`DType`): The DType of storage.
* ​width (`Int`): The system simd vector size.

**Args:**

* ​storage (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the address to perform load.
* ​lbound (`Int`): Lower bound of valid index within simd (inclusive).
* ​rbound (`Int`): Upper bound of valid index within simd (non-inclusive).
* ​data (`SIMD[type, width]`): The vector value to store.

---

## prod_dims

`prod_dims[start_dim: Int, end_dim: Int](x: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> Int`

Computes the product of a slice of the given buffer's dimensions.

**Parameters:**

* ​start\_dim (`Int`): The index at which to begin computing the product.
* ​end\_dim (`Int`): The index at which to stop computing the product.

**Args:**

* ​x (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The NDBuffer whose dimensions will be multiplied.

**Returns:**

The product of the specified slice of the buffer's dimensions.

---

## Dim

`@register_passable(trivial)`
`struct Dim`

A static or dynamic dimension modeled with an optional integer.

This class is meant to represent an optional static dimension. When a value
is present, the dimension has that static value. When a value is not
present, the dimension is dynamic.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Defaultable`,
`EqualityComparable`,
`ImplicitlyBoolable`,
`Indexer`,
`Intable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`@implicit`
`__init__[I: Intable](value: I) -> Self`

Creates a statically-known dimension.

**Parameters:**

* ​I (`Intable`): The Intable type.

**Args:**

* ​value (`I`): The static dimension value.

`@implicit`
`__init__[I: Indexer](value: I) -> Self`

Creates a statically-known dimension.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​value (`I`): The static dimension value.

`@implicit`
`__init__(value: index) -> Self`

Creates a statically-known dimension.

**Args:**

* ​value (`index`): The static dimension value.

`@implicit`
`__init__(value: Int) -> Self`

Creates a statically-known dimension.

**Args:**

* ​value (`Int`): The static dimension value.

`__init__() -> Self`

Creates a dynamic dimension with no static value.

### `__bool__`

`__bool__(self) -> Bool`

Returns True if the dimension has a static value.

**Returns:**

Whether the dimension has a static value.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compares two dimensions for equality.

**Args:**

* ​rhs (`Self`): The other dimension.

**Returns:**

True if the dimensions are the same.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Compare two dimensions for inequality.

**Args:**

* ​rhs (`Self`): The dimension to compare.

**Returns:**

True if they are not equal.

### `__mul__`

`__mul__(self, rhs: Self) -> Self`

Multiplies two dimensions.

If either are unknown, the result is unknown as well.

**Args:**

* ​rhs (`Self`): The other dimension.

**Returns:**

The product of the two dimensions.

### `__floordiv__`

`__floordiv__(self, rhs: Self) -> Self`

Divide by the given dimension and round towards negative infinity.

If either are unknown, the result is unknown as well.

**Args:**

* ​rhs (`Self`): The divisor dimension.

**Returns:**

The floor division of the two dimensions.

### `__rfloordiv__`

`__rfloordiv__(self, rhs: Self) -> Self`

Divide the given argument by self and round towards negative infinity.

If either are unknown, the result is unknown as well.

**Args:**

* ​rhs (`Self`): The dimension to divide by this Dim.

**Returns:**

The floor of the argument divided by self.

### `__imul__`

`__imul__(mut self, rhs: Self)`

Inplace multiplies two dimensions.

If either are unknown, the result is unknown as well.

**Args:**

* ​rhs (`Self`): The other dimension.

### `__as_bool__`

`__as_bool__(self) -> Bool`

Returns True if the dimension has a static value.

**Returns:**

Whether the dimension has a static value.

### `has_value`

`has_value(self) -> Bool`

Returns True if the dimension has a static value.

**Returns:**

Whether the dimension has a static value.

### `is_dynamic`

`is_dynamic(self) -> Bool`

Returns True if the dimension has a dynamic value.

**Returns:**

Whether the dimension is dynamic.

### `get`

`get(self) -> Int`

Gets the static dimension value.

**Returns:**

The static dimension value.

### `is_multiple`

`is_multiple[alignment: Int](self) -> Bool`

Checks if the dimension is aligned.

**Parameters:**

* ​alignment (`Int`): The alignment requirement.

**Returns:**

Whether the dimension is aligned.

### `__index__`

`__index__(self) -> index`

Convert to index.

**Returns:**

The corresponding \_\_mlir\_type.index value.

### `__int__`

`__int__(self) -> Int`

Gets the static dimension value.

**Returns:**

The static dimension value.

### `__str__`

`__str__(self) -> String`

Converts the Dim to a String. If the value is unknown, then the string "?" is returned.

**Returns:**

The string representation of the type.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this DimList to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `or_else`

`or_else(self, default: Int) -> Int`

Return the underlying value contained in the Optional or a default value if the Optional's underlying value is not present.

**Args:**

* ​default (`Int`): The new value to use if no value was present.

**Returns:**

The underlying value contained in the Optional or a default value.

---

## DimList

`@register_passable(trivial)`
`struct DimList`

This type represents a list of dimensions. Each dimension may have a static value or not have a value, which represents a dynamic dimension.

## Fields

* ​value (`VariadicList[Dim]`): The underlying storage for the list of dimensions.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Representable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`@implicit`
`__init__[Intable: Intable](value: Intable) -> Self`

Creates a dimension list from the given list of values.

**Parameters:**

* ​Intable (`Intable`): A type able to be converted to an `Int`.

**Args:**

* ​value (`Intable`): The initial dim values list.

`@implicit`
`__init__[I: Indexer](values: Tuple[I]) -> Self`

Creates a dimension list from the given list of values.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​values (`Tuple[I]`): The initial dim values list.

`@implicit`
`__init__[I0: Indexer, I1: Indexer](values: Tuple[I0, I1]) -> Self`

Creates a dimension list from the given list of values.

**Parameters:**

* ​I0 (`Indexer`): A type that can be used as an Index.
* ​I1 (`Indexer`): A type that can be used as an Index.

**Args:**

* ​values (`Tuple[I0, I1]`): The initial dim values list.

`@implicit`
`__init__[I0: Indexer, I1: Indexer, I2: Indexer](values: Tuple[I0, I1, I2]) -> Self`

Creates a dimension list from the given list of values.

**Parameters:**

* ​I0 (`Indexer`): A type that can be used as an Index.
* ​I1 (`Indexer`): A type that can be used as an Index.
* ​I2 (`Indexer`): A type that can be used as an Index.

**Args:**

* ​values (`Tuple[I0, I1, I2]`): The initial dim values list.

`__init__[I0: Indexer, I1: Indexer](val0: I0, val1: I1) -> Self`

Creates a dimension list from the given list of values.

**Parameters:**

* ​I0 (`Indexer`): A type that can be used as an Index.
* ​I1 (`Indexer`): A type that can be used as an Index.

**Args:**

* ​val0 (`I0`): The initial dim value.
* ​val1 (`I1`): The initial dim value.

`__init__[I0: Indexer, I1: Indexer, I2: Indexer](val0: I0, val1: I1, val2: I2) -> Self`

Creates a dimension list from the given list of values.

**Parameters:**

* ​I0 (`Indexer`): A type that can be used as an Index.
* ​I1 (`Indexer`): A type that can be used as an Index.
* ​I2 (`Indexer`): A type that can be used as an Index.

**Args:**

* ​val0 (`I0`): The initial dim value.
* ​val1 (`I1`): The initial dim value.
* ​val2 (`I2`): The initial dim value.

`__init__[I0: Indexer, I1: Indexer, I2: Indexer, I3: Indexer](val0: I0, val1: I1, val2: I2, val3: I3) -> Self`

Creates a statically-known dimension.

**Parameters:**

* ​I0 (`Indexer`): A type that can be used as an Index.
* ​I1 (`Indexer`): A type that can be used as an Index.
* ​I2 (`Indexer`): A type that can be used as an Index.
* ​I3 (`Indexer`): A type that can be used as an Index.

**Args:**

* ​val0 (`I0`): The initial dim value.
* ​val1 (`I1`): The initial dim value.
* ​val2 (`I2`): The initial dim value.
* ​val3 (`I3`): The initial dim value.

`@implicit`
`__init__(values: VariadicList[Dim]) -> Self`

Creates a dimension list from the given list of values.

**Args:**

* ​values (`VariadicList[Dim]`): The initial dim values list.

`@implicit`
`__init__(*values: Dim) -> Self`

Creates a dimension list from the given Dim values.

**Args:**

* ​\*values (`Dim`): The initial dim values.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compares two DimLists for equality.

DimLists are considered equal if all non-dynamic Dims have similar
values and all dynamic Dims in self are also dynamic in rhs.

**Args:**

* ​rhs (`Self`): The other DimList.

**Returns:**

True if the DimLists are the same.

### `__len__`

`__len__(self) -> Int`

Gets the size of the DimList.

**Returns:**

The number of elements in the DimList.

### `get`

`get[i: Int](self) -> Int`

Gets the static dimension value at a specified index.

**Parameters:**

* ​i (`Int`): The dimension index.

**Returns:**

The static dimension value at the specified index.

### `at`

`at[i: Int](self) -> Dim`

Gets the dimension at a specified index.

**Parameters:**

* ​i (`Int`): The dimension index.

**Returns:**

The dimension at the specified index.

### `has_value`

`has_value[i: Int](self) -> Bool`

Returns True if the dimension at the given index has a static value.

**Parameters:**

* ​i (`Int`): The dimension index.

**Returns:**

Whether the specified dimension has a static value.

### `product`

`product[length: Int](self) -> Dim`

Computes the product of the first `length` dimensions in the list.

If any are dynamic, the result is a dynamic dimension value.

**Parameters:**

* ​length (`Int`): The number of elements in the list.

**Returns:**

The product of the first `length` dimensions.

`product[start: Int, end: Int](self) -> Dim`

Computes the product of a range of the dimensions in the list.

If any in the range are dynamic, the result is a dynamic dimension
value.

**Parameters:**

* ​start (`Int`): The starting index.
* ​end (`Int`): The end index.

**Returns:**

The product of all the dimensions.

`product(self) -> Dim`

Computes the product of all the dimensions in the list.

If any are dynamic, the result is a dynamic dimension value.

**Returns:**

The product of all the dimensions.

### `contains`

`contains[length: Int](self, value: Dim) -> Bool`

Determines whether the dimension list contains a specified dimension value.

**Parameters:**

* ​length (`Int`): The number of elements in the list.

**Args:**

* ​value (`Dim`): The value to find.

**Returns:**

True if the list contains a dimension of the specified value.

### `all_known`

`all_known[length: Int](self) -> Bool`

Determines whether all dimensions are statically known.

**Parameters:**

* ​length (`Int`): The number of elements in the list.

**Returns:**

True if all dimensions have a static value.

`all_known[start: Int, end: Int](self) -> Bool`

Determines whether all dimensions within \[start, end) are statically known.

**Parameters:**

* ​start (`Int`): The first queried dimension.
* ​end (`Int`): The last queried dimension.

**Returns:**

True if all queried dimensions have a static value.

### `into_index_list`

`into_index_list[rank: Int](self) -> IndexList[rank]`

Copy the DimList values into an `IndexList`, providing the rank.

```mojo
from buffer import DimList

var dim_list = DimList(2, 4)
var index_list = dim_list.into_index_list[rank=2]()
```

.

**Parameters:**

* ​rank (`Int`): The rank of the output IndexList.

**Returns:**

An IndexList with the same dimensions as the DimList.

### `create_unknown`

`static create_unknown[length: Int]() -> Self`

Creates a dimension list of all dynamic dimension values.

**Parameters:**

* ​length (`Int`): The number of elements in the list.

**Returns:**

A list of all dynamic dimension values.

### `__str__`

`__str__(self) -> String`

Converts the DimList to a String. The String is a comma separated list of the string representation of Dim.

**Returns:**

The string representation of the type.

### `__repr__`

`__repr__(self) -> String`

Converts the DimList to a readable String representation.

**Returns:**

The string representation of the type.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this DimList to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

---

## dimlist

Provides utilities for working with static and variadic lists.

You can import these APIs from the `buffer` package. For example:

```mojo
from buffer import Dim
```

## Structs

* [​`Dim`](/mojo/stdlib/buffer/dimlist/Dim): A static or dynamic dimension modeled with an optional integer.
* [​`DimList`](/mojo/stdlib/buffer/dimlist/DimList): This type represents a list of dimensions. Each dimension may have a static value or not have a value, which represents a dynamic dimension.

---

## buffer

Implements the buffer package.

## Modules

* [​`buffer`](/mojo/stdlib/buffer/buffer/): Implements the NDBuffer struct.
* [​`dimlist`](/mojo/stdlib/buffer/dimlist/): Provides utilities for working with static and variadic lists.

---

## AnyType

A trait for types that require lifetime management through destructors.

The `AnyType` trait is fundamental to Mojo's memory management system. It indicates
that a type has a destructor that needs to be called when instances go out of scope.
This is essential for types that own resources like memory, file handles, or other
system resources that need proper cleanup.

Key aspects:

* Any type with a destructor must implement this trait
* The destructor (`__del__`) is called automatically when an instance's lifetime ends
* Composition of types with destructors automatically gets a destructor
* All Mojo structs and traits inherit from `AnyType` by default unless they specify
  `@explicit_destroy`

Example:

```mojo
struct ResourceOwner(AnyType):
    var ptr: UnsafePointer[Int]

    fn __init__(out self, size: Int):
        self.ptr = UnsafePointer[Int].alloc(size)

    fn __del__(owned self):
        # Clean up owned resources
        self.ptr.free()
```

Best practices:

* Implement this trait when your type owns resources that need cleanup
* Ensure the destructor properly frees all owned resources
* Consider using `@explicit_destroy` for types that should never have destructors
* Use composition to automatically handle nested resource cleanup

## Implemented traits

`UnknownDestructibility`

## Methods

### `__del__`

`__del__(owned self: _Self, /)`

Destroys the instance and cleans up any owned resources.

This method is called automatically when an instance's lifetime ends. It receives
an owned value and should perform all necessary cleanup operations like:

* Freeing allocated memory
* Closing file handles
* Releasing system resources
* Cleaning up any other owned resources

The instance is considered dead after this method completes, regardless of
whether any explicit cleanup was performed.

---

## UnknownDestructibility

The most basic trait that all Mojo types extend by default.

This trait indicates that a type has no destructor and therefore no lifetime
management. It is the default for all types unless they explicitly implement
`AnyType` or `ImplicitlyDestructible`.

Types with this trait:

* Have no `__del__` method
* Do not perform any cleanup when they go out of scope
* Are suitable for simple value types that don't own resources

For types that need cleanup when they are destroyed, use `ImplicitlyDestructible`
or `AnyType` instead.

---

## anytype

Defines the core traits for object lifetime management in Mojo.

This module provides the foundational traits that define how objects are created,
managed and destroyed in Mojo:

* `UnknownDestructibility`: The most basic trait that all types extend by default.
  Types with this trait have no destructor and no lifetime management.

* `AnyType`: The base trait for types that require lifetime management through
  destructors. Any type that needs cleanup when it goes out of scope should
  implement this trait.

* `ImplicitlyDestructible`: An alias for `AnyType` to help with the transition
  to linear types. Use this when you want to be explicit about a type having
  a destructor.

These traits are built into Mojo and do not need to be imported.

## Aliases

### `ImplicitlyDestructible`

`alias ImplicitlyDestructible = AnyType`

## Traits

* [​`AnyType`](/mojo/stdlib/builtin/anytype/AnyType): A trait for types that require lifetime management through destructors.
* [​`UnknownDestructibility`](/mojo/stdlib/builtin/anytype/UnknownDestructibility): The most basic trait that all Mojo types extend by default.

---

## Bool

`@register_passable(trivial)`
`struct Bool`

The primitive Bool scalar value used in Mojo.

## Fields

* ​value (`i1`): The underlying storage of the boolean value.

## Implemented traits

`AnyType`,
`Boolable`,
`ConvertibleFromPython`,
`Copyable`,
`Defaultable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Floatable`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`ImplicitlyBoolable`,
`ImplicitlyIntable`,
`Indexer`,
`Intable`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`PythonConvertible`,
`Representable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`,
`_HashableWithHasher`

## Aliases

### `MAX`

`alias MAX = __init__[::Boolable](True)`

The maximum value of a Bool.

### `MIN`

`alias MIN = __init__[::Boolable](False)`

The minimum value of a Bool.

## Methods

### `__init__`

`__init__() -> Self`

Construct a default, `False` Bool.

`@implicit`
`__init__[T: ImplicitlyBoolable, //](value: T) -> Self`

Convert an ImplicitlyBoolable value to a Bool.

**Parameters:**

* ​T (`ImplicitlyBoolable`): The ImplicitlyBoolable type.

**Args:**

* ​value (`T`): The boolable value.

`__init__[T: Boolable, //](value: T) -> Self`

Set the bool representation of the object.

**Parameters:**

* ​T (`Boolable`): The type of the object.

**Args:**

* ​value (`T`): The object to get the bool representation of.

`__init__(value: None) -> Self`

Set the bool representation of the `None` type to `False`.

**Args:**

* ​value (`None`): The object to get the bool representation of.

`@implicit`
`__init__(value: SIMD[bool, 1]) -> Self`

Convert a scalar SIMD value to a Bool.

**Args:**

* ​value (`SIMD[bool, 1]`): The scalar value.

### `__bool__`

`__bool__(self) -> Self`

Convert to Bool.

**Returns:**

This value.

### `__neg__`

`__neg__(self) -> Int`

Defines the unary `-` operation.

**Returns:**

0 for False and -1 for True.

### `__invert__`

`__invert__(self) -> Self`

Inverts the Bool value.

**Returns:**

True if the object is false and False otherwise.

### `__lt__`

`__lt__(self, rhs: Self) -> Self`

Compare this Bool to RHS using less-than comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

True if self is False and rhs is True.

### `__le__`

`__le__(self, rhs: Self) -> Self`

Compare this Bool to RHS using less-than-or-equal comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

True if self is False and rhs is True or False.

### `__eq__`

`__eq__(self, rhs: Self) -> Self`

Compare this Bool to RHS.

Performs an equality comparison between the Bool value and the argument.
This method gets invoked when a user uses the `==` infix operator.

**Args:**

* ​rhs (`Self`): The rhs value of the equality statement.

**Returns:**

True if the two values match and False otherwise.

### `__ne__`

`__ne__(self, rhs: Self) -> Self`

Compare this Bool to RHS.

Performs a non-equality comparison between the Bool value and the
argument. This method gets invoked when a user uses the `!=` infix
operator.

**Args:**

* ​rhs (`Self`): The rhs value of the non-equality statement.

**Returns:**

False if the two values do match and True otherwise.

### `__gt__`

`__gt__(self, rhs: Self) -> Self`

Compare this Bool to RHS using greater-than comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

True if self is True and rhs is False.

### `__ge__`

`__ge__(self, rhs: Self) -> Self`

Compare this Bool to RHS using greater-than-or-equal comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

True if self is True and rhs is True or False.

### `__and__`

`__and__(self, rhs: Self) -> Self`

Returns `self & rhs`.

Bitwise and's the Bool value with the argument. This method gets invoked
when a user uses the `and` infix operator.

**Args:**

* ​rhs (`Self`): The right hand side of the `and` statement.

**Returns:**

`self & rhs`.

### `__or__`

`__or__(self, rhs: Self) -> Self`

Returns `self | rhs`.

Bitwise or's the Bool value with the argument. This method gets invoked
when a user uses the `or` infix operator.

**Args:**

* ​rhs (`Self`): The right hand side of the `or` statement.

**Returns:**

`self | rhs`.

### `__xor__`

`__xor__(self, rhs: Self) -> Self`

Returns `self ^ rhs`.

Bitwise Xor's the Bool value with the argument. This method gets invoked
when a user uses the `^` infix operator.

**Args:**

* ​rhs (`Self`): The right hand side of the `xor` statement.

**Returns:**

`self ^ rhs`.

### `__rand__`

`__rand__(self, lhs: Self) -> Self`

Returns `lhs & self`.

**Args:**

* ​lhs (`Self`): The left hand side of the `and` statement.

**Returns:**

`lhs & self`.

### `__ror__`

`__ror__(self, lhs: Self) -> Self`

Returns `lhs | self`.

**Args:**

* ​lhs (`Self`): The left hand side of the `or` statement.

**Returns:**

`lhs | self`.

### `__rxor__`

`__rxor__(self, lhs: Self) -> Self`

Returns `lhs ^ self`.

**Args:**

* ​lhs (`Self`): The left hand side of the `xor` statement.

**Returns:**

`lhs ^ self`.

### `__iand__`

`__iand__(mut self, rhs: Self)`

Computes `self & rhs` and store the result in `self`.

**Args:**

* ​rhs (`Self`): The right hand side of the `and` statement.

### `__ixor__`

`__ixor__(mut self, rhs: Self)`

Computes `self ^ rhs` and stores the result in `self`.

**Args:**

* ​rhs (`Self`): The right hand side of the `xor` statement.

### `__ior__`

`__ior__(mut self, rhs: Self)`

Computes `self | rhs` and store the result in `self`.

**Args:**

* ​rhs (`Self`): The right hand side of the `or` statement.

### `copy`

`copy(self) -> Self`

Explicitly construct a deep copy of the provided value.

**Returns:**

A copy of the value.

### `__as_bool__`

`__as_bool__(self) -> Self`

Convert to Bool.

**Returns:**

This value.

### `__str__`

`__str__(self) -> String`

Get the bool as a string.

Returns `"True"` or `"False"`.

**Returns:**

A string representation.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this boolean to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__repr__`

`__repr__(self) -> String`

Get the bool as a string.

Returns `"True"` or `"False"`.

**Returns:**

A string representation.

### `__int__`

`__int__(self) -> Int`

Convert this Bool to an integer.

**Returns:**

1 if the Bool is True, 0 otherwise.

### `__as_int__`

`__as_int__(self) -> Int`

Implicitly convert to an integral representation of the value, wherever an `Int` is expected.

**Returns:**

The integral representation of the value.

### `__index__`

`__index__(self) -> index`

Convert to index.

**Returns:**

1 if the Bool is True, 0 otherwise.

### `__float__`

`__float__(self) -> SIMD[float64, 1]`

Convert this Bool to a float.

**Returns:**

1.0 if True else 0.0 otherwise.

### `__hash__`

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with the underlying bytes.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

### `to_python_object`

`to_python_object(owned self) -> PythonObject`

Convert this value to a PythonObject.

**Returns:**

A PythonObject representing the value.

---

## Boolable

The `Boolable` trait describes a type that can be explicitly converted to a `Bool` or evaluated as a boolean expression in `if` or `while` conditions.

This trait requires the type to implement the `__bool__()` method. For
example:

```mojo
struct Foo(Boolable):
    var val: Bool

    fn __bool__(self) -> Bool:
        return self.val
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__bool__`

`__bool__(self: _Self) -> Bool`

Get the boolean representation of the value.

**Returns:**

The boolean representation of the value.

---

## ImplicitlyBoolable

The `ImplicitlyBoolable` trait describes a type that can be implicitly converted to a `Bool`.

Types conforming to this trait can be passed to a function that expects a
`Bool` without explicitly converting to it. Accordingly, most types should
conform to `Boolable` instead, since implicit conversions to `Bool` can have
unintuitive consequences.

This trait requires the type to implement the `__as_bool__()` method. For
example:

```mojo
struct Foo(ImplicitlyBoolable):
    var val: Bool

    fn __as_bool__(self) -> Bool:
        return self.val

    fn __bool__(self) -> Bool:
        return self.__as_bool__()
```

## Implemented traits

`AnyType`,
`Boolable`,
`UnknownDestructibility`

## Methods

### `__bool__`

`__bool__(self: _Self) -> Bool`

Get the boolean representation of the value.

**Returns:**

The boolean representation of the value.

### `__as_bool__`

`__as_bool__(self: _Self) -> Bool`

Get the boolean representation of the value.

**Returns:**

The boolean representation of the value.

---

## all

`all[T: Boolable & Copyable & Movable, //](list: List[T, hint_trivial_type]) -> Bool`

Checks if **all** elements in the list are truthy.

**Parameters:**

* ​T (`Boolable & Copyable & Movable`): The type of elements to check.

**Args:**

* ​list (`List[T, hint_trivial_type]`): The list to check.

**Returns:**

`True` if **all** elements in the list are truthy, `False` otherwise.

`all[T: Boolable & Copyable & Movable & Hashable & EqualityComparable, //](set: Set[T]) -> Bool`

Checks if **all** elements in the set are truthy.

**Parameters:**

* ​T (`Boolable & Copyable & Movable & Hashable & EqualityComparable`): The type of elements to check.

**Args:**

* ​set (`Set[T]`): The set to check.

**Returns:**

`True` if **all** elements in the set are truthy, `False` otherwise.

`all(value: SIMD[dtype, size]) -> Bool`

Checks if **all** elements in the simd vector are truthy.

**Args:**

* ​value (`SIMD[dtype, size]`): The simd vector to check.

**Returns:**

`True` if **all** elements in the simd vector are truthy, `False`
otherwise.

---

## any

`any[T: Boolable & Copyable & Movable, //](list: List[T, hint_trivial_type]) -> Bool`

Checks if **any** element in the list is truthy.

**Parameters:**

* ​T (`Boolable & Copyable & Movable`): The type of elements to check.

**Args:**

* ​list (`List[T, hint_trivial_type]`): The list to check.

**Returns:**

`True` if **any** element in the list is truthy, `False` otherwise.

`any[T: Boolable & Copyable & Movable & Hashable & EqualityComparable, //](set: Set[T]) -> Bool`

Checks if **any** element in the set is truthy.

**Parameters:**

* ​T (`Boolable & Copyable & Movable & Hashable & EqualityComparable`): The type of elements to check.

**Args:**

* ​set (`Set[T]`): The set to check.

**Returns:**

`True` if **any** element in the set is truthy, `False` otherwise.

`any(value: SIMD[dtype, size]) -> Bool`

Checks if **any** element in the simd vector is truthy.

**Args:**

* ​value (`SIMD[dtype, size]`): The simd vector to check.

**Returns:**

`True` if **any** element in the simd vector is truthy, `False`
otherwise.

---

## bool

Implements the Bool class.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`Bool`](/mojo/stdlib/builtin/bool/Bool): The primitive Bool scalar value used in Mojo.

## Traits

* [​`Boolable`](/mojo/stdlib/builtin/bool/Boolable): The `Boolable` trait describes a type that can be explicitly converted to a `Bool` or evaluated as a boolean expression in `if` or `while` conditions.
* [​`ImplicitlyBoolable`](/mojo/stdlib/builtin/bool/ImplicitlyBoolable): The `ImplicitlyBoolable` trait describes a type that can be implicitly converted to a `Bool`.

## Functions

* [​`all`](/mojo/stdlib/builtin/bool/all): Checks if **all** elements in the list are truthy.
* [​`any`](/mojo/stdlib/builtin/bool/any): Checks if **any** element in the list is truthy.

---

## breakpoint

`breakpoint()`

Cause an execution trap with the intention of requesting the attention of a debugger.

---

## breakpoint

This module includes the builtin breakpoint function.

## Functions

* [​`breakpoint`](/mojo/stdlib/builtin/breakpoint/breakpoint): Cause an execution trap with the intention of requesting the attention of a debugger.

---

## Slice

`struct Slice`

Represents a slice expression.

Objects of this type are generated when slice syntax is used within square
brackets, e.g.:

```mojo
var msg: String = "Hello Mojo"

# Both are equivalent and print "Mojo".
print(msg[6:])
print(msg.__getitem__(Slice(6, len(msg))))
```

## Fields

* ​start (`Optional[Int]`): The starting index of the slice.
* ​end (`Optional[Int]`): The end index of the slice.
* ​step (`Optional[Int]`): The step increment value of the slice.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Movable`,
`Representable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(out self, start: Int, end: Int)`

Construct slice given the start and end values.

**Args:**

* ​start (`Int`): The start value.
* ​end (`Int`): The end value.

`__init__(out self, start: Optional[Int], end: Optional[Int], step: Optional[Int])`

Construct slice given the start, end and step values.

**Args:**

* ​start (`Optional[Int]`): The start value.
* ​end (`Optional[Int]`): The end value.
* ​step (`Optional[Int]`): The step value.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compare this slice to the other.

**Args:**

* ​other (`Self`): The slice to compare to.

**Returns:**

True if start, end, and step values of this slice match the
corresponding values of the other slice and False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Compare this slice to the other.

**Args:**

* ​other (`Self`): The slice to compare to.

**Returns:**

False if start, end, and step values of this slice match the
corresponding values of the other slice and True otherwise.

### `copy`

`copy(self) -> Self`

Creates a deep copy of the Slice.

**Returns:**

A copy of the value.

### `__str__`

`__str__(self) -> String`

Gets the string representation of the span.

**Returns:**

The string representation of the span.

### `__repr__`

`__repr__(self) -> String`

Gets the string representation of the span.

**Returns:**

The string representation of the span.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Write Slice string representation to a `Writer`.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `indices`

`indices(self, length: Int) -> Tuple[Int, Int, Int]`

Returns a tuple of 3 integers representing the start, end, and step    of the slice if applied to a container of the given length.

Uses the target container length to normalize negative, out of bounds,
or None indices.

Negative indices are wrapped using the length of the container.

```mojo
s = slice(0, -1, 1)
i = s.indices(5) # returns (0, 4, 1)
```

None indices are defaulted to the start or the end of the container
based on whether `step` is positive or negative.

```mojo
s = slice(None, None, 1)
i = s.indices(5) # returns (0, 5, 1)
```

Out of bounds indices are clamped using the size of the container.

```mojo
s = slice(20)
i = s.indices(5) # returns (0, 5, 1)
```

**Args:**

* ​length (`Int`): The length of the target container.

**Returns:**

A tuple containing three integers for start, end, and step.

---

## builtin_slice

Implements slice.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`Slice`](/mojo/stdlib/builtin/builtin_slice/Slice): Represents a slice expression.

## Functions

* [​`slice`](/mojo/stdlib/builtin/builtin_slice/slice-function): Construct slice given the end value.

---

## slice

`slice(end: Int) -> Slice`

Construct slice given the end value.

**Args:**

* ​end (`Int`): The end value.

**Returns:**

The constructed slice.

`slice(start: Int, end: Int) -> Slice`

Construct slice given the start and end values.

**Args:**

* ​start (`Int`): The start value.
* ​end (`Int`): The end value.

**Returns:**

The constructed slice.

`slice(start: Optional[Int], end: Optional[Int], step: Optional[Int]) -> Slice`

Construct a Slice given the start, end and step values.

**Args:**

* ​start (`Optional[Int]`): The start value.
* ​end (`Optional[Int]`): The end value.
* ​step (`Optional[Int]`): The step value.

**Returns:**

The constructed slice.

---

## GreaterThanComparable

A type which can be greater than compared with other instances of itself.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__gt__`

`__gt__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is greater than `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is greater than `rhs`.

---

## GreaterThanOrEqualComparable

A type which can be greater than or equal to compared with other instances of itself.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__ge__`

`__ge__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is greater than or equal to `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is greater than or equal to `rhs`.

---

## LessThanComparable

A type which can be less than compared with other instances of itself.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__lt__`

`__lt__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is less than `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is less than `rhs`.

---

## LessThanOrEqualComparable

A type which can be less than or equal to compared with other instances of itself.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__le__`

`__le__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is less than or equal to `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is less than or equal to `rhs`.

---

## comparable

## Aliases

### `Comparable`

`alias Comparable = EqualityComparable & LessThanComparable & GreaterThanComparable & LessThanOrEqualComparable & GreaterThanOrEqualComparable`

A type which can be compared with other instances of itself.

## Traits

* [​`GreaterThanComparable`](/mojo/stdlib/builtin/comparable/GreaterThanComparable): A type which can be greater than compared with other instances of itself.
* [​`GreaterThanOrEqualComparable`](/mojo/stdlib/builtin/comparable/GreaterThanOrEqualComparable): A type which can be greater than or equal to compared with other instances of itself.
* [​`LessThanComparable`](/mojo/stdlib/builtin/comparable/LessThanComparable): A type which can be less than compared with other instances of itself.
* [​`LessThanOrEqualComparable`](/mojo/stdlib/builtin/comparable/LessThanOrEqualComparable): A type which can be less than or equal to compared with other instances of itself.

---

## constrained

`constrained[cond: Bool, msg: StringSlice[StaticConstantOrigin], *extra: StringSlice[StaticConstantOrigin]]()`

Asserts that the condition must be true at compile time.

The `constrained()` function introduces a compile-time constraint on the
enclosing function. If the condition is true at compile time, the constraint
has no effect. If the condition is false, compilation fails and the message
is displayed.

This is similar to `static_assert` in C++. It differs from
[`debug_assert()`](/mojo/stdlib/builtin/debug_assert/debug_assert), which
is a run-time assertion.

Example:

```mojo
fn half[dtype: DType](a: Scalar[dtype]) -> Scalar[dtype]:
    constrained[
        dtype.is_numeric(),
        "dtype must be numeric."
    ]()
    return a / 2

def main():
    print(half(UInt8(5)))  # prints 2
    print(half(Scalar[DType.bool](True)))  # constraint failed:
                                           #     dtype must be numeric.
```

**Parameters:**

* ​cond (`Bool`): The bool value to assert.
* ​msg (`StringSlice[StaticConstantOrigin]`): The message to display on failure.
* ​\*extra (`StringSlice[StaticConstantOrigin]`): Additional messages to concatenate to msg.

`constrained[cond: Bool]()`

Asserts that the condition must be true at compile time.

The `constrained()` function introduces a compile-time constraint on the
enclosing function. If the condition is true at compile time, the constraint
has no effect. If the condition is false, compilation fails and a generic
message is displayed.

This is similar to `static_assert` in C++. It differs from
[`debug_assert()`](/mojo/stdlib/builtin/debug_assert/debug_assert), which
is a run-time assertion.

For an example, see the
[first overload](/mojo/stdlib/builtin/constrained/constrained).

**Parameters:**

* ​cond (`Bool`): The bool value to assert.

---

## constrained

Implements compile-time constraints.

These are Mojo built-ins, so you don't need to import them.

## Functions

* [​`constrained`](/mojo/stdlib/builtin/constrained/constrained): Asserts that the condition must be true at compile time.

---

## Coroutine

`@register_passable`
`struct Coroutine[type: AnyType, origins: origin.set]`

Represents a coroutine.

Coroutines can pause execution saving the state of the program (including
values of local variables and the location of the next instruction to be
executed). When the coroutine is resumed, execution continues from where it
left off, with the saved state restored.

## Parameters

* ​type (`AnyType`): Type of value returned upon completion of the coroutine.
* ​origins (`origin.set`): The origin of the coroutine's captures.

## Implemented traits

`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(handle: !co.routine) -> Self`

Construct a coroutine object from a handle.

**Args:**

* ​handle (`!co.routine`): The init handle.

### `__await__`

`__await__(owned self, out result: type)`

Suspends the current coroutine until the coroutine is complete.

**Returns:**

The coroutine promise.

### `force_destroy`

`force_destroy(owned self)`

Destroy the coroutine object.

---

## RaisingCoroutine

`@register_passable`
`struct RaisingCoroutine[type: AnyType, origins: origin.set]`

Represents a coroutine that can raise.

Coroutines can pause execution saving the state of the program (including
values of local variables and the location of the next instruction to be
executed). When the coroutine is resumed, execution continues from where it
left off, with the saved state restored.

## Parameters

* ​type (`AnyType`): Type of value returned upon completion of the coroutine.
* ​origins (`origin.set`): The origin set of the coroutine's captures.

## Implemented traits

`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(handle: !co.routine) -> Self`

Construct a coroutine object from a handle.

**Args:**

* ​handle (`!co.routine`): The init handle.

### `__await__`

`__await__(owned self, out result: type)`

Suspends the current coroutine until the coroutine is complete.

**Returns:**

The coroutine promise.

### `force_destroy`

`force_destroy(owned self)`

Destroy the coroutine object.

---

## coroutine

Implements classes and methods for coroutines.

These are Mojo built-ins, so you don't need to import them.

## Aliases

### `AnyCoroutine`

`alias AnyCoroutine = !co.routine`

## Structs

* [​`Coroutine`](/mojo/stdlib/builtin/coroutine/Coroutine): Represents a coroutine.
* [​`RaisingCoroutine`](/mojo/stdlib/builtin/coroutine/RaisingCoroutine): Represents a coroutine that can raise.

---

## debug_assert

`debug_assert[: origin.set, //, cond: fn() capturing -> Bool, assert_mode: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("none"), *Ts: Writable = *?, *, cpu_only: Bool = False](*messages: *Ts)`

Asserts that the condition is true at run time.

If the condition is false, the assertion displays the given message and
causes the program to exit.

You can pass in multiple arguments to generate a formatted
message. No string allocation occurs unless the assertion is triggered.

```mojo
x = 0
debug_assert(x > 0, "expected x to be more than 0 but got: ", x)
```

Normal assertions are off by default—they only run when the program is
compiled with all assertions enabled. You can set the `assert_mode` to
`safe` to create an assertion that's on by default:

```mojo
debug_assert[assert_mode="safe"](
    x > 0, "expected x to be more than 0 but got: ", x
)
```

Use the `ASSERT` variable to turn assertions on or off when building or
running a Mojo program:

```sh
mojo -D ASSERT=all main.mojo
```

The `ASSERT` variable takes the following values:

* all: Turn on all assertions.
* safe: Turn on "safe" assertions only. This is the default.
* none: Turn off all assertions, for performance at the cost of safety.
* warn: Turn on all assertions, but print any errors instead of exiting.

To ensure that you have no run-time penalty from your assertions even when
they're disabled, make sure there are no side effects in your message and
condition expressions. For example:

```mojo
person = "name: john, age: 50"
name = "john"
debug_assert(String("name: ") + name == person, "unexpected name")
```

This will have a run-time penalty due to allocating a `String` in the
condition expression, even when assertions are disabled. To avoid this, put
the condition inside a closure so it runs only when the assertion is turned
on:

```mojo
fn check_name() capturing -> Bool:
    return String("name: ") + name == person

debug_assert[check_name]("unexpected name")
```

If you need to allocate, and so don't want the assert to ever run on GPU,
you can set it to CPU only:

```mojo
debug_assert[check_name, cpu_only=True]("unexpected name")
```

For compile-time assertions, see
[`constrained()`](/mojo/stdlib/builtin/constrained/constrained).

**Parameters:**

* ​cond (`fn() capturing -> Bool`): The function to invoke to check if the assertion holds.
* ​assert\_mode (`StringSlice[StaticConstantOrigin]`): Determines when the assert is turned on.
  * default ("none"): Turned on when compiled with `-D ASSERT=all`.
  * "safe": Turned on by default.
* ​\*Ts (`Writable`): The element types for the message arguments.
* ​cpu\_only (`Bool`): If true, only run the assert on CPU.

**Args:**

* ​\*messages (`*Ts`): A set of [`Writable`](/mojo/stdlib/utils/write/Writable/)
  arguments to convert to a `String` message.

`debug_assert[assert_mode: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("none"), *Ts: Writable = *?, *, cpu_only: Bool = False](cond: Bool, *messages: *Ts)`

Asserts that the condition is true at run time.

If the condition is false, the assertion displays the given message and
causes the program to exit.

You can pass in multiple arguments to generate a formatted
message. No string allocation occurs unless the assertion is triggered.

```mojo
x = 0
debug_assert(x > 0, "expected x to be more than 0 but got: ", x)
```

Normal assertions are off by default—they only run when the program is
compiled with all assertions enabled. You can set the `assert_mode` to
`safe` to create an assertion that's on by default:

```mojo
debug_assert[assert_mode="safe"](
    x > 0, "expected x to be more than 0 but got: ", x
)
```

Use the `ASSERT` variable to turn assertions on or off when building or
running a Mojo program:

```sh
mojo -D ASSERT=all main.mojo
```

The `ASSERT` variable takes the following values:

* all: Turn on all assertions.
* safe: Turn on "safe" assertions only. This is the default.
* none: Turn off all assertions, for performance at the cost of safety.
* warn: Turn on all assertions, but print any errors instead of exiting.

To ensure that you have no run-time penalty from your assertions even when
they're disabled, make sure there are no side effects in your message and
condition expressions. For example:

```mojo
person = "name: john, age: 50"
name = "john"
debug_assert(String("name: ") + name == person, "unexpected name")
```

This will have a run-time penalty due to allocating a `String` in the
condition expression, even when assertions are disabled. To avoid this, put
the condition inside a closure so it runs only when the assertion is turned
on:

```mojo
fn check_name() capturing -> Bool:
    return String("name: ") + name == person

debug_assert[check_name]("unexpected name")
```

If you need to allocate, and so don't want the assert to ever run on GPU,
you can set it to CPU only:

```mojo
debug_assert[check_name, cpu_only=True]("unexpected name")
```

For compile-time assertions, see
[`constrained()`](/mojo/stdlib/builtin/constrained/constrained).

**Parameters:**

* ​assert\_mode (`StringSlice[StaticConstantOrigin]`): Determines when the assert is turned on.
  * default ("none"): Turned on when compiled with `-D ASSERT=all`.
  * "safe": Turned on by default.
* ​\*Ts (`Writable`): The element types for the message arguments.
* ​cpu\_only (`Bool`): If true, only run the assert on CPU.

**Args:**

* ​cond (`Bool`): The bool value to assert.
* ​\*messages (`*Ts`): A set of [`Writable`](/mojo/stdlib/utils/write/Writable/)
  arguments to convert to a `String` message.

`debug_assert[assert_mode: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("none"), cpu_only: Bool = False](cond: Bool, message: StringLiteral[value])`

Asserts that the condition is true at run time.

If the condition is false, the assertion displays the given message and
causes the program to exit.

You can pass in multiple arguments to generate a formatted
message. No string allocation occurs unless the assertion is triggered.

```mojo
x = 0
debug_assert(x > 0, "expected x to be more than 0 but got: ", x)
```

Normal assertions are off by default—they only run when the program is
compiled with all assertions enabled. You can set the `assert_mode` to
`safe` to create an assertion that's on by default:

```mojo
debug_assert[assert_mode="safe"](
    x > 0, "expected x to be more than 0 but got: ", x
)
```

Use the `ASSERT` variable to turn assertions on or off when building or
running a Mojo program:

```sh
mojo -D ASSERT=all main.mojo
```

The `ASSERT` variable takes the following values:

* all: Turn on all assertions.
* safe: Turn on "safe" assertions only. This is the default.
* none: Turn off all assertions, for performance at the cost of safety.
* warn: Turn on all assertions, but print any errors instead of exiting.

To ensure that you have no run-time penalty from your assertions even when
they're disabled, make sure there are no side effects in your message and
condition expressions. For example:

```mojo
person = "name: john, age: 50"
name = "john"
debug_assert(String("name: ") + name == person, "unexpected name")
```

This will have a run-time penalty due to allocating a `String` in the
condition expression, even when assertions are disabled. To avoid this, put
the condition inside a closure so it runs only when the assertion is turned
on:

```mojo
fn check_name() capturing -> Bool:
    return String("name: ") + name == person

debug_assert[check_name]("unexpected name")
```

If you need to allocate, and so don't want the assert to ever run on GPU,
you can set it to CPU only:

```mojo
debug_assert[check_name, cpu_only=True]("unexpected name")
```

For compile-time assertions, see
[`constrained()`](/mojo/stdlib/builtin/constrained/constrained).

**Parameters:**

* ​assert\_mode (`StringSlice[StaticConstantOrigin]`): Determines when the assert is turned on.
  * default ("none"): Turned on when compiled with `-D ASSERT=all`.
  * "safe": Turned on by default.
* ​cpu\_only (`Bool`): If true, only run the assert on CPU.

**Args:**

* ​cond (`Bool`): The bool value to assert.
* ​message (`StringLiteral[value]`): A static string message.

---

## debug_assert

Implements run-time assertions.

These are Mojo built-ins, so you don't need to import them.

## Aliases

### `ASSERT_MODE`

`alias ASSERT_MODE = env_get_string[::StringSlice[::Bool()`

## Functions

* [​`debug_assert`](/mojo/stdlib/builtin/debug_assert/debug_assert): Asserts that the condition is true at run time.

---

## DevicePassable

This trait marks types as passable to accelerator devices.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `device_type`

`alias device_type`

Indicate the type being used on accelerator devices.

## Methods

### `get_type_name`

`static get_type_name() -> String`

Gets the name of the host type (the one implementing this trait). For example, Int would return "Int", DeviceBuffer\[DType.float32] would return "DeviceBuffer\[DType.float32]". This is used for error messages when passing types to the device. TODO: This method will be retired soon when better kernel call error messages arrive.

**Returns:**

The host type's name.

### `get_device_type_name`

`static get_device_type_name() -> String`

Gets device\_type's name. For example, because DeviceBuffer's device\_type is UnsafePointer, DeviceBuffer\[DType.float32]'s get\_device\_type\_name() should return something like "UnsafePointer\[Scalar\[DType.float32]]". This is used for error messages when passing types to the device. TODO: This method will be retired soon when better kernel call error messages arrive.

**Returns:**

The device type's name.

---

## device_passable

## Traits

* [​`DevicePassable`](/mojo/stdlib/builtin/device_passable/DevicePassable): This trait marks types as passable to accelerator devices.

---

## DType

`@register_passable(trivial)`
`struct DType`

Represents DType and provides methods for working with it.

## Fields

* ​value (`dtype`): The underlying storage for the DType value.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Hashable`,
`Movable`,
`Representable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`,
`_HashableWithHasher`

## Aliases

### `bfloat16`

`alias bfloat16`

Represents a brain floating point value whose bitwidth is 16.

### `bool`

`alias bool`

Represents a boolean data type.

### `float16`

`alias float16`

Represents an IEEE754-2008 `binary16` floating point value.

### `float32`

`alias float32`

Represents an IEEE754-2008 `binary32` floating point value.

### `float64`

`alias float64`

Represents an IEEE754-2008 `binary64` floating point value.

### `float8_e3m4`

`alias float8_e3m4`

Represents an 8-bit e3m4 floating point format, encoded as `seeemmmm`: - (s)ign: 1 bit - (e)xponent: 3 bits - (m)antissa: 4 bits - exponent bias: 3 - nan: 00111111, 11111111 - -0: 10000000 - fn: finite (no inf or -inf encodings)

### `float8_e4m3fn`

`alias float8_e4m3fn`

Represents the E4M3 floating point format defined in the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1).

This type is named differently across libraries and vendors, for example:

* Mojo, PyTorch, JAX, and LLVM refer to it as `e4m3fn`.
* OCP, NVIDIA CUDA, and AMD ROCm refer to it as `e4m3`.

In these contexts, they are all referring to the same finite type specified
in the OFP8 standard above, encoded as `seeeemmm`:

* (s)ign: 1 bit
* (e)xponent: 4 bits
* (m)antissa: 3 bits
* exponent bias: 7
* nan: 01111111, 11111111
* -0: 10000000
* fn: finite (no inf or -inf encodings)

### `float8_e4m3fnuz`

`alias float8_e4m3fnuz`

Represents an 8-bit e4m3fnuz floating point format, encoded as `seeeemmm`: - (s)ign: 1 bit - (e)xponent: 4 bits - (m)antissa: 3 bits - exponent bias: 8 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding)

### `float8_e5m2`

`alias float8_e5m2`

Represents the 8-bit E5M2 floating point format from the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1), encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 15 - nan: {0,1}11111{01,10,11} - inf: 01111100 - -inf: 11111100 - -0: 10000000

### `float8_e5m2fnuz`

`alias float8_e5m2fnuz`

Represents an 8-bit floating point format, encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 16 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding)

### `index`

`alias index`

Represents an integral type whose bitwidth is the maximum integral value on the system.

### `int128`

`alias int128 = si128`

Represents a signed integer type whose bitwidth is 128.

### `int16`

`alias int16`

Represents a signed integer type whose bitwidth is 16.

### `int256`

`alias int256 = si256`

Represents a signed integer type whose bitwidth is 256.

### `int32`

`alias int32`

Represents a signed integer type whose bitwidth is 32.

### `int64`

`alias int64`

Represents a signed integer type whose bitwidth is 64.

### `int8`

`alias int8`

Represents a signed integer type whose bitwidth is 8.

### `invalid`

`alias invalid`

Represents an invalid or unknown data type.

### `tensor_float32`

`alias tensor_float32`

Represents a special floating point format supported by NVIDIA Tensor Cores, with the same range as float32 and reduced precision (>=10 bits). Note that this dtype is only available on NVIDIA GPUs.

### `type`

`alias type = dtype`

### `uint128`

`alias uint128 = ui128`

Represents an unsigned integer type whose bitwidth is 128.

### `uint16`

`alias uint16`

Represents an unsigned integer type whose bitwidth is 16.

### `uint256`

`alias uint256 = ui256`

Represents an unsigned integer type whose bitwidth is 256.

### `uint32`

`alias uint32`

Represents an unsigned integer type whose bitwidth is 32.

### `uint64`

`alias uint64`

Represents an unsigned integer type whose bitwidth is 64.

### `uint8`

`alias uint8`

Represents an unsigned integer type whose bitwidth is 8.

## Methods

### `__init__`

`@implicit`
`__init__(value: dtype) -> Self`

Construct a DType from MLIR dtype.

**Args:**

* ​value (`dtype`): The MLIR dtype.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compares one DType to another for equality.

**Args:**

* ​rhs (`Self`): The DType to compare against.

**Returns:**

True if the DTypes are the same and False otherwise.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Compares one DType to another for inequality.

**Args:**

* ​rhs (`Self`): The DType to compare against.

**Returns:**

False if the DTypes are the same and True otherwise.

### `__is__`

`__is__(self, rhs: Self) -> Bool`

Compares one DType to another for equality.

**Args:**

* ​rhs (`Self`): The DType to compare against.

**Returns:**

True if the DTypes are the same and False otherwise.

### `__isnot__`

`__isnot__(self, rhs: Self) -> Bool`

Compares one DType to another for inequality.

**Args:**

* ​rhs (`Self`): The DType to compare against.

**Returns:**

True if the DTypes are the same and False otherwise.

### `copy`

`copy(self) -> Self`

Copy this DType.

**Returns:**

A copy of the value.

### `__str__`

`__str__(self) -> String`

Gets the name of the DType.

**Returns:**

The name of the dtype.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this dtype to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__repr__`

`__repr__(self) -> String`

Gets the representation of the DType e.g. `"DType.float32"`.

**Returns:**

The representation of the dtype.

### `get_value`

`get_value(self) -> dtype`

Gets the associated internal kgen.dtype value.

**Returns:**

The kgen.dtype value.

### `__hash__`

`__hash__(self) -> UInt`

Return a 64-bit hash for this `DType` value.

**Returns:**

A 64-bit integer hash of this `DType` value.

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with this `DType` value.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

### `is_unsigned`

`is_unsigned(self) -> Bool`

Returns True if the type parameter is unsigned and False otherwise.

**Returns:**

Returns True if the input type parameter is unsigned.

### `is_signed`

`is_signed(self) -> Bool`

Returns True if the type parameter is signed and False otherwise.

**Returns:**

Returns True if the input type parameter is signed.

### `is_integral`

`is_integral(self) -> Bool`

Returns True if the type parameter is an integer and False otherwise.

**Returns:**

Returns True if the input type parameter is an integer.

### `is_floating_point`

`is_floating_point(self) -> Bool`

Returns True if the type parameter is a floating-point and False otherwise.

**Returns:**

Returns True if the input type parameter is a floating-point.

### `is_float8`

`is_float8(self) -> Bool`

Returns True if the dtype is a 8bit-precision floating point type, e.g. float8\_e5m2, float8\_e5m2fnuz, float8\_e4m3fn and float8\_e4m3fnuz.

**Returns:**

True if the dtype is a 8bit-precision float, false otherwise.

### `is_half_float`

`is_half_float(self) -> Bool`

Returns True if the dtype is a half-precision floating point type, e.g. either fp16 or bf16.

**Returns:**

True if the dtype is a half-precision float, false otherwise..

### `is_numeric`

`is_numeric(self) -> Bool`

Returns True if the type parameter is numeric (i.e. you can perform arithmetic operations on).

**Returns:**

Returns True if the input type parameter is either integral or
floating-point.

### `sizeof`

`sizeof(self) -> Int`

Returns the size in bytes of the current DType.

**Returns:**

Returns the size in bytes of the current DType.

### `bitwidth`

`bitwidth(self) -> Int`

Returns the size in bits of the current DType.

**Returns:**

Returns the size in bits of the current DType.

### `dispatch_integral`

`dispatch_integral[: origin.set, //, func: fn[DType]() capturing -> None](self)`

Dispatches an integral function corresponding to the current DType.

**Constraints:**

DType must be integral.

**Parameters:**

* ​func (`fn[DType]() capturing -> None`): A parametrized on dtype function to dispatch.

### `dispatch_floating`

`dispatch_floating[: origin.set, //, func: fn[DType]() capturing -> None](self)`

Dispatches a floating-point function corresponding to the current DType.

**Constraints:**

DType must be floating-point or integral.

**Parameters:**

* ​func (`fn[DType]() capturing -> None`): A parametrized on dtype function to dispatch.

### `dispatch_arithmetic`

`dispatch_arithmetic[: origin.set, //, func: fn[DType]() capturing -> None](self)`

Dispatches a function corresponding to the current DType.

**Parameters:**

* ​func (`fn[DType]() capturing -> None`): A parametrized on dtype function to dispatch.

### `__mlir_type`

`__mlir_type(self) -> !kgen.deferred`

Returns the MLIR type of the current DType as an MLIR type.

**Returns:**

The MLIR type of the current DType.

### `get_dtype`

`static get_dtype[T: AnyType, size: Int = 1]() -> Self`

Get the `DType` if the given Type is a `SIMD[_, size]` of a `DType`.

**Parameters:**

* ​T (`AnyType`): AnyType.
* ​size (`Int`): The SIMD size to compare against.

**Returns:**

The `DType` if matched, otherwise `DType.invalid`.

### `is_scalar`

`static is_scalar[T: AnyType]() -> Bool`

Whether the given Type is a Scalar of a DType.

**Parameters:**

* ​T (`AnyType`): AnyType.

**Returns:**

The result.

---

## dtype

Implements the DType class.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`DType`](/mojo/stdlib/builtin/dtype/DType): Represents DType and provides methods for working with it.

---

## EqualityComparable

A type which can be compared for equality with other instances of itself.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__eq__`

`__eq__(self: _Self, other: _Self) -> Bool`

Define whether two instances of the object are equal to each other.

**Args:**

* ​other (`_Self`): Another instance of the same type.

**Returns:**

True if the instances are equal according to the type's definition
of equality, False otherwise.

### `__ne__`

`__ne__(self: _Self, other: _Self) -> Bool`

Define whether two instances of the object are not equal to each other.

**Args:**

* ​other (`_Self`): Another instance of the same type.

**Returns:**

True if the instances are not equal according to the type's definition
of equality, False otherwise.

---

## equality_comparable

## Traits

* [​`EqualityComparable`](/mojo/stdlib/builtin/equality_comparable/EqualityComparable): A type which can be compared for equality with other instances of itself.

---

## Error

`@register_passable`
`struct Error`

This type represents an Error.

## Fields

* ​data (`UnsafePointer[SIMD[uint8, 1]]`): A pointer to the beginning of the string data being referenced.
* ​loaded\_length (`Int`): The length of the string being referenced. Error instances conditionally own their error message. To reduce the size of the error instance we use the sign bit of the length field to store the ownership value. When loaded\_length is negative it indicates ownership and a free is executed in the destructor.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Defaultable`,
`ExplicitlyCopyable`,
`Movable`,
`Representable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Default constructor.

`@implicit`
`__init__(value: StringLiteral[value]) -> Self`

Construct an Error object with a given string literal.

**Args:**

* ​value (`StringLiteral[value]`): The error message.

`@implicit`
`__init__(src: String) -> Self`

Construct an Error object with a given string.

**Args:**

* ​src (`String`): The error message.

`@implicit`
`__init__(src: StringSlice[origin]) -> Self`

Construct an Error object with a given string ref.

**Args:**

* ​src (`StringSlice[origin]`): The error message.

`__init__[*Ts: Writable](*args: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")) -> Self`

Construct an Error by concatenating a sequence of Writable arguments.

**Parameters:**

* ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy
  `Writable`.

**Args:**

* ​\*args (`*Ts`): A sequence of Writable arguments.
* ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements.
* ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements.

### `__copyinit__`

`__copyinit__(existing: Self) -> Self`

Creates a deep copy of an existing error.

**Args:**

* ​existing (`Self`): The error to copy from.

### `__del__`

`__del__(owned self)`

Releases memory if allocated.

### `__bool__`

`__bool__(self) -> Bool`

Returns True if the error is set and false otherwise.

**Returns:**

True if the error object contains a value and False otherwise.

### `copy`

`copy(self) -> Self`

Copy the object.

**Returns:**

A copy of the value.

### `__str__`

`__str__(self) -> String`

Converts the Error to string representation.

**Returns:**

A String of the error message.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this error to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__repr__`

`__repr__(self) -> String`

Converts the Error to printable representation.

**Returns:**

A printable representation of the error message.

### `byte_length`

`byte_length(self) -> Int`

Get the length of the Error string in bytes.

Notes:
This does not include the trailing null terminator in the count.

**Returns:**

The length of the Error string in bytes.

### `unsafe_cstr_ptr`

`unsafe_cstr_ptr(self) -> UnsafePointer[SIMD[int8, 1]]`

Retrieves a C-string-compatible pointer to the underlying memory.

The returned pointer is guaranteed to be NUL terminated, and not null.

**Returns:**

The pointer to the underlying memory.

### `as_string_slice`

`as_string_slice(self) -> StringSlice[ImmutableAnyOrigin]`

Returns a string slice of the data maybe owned by the Error.

Notes:
Since the data is not guaranteed to be owned by the Error, the
resulting StringSlice is given an ImmutableAnyOrigin.

**Returns:**

A string slice pointing to the data maybe owned by the Error.

---

## error

Implements the Error class.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`Error`](/mojo/stdlib/builtin/error/Error): This type represents an Error.

---

## FileHandle

`struct FileHandle`

File handle to an opened file.

## Fields

* ​handle (`UnsafePointer[NoneType]`): The underlying pointer to the file handle.

## Implemented traits

`AnyType`,
`Defaultable`,
`Movable`,
`UnknownDestructibility`,
`Writer`

## Methods

### `__init__`

`__init__(out self)`

Default constructor.

`__init__(out self, path: StringSlice[origin], mode: StringSlice[origin])`

Construct the FileHandle using the file path and mode.

**Args:**

* ​path (`StringSlice[origin]`): The file path.
* ​mode (`StringSlice[origin]`): The mode to open the file in (the mode can be "r" or "w" or "rw").

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Moves constructor for the file handle.

**Args:**

* ​existing (`Self`): The existing file handle.

### `__del__`

`__del__(owned self)`

Closes the file handle.

### `close`

`close(mut self)`

Closes the file handle.

### `read`

`read(self, size: Int = -1) -> String`

Reads data from a file and sets the file handle seek position. If size is left as the default of -1, it will read to the end of the file. Setting size to a number larger than what's in the file will set the String length to the total number of bytes, and read all the data.

Examples:

Read the entire file into a String:

```mojo
var file = open("/tmp/example.txt", "r")
var string = file.read()
print(string)
```

Read the first 8 bytes, skip 2 bytes, and then read the next 8 bytes:

```mojo
import os
var file = open("/tmp/example.txt", "r")
var word1 = file.read(8)
print(word1)
_ = file.seek(2, os.SEEK_CUR)
var word2 = file.read(8)
print(word2)
```

Read the last 8 bytes in the file, then the first 8 bytes

```mojo
_ = file.seek(-8, os.SEEK_END)
var last_word = file.read(8)
print(last_word)
_ = file.seek(8, os.SEEK_SET) # os.SEEK_SET is the default start of file
var first_word = file.read(8)
print(first_word)
```

.

**Args:**

* ​size (`Int`): Requested number of bytes to read (Default: -1 = EOF).

**Returns:**

The contents of the file.

**Raises:**

An error if this file handle is invalid, or if the file read
returned a failure.

`read[dtype: DType, origin: MutableOrigin](self, buffer: Span[SIMD[dtype, 1], origin]) -> Int`

Read data from the file into the Span.

This will read n bytes from the file into the input Span where
`0 dtype (`DType`): The type that the data will be represented as.
* ​origin (`MutableOrigin`): The origin of the passed in Span.

**Args:**

* ​buffer (`Span[SIMD[dtype, 1], origin]`): The mutable Span to read data into.

**Returns:**

The total amount of data that was read in bytes.

**Raises:**

An error if this file handle is invalid, or if the file read
returned a failure.

### `read_bytes`

`read_bytes(self, size: Int = -1) -> List[SIMD[uint8, 1]]`

Reads data from a file and sets the file handle seek position. If size is left as default of -1, it will read to the end of the file. Setting size to a number larger than what's in the file will be handled and set the List length to the total number of bytes in the file.

Examples:

Reading the entire file into a List\[Int8]:

```mojo
var file = open("/tmp/example.txt", "r")
var string = file.read_bytes()
```

Reading the first 8 bytes, skipping 2 bytes, and then reading the next
8 bytes:

```mojo
import os
var file = open("/tmp/example.txt", "r")
var list1 = file.read(8)
_ = file.seek(2, os.SEEK_CUR)
var list2 = file.read(8)
```

Reading the last 8 bytes in the file, then the first 8 bytes:

```mojo
import os
var file = open("/tmp/example.txt", "r")
_ = file.seek(-8, os.SEEK_END)
var last_data = file.read(8)
_ = file.seek(8, os.SEEK_SET) # os.SEEK_SET is the default start of file
var first_data = file.read(8)
```

.

**Args:**

* ​size (`Int`): Requested number of bytes to read (Default: -1 = EOF).

**Returns:**

The contents of the file.

**Raises:**

An error if this file handle is invalid, or if the file read
returned a failure.

### `seek`

`seek(self, offset: SIMD[uint64, 1], whence: SIMD[uint8, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[uint64, 1]`

Seeks to the given offset in the file.

Examples:

Skip 32 bytes from the current read position:

```mojo
import os
var f = open("/tmp/example.txt", "r")
_ = f.seek(32, os.SEEK_CUR)
```

Start from 32 bytes from the end of the file:

```mojo
import os
var f = open("/tmp/example.txt", "r")
_ = f.seek(-32, os.SEEK_END)
```

.

**Args:**

* ​offset (`SIMD[uint64, 1]`): The byte offset to seek to.
* ​whence (`SIMD[uint8, 1]`): The reference point for the offset:
  os.SEEK\_SET = 0: start of file (Default).
  os.SEEK\_CUR = 1: current position.
  os.SEEK\_END = 2: end of file.

**Returns:**

The resulting byte offset from the start of the file.

**Raises:**

An error if this file handle is invalid, or if file seek returned a
failure.

### `write_bytes`

`write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])`

Write a span of bytes to the file.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file.

### `write`

`write[*Ts: Writable](mut self, *args: *Ts)`

Write a sequence of Writable arguments to the provided Writer.

**Parameters:**

* ​\*Ts (`Writable`): Types of the provided argument sequence.

**Args:**

* ​\*args (`*Ts`): Sequence of arguments to write to this Writer.

### `__enter__`

`__enter__(owned self) -> Self`

The function to call when entering the context.

**Returns:**

The file handle.

---

## file

Provides APIs to read and write files.

These are Mojo built-ins, so you don't need to import them.

For example, here's how to read a file:

```mojo
var  f = open("my_file.txt", "r")
print(f.read())
f.close()
```

Or use a `with` statement to close the file automatically:

```mojo
with open("my_file.txt", "r") as f:
  print(f.read())
```

## Structs

* [​`FileHandle`](/mojo/stdlib/builtin/file/FileHandle): File handle to an opened file.

## Functions

* [​`open`](/mojo/stdlib/builtin/file/open): Opens the file specified by path using the mode provided, returning a FileHandle.

---

## open

`open[PathLike: PathLike](path: PathLike, mode: StringSlice[origin]) -> FileHandle`

Opens the file specified by path using the mode provided, returning a FileHandle.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the file to open.
* ​mode (`StringSlice[origin]`): The mode to open the file in (the mode can be "r" or "w").

**Returns:**

A file handle.

---

## FileDescriptor

`@register_passable(trivial)`
`struct FileDescriptor`

File descriptor of a file.

## Fields

* ​value (`Int`): The underlying value of the file descriptor.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`,
`Writer`

## Methods

### `__init__`

`__init__(value: Int = 1) -> Self`

Constructs the file descriptor from an integer.

**Args:**

* ​value (`Int`): The file identifier (Default 1 = stdout).

`@implicit`
`__init__(f: FileHandle) -> Self`

Constructs the file descriptor from a file handle.

**Args:**

* ​f (`FileHandle`): The file handle.

### `__write_bytes_cpu`

`__write_bytes_cpu(mut self, bytes: Span[SIMD[uint8, 1], origin])`

Write a span of bytes to the file.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file.

### `write_bytes`

`write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])`

Write a span of bytes to the file.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file.

### `read_bytes`

`read_bytes(mut self, buffer: Span[SIMD[uint8, 1], origin]) -> UInt`

Read a number of bytes from the file into a buffer.

Notes:
[Reference](https://pubs.opengroup.org/onlinepubs/9799919799/functions/read.html).

**Args:**

* ​buffer (`Span[SIMD[uint8, 1], origin]`): A `Span[Byte]` to read bytes into. Read up to `len(buffer)` number of bytes.

**Returns:**

Actual number of bytes read.

### `write`

`write[*Ts: Writable](mut self, *args: *Ts)`

Write a sequence of Writable arguments to the provided Writer.

**Parameters:**

* ​\*Ts (`Writable`): Types of the provided argument sequence.

**Args:**

* ​\*args (`*Ts`): Sequence of arguments to write to this Writer.

---

## file_descriptor

Higher level abstraction for file stream.

These are Mojo built-ins, so you don't need to import them.

For example, here's how to print to a file

```mojo
var f = open("my_file.txt", "r")
print("hello", file=f^)
f.close()
```

## Structs

* [​`FileDescriptor`](/mojo/stdlib/builtin/file_descriptor/FileDescriptor): File descriptor of a file.

---

## FloatLiteral

`@register_passable(trivial)`
`struct FloatLiteral[value: !pop.float_literal]`

Mojo floating point literal type.

## Parameters

* ​value (`!pop.float_literal`): The underlying infinite precision floating point value.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Defaultable`,
`Floatable`,
`ImplicitlyBoolable`,
`Intable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`

## Aliases

### `infinity`

`alias infinity = inf`

### `nan`

`alias nan`

### `negative_infinity`

`alias negative_infinity = -inf`

### `negative_zero`

`alias negative_zero = -0.0`

## Methods

### `__init__`

`__init__() -> Self`

Create a FloatLiteral for any parameter value.

`@implicit`
`__init__(value: IntLiteral[value]) -> FloatLiteral[#pop.int_to_float_literal]`

Convert an IntLiteral to a FloatLiteral value.

**Args:**

* ​value (`IntLiteral[value]`): The IntLiteral value.

### `__bool__`

`__bool__(self) -> Bool`

A FloatLiteral value is true if it is non-zero.

**Returns:**

True if non-zero.

### `__neg__`

`__neg__(self) -> FloatLiteral[#pop.float_literal_bin>]`

Return the negation of the FloatLiteral value.

**Returns:**

The negated FloatLiteral value.

### `__lt__`

`__lt__(self, rhs: FloatLiteral[value]) -> Bool`

Less than comparison.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to compare.

**Returns:**

True if this value is less than `rhs`.

### `__le__`

`__le__(self, rhs: FloatLiteral[value]) -> Bool`

Less than or equal to comparison.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to compare.

**Returns:**

True if this value is less than or equal to `rhs`.

### `__eq__`

`__eq__(self, rhs: FloatLiteral[value]) -> Bool`

Compare for equality.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to compare.

**Returns:**

True if they are equal.

### `__ne__`

`__ne__(self, rhs: FloatLiteral[value]) -> Bool`

Compare for inequality.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to compare.

**Returns:**

True if they are not equal.

### `__gt__`

`__gt__(self, rhs: FloatLiteral[value]) -> Bool`

Greater than comparison.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to compare.

**Returns:**

True if this value is greater than `rhs`.

### `__ge__`

`__ge__(self, rhs: FloatLiteral[value]) -> Bool`

Greater than or equal to comparison.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to compare.

**Returns:**

True if this value is greater than or equal to `rhs`.

### `__add__`

`__add__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Add two FloatLiterals.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to add.

**Returns:**

The sum of the two values.

### `__sub__`

`__sub__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Subtract two FloatLiterals.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to subtract.

**Returns:**

The difference of the two values.

### `__mul__`

`__mul__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Multiply two FloatLiterals.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to multiply.

**Returns:**

The product of the two values.

### `__truediv__`

`__truediv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Divide two FloatLiterals.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to divide.

**Returns:**

The quotient of the two values.

### `__floordiv__`

`__floordiv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Returns self divided by rhs, rounded down to the nearest integer.

**Args:**

* ​rhs (`FloatLiteral[value]`): The divisor value.

**Returns:**

`floor(self / rhs)` value.

### `__mod__`

`__mod__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin, value>>]`

Return the remainder of self divided by rhs.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to divide on.

**Returns:**

The remainder of dividing self by rhs.

### `__radd__`

`__radd__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Reversed addition operator.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to add.

**Returns:**

The sum of this and the given value.

### `__rsub__`

`__rsub__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Reversed subtraction operator.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to subtract from.

**Returns:**

The result of subtracting this from the given value.

### `__rmul__`

`__rmul__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Reversed multiplication operator.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to multiply.

**Returns:**

The product of the given number and this.

### `__rtruediv__`

`__rtruediv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Reversed division.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to be divided by this.

**Returns:**

The result of dividing the given value by this.

### `__rfloordiv__`

`__rfloordiv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Returns rhs divided by self, rounded down to the nearest integer.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to be divided by self.

**Returns:**

`floor(rhs / self)` value.

### `__rmod__`

`__rmod__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin, value>>]`

Return the remainder of rhs divided by self.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to divide on.

**Returns:**

The remainder of dividing rhs by self.

### `is_nan`

`is_nan(self) -> Bool`

Return whether the FloatLiteral is nan.

Since `nan == nan` is False, this provides a way to check for nan-ness.

**Returns:**

True, if the value is nan, False otherwise.

### `is_neg_zero`

`is_neg_zero(self) -> Bool`

Return whether the FloatLiteral is negative zero.

Since `FloatLiteral.negative_zero == 0.0` is True, this provides a way
to check if the FloatLiteral is negative zero.

**Returns:**

True, if the value is negative zero, False otherwise.

### `__str__`

`__str__(self) -> String`

Get the float as a string.

**Returns:**

A string representation.

### `__int_literal__`

`__int_literal__(self) -> IntLiteral[#pop.float_to_int_literal]`

Casts the floating point value to an IntLiteral. If there is a fractional component, then the value is truncated towards zero.

Eg. `(4.5).__int_literal__()` returns `4`, and `(-3.7).__int_literal__()`
returns `-3`.

**Returns:**

The value as an integer.

### `__int__`

`__int__(self) -> Int`

Converts the FloatLiteral value to an Int. If there is a fractional component, then the value is truncated towards zero.

Eg. `(4.5).__int__()` returns `4`, and `(-3.7).__int__()` returns `-3`.

**Returns:**

The value as an integer.

### `__float__`

`__float__(self) -> SIMD[float64, 1]`

Converts the FloatLiteral to a concrete Float64.

**Returns:**

The Float value.

### `__as_bool__`

`__as_bool__(self) -> Bool`

A FloatLiteral value is true if it is non-zero.

**Returns:**

True if non-zero.

### `__ceildiv__`

`__ceildiv__(self, denominator: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin>>, #pop.float_literal>]`

Return the rounded-up result of dividing self by denominator.

**Args:**

* ​denominator (`FloatLiteral[value]`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

---

## float_literal

Implements the FloatLiteral class.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`FloatLiteral`](/mojo/stdlib/builtin/float_literal/FloatLiteral): Mojo floating point literal type.

---

## Floatable

The `Floatable` trait describes a type that can be converted to a Float64.

This trait requires the type to implement the `__float__()` method.

For example:

```mojo
struct Foo(Floatable):
    var i: Float64

    fn __float__(self) -> Float64:
        return self.i
```

A `Foo` can now be converted to a `Float64`:

```mojo
var f = Float64(Foo(5.5))
```

**Note:** If the `__float__()` method can raise an error, use
the [`FloatableRaising`](/mojo/stdlib/builtin/floatable/floatableraising)
trait instead.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__float__`

`__float__(self: _Self) -> SIMD[float64, 1]`

Get the float point representation of the value.

**Returns:**

The float point representation of the value.

---

## FloatableRaising

The `FloatableRaising` trait describes a type that can be converted to a Float64, but the conversion might raise an error (e.g.: a string).

This trait requires the type to implement the `__float__()` method, which
can raise an error.

For example:

```mojo
from utils import Variant

struct MaybeFloat(FloatableRaising):
    var value: Variant[Float64, NoneType]

    fn __float__(self) raises -> Float64:
        if self.value.isa[NoneType]():
            raise "Float expected"
        return self.value[Float64]
```

A `MaybeFloat` can now be converted to `Float64`:

```mojo
try:
    print(Float64(MaybeFloat(4.6)))
except:
    print("error occurred")
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__float__`

`__float__(self: _Self) -> SIMD[float64, 1]`

Get the float point representation of the value.

**Returns:**

The float point representation of the value.

**Raises:**

If the type does not have a float point representation.

---

## floatable

Implements the `Floatable` and `FloatableRaising` traits.

These are Mojo built-ins, so you don't need to import them.

## Traits

* [​`Floatable`](/mojo/stdlib/builtin/floatable/Floatable): The `Floatable` trait describes a type that can be converted to a Float64.
* [​`FloatableRaising`](/mojo/stdlib/builtin/floatable/FloatableRaising): The `FloatableRaising` trait describes a type that can be converted to a Float64, but the conversion might raise an error (e.g.: a string).

---

## bin

`bin(num: SIMD[dtype, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0b")) -> String`

Return the binary string representation an integral value.

```mojo
print(bin(123))
print(bin(-123))
```

```plaintext
'0b1111011'
'-0b1111011'
```

**Args:**

* ​num (`SIMD[dtype, 1]`): An integral scalar value.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

The binary string representation of num.

`bin(b: SIMD[bool, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0b")) -> String`

Returns the binary representation of a scalar bool.

**Args:**

* ​b (`SIMD[bool, 1]`): A scalar bool value.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

The binary string representation of b.

`bin[T: Intable, //](num: T, /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0b")) -> String`

Returns the binary representation of an indexer type.

**Parameters:**

* ​T (`Intable`): The Intable type.

**Args:**

* ​num (`T`): An indexer value.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

The binary string representation of num.

---

## hex

`hex(value: SIMD[dtype, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0x")) -> String`

Returns the hex string representation of the given integer.

The hexadecimal representation is a base-16 encoding of the integer value.

The returned string will be prefixed with "0x" to indicate that the
subsequent digits are hex.

**Args:**

* ​value (`SIMD[dtype, 1]`): The integer value to format.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

A string containing the hex representation of the given integer.

`hex[T: Intable, //](value: T, /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0x")) -> String`

Returns the hex string representation of the given integer.

The hexadecimal representation is a base-16 encoding of the integer value.

The returned string will be prefixed with "0x" to indicate that the
subsequent digits are hex.

**Parameters:**

* ​T (`Intable`): The indexer type to represent in hexadecimal.

**Args:**

* ​value (`T`): The integer value to format.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

A string containing the hex representation of the given integer.

`hex(value: SIMD[bool, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0x")) -> String`

Returns the hex string representation of the given scalar bool.

The hexadecimal representation is a base-16 encoding of the bool.

The returned string will be prefixed with "0x" to indicate that the
subsequent digits are hex.

**Args:**

* ​value (`SIMD[bool, 1]`): The bool value to format.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

A string containing the hex representation of the given bool.

---

## format_int

Provides the `hex` and `bin` functions.

These are Mojo built-ins, so you don't need to import them.

## Functions

* [​`bin`](/mojo/stdlib/builtin/format_int/bin): Return the binary string representation an integral value.
* [​`hex`](/mojo/stdlib/builtin/format_int/hex): Returns the hex string representation of the given integer.
* [​`oct`](/mojo/stdlib/builtin/format_int/oct): Returns the octal string representation of the given integer.

---

## oct

`oct(value: SIMD[dtype, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0o")) -> String`

Returns the octal string representation of the given integer.

The octal representation is a base-8 encoding of the integer value.

The returned string will be prefixed with "0o" to indicate that the
subsequent digits are octal.

**Args:**

* ​value (`SIMD[dtype, 1]`): The integer value to format.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

A string containing the octal representation of the given integer.

`oct[T: Intable, //](value: T, /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0o")) -> String`

Returns the octal string representation of the given integer.

The octal representation is a base-8 encoding of the integer value.

The returned string will be prefixed with "0o" to indicate that the
subsequent digits are octal.

**Parameters:**

* ​T (`Intable`): The intable type to represent in octal.

**Args:**

* ​value (`T`): The integer value to format.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

A string containing the octal representation of the given integer.

`oct(value: SIMD[bool, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0o")) -> String`

Returns the octal string representation of the given scalar bool.

The octal representation is a base-8 encoding of the bool.

The returned string will be prefixed with "0o" to indicate that the
subsequent digits are octal.

**Args:**

* ​value (`SIMD[bool, 1]`): The bool value to format.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

A string containing the octal representation of the given bool.

---

## Identifiable

The Identifiable trait denotes a type with an identity which can be compared with other instances of itself.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__is__`

`__is__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` has the same identity as `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is `rhs`.

### `__isnot__`

`__isnot__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` has a different identity than `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is not `rhs`.

---

## identifiable

## Traits

* [​`Identifiable`](/mojo/stdlib/builtin/identifiable/Identifiable): The Identifiable trait denotes a type with an identity which can be compared with other instances of itself.

---

## builtin

Implements the builtin package.

## Modules

* [​`anytype`](/mojo/stdlib/builtin/anytype/): Defines the core traits for object lifetime management in Mojo.
* [​`bool`](/mojo/stdlib/builtin/bool/): Implements the Bool class.
* [​`breakpoint`](/mojo/stdlib/builtin/breakpoint/): This module includes the builtin breakpoint function.
* [​`builtin_slice`](/mojo/stdlib/builtin/builtin_slice/): Implements slice.
* [​`comparable`](/mojo/stdlib/builtin/comparable/):
* [​`constrained`](/mojo/stdlib/builtin/constrained/): Implements compile-time constraints.
* [​`coroutine`](/mojo/stdlib/builtin/coroutine/): Implements classes and methods for coroutines.
* [​`debug_assert`](/mojo/stdlib/builtin/debug_assert/): Implements run-time assertions.
* [​`device_passable`](/mojo/stdlib/builtin/device_passable/):
* [​`dtype`](/mojo/stdlib/builtin/dtype/): Implements the DType class.
* [​`equality_comparable`](/mojo/stdlib/builtin/equality_comparable/):
* [​`error`](/mojo/stdlib/builtin/error/): Implements the Error class.
* [​`file`](/mojo/stdlib/builtin/file/): Provides APIs to read and write files.
* [​`file_descriptor`](/mojo/stdlib/builtin/file_descriptor/): Higher level abstraction for file stream.
* [​`float_literal`](/mojo/stdlib/builtin/float_literal/): Implements the FloatLiteral class.
* [​`floatable`](/mojo/stdlib/builtin/floatable/): Implements the `Floatable` and `FloatableRaising` traits.
* [​`format_int`](/mojo/stdlib/builtin/format_int/): Provides the `hex` and `bin` functions.
* [​`identifiable`](/mojo/stdlib/builtin/identifiable/):
* [​`int`](/mojo/stdlib/builtin/int/): Implements the Int class.
* [​`int_literal`](/mojo/stdlib/builtin/int_literal/): Implements the IntLiteral class.
* [​`io`](/mojo/stdlib/builtin/io/): Provides utilities for working with input/output.
* [​`len`](/mojo/stdlib/builtin/len/): Provides the `len()` function and its associated traits.
* [​`math`](/mojo/stdlib/builtin/math/): Defines basic math functions for use in the open source parts of the standard library since the `math` package is currently closed source and cannot be depended on in the open source parts of the standard library.
* [​`none`](/mojo/stdlib/builtin/none/): Defines the builtin `NoneType`.
* [​`range`](/mojo/stdlib/builtin/range/): Implements a 'range' call.
* [​`rebind`](/mojo/stdlib/builtin/rebind/): Implements type rebind.
* [​`repr`](/mojo/stdlib/builtin/repr/): Provide the `repr` function.
* [​`reversed`](/mojo/stdlib/builtin/reversed/): Provides the `reversed` function for reverse iteration over collections.
* [​`simd`](/mojo/stdlib/builtin/simd/): Implements SIMD primitives and abstractions.
* [​`sort`](/mojo/stdlib/builtin/sort/): Implements the built-in `sort` function.
* [​`str`](/mojo/stdlib/builtin/str/): Provides the `str` function.
* [​`string_literal`](/mojo/stdlib/builtin/string_literal/): Implements the StringLiteral struct.
* [​`swap`](/mojo/stdlib/builtin/swap/): Implements the built-in `swap` function.
* [​`tuple`](/mojo/stdlib/builtin/tuple/): Implements the Tuple type.
* [​`type_aliases`](/mojo/stdlib/builtin/type_aliases/): Defines some type aliases.
* [​`uint`](/mojo/stdlib/builtin/uint/): Implements the UInt class.
* [​`value`](/mojo/stdlib/builtin/value/): Defines core value traits.
* [​`variadics`](/mojo/stdlib/builtin/variadics/): Implements the VariadicList and VariadicPack types.

---

## ImplicitlyIntable

The `ImplicitlyIntable` trait describes a type that can be converted to an Int implicitly.

This trait requires the type to implement the `__as_int__()` method. For
example:

```mojo
struct Foo(ImplicitlyIntable):
    var i: Int

    fn __int__(self) -> Int:
        return self.i

    fn __as_int__(self) -> Int:
        return self.__int__()

```

Now you can use `Foo` anywhere that an `Int` is expected, e.g. equality
checks:

```mojo
foo = Foo(42)
assert_equal(Int(42), foo)
```

## Implemented traits

`AnyType`,
`Copyable`,
`Intable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `__as_int__`

`__as_int__(self: _Self) -> Int`

Implicitly convert to an integral representation of the value, wherever an `Int` is expected.

**Returns:**

The integral representation of the value.

### `__int__`

`__int__(self: _Self) -> Int`

Get the integral representation of the value.

**Returns:**

The integral representation of the value.

---

## Indexer

The `Indexer` trait is used for types that can index into a collection or pointer. The type returned is the underlying \_\_mlir\_type.index, enabling types like `UInt` to not have to be converted to an `Int` first. This type is implicitly convertible to an `Int`, so can be used anywhere an `Int` can e.g. for comparisons.

## Implemented traits

`AnyType`,
`Copyable`,
`Intable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `__index__`

`__index__(self: _Self) -> index`

Convert to index.

**Returns:**

The corresponding \_\_mlir\_type.index value.

### `__int__`

`__int__(self: _Self) -> Int`

Get the integral representation of the value.

**Returns:**

The integral representation of the value.

---

## Int

`@register_passable(trivial)`
`struct Int`

This type represents an integer value.

## Fields

* ​value (`index`): The underlying storage for the integer value.

## Implemented traits

`Absable`,
`AnyType`,
`Boolable`,
`CeilDivable`,
`Ceilable`,
`ConvertibleFromPython`,
`Copyable`,
`Defaultable`,
`DevicePassable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Floorable`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`Hashable`,
`ImplicitlyBoolable`,
`Indexer`,
`Intable`,
`IntervalElement`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`Powable`,
`PythonConvertible`,
`Representable`,
`Roundable`,
`Stringable`,
`Truncable`,
`UnknownDestructibility`,
`Writable`,
`_HashableWithHasher`

## Aliases

### `BITWIDTH`

`alias BITWIDTH = __init__[::Intable](bitwidthof[::DType,__mlir_type.!kgen.target]())`

The bit width of the integer type.

### `device_type`

`alias device_type = Int`

Int is remapped to the same type when passed to accelerator devices.

### `MAX`

`alias MAX = __init__[::Intable](SIMD(max_or_inf[::DType]()))`

Returns the maximum integer value.

### `MIN`

`alias MIN = __init__[::Intable](SIMD(min_or_neg_inf[::DType]()))`

Returns the minimum value of type.

## Methods

### `__init__`

`__init__() -> Self`

Default constructor that produces zero.

`@implicit`
`__init__(value: IntLiteral[value]) -> Self`

Construct Int from the given IntLiteral value.

**Args:**

* ​value (`IntLiteral[value]`): The init value.

`@implicit`
`__init__(value: UInt) -> Self`

Construct Int from the given UInt value.

**Args:**

* ​value (`UInt`): The init value.

`__init__[T: Intable](value: T) -> Self`

Get the Int representation of the value.

**Parameters:**

* ​T (`Intable`): The Intable type.

**Args:**

* ​value (`T`): The object to get the integral representation of.

`__init__[T: IntableRaising](out self, value: T)`

Get the Int representation of the value.

**Parameters:**

* ​T (`IntableRaising`): The Intable type.

**Args:**

* ​value (`T`): The object to get the integral representation of.

**Raises:**

If the type does not have an integral representation.

`@implicit`
`__init__[I: ImplicitlyIntable](value: I) -> Self`

Construct Int from implicitly convertible type.

**Parameters:**

* ​I (`ImplicitlyIntable`): The type that is implicitly convertible to an `Int`.

**Args:**

* ​value (`I`): The init value.

`__init__(out self, value: StringSlice[origin], base: UInt = UInt(10))`

Parses and returns the given string as an integer in the given base.

If base is set to 0, the string is parsed as an Integer literal, with the
following considerations:

* '0b' or '0B' prefix indicates binary (base 2)
* '0o' or '0O' prefix indicates octal (base 8)
* '0x' or '0X' prefix indicates hexadecimal (base 16)
* Without a prefix, it's treated as decimal (base 10)

Examples:

> > > Int("32")
> > > 32
> > > Int("FF", 16)
> > > 255
> > > Int("0xFF", 0)
> > > 255
> > > Int("0b1010", 0)
> > > 10

Notes:
This follows [Python's integer literals](https://docs.python.org/3/reference/lexical_analysis.html#integers).

**Args:**

* ​value (`StringSlice[origin]`): A string to be parsed as an integer in the given base.
* ​base (`UInt`): Base used for conversion, value must be between 2 and 36, or 0.

**Raises:**

If the given string cannot be parsed as an integer value or if an
incorrect base is provided.

### `__bool__`

`__bool__(self) -> Bool`

Convert this Int to Bool.

**Returns:**

False Bool value if the value is equal to 0 and True otherwise.

### `__neg__`

`__neg__(self) -> Self`

Return -self.

**Returns:**

The -self value.

### `__pos__`

`__pos__(self) -> Self`

Return +self.

**Returns:**

The +self value.

### `__invert__`

`__invert__(self) -> Self`

Return \~self.

**Returns:**

The \~self value.

### `__lt__`

`__lt__(self, rhs: Self) -> Bool`

Compare this Int to the RHS using LT comparison.

**Args:**

* ​rhs (`Self`): The other Int to compare against.

**Returns:**

True if this Int is less-than the RHS Int and False otherwise.

### `__le__`

`__le__(self, rhs: Self) -> Bool`

Compare this Int to the RHS using LE comparison.

**Args:**

* ​rhs (`Self`): The other Int to compare against.

**Returns:**

True if this Int is less-or-equal than the RHS Int and False
otherwise.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compare this Int to the RHS using EQ comparison.

**Args:**

* ​rhs (`Self`): The other Int to compare against.

**Returns:**

True if this Int is equal to the RHS Int and False otherwise.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Compare this Int to the RHS using NE comparison.

**Args:**

* ​rhs (`Self`): The other Int to compare against.

**Returns:**

True if this Int is non-equal to the RHS Int and False otherwise.

### `__gt__`

`__gt__(self, rhs: Self) -> Bool`

Compare this Int to the RHS using GT comparison.

**Args:**

* ​rhs (`Self`): The other Int to compare against.

**Returns:**

True if this Int is greater than the RHS Int and False otherwise.

### `__ge__`

`__ge__(self, rhs: Self) -> Bool`

Compare this Int to the RHS using GE comparison.

**Args:**

* ​rhs (`Self`): The other Int to compare against.

**Returns:**

True if this Int is greater-or-equal than the RHS Int and False
otherwise.

### `__add__`

`__add__(self, rhs: Self) -> Self`

Return `self + rhs`.

**Args:**

* ​rhs (`Self`): The value to add.

**Returns:**

`self + rhs` value.

### `__sub__`

`__sub__(self, rhs: Self) -> Self`

Return `self - rhs`.

**Args:**

* ​rhs (`Self`): The value to subtract.

**Returns:**

`self - rhs` value.

### `__mul__`

`__mul__(self, rhs: Self) -> Self`

Return `self * rhs`.

**Args:**

* ​rhs (`Self`): The value to multiply with.

**Returns:**

`self * rhs` value.

### `__truediv__`

`__truediv__(self, rhs: Self) -> SIMD[float64, 1]`

Return the floating point division of `self` and `rhs`.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

`Float64(self)/Float64(rhs)` value.

### `__floordiv__`

`__floordiv__(self, rhs: Self) -> Self`

Return the division of `self` and `rhs` rounded down to the nearest integer.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

`floor(self/rhs)` value.

### `__mod__`

`__mod__(self, rhs: Self) -> Self`

Return the remainder of self divided by rhs.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

The remainder of dividing self by rhs.

### `__pow__`

`__pow__(self, exp: Self) -> Self`

Return the value raised to the power of the given exponent.

Computes the power of an integer using the Russian Peasant Method.

**Args:**

* ​exp (`Self`): The exponent value.

**Returns:**

The value of `self` raised to the power of `exp`.

### `__lshift__`

`__lshift__(self, rhs: Self) -> Self`

Return `self rhs (`Self`): The value to shift with.

**Returns:**

`self 

### `__rshift__`

`__rshift__(self, rhs: Self) -> Self`

Return `self >> rhs`.

**Args:**

* ​rhs (`Self`): The value to shift with.

**Returns:**

`self >> rhs`.

### `__and__`

`__and__(self, rhs: Self) -> Self`

Return `self & rhs`.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self & rhs`.

### `__or__`

`__or__(self, rhs: Self) -> Self`

Return `self | rhs`.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self | rhs`.

### `__xor__`

`__xor__(self, rhs: Self) -> Self`

Return `self ^ rhs`.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self ^ rhs`.

### `__radd__`

`__radd__(self, value: Self) -> Self`

Return `value + self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value + self`.

### `__rsub__`

`__rsub__(self, value: Self) -> Self`

Return `value - self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value - self`.

### `__rmul__`

`__rmul__(self, value: Self) -> Self`

Return `value * self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value * self`.

### `__rfloordiv__`

`__rfloordiv__(self, value: Self) -> Self`

Return `value // self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value // self`.

### `__rmod__`

`__rmod__(self, value: Self) -> Self`

Return `value % self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value % self`.

### `__rpow__`

`__rpow__(self, value: Self) -> Self`

Return `pow(value,self)`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`pow(value,self)`.

### `__rlshift__`

`__rlshift__(self, value: Self) -> Self`

Return `value value (`Self`): The other value.

**Returns:**

`value 

### `__rrshift__`

`__rrshift__(self, value: Self) -> Self`

Return `value >> self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value >> self`.

### `__rand__`

`__rand__(self, value: Self) -> Self`

Return `value & self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value & self`.

### `__ror__`

`__ror__(self, value: Self) -> Self`

Return `value | self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value | self`.

### `__rxor__`

`__rxor__(self, value: Self) -> Self`

Return `value ^ self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value ^ self`.

### `__iadd__`

`__iadd__(mut self, rhs: Self)`

Compute `self + rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__isub__`

`__isub__(mut self, rhs: Self)`

Compute `self - rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__imul__`

`__imul__(mut self, rhs: Self)`

Compute self\*rhs and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__itruediv__`

`__itruediv__(mut self, rhs: Self)`

Compute `self / rhs`, convert to int, and save the result in self.

Since `floor(self / rhs)` is equivalent to `self // rhs`, this yields
the same as `__ifloordiv__`.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ifloordiv__`

`__ifloordiv__(mut self, rhs: Self)`

Compute `self // rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__imod__`

`__imod__(mut self, rhs: Self)`

Compute `self % rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ipow__`

`__ipow__(mut self, rhs: Self)`

Compute `pow(self, rhs)` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ilshift__`

`__ilshift__(mut self, rhs: Self)`

Compute `self rhs (`Self`): The RHS value.

### `__irshift__`

`__irshift__(mut self, rhs: Self)`

Compute `self >> rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__iand__`

`__iand__(mut self, rhs: Self)`

Compute `self & rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ixor__`

`__ixor__(mut self, rhs: Self)`

Compute `self ^ rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ior__`

`__ior__(mut self, rhs: Self)`

Compute self|rhs and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `get_type_name`

`static get_type_name() -> String`

Gets this type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls.

**Returns:**

This type's name.

### `get_device_type_name`

`static get_device_type_name() -> String`

Gets device\_type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls.

**Returns:**

This type's name.

### `__divmod__`

`__divmod__(self, rhs: Self) -> Tuple[Int, Int]`

Computes both the quotient and remainder using integer division.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

The quotient and remainder as a tuple `(self // rhs, self % rhs)`.

### `__as_bool__`

`__as_bool__(self) -> Bool`

Convert this Int to Bool.

**Returns:**

False Bool value if the value is equal to 0 and True otherwise.

### `__index__`

`__index__(self) -> index`

Convert to index.

**Returns:**

The corresponding \_\_mlir\_type.index value.

### `__int__`

`__int__(self) -> Self`

Gets the integral value (this is an identity function for Int).

**Returns:**

The value as an integer.

### `__abs__`

`__abs__(self) -> Self`

Return the absolute value of the Int value.

**Returns:**

The absolute value.

### `__ceil__`

`__ceil__(self) -> Self`

Return the ceiling of the Int value, which is itself.

**Returns:**

The Int value itself.

### `__floor__`

`__floor__(self) -> Self`

Return the floor of the Int value, which is itself.

**Returns:**

The Int value itself.

### `__round__`

`__round__(self) -> Self`

Return the rounded value of the Int value, which is itself.

**Returns:**

The Int value itself.

`__round__(self, ndigits: Self) -> Self`

Return the rounded value of the Int value, which is itself.

**Args:**

* ​ndigits (`Self`): The number of digits to round to.

**Returns:**

The Int value itself if ndigits >= 0 else the rounded value.

### `__trunc__`

`__trunc__(self) -> Self`

Return the truncated Int value, which is itself.

**Returns:**

The Int value itself.

### `__ceildiv__`

`__ceildiv__(self, denominator: Self) -> Self`

Return the rounded-up result of dividing self by denominator.

**Args:**

* ​denominator (`Self`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

### `is_power_of_two`

`is_power_of_two(self) -> Bool`

Check if the integer is a (non-zero) power of two.

**Returns:**

True if the integer is a power of two, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this integer to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `write_padded`

`write_padded[W: Writer](self, mut writer: W, width: Self)`

Write the int right-aligned to a set padding.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.
* ​width (`Self`): The amount to pad to the left.

### `__str__`

`__str__(self) -> String`

Get the integer as a string.

**Returns:**

A string representation.

### `__repr__`

`__repr__(self) -> String`

Get the integer as a string. Returns the same `String` as `__str__`.

**Returns:**

A string representation.

### `__hash__`

`__hash__(self) -> UInt`

Hash the int using builtin hash.

**Returns:**

A 64-bit hash value. This value is *not* suitable for cryptographic
uses. Its intended usage is for data structures. See the `hash`
builtin documentation for more details.

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with this int value.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

### `to_python_object`

`to_python_object(owned self) -> PythonObject`

Convert this value to a PythonObject.

**Returns:**

A PythonObject representing the value.

---

## Intable

The `Intable` trait describes a type that can be converted to an Int.

Any type that conforms to `Intable` or
[`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising) can construct an
`Int`.

This trait requires the type to implement the `__int__()` method. For
example:

```mojo
struct Foo(Intable):
    var i: Int

    fn __int__(self) -> Int:
        return self.i
```

Now you can construct an `Int`:

```mojo
foo = Foo(42)
assert_equal(Int(foo), 42)
```

**Note:** If the `__int__()` method can raise an error, use the
[`IntableRaising`](/mojo/stdlib/builtin/int/intableraising) trait
instead.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `__int__`

`__int__(self: _Self) -> Int`

Get the integral representation of the value.

**Returns:**

The integral representation of the value.

---

## IntableRaising

The `IntableRaising` trait describes a type can be converted to an Int, but the conversion might raise an error.

Any type that conforms to [`Intable`](/mojo/stdlib/builtin/int/Intable)
or `IntableRaising` can construct an `Int`.

This trait requires the type to implement the `__int__()` method, which can
raise an error. For example:

```mojo
struct Foo(IntableRaising):
    var i: Int

    fn __int__(self) raises -> Int:
        return self.i
```

Now you can construct an `Int`:

```mojo
foo = Foo(42)
assert_equal(Int(foo), 42)
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__int__`

`__int__(self: _Self) -> Int`

Get the integral representation of the value.

**Returns:**

The integral representation of the type.

**Raises:**

If the type does not have an integral representation.

---

## index

`index[T: Indexer](idx: T, /) -> index`

Returns the value of `__index__` for the given value.

**Parameters:**

* ​T (`Indexer`): A type conforming to the `Indexer` trait.

**Args:**

* ​idx (`T`): The value.

**Returns:**

An `__mlir_type` representing the index value.

---

## int

Implements the Int class.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`Int`](/mojo/stdlib/builtin/int/Int): This type represents an integer value.

## Traits

* [​`ImplicitlyIntable`](/mojo/stdlib/builtin/int/ImplicitlyIntable): The `ImplicitlyIntable` trait describes a type that can be converted to an Int implicitly.
* [​`Indexer`](/mojo/stdlib/builtin/int/Indexer): The `Indexer` trait is used for types that can index into a collection or pointer. The type returned is the underlying \_\_mlir\_type.index, enabling types like `UInt` to not have to be converted to an `Int` first. This type is implicitly convertible to an `Int`, so can be used anywhere an `Int` can e.g. for comparisons.
* [​`Intable`](/mojo/stdlib/builtin/int/Intable): The `Intable` trait describes a type that can be converted to an Int.
* [​`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising): The `IntableRaising` trait describes a type can be converted to an Int, but the conversion might raise an error.

## Functions

* [​`index`](/mojo/stdlib/builtin/int/index-function): Returns the value of `__index__` for the given value.

---

## IntLiteral

`@register_passable(trivial)`
`struct IntLiteral[value: !pop.int_literal]`

This type represents a static integer literal value with infinite precision.  This type is a compile-time construct which stores its value as a parameter.  It is typically materialized into other types (like `Int`) for use at runtime.  This compile-time representation allows for arbitrary precision constants that would overflow on Int and other fixed precision integer types.

## Parameters

* ​value (`!pop.int_literal`): The underlying integer value.

## Implemented traits

`AnyType`,
`Boolable`,
`Ceilable`,
`Copyable`,
`Defaultable`,
`Floorable`,
`ImplicitlyBoolable`,
`ImplicitlyIntable`,
`Indexer`,
`Intable`,
`Movable`,
`Stringable`,
`Truncable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Constructor for any value.

### `__bool__`

`__bool__(self) -> Bool`

Convert this IntLiteral to Bool.

**Returns:**

False Bool value if the value is equal to 0 and True otherwise.

### `__neg__`

`__neg__(self) -> IntLiteral[(0 - value)]`

Return -self.

**Returns:**

The -self value.

### `__pos__`

`__pos__(self) -> Self`

Return +self.

**Returns:**

The +self value.

### `__invert__`

`__invert__(self) -> IntLiteral[(value ^ -1)]`

Return \~self.

**Returns:**

The \~self value.

### `__lt__`

`__lt__(self, rhs: IntLiteral[value]) -> Bool`

Compare this IntLiteral to the RHS using LT comparison.

**Args:**

* ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against.

**Returns:**

True if this IntLiteral is less-than the RHS IntLiteral and False otherwise.

### `__le__`

`__le__(self, rhs: IntLiteral[value]) -> Bool`

Compare this IntLiteral to the RHS using LE comparison.

**Args:**

* ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against.

**Returns:**

True if this IntLiteral is less-or-equal than the RHS IntLiteral and False
otherwise.

### `__eq__`

`__eq__(self, rhs: IntLiteral[value]) -> Bool`

Compare this IntLiteral to the RHS using EQ comparison.

**Args:**

* ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against.

**Returns:**

True if this IntLiteral is equal to the RHS IntLiteral and False otherwise.

### `__ne__`

`__ne__(self, rhs: IntLiteral[value]) -> Bool`

Compare this IntLiteral to the RHS using NE comparison.

**Args:**

* ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against.

**Returns:**

True if this IntLiteral is non-equal to the RHS IntLiteral and False otherwise.

### `__gt__`

`__gt__(self, rhs: IntLiteral[value]) -> Bool`

Compare this IntLiteral to the RHS using GT comparison.

**Args:**

* ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against.

**Returns:**

True if this IntLiteral is greater-than the RHS IntLiteral and False otherwise.

### `__ge__`

`__ge__(self, rhs: IntLiteral[value]) -> Bool`

Compare this IntLiteral to the RHS using GE comparison.

**Args:**

* ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against.

**Returns:**

True if this IntLiteral is greater-or-equal than the RHS IntLiteral and False
otherwise.

### `__add__`

`__add__(self, rhs: IntLiteral[value]) -> IntLiteral[(value + value)]`

Return `self + rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The value to add.

**Returns:**

`self + rhs` value.

### `__sub__`

`__sub__(self, rhs: IntLiteral[value]) -> IntLiteral[(value - value)]`

Return `self - rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The value to subtract.

**Returns:**

`self - rhs` value.

### `__mul__`

`__mul__(self, rhs: IntLiteral[value]) -> IntLiteral[(value * value)]`

Return `self * rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The value to multiply with.

**Returns:**

`self * rhs` value.

### `__floordiv__`

`__floordiv__(self, rhs: IntLiteral[value]) -> IntLiteral[(value // value)]`

Return `self // rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The value to divide with.

**Returns:**

`self // rhs` value.

### `__mod__`

`__mod__(self, rhs: IntLiteral[value]) -> IntLiteral[(value % value)]`

Return the remainder of self divided by rhs.

**Args:**

* ​rhs (`IntLiteral[value]`): The value to divide on.

**Returns:**

The remainder of dividing self by rhs.

### `__lshift__`

`__lshift__(self, rhs: IntLiteral[value]) -> IntLiteral[(value 

Return `self rhs (`IntLiteral[value]`): The value to shift with.

**Returns:**

`self 

### `__rshift__`

`__rshift__(self, rhs: IntLiteral[value]) -> IntLiteral[(value >> value)]`

Return `self >> rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The value to shift with.

**Returns:**

`self >> rhs`.

### `__and__`

`__and__(self, rhs: IntLiteral[value]) -> IntLiteral[(value & value)]`

Return `self & rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The RHS value.

**Returns:**

`self & rhs`.

### `__or__`

`__or__(self, rhs: IntLiteral[value]) -> IntLiteral[(value | value)]`

Return `self | rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The RHS value.

**Returns:**

`self | rhs`.

### `__xor__`

`__xor__(self, rhs: IntLiteral[value]) -> IntLiteral[(value ^ value)]`

Return `self ^ rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The RHS value.

**Returns:**

`self ^ rhs`.

### `__as_bool__`

`__as_bool__(self) -> Bool`

Convert this IntLiteral to Bool.

**Returns:**

False Bool value if the value is equal to 0 and True otherwise.

### `__int__`

`__int__(self) -> Int`

Convert from IntLiteral to Int.

**Returns:**

The value as an integer of platform-specific width.

### `__as_int__`

`__as_int__(self) -> Int`

Implicitly convert to an Int.

**Returns:**

An integral value that represents this object.

### `__uint__`

`__uint__(self) -> UInt`

Convert from IntLiteral to UInt.

**Returns:**

The value as an unsigned integer of platform-specific width.

### `__ceil__`

`__ceil__(self) -> Self`

Return the ceiling of the IntLiteral value, which is itself.

**Returns:**

The IntLiteral value itself.

### `__floor__`

`__floor__(self) -> Self`

Return the floor of the IntLiteral value, which is itself.

**Returns:**

The IntLiteral value itself.

### `__trunc__`

`__trunc__(self) -> Self`

Return the truncated of the IntLiteral value, which is itself.

**Returns:**

The IntLiteral value itself.

### `__str__`

`__str__(self) -> String`

Convert from IntLiteral to String.

**Returns:**

The value as a string.

### `__ceildiv__`

`__ceildiv__(self, denominator: IntLiteral[value]) -> IntLiteral[(0 - (value // (0 - value)))]`

Return the rounded-up result of dividing self by denominator.

**Args:**

* ​denominator (`IntLiteral[value]`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

### `__index__`

`__index__(self) -> index`

Convert from IntLiteral to index.

**Returns:**

The corresponding \_\_mlir\_type.index value, interpreting as signed.

---

## int_literal

Implements the IntLiteral class.

## Structs

* [​`IntLiteral`](/mojo/stdlib/builtin/int_literal/IntLiteral): This type represents a static integer literal value with infinite precision.  This type is a compile-time construct which stores its value as a parameter.  It is typically materialized into other types (like `Int`) for use at runtime.  This compile-time representation allows for arbitrary precision constants that would overflow on Int and other fixed precision integer types.

---

## io

Provides utilities for working with input/output.

These are Mojo built-ins, so you don't need to import them.

## Functions

* [​`input`](/mojo/stdlib/builtin/io/input): Reads a line of input from the user.
* [​`print`](/mojo/stdlib/builtin/io/print): Prints elements to the text stream. Each element is separated by `sep` and followed by `end`.

---

## input

`input(prompt: String = __init__[__mlir_type.!kgen.string]("")) -> String`

Reads a line of input from the user.

Reads a line from standard input, converts it to a string, and returns that string.
If the prompt argument is present, it is written to standard output without a trailing newline.

Examples:

```mojo
name = input("Enter your name: ")
print("Hello", name)
```

If the user enters "Mojo" it prints "Hello Mojo".

**Args:**

* ​prompt (`String`): An optional string to be printed before reading input.

**Returns:**

A string containing the line read from the user input.

---

## print

`print[*Ts: Writable](*values: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" "), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("\n"), flush: Bool = False, owned file: FileDescriptor = FileDescriptor(1))`

Prints elements to the text stream. Each element is separated by `sep` and followed by `end`.

**Parameters:**

* ​\*Ts (`Writable`): The elements types.

**Args:**

* ​\*values (`*Ts`): The elements to print.
* ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements.
* ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements.
* ​flush (`Bool`): If set to true, then the stream is forcibly flushed.
* ​file (`FileDescriptor`): The output stream.

---

## Sized

The `Sized` trait describes a type that has an integer length (such as a string or array).

Any type that conforms to `Sized` or
[`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) works with the
built-in [`len()`](/mojo/stdlib/builtin/len/len) function.

The `Sized` trait requires a type to implement the `__len__()`
method. For example:

```mojo
struct Foo(Sized):
    var length: Int

    fn __len__(self) -> Int:
        return self.length
```

You can pass an instance of `Foo` to the `len()` function to get its
length:

```mojo
var foo = Foo(42)
print(len(foo) == 42)
```

```plaintext
True
```

**Note:** If the `__len__()` method can raise an error, use the
[`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) trait instead.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__len__`

`__len__(self: _Self) -> Int`

Get the length of the type.

**Returns:**

The length of the type.

---

## SizedRaising

The `SizedRaising` trait describes a type that has an integer length, which might raise an error if the length can't be determined.

Any type that conforms to [`Sized`](/mojo/stdlib/builtin/len/Sized) or
`SizedRaising` works with the built-in
[`len()`](/mojo/stdlib/builtin/len/len) function.

The `SizedRaising` trait requires a type to implement the `__len__()`
method, which can raise an error. For example:

```mojo
struct Foo(SizedRaising):
    var length: Int

    fn __len__(self) raises -> Int:
        if self.length 

`__len__(self: _Self) -> Int`

Get the length of the type.

**Returns:**

The length of the type.

**Raises:**

If the length cannot be computed.

---

## UIntSized

The `Sized` trait describes a type that has an integer length (such as a string or array).

Any type that conforms to `Sized` or
[`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) works with the
built-in [`len()`](/mojo/stdlib/builtin/len/len) function.

The `Sized` trait requires a type to implement the `__len__()`
method. For example:

```mojo
struct Foo(Sized):
    var length: Int

    fn __len__(self) -> Int:
        return self.length
```

You can pass an instance of `Foo` to the `len()` function to get its
length:

```mojo
var foo = Foo(42)
print(len(foo) == 42)
```

```plaintext
True
```

**Note:** If the `__len__()` method can raise an error, use the
[`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) trait instead.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__len__`

`__len__(self: _Self) -> UInt`

Get the length of the type.

**Returns:**

The length of the type.

---

## len

Provides the `len()` function and its associated traits.

These are Mojo built-ins, so you don't need to import them.

## Traits

* [​`Sized`](/mojo/stdlib/builtin/len/Sized): The `Sized` trait describes a type that has an integer length (such as a string or array).
* [​`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising): The `SizedRaising` trait describes a type that has an integer length, which might raise an error if the length can't be determined.
* [​`UIntSized`](/mojo/stdlib/builtin/len/UIntSized): The `Sized` trait describes a type that has an integer length (such as a string or array).

## Functions

* [​`len`](/mojo/stdlib/builtin/len/len): Get the length of a value.

---

## len

`len[T: Sized](value: T) -> Int`

Get the length of a value.

**Parameters:**

* ​T (`Sized`): The Sized type.

**Args:**

* ​value (`T`): The object to get the length of.

**Returns:**

The length of the object.

`len[T: SizedRaising](value: T) -> Int`

Get the length of a value.

**Parameters:**

* ​T (`SizedRaising`): The Sized type.

**Args:**

* ​value (`T`): The object to get the length of.

**Returns:**

The length of the object.

**Raises:**

If the length cannot be computed.

---

## Absable

The `Absable` trait describes a type that defines an absolute value operation.

Types that conform to `Absable` will work with the builtin `abs` function.
The absolute value operation always returns the same type as the input.

For example:

```mojo
struct Point(Absable):
    var x: Float64
    var y: Float64

    fn __abs__(self) -> Self:
        return sqrt(self.x * self.x + self.y * self.y)
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__abs__`

`__abs__(self: _Self) -> _Self`

Get the absolute value of this instance.

**Returns:**

The absolute value of the instance.

---

## Powable

The `Powable` trait describes a type that defines a power operation (i.e. exponentiation) with the same base and exponent types.

Types that conform to `Powable` will work with the builtin `pow` function,
which will return the same type as the inputs.

For example:

```mojo
struct Rational(Powable):
    var numerator: Float64
    var denominator: Float64

    fn __init__(out self, numerator: Float64, denominator: Float64):
        self.numerator = numerator
        self.denominator = denominator

    fn __pow__(self, exp: Self)  -> Self:
        var exp_value = exp.numerator / exp.denominator
        return Self(pow(self.numerator, exp_value), pow(self.denominator, exp_value))
```

You can now use the \*\* operator to exponentiate objects
inside generic functions:

```mojo
fn exponentiate[T: Powable](base: T, exp: T) -> T:
    return base ** exp

var base = Rational(Float64(3.0), 5.0)
var exp = Rational(Float64(1.0), 2.0)
var res = exponentiate(base, exp)
```

```plaintext
raising to power
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__pow__`

`__pow__(self: _Self, exp: _Self) -> _Self`

Return the value raised to the power of the given exponent.

**Args:**

* ​exp (`_Self`): The exponent value.

**Returns:**

The value of `self` raised to the power of `exp`.

---

## Roundable

The `Roundable` trait describes a type that defines a rounding operation.

Types that conform to `Roundable` will work with the builtin `round`
function. The round operation always returns the same type as the input.

For example:

```mojo
@fieldwise_init
struct Complex(Roundable):
    var re: Float64
    var im: Float64

    fn __round__(self) -> Self:
        return Self(round(self.re), round(self.im))

    fn __round__(self, ndigits: Int) -> Self:
        return Self(round(self.re, ndigits), round(self.im, ndigits))
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__round__`

`__round__(self: _Self) -> _Self`

Get a rounded value for the type.

**Returns:**

The rounded value.

`__round__(self: _Self, ndigits: Int) -> _Self`

Get a rounded value for the type.

**Args:**

* ​ndigits (`Int`): Number of digits after the decimal point.

**Returns:**

The rounded value.

---

## abs

`abs[T: Absable](value: T) -> T`

Get the absolute value of the given object.

**Parameters:**

* ​T (`Absable`): The type conforming to Absable.

**Args:**

* ​value (`T`): The object to get the absolute value of.

**Returns:**

The absolute value of the object.

---

## divmod

`divmod(numerator: Int, denominator: Int) -> Tuple[Int, Int]`

Performs integer division and returns the quotient and the remainder.

Currently supported only for integers. Support for more standard library
types like Int8, Int16... is planned.

This method calls `a.__divmod__(b)`, thus, the actual implementation of
divmod should go in the `__divmod__` method of the struct of `a`.

**Args:**

* ​numerator (`Int`): The dividend.
* ​denominator (`Int`): The divisor.

**Returns:**

A `Tuple` containing the quotient and the remainder.

`divmod(numerator: UInt, denominator: UInt) -> Tuple[UInt, UInt]`

Performs integer division and returns the quotient and the remainder.

Currently supported only for integers. Support for more standard library
types like Int8, Int16... is planned.

This method calls `a.__divmod__(b)`, thus, the actual implementation of
divmod should go in the `__divmod__` method of the struct of `a`.

**Args:**

* ​numerator (`UInt`): The dividend.
* ​denominator (`UInt`): The divisor.

**Returns:**

A `Tuple` containing the quotient and the remainder.

---

## math

Defines basic math functions for use in the open source parts of the standard library since the `math` package is currently closed source and cannot be depended on in the open source parts of the standard library.

These are Mojo built-ins, so you don't need to import them.

## Traits

* [​`Absable`](/mojo/stdlib/builtin/math/Absable): The `Absable` trait describes a type that defines an absolute value operation.
* [​`Powable`](/mojo/stdlib/builtin/math/Powable): The `Powable` trait describes a type that defines a power operation (i.e. exponentiation) with the same base and exponent types.
* [​`Roundable`](/mojo/stdlib/builtin/math/Roundable): The `Roundable` trait describes a type that defines a rounding operation.

## Functions

* [​`abs`](/mojo/stdlib/builtin/math/abs): Get the absolute value of the given object.
* [​`divmod`](/mojo/stdlib/builtin/math/divmod): Performs integer division and returns the quotient and the remainder.
* [​`max`](/mojo/stdlib/builtin/math/max): Gets the maximum of two integers.
* [​`min`](/mojo/stdlib/builtin/math/min): Gets the minimum of two integers.
* [​`pow`](/mojo/stdlib/builtin/math/pow): Computes the `base` raised to the power of the `exp`.
* [​`round`](/mojo/stdlib/builtin/math/round): Get the rounded value of the given object.

---

## max

`max(x: Int, y: Int, /) -> Int`

Gets the maximum of two integers.

**Args:**

* ​x (`Int`): Integer input to max.
* ​y (`Int`): Integer input to max.

**Returns:**

Maximum of x and y.

`max(x: UInt, y: UInt, /) -> UInt`

Gets the maximum of two integers.

**Args:**

* ​x (`UInt`): Integer input to max.
* ​y (`UInt`): Integer input to max.

**Returns:**

Maximum of x and y.

`max[dtype: DType, //](x: SIMD[dtype, size], y: SIMD[dtype, size], /) -> SIMD[dtype, size]`

Performs elementwise maximum of x and y.

An element of the result SIMD vector will be the maximum of the
corresponding elements in x and y.

**Constraints:**

The type of the inputs must be numeric or boolean.

**Parameters:**

* ​dtype (`DType`): The data type of the SIMD vector.

**Args:**

* ​x (`SIMD[dtype, size]`): First SIMD vector.
* ​y (`SIMD[dtype, size]`): Second SIMD vector.

**Returns:**

A SIMD vector containing the elementwise maximum of x and y.

`max[T: Copyable & GreaterThanComparable](x: T, *ys: T) -> T`

Gets the maximum value from a sequence of values.

**Parameters:**

* ​T (`Copyable & GreaterThanComparable`): A type that is both copyable and comparable with greater than.

**Args:**

* ​x (`T`): The first value to compare.
* ​\*ys (`T`): Zero or more additional values to compare.

**Returns:**

The maximum value from the input sequence.

---

## min

`min(x: Int, y: Int, /) -> Int`

Gets the minimum of two integers.

**Args:**

* ​x (`Int`): Integer input to min.
* ​y (`Int`): Integer input to min.

**Returns:**

Minimum of x and y.

`min(x: UInt, y: UInt, /) -> UInt`

Gets the minimum of two integers.

**Args:**

* ​x (`UInt`): Integer input to min.
* ​y (`UInt`): Integer input to min.

**Returns:**

Minimum of x and y.

`min[dtype: DType, //](x: SIMD[dtype, size], y: SIMD[dtype, size], /) -> SIMD[dtype, size]`

Gets the elementwise minimum of x and y.

An element of the result SIMD vector will be the minimum of the
corresponding elements in x and y.

**Constraints:**

The type of the inputs must be numeric or boolean.

**Parameters:**

* ​dtype (`DType`): The data type of the SIMD vector.

**Args:**

* ​x (`SIMD[dtype, size]`): First SIMD vector.
* ​y (`SIMD[dtype, size]`): Second SIMD vector.

**Returns:**

A SIMD vector containing the elementwise minimum of x and y.

`min[T: Copyable & LessThanComparable](x: T, *ys: T) -> T`

Gets the minimum value from a sequence of values.

**Parameters:**

* ​T (`Copyable & LessThanComparable`): A type that is both copyable and comparable with less than.

**Args:**

* ​x (`T`): The first value to compare.
* ​\*ys (`T`): Zero or more additional values to compare.

**Returns:**

The minimum value from the input sequence.

---

## pow

`pow[T: Powable](base: T, exp: T) -> T`

Computes the `base` raised to the power of the `exp`.

**Parameters:**

* ​T (`Powable`): A type conforming to the `Powable` trait.

**Args:**

* ​base (`T`): The base of the power operation.
* ​exp (`T`): The exponent of the power operation.

**Returns:**

The `base` raised to the power of the `exp`.

`pow(base: SIMD[dtype, size], exp: Int) -> SIMD[dtype, size]`

Computes elementwise value of a SIMD vector raised to the power of the given integer.

**Args:**

* ​base (`SIMD[dtype, size]`): The first input argument.
* ​exp (`Int`): The second input argument.

**Returns:**

The `base` elementwise raised raised to the power of `exp`.

---

## round

`round[T: Roundable, //](number: T) -> T`

Get the rounded value of the given object.

**Parameters:**

* ​T (`Roundable`): The type conforming to Roundable.

**Args:**

* ​number (`T`): The object to get the rounded value of.

**Returns:**

The rounded value of the object.

`round[T: Roundable, //](number: T, ndigits: Int) -> T`

Get the value of this object, rounded to a specified number of digits after the decimal point.

**Parameters:**

* ​T (`Roundable`): The type conforming to Roundable.

**Args:**

* ​number (`T`): The object to get the rounded value of.
* ​ndigits (`Int`): The number of digits to round to.

**Returns:**

The rounded value of the object.

---

## NoneType

`@register_passable(trivial)`
`struct NoneType`

Represents the absence of a value.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`ExplicitlyCopyable`,
`Movable`,
`Representable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Construct an instance of the `None` type.

`@implicit`
`__init__(value: None) -> Self`

Construct an instance of the `None` type.

**Args:**

* ​value (`None`): The MLIR none type to construct from.

### `copy`

`copy(self) -> Self`

Explicit copy constructor.

**Returns:**

A copy of the value.

### `__str__`

`__str__(self) -> String`

Returns the string representation of `None`.

**Returns:**

`"None"`.

### `__repr__`

`__repr__(self) -> String`

Returns the string representation of `None`.

**Returns:**

`"None"`.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Write `None` to a writer stream.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

---

## none

Defines the builtin `NoneType`.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`NoneType`](/mojo/stdlib/builtin/none/NoneType): Represents the absence of a value.

---

## range

Implements a 'range' call.

These are Mojo built-ins, so you don't need to import them.

## Functions

* [​`range`](/mojo/stdlib/builtin/range/range): Constructs a \[0; end) Range.

---

## range

`range[T: Indexer, //](end: T) -> _ZeroStartingRange`

Constructs a \[0; end) Range.

**Parameters:**

* ​T (`Indexer`): The type of the end value.

**Args:**

* ​end (`T`): The end of the range.

**Returns:**

The constructed range.

`range[T: IntableRaising, //](end: T) -> _ZeroStartingRange`

Constructs a \[0; end) Range.

**Parameters:**

* ​T (`IntableRaising`): The type of the end value.

**Args:**

* ​end (`T`): The end of the range.

**Returns:**

The constructed range.

**Raises:**

An error if the conversion to an `Int` failed.

`range(end: PythonObject) -> _ZeroStartingRange`

Constructs a \[0; end) Range from a Python `int`.

**Args:**

* ​end (`PythonObject`): The end of the range as a Python `int`.

**Returns:**

The constructed range.

**Raises:**

An error if converting `end` to an `Int` failed.

`range[T0: Indexer, T1: Indexer, //](start: T0, end: T1) -> _SequentialRange`

Constructs a \[start; end) Range.

**Parameters:**

* ​T0 (`Indexer`): The type of the start value.
* ​T1 (`Indexer`): The type of the end value.

**Args:**

* ​start (`T0`): The start of the range.
* ​end (`T1`): The end of the range.

**Returns:**

The constructed range.

`range[T0: IntableRaising, T1: IntableRaising](start: T0, end: T1) -> _SequentialRange`

Constructs a \[start; end) Range.

**Parameters:**

* ​T0 (`IntableRaising`): The type of the start value.
* ​T1 (`IntableRaising`): The type of the end value.

**Args:**

* ​start (`T0`): The start of the range.
* ​end (`T1`): The end of the range.

**Returns:**

The constructed range.

**Raises:**

An error if converting `start` or `end` to an `Int` failed.

`range(start: PythonObject, end: PythonObject) -> _SequentialRange`

Constructs a \[start; end) Range from Python `int` objects.

**Args:**

* ​start (`PythonObject`): The start of the range as a Python `int`.
* ​end (`PythonObject`): The end of the range as a Python `int`.

**Returns:**

The constructed range.

**Raises:**

An error if converting `start` or `end` to an `Int` failed.

`range[T0: Indexer, T1: Indexer, T2: Indexer, //](start: T0, end: T1, step: T2) -> _StridedRange`

Constructs a \[start; end) Range with a given step.

**Parameters:**

* ​T0 (`Indexer`): The type of the start value.
* ​T1 (`Indexer`): The type of the end value.
* ​T2 (`Indexer`): The type of the step value.

**Args:**

* ​start (`T0`): The start of the range.
* ​end (`T1`): The end of the range.
* ​step (`T2`): The step for the range.

**Returns:**

The constructed range.

`range[T0: IntableRaising, T1: IntableRaising, T2: IntableRaising, //](start: T0, end: T1, step: T2) -> _StridedRange`

Constructs a \[start; end) Range with a given step.

**Parameters:**

* ​T0 (`IntableRaising`): The type of the start value.
* ​T1 (`IntableRaising`): The type of the end value.
* ​T2 (`IntableRaising`): The type of the step value.

**Args:**

* ​start (`T0`): The start of the range.
* ​end (`T1`): The end of the range.
* ​step (`T2`): The step for the range.

**Returns:**

The constructed range.

**Raises:**

An error if converting `start`, `end`, or `step` to an `Int` failed.

`range(start: PythonObject, end: PythonObject, step: PythonObject) -> _StridedRange`

Constructs a \[start; end) Range from Python `int` objects with a given step.

**Args:**

* ​start (`PythonObject`): The start of the range as a Python `int`.
* ​end (`PythonObject`): The end of the range as a Python `int`.
* ​step (`PythonObject`): The step for the range as a Python `int`.

**Returns:**

The constructed range.

**Raises:**

An error if converting `start`, `end`, or `step` to an `Int` failed.

`range(end: UInt) -> _UIntZeroStartingRange`

Constructs a \[0; end) Range.

**Args:**

* ​end (`UInt`): The end of the range.

**Returns:**

The constructed range.

`range(start: UInt, end: UInt, step: UInt = UInt(1)) -> _UIntStridedRange`

Constructs a \[start; end) Range with a given step.

**Args:**

* ​start (`UInt`): The start of the range.
* ​end (`UInt`): The end of the range.
* ​step (`UInt`): The step for the range.  Defaults to 1.

**Returns:**

The constructed range.

`range[dtype: DType, //](end: SIMD[dtype, 1]) -> _ZeroStartingScalarRange[dtype]`

Constructs a \[start; end) Range with a given step.

**Parameters:**

* ​dtype (`DType`): The range dtype.

**Args:**

* ​end (`SIMD[dtype, 1]`): The end of the range.

**Returns:**

The constructed range.

`range[dtype: DType, //](start: SIMD[dtype, 1], end: SIMD[dtype, 1]) -> _SequentialScalarRange[dtype]`

Constructs a \[start; end) Range with a given step.

**Parameters:**

* ​dtype (`DType`): The range dtype.

**Args:**

* ​start (`SIMD[dtype, 1]`): The start of the range.
* ​end (`SIMD[dtype, 1]`): The end of the range.

**Returns:**

The constructed range.

`range[dtype: DType, //](start: SIMD[dtype, 1], end: SIMD[dtype, 1], step: SIMD[dtype, 1]) -> _StridedScalarRange[dtype]`

Constructs a \[start; end) Range with a given step.

**Parameters:**

* ​dtype (`DType`): The range dtype.

**Args:**

* ​start (`SIMD[dtype, 1]`): The start of the range.
* ​end (`SIMD[dtype, 1]`): The end of the range.
* ​step (`SIMD[dtype, 1]`): The step for the range.  Defaults to 1.

**Returns:**

The constructed range.

---

## rebind

Implements type rebind.

These are Mojo built-ins, so you don't need to import them.

## Functions

* [​`rebind`](/mojo/stdlib/builtin/rebind/rebind): Statically assert that a parameter input type `src_type` resolves to the same type as a parameter result type `dest_type` after function instantiation and "rebind" the input to the result type.

---

## rebind

`rebind[src_type: AnyTrivialRegType, //, dest_type: AnyTrivialRegType](src: src_type) -> dest_type`

Statically assert that a parameter input type `src_type` resolves to the same type as a parameter result type `dest_type` after function instantiation and "rebind" the input to the result type.

This function is meant to be used in uncommon cases where a parametric type
depends on the value of a constrained parameter in order to manually refine
the type with the constrained parameter value.

**Parameters:**

* ​src\_type (`AnyTrivialRegType`): The original type.
* ​dest\_type (`AnyTrivialRegType`): The type to rebind to.

**Args:**

* ​src (`src_type`): The value to rebind.

**Returns:**

The rebound value of `dest_type`.

`rebind[src_type: AnyType, //, dest_type: AnyType](ref src: src_type) -> ref [src] dest_type`

Statically assert that a parameter input type `src_type` resolves to the same type as a parameter result type `dest_type` after function instantiation and "rebind" the input to the result type, returning a reference to the input value with an adjusted type.

This function is meant to be used in uncommon cases where a parametric type
depends on the value of a constrained parameter in order to manually refine
the type with the constrained parameter value.

**Parameters:**

* ​src\_type (`AnyType`): The original type.
* ​dest\_type (`AnyType`): The type to rebind to.

**Args:**

* ​src (`src_type`): The value to rebind.

**Returns:**

A reference to the value rebound as `dest_type`.

---

## Representable

A trait that describes a type that has a String representation.

Any type that conforms to the `Representable` trait can be used with the
`repr` function. Any conforming type must also implement the `__repr__` method.
Here is an example:

```mojo
struct Dog(Representable):
    var name: String
    var age: Int

    fn __repr__(self) -> String:
        return "Dog(name=" + repr(self.name) + ", age=" + repr(self.age) + ")"

var dog = Dog("Rex", 5)
print(repr(dog))
# Dog(name='Rex', age=5)
```

The method `__repr__` should compute the "official" string representation of a type.

If at all possible, this should look like a valid Mojo expression
that could be used to recreate a struct instance with the same
value (given an appropriate environment).
So a returned String of the form `module_name.SomeStruct(arg1=value1, arg2=value2)` is advised.
If this is not possible, a string of the form ``
should be returned.

The return value must be a `String` instance.
This is typically used for debugging, so it is important that the representation is information-rich and unambiguous.

Note that when computing the string representation of a collection (`Dict`, `List`, `Set`, etc...),
the `repr` function is called on each element, not the `String()` function.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__repr__`

`__repr__(self: _Self) -> String`

Get the string representation of the type instance, if possible, compatible with Mojo syntax.

**Returns:**

The string representation of the instance.

---

## repr

Provide the `repr` function.

The functions and traits provided here are built-ins, so you don't need to import them.

## Traits

* [​`Representable`](/mojo/stdlib/builtin/repr/Representable): A trait that describes a type that has a String representation.

## Functions

* [​`repr`](/mojo/stdlib/builtin/repr/repr): Returns the string representation of the given value.

---

## repr

`repr[T: Representable](value: T) -> String`

Returns the string representation of the given value.

**Parameters:**

* ​T (`Representable`): The type of `value`. Must implement the `Representable` trait.

**Args:**

* ​value (`T`): The value to get the string representation of.

**Returns:**

The string representation of the given value.

`repr(value: None) -> String`

Returns the string representation of `None`.

**Args:**

* ​value (`None`): A `None` value.

**Returns:**

The string representation of `None`.

---

## ReversibleRange

The `ReversibleRange` trait describes a range that can be reversed.

Any type that conforms to `ReversibleRange` works with the builtin
[`reversed()`](/mojo/stdlib/builtin/reversed.html) functions.

The `ReversibleRange` trait requires the type to define the `__reversed__()`
method.

**Note**: iterators are currently non-raising.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__reversed__`

`__reversed__(self: _Self) -> _StridedRange`

Get a reversed iterator for the type.

**Note**: iterators are currently non-raising.

**Returns:**

The reversed iterator of the type.

---

## reversed

Provides the `reversed` function for reverse iteration over collections.

These are Mojo built-ins, so you don't need to import them.

## Traits

* [​`ReversibleRange`](/mojo/stdlib/builtin/reversed/ReversibleRange): The `ReversibleRange` trait describes a range that can be reversed.

## Functions

* [​`reversed`](/mojo/stdlib/builtin/reversed/reversed): Get a reversed iterator of the input range.

---

## reversed

`reversed[T: ReversibleRange](value: T) -> _StridedRange`

Get a reversed iterator of the input range.

**Note**: iterators are currently non-raising.

**Parameters:**

* ​T (`ReversibleRange`): The type conforming to ReversibleRange.

**Args:**

* ​value (`T`): The range to get the reversed iterator of.

**Returns:**

The reversed iterator of the range.

`reversed[T: Copyable & Movable](ref value: List[T, hint_trivial_type]) -> _ListIter[T, hint_trivial_type, value_is_origin, False]`

Get a reversed iterator of the input list.

**Note**: iterators are currently non-raising.

**Parameters:**

* ​T (`Copyable & Movable`): The type of the elements in the list.

**Args:**

* ​value (`List[T, hint_trivial_type]`): The list to get the reversed iterator of.

**Returns:**

The reversed iterator of the list.

`reversed[T: Copyable & Movable](ref value: Deque[T]) -> _DequeIter[T, value_is_origin, False]`

Get a reversed iterator of the deque.

**Note**: iterators are currently non-raising.

**Parameters:**

* ​T (`Copyable & Movable`): The type of the elements in the deque.

**Args:**

* ​value (`Deque[T]`): The deque to get the reversed iterator of.

**Returns:**

The reversed iterator of the deque.

`reversed[K: Copyable & Movable & Hashable & EqualityComparable, V: Copyable & Movable](ref value: Dict[K, V]) -> _DictKeyIter[K, V, value_is_origin, False]`

Get a reversed iterator of the input dict.

**Note**: iterators are currently non-raising.

**Parameters:**

* ​K (`Copyable & Movable & Hashable & EqualityComparable`): The type of the keys in the dict.
* ​V (`Copyable & Movable`): The type of the values in the dict.

**Args:**

* ​value (`Dict[K, V]`): The dict to get the reversed iterator of.

**Returns:**

The reversed iterator of the dict keys.

`reversed[K: Copyable & Movable & Hashable & EqualityComparable, V: Copyable & Movable, dict_mutability: Bool, dict_origin: Origin[dict_mutability]](ref value: _DictValueIter[K, V, dict_origin]) -> _DictValueIter[K, V, dict_origin, False]`

Get a reversed iterator of the input dict values.

**Note**: iterators are currently non-raising.

**Parameters:**

* ​K (`Copyable & Movable & Hashable & EqualityComparable`): The type of the keys in the dict.
* ​V (`Copyable & Movable`): The type of the values in the dict.
* ​dict\_mutability (`Bool`): Whether the reference to the dict values is mutable.
* ​dict\_origin (`Origin[dict_mutability]`): The origin of the dict values.

**Args:**

* ​value (`_DictValueIter[K, V, dict_origin]`): The dict values to get the reversed iterator of.

**Returns:**

The reversed iterator of the dict values.

`reversed[K: Copyable & Movable & Hashable & EqualityComparable, V: Copyable & Movable, dict_mutability: Bool, dict_origin: Origin[dict_mutability]](ref value: _DictEntryIter[K, V, dict_origin]) -> _DictEntryIter[K, V, dict_origin, False]`

Get a reversed iterator of the input dict items.

**Note**: iterators are currently non-raising.

**Parameters:**

* ​K (`Copyable & Movable & Hashable & EqualityComparable`): The type of the keys in the dict.
* ​V (`Copyable & Movable`): The type of the values in the dict.
* ​dict\_mutability (`Bool`): Whether the reference to the dict items is mutable.
* ​dict\_origin (`Origin[dict_mutability]`): The origin of the dict items.

**Args:**

* ​value (`_DictEntryIter[K, V, dict_origin]`): The dict items to get the reversed iterator of.

**Returns:**

The reversed iterator of the dict items.

`reversed[T: Copyable & Movable](value: Span[T, origin]) -> _SpanIter[T, origin, False]`

Get a reversed iterator of the input Span.

**Note**: iterators are currently non-raising.

**Parameters:**

* ​T (`Copyable & Movable`): The type of the elements in the Span.

**Args:**

* ​value (`Span[T, origin]`): The Span to get the reversed iterator of.

**Returns:**

The reversed iterator of the Span.

---

## SIMD

`@register_passable(trivial)`
`struct SIMD[dtype: DType, size: Int]`

Represents a small vector that is backed by a hardware vector element.

SIMD allows a single instruction to be executed across the multiple data
elements of the vector.

**Constraints:**

The size of the SIMD vector to be positive and a power of 2.

## Parameters

* ​dtype (`DType`): The data type of SIMD vector elements.
* ​size (`Int`): The size of the SIMD vector.

## Fields

* ​value (`simd, #lit.struct.extract>`): The underlying storage for the vector.

## Implemented traits

`Absable`,
`AnyType`,
`Boolable`,
`CeilDivable`,
`Ceilable`,
`Copyable`,
`Defaultable`,
`DevicePassable`,
`ExplicitlyCopyable`,
`Floatable`,
`Floorable`,
`Hashable`,
`Indexer`,
`Intable`,
`Movable`,
`Powable`,
`PythonConvertible`,
`Representable`,
`Roundable`,
`Sized`,
`Stringable`,
`Truncable`,
`UnknownDestructibility`,
`Writable`,
`_HashableWithHasher`

## Aliases

### `device_type`

`alias device_type = SIMD[dtype, size]`

SIMD types are remapped to the same type when passed to accelerator devices.

### `element_type`

`alias element_type = dtype`

### `MAX`

`alias MAX = SIMD(max_or_inf[::DType]())`

Gets the maximum value for the SIMD value, potentially +inf.

### `MAX_FINITE`

`alias MAX_FINITE = SIMD(max_finite[::DType]())`

Returns the maximum finite value of SIMD value.

### `MIN`

`alias MIN = SIMD(min_or_neg_inf[::DType]())`

Gets the minimum value for the SIMD value, potentially -inf.

### `MIN_FINITE`

`alias MIN_FINITE = SIMD(min_finite[::DType]())`

Returns the minimum (lowest) finite value of SIMD value.

## Methods

### `__init__`

`__init__() -> Self`

Default initializer of the SIMD vector.

By default the SIMD vectors are initialized to all zeros.

`__init__[other_dtype: DType, //](value: SIMD[other_dtype, size], /) -> Self`

Initialize from another SIMD of the same size. If the value passed is a scalar, you can initialize a SIMD vector with more elements.

Example:

```mojo
print(UInt64(UInt8(42))) # 42
print(SIMD[DType.uint64, 4](UInt8(42))) # [42, 42, 42, 42]
```

Casting behavior:

```mojo
# Basic casting preserves value within range
Int8(UInt8(127)) == Int8(127)

# Numbers above signed max wrap to negative using two's complement
Int8(UInt8(128)) == Int8(-128)
Int8(UInt8(129)) == Int8(-127)
Int8(UInt8(256)) == Int8(0)

# Negative signed cast to unsigned using two's complement
UInt8(Int8(-128)) == UInt8(128)
UInt8(Int8(-127)) == UInt8(129)
UInt8(Int8(-1)) == UInt8(255)

# Truncate precision after downcast and upcast
Float64(Float32(Float64(123456789.123456789))) == Float64(123456792.0)

# Rightmost bits of significand become 0's on upcast
Float64(Float32(0.3)) == Float64(0.30000001192092896)

# Numbers equal after truncation of float literal and cast truncation
Float32(Float64(123456789.123456789)) == Float32(123456789.123456789)

# Float to int/uint floors
Int64(Float64(42.2)) == Int64(42)
```

.

**Parameters:**

* ​other\_dtype (`DType`): The type of the value that is being cast from.

**Args:**

* ​value (`SIMD[other_dtype, size]`): The value to cast from.

`@implicit`
`__init__(value: UInt, /) -> Self`

Initializes the SIMD vector with an unsigned integer.

The unsigned integer value is splatted across all the elements of the SIMD
vector.

**Args:**

* ​value (`UInt`): The input value.

`@implicit`
`__init__(value: Int, /) -> Self`

Initializes the SIMD vector with a signed integer.

The signed integer value is splatted across all the elements of the SIMD
vector.

**Args:**

* ​value (`Int`): The input value.

`__init__[T: Floatable, //](value: T, /) -> SIMD[float64, 1]`

Initialize a Float64 from a type conforming to Floatable.

**Parameters:**

* ​T (`Floatable`): The Floatable type.

**Args:**

* ​value (`T`): The object to get the float point representation of.

`__init__[T: FloatableRaising, //](out self: SIMD[float64, 1], value: T, /)`

Initialize a Float64 from a type conforming to FloatableRaising.

**Parameters:**

* ​T (`FloatableRaising`): The FloatableRaising type.

**Args:**

* ​value (`T`): The object to get the float point representation of.

**Raises:**

If the type does not have a float point representation.

`__init__[*, _: Int = 0](out self: SIMD[float64, 1], value: PythonObject, /)`

Initialize a Float64 from a PythonObject.

**Parameters:**

* ​\_ (`Int`): A dummy parameter to ensure this overload has lower priority than
  the others. Its value is ignored.

**Args:**

* ​value (`PythonObject`): The PythonObject to convert.

**Raises:**

If the conversion to double fails.

`@implicit`
`__init__(value: IntLiteral[value], /) -> Self`

Initializes the SIMD vector with an integer.

The integer value is splatted across all the elements of the SIMD
vector.

**Args:**

* ​value (`IntLiteral[value]`): The input value.

`@implicit`
`__init__(value: Bool, /) -> SIMD[bool, size]`

Initializes the SIMD vector with a bool value.

The bool value is splatted across all elements of the SIMD vector.

**Args:**

* ​value (`Bool`): The bool value.

`@implicit`
`__init__(value: simd, #lit.struct.extract>, /) -> Self`

Initializes the SIMD vector with the underlying mlir value.

**Args:**

* ​value (`simd, #lit.struct.extract>`): The input value.

`@implicit`
`__init__(value: SIMD[dtype, 1], /) -> Self`

Constructs a SIMD vector by splatting a scalar value.

The input value is splatted across all elements of the SIMD vector.

**Args:**

* ​value (`SIMD[dtype, 1]`): The value to splat to the elements of the vector.

`__init__(*elems: SIMD[dtype, 1], *, __list_literal__: Tuple[] = Tuple()) -> Self`

Constructs a SIMD vector via a variadic list of elements.

The input values are assigned to the corresponding elements of the SIMD
vector.

**Constraints:**

The number of input values is equal to size of the SIMD vector.

**Args:**

* ​\*elems (`SIMD[dtype, 1]`): The variadic list of elements from which the SIMD vector is
  constructed.
* ​**list\_literal** (`Tuple[]`): Tell Mojo to use this method for list literals.

`@implicit`
`__init__(value: FloatLiteral[value], /) -> Self`

Initializes the SIMD vector with a float.

The value is splatted across all the elements of the SIMD
vector.

**Args:**

* ​value (`FloatLiteral[value]`): The input value.

### `__bool__`

`__bool__(self) -> Bool`

Converts the SIMD scalar into a boolean value.

**Constraints:**

The size of the SIMD vector must be 1.

**Returns:**

True if the SIMD scalar is non-zero and False otherwise.

### `__getitem__`

`__getitem__(self, idx: Int) -> SIMD[dtype, 1]`

Gets an element from the vector.

**Args:**

* ​idx (`Int`): The element index.

**Returns:**

The value at position `idx`.

### `__setitem__`

`__setitem__(mut self, idx: Int, val: SIMD[dtype, 1])`

Sets an element in the vector.

**Args:**

* ​idx (`Int`): The index to set.
* ​val (`SIMD[dtype, 1]`): The value to set.

### `__neg__`

`__neg__(self) -> Self`

Defines the unary `-` operation.

**Returns:**

The negation of this SIMD vector.

### `__pos__`

`__pos__(self) -> Self`

Defines the unary `+` operation.

**Returns:**

This SIMD vector.

### `__invert__`

`__invert__(self) -> Self`

Returns `~self`.

**Constraints:**

The element type of the SIMD vector must be boolean or integral.

**Returns:**

The `~self` value.

### `__lt__`

`__lt__(self, rhs: Self) -> SIMD[bool, size]`

Compares two SIMD vectors using less-than comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

A new bool SIMD vector of the same size whose element at position
`i` is True or False depending on the expression
`self[i] 

### `__le__`

`__le__(self, rhs: Self) -> SIMD[bool, size]`

Compares two SIMD vectors using less-than-or-equal comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

A new bool SIMD vector of the same size whose element at position
`i` is True or False depending on the expression
`self[i] 

### `__eq__`

`__eq__(self, rhs: Self) -> SIMD[bool, size]`

Compares two SIMD vectors using equal-to comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

A new bool SIMD vector of the same size whose element at position
`i` is True or False depending on the expression
`self[i] == rhs[i]`.

### `__ne__`

`__ne__(self, rhs: Self) -> SIMD[bool, size]`

Compares two SIMD vectors using not-equal comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

A new bool SIMD vector of the same size whose element at position
`i` is True or False depending on the expression
`self[i] != rhs[i]`.

### `__gt__`

`__gt__(self, rhs: Self) -> SIMD[bool, size]`

Compares two SIMD vectors using greater-than comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

A new bool SIMD vector of the same size whose element at position
`i` is True or False depending on the expression
`self[i] > rhs[i]`.

### `__ge__`

`__ge__(self, rhs: Self) -> SIMD[bool, size]`

Compares two SIMD vectors using greater-than-or-equal comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

A new bool SIMD vector of the same size whose element at position
`i` is True or False depending on the expression
`self[i] >= rhs[i]`.

### `__contains__`

`__contains__(self, value: SIMD[dtype, 1]) -> Bool`

Whether the vector contains the value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The value.

**Returns:**

Whether the vector contains the value.

### `__add__`

`__add__(self, rhs: Self) -> Self`

Computes `self + rhs`.

**Args:**

* ​rhs (`Self`): The rhs value.

**Returns:**

A new vector whose element at position `i` is computed as
`self[i] + rhs[i]`.

### `__sub__`

`__sub__(self, rhs: Self) -> Self`

Computes `self - rhs`.

**Args:**

* ​rhs (`Self`): The rhs value.

**Returns:**

A new vector whose element at position `i` is computed as
`self[i] - rhs[i]`.

### `__mul__`

`__mul__(self, rhs: Self) -> Self`

Computes `self * rhs`.

**Args:**

* ​rhs (`Self`): The rhs value.

**Returns:**

A new vector whose element at position `i` is computed as
`self[i] * rhs[i]`.

### `__truediv__`

`__truediv__(self, rhs: Self) -> Self`

Computes `self / rhs`.

**Args:**

* ​rhs (`Self`): The rhs value.

**Returns:**

A new vector whose element at position `i` is computed as
`self[i] / rhs[i]`.

### `__floordiv__`

`__floordiv__(self, rhs: Self) -> Self`

Returns the division of self and rhs rounded down to the nearest integer.

**Constraints:**

The element type of the SIMD vector must be numeric.

**Args:**

* ​rhs (`Self`): The value to divide with.

**Returns:**

`floor(self / rhs)` value.

### `__mod__`

`__mod__(self, rhs: Self) -> Self`

Returns the remainder of self divided by rhs.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

The remainder of dividing self by rhs.

### `__pow__`

`__pow__(self, exp: Int) -> Self`

Computes the vector raised to the power of the input integer value.

**Args:**

* ​exp (`Int`): The exponent value.

**Returns:**

A SIMD vector where each element is raised to the power of the
specified exponent value.

`__pow__(self, exp: Self) -> Self`

Computes the vector raised elementwise to the right hand side power.

**Args:**

* ​exp (`Self`): The exponent value.

**Returns:**

A SIMD vector where each element is raised to the power of the
specified exponent value.

### `__lshift__`

`__lshift__(self, rhs: Self) -> Self`

Returns `self rhs (`Self`): The RHS value.

**Returns:**

`self 

### `__rshift__`

`__rshift__(self, rhs: Self) -> Self`

Returns `self >> rhs`.

**Constraints:**

The element type of the SIMD vector must be integral.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self >> rhs`.

### `__and__`

`__and__(self, rhs: Self) -> Self`

Returns `self & rhs`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self & rhs`.

### `__or__`

`__or__(self, rhs: Self) -> Self`

Returns `self | rhs`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self | rhs`.

### `__xor__`

`__xor__(self, rhs: Self) -> Self`

Returns `self ^ rhs`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self ^ rhs`.

### `__radd__`

`__radd__(self, value: Self) -> Self`

Returns `value + self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value + self`.

### `__rsub__`

`__rsub__(self, value: Self) -> Self`

Returns `value - self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value - self`.

### `__rmul__`

`__rmul__(self, value: Self) -> Self`

Returns `value * self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value * self`.

### `__rtruediv__`

`__rtruediv__(self, value: Self) -> Self`

Returns `value / self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value / self`.

### `__rfloordiv__`

`__rfloordiv__(self, rhs: Self) -> Self`

Returns the division of rhs and self rounded down to the nearest integer.

**Constraints:**

The element type of the SIMD vector must be numeric.

**Args:**

* ​rhs (`Self`): The value to divide by self.

**Returns:**

`floor(rhs / self)` value.

### `__rmod__`

`__rmod__(self, value: Self) -> Self`

Returns `value mod self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value mod self`.

### `__rpow__`

`__rpow__(self, base: Self) -> Self`

Returns `base ** self`.

**Args:**

* ​base (`Self`): The base value.

**Returns:**

`base ** self`.

### `__rlshift__`

`__rlshift__(self, value: Self) -> Self`

Returns `value value (`Self`): The other value.

**Returns:**

`value 

### `__rrshift__`

`__rrshift__(self, value: Self) -> Self`

Returns `value >> self`.

**Constraints:**

The element type of the SIMD vector must be integral.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value >> self`.

### `__rand__`

`__rand__(self, value: Self) -> Self`

Returns `value & self`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value & self`.

### `__ror__`

`__ror__(self, value: Self) -> Self`

Returns `value | self`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value | self`.

### `__rxor__`

`__rxor__(self, value: Self) -> Self`

Returns `value ^ self`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value ^ self`.

### `__iadd__`

`__iadd__(mut self, rhs: Self)`

Performs in-place addition.

The vector is mutated where each element at position `i` is computed as
`self[i] + rhs[i]`.

**Args:**

* ​rhs (`Self`): The rhs of the addition operation.

### `__isub__`

`__isub__(mut self, rhs: Self)`

Performs in-place subtraction.

The vector is mutated where each element at position `i` is computed as
`self[i] - rhs[i]`.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

### `__imul__`

`__imul__(mut self, rhs: Self)`

Performs in-place multiplication.

The vector is mutated where each element at position `i` is computed as
`self[i] * rhs[i]`.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

### `__itruediv__`

`__itruediv__(mut self, rhs: Self)`

In-place true divide operator.

The vector is mutated where each element at position `i` is computed as
`self[i] / rhs[i]`.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

### `__ifloordiv__`

`__ifloordiv__(mut self, rhs: Self)`

In-place flood div operator.

The vector is mutated where each element at position `i` is computed as
`self[i] // rhs[i]`.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

### `__imod__`

`__imod__(mut self, rhs: Self)`

In-place mod operator.

The vector is mutated where each element at position `i` is computed as
`self[i] % rhs[i]`.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

### `__ipow__`

`__ipow__(mut self, rhs: Int)`

In-place pow operator.

The vector is mutated where each element at position `i` is computed as
`pow(self[i], rhs)`.

**Args:**

* ​rhs (`Int`): The rhs of the operation.

### `__ilshift__`

`__ilshift__(mut self, rhs: Self)`

Computes `self rhs (`Self`): The RHS value.

### `__irshift__`

`__irshift__(mut self, rhs: Self)`

Computes `self >> rhs` and save the result in `self`.

**Constraints:**

The element type of the SIMD vector must be integral.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__iand__`

`__iand__(mut self, rhs: Self)`

Computes `self & rhs` and save the result in `self`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ixor__`

`__ixor__(mut self, rhs: Self)`

Computes `self ^ rhs` and save the result in `self`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ior__`

`__ior__(mut self, rhs: Self)`

Computes `self | rhs` and save the result in `self`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​rhs (`Self`): The RHS value.

### `get_type_name`

`static get_type_name() -> String`

Gets this type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls.

**Returns:**

This type's name.

### `get_device_type_name`

`static get_device_type_name() -> String`

Gets device\_type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls.

**Returns:**

This type's name.

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

### `from_bits`

`static from_bits[int_dtype: DType, //](value: SIMD[int_dtype, size]) -> Self`

Initializes the SIMD vector from the bits of an integral SIMD vector.

**Parameters:**

* ​int\_dtype (`DType`): The integral type of the input SIMD vector.

**Args:**

* ​value (`SIMD[int_dtype, size]`): The SIMD vector to copy the bits from.

**Returns:**

The bitcast SIMD vector.

### `to_python_object`

`to_python_object(owned self) -> PythonObject`

Convert this value to a PythonObject.

**Returns:**

A PythonObject representing the value.

### `__len__`

`__len__(self) -> Int`

Gets the length of the SIMD vector.

**Returns:**

The length of the SIMD vector.

### `__int__`

`__int__(self) -> Int`

Casts to the value to an Int. If there is a fractional component, then the fractional part is truncated.

**Constraints:**

The size of the SIMD vector must be 1.

**Returns:**

The value as an integer.

### `__index__`

`__index__(self) -> index`

Convert to index.

**Returns:**

The corresponding \_\_mlir\_type.index value.

### `__float__`

`__float__(self) -> SIMD[float64, 1]`

Casts the value to a float.

**Constraints:**

The size of the SIMD vector must be 1.

**Returns:**

The value as a float.

### `__str__`

`__str__(self) -> String`

Get the SIMD as a string.

**Returns:**

A string representation.

### `__repr__`

`__repr__(self) -> String`

Get the representation of the SIMD value e.g. "SIMD\[DType.int8, 2]\(1, 2)".

**Returns:**

The representation of the SIMD value.

### `__floor__`

`__floor__(self) -> Self`

Performs elementwise floor on the elements of a SIMD vector.

**Returns:**

The elementwise floor of this SIMD vector.

### `__ceil__`

`__ceil__(self) -> Self`

Performs elementwise ceiling on the elements of a SIMD vector.

**Returns:**

The elementwise ceiling of this SIMD vector.

### `__trunc__`

`__trunc__(self) -> Self`

Performs elementwise truncation on the elements of a SIMD vector.

**Returns:**

The elementwise truncated values of this SIMD vector.

### `__abs__`

`__abs__(self) -> Self`

Defines the absolute value operation.

**Returns:**

The absolute value of this SIMD vector.

### `__round__`

`__round__(self) -> Self`

Performs elementwise rounding on the elements of a SIMD vector.

This rounding goes to the nearest integer with ties away from zero.

**Returns:**

The elementwise rounded value of this SIMD vector.

`__round__(self, ndigits: Int) -> Self`

Performs elementwise rounding on the elements of a SIMD vector.

This rounding goes to the nearest integer with ties away from zero.

**Args:**

* ​ndigits (`Int`): The number of digits to round to.

**Returns:**

The elementwise rounded value of this SIMD vector.

### `__hash__`

`__hash__(self) -> UInt`

Hash the value using builtin hash.

**Returns:**

A 64-bit hash value. This value is *not* suitable for cryptographic
uses. Its intended usage is for data structures. See the `hash`
builtin documentation for more details.

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with this SIMD value.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

### `__ceildiv__`

`__ceildiv__(self, denominator: Self) -> Self`

Return the rounded-up result of dividing self by denominator.

**Args:**

* ​denominator (`Self`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

### `cast`

`cast[target: DType](self) -> SIMD[target, size]`

Casts the elements of the SIMD vector to the target element type.

Casting behavior:

```mojo
# Basic casting preserves value within range
Int8(UInt8(127)) == Int8(127)

# Numbers above signed max wrap to negative using two's complement
Int8(UInt8(128)) == Int8(-128)
Int8(UInt8(129)) == Int8(-127)
Int8(UInt8(256)) == Int8(0)

# Negative signed cast to unsigned using two's complement
UInt8(Int8(-128)) == UInt8(128)
UInt8(Int8(-127)) == UInt8(129)
UInt8(Int8(-1)) == UInt8(255)

# Truncate precision after downcast and upcast
Float64(Float32(Float64(123456789.123456789))) == Float64(123456792.0)

# Rightmost bits of significand become 0's on upcast
Float64(Float32(0.3)) == Float64(0.30000001192092896)

# Numbers equal after truncation of float literal and cast truncation
Float32(Float64(123456789.123456789)) == Float32(123456789.123456789)

# Float to int/uint floors
Int64(Float64(42.2)) == Int64(42)
```

.

**Parameters:**

* ​target (`DType`): The target DType.

**Returns:**

A new SIMD vector whose elements have been casted to the target
element type.

### `is_power_of_two`

`is_power_of_two(self) -> SIMD[bool, size]`

Checks if the input value is a power of 2 for each element of a SIMD vector.

**Constraints:**

The element type of the input vector must be integral.

**Returns:**

A SIMD value where the element at position `i` is True if the integer at
position `i` of the input value is a power of 2, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this SIMD value to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `to_bits`

`to_bits[int_dtype: DType = _integral_type_of[::DType]()](self) -> SIMD[int_dtype, size]`

Bitcasts the SIMD vector to an integer SIMD vector.

**Parameters:**

* ​int\_dtype (`DType`): The integer type to cast to.

**Returns:**

An integer representation of the floating-point value.

### `from_bytes`

`static from_bytes[big_endian: Bool = is_big_endian[__mlir_type.!kgen.target]()](bytes: InlineArray[SIMD[uint8, 1], dtype.sizeof()]) -> SIMD[dtype, 1]`

Converts a byte array to an scalar integer.

**Parameters:**

* ​big\_endian (`Bool`): Whether the byte array is big-endian.

**Args:**

* ​bytes (`InlineArray[SIMD[uint8, 1], dtype.sizeof()]`): The byte array to convert.

**Returns:**

The integer value.

### `as_bytes`

`as_bytes[big_endian: Bool = is_big_endian[__mlir_type.!kgen.target]()](self) -> InlineArray[SIMD[uint8, 1], dtype.sizeof()]`

Convert the scalar integer to a byte array.

**Parameters:**

* ​big\_endian (`Bool`): Whether the byte array should be big-endian.

**Returns:**

The byte array.

### `clamp`

`clamp(self, lower_bound: Self, upper_bound: Self) -> Self`

Clamps the values in a SIMD vector to be in a certain range.

Clamp cuts values in the input SIMD vector off at the upper bound and
lower bound values. For example,  SIMD vector `[0, 1, 2, 3]` clamped to
a lower bound of 1 and an upper bound of 2 would return `[1, 1, 2, 2]`.

**Args:**

* ​lower\_bound (`Self`): Minimum of the range to clamp to.
* ​upper\_bound (`Self`): Maximum of the range to clamp to.

**Returns:**

A new SIMD vector containing x clamped to be within lower\_bound and
upper\_bound.

### `fma`

`fma(self, multiplier: Self, accumulator: Self) -> Self`

Performs a fused multiply-add operation, i.e. `self*multiplier + accumulator`.

**Args:**

* ​multiplier (`Self`): The value to multiply.
* ​accumulator (`Self`): The value to accumulate.

**Returns:**

A new vector whose element at position `i` is computed as
`self[i]*multiplier[i] + accumulator[i]`.

### `shuffle`

`shuffle[*mask: Int](self) -> Self`

Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`.

**Parameters:**

* ​\*mask (`Int`): The permutation to use in the shuffle.

**Returns:**

A new vector with the same length as the mask where the value at
position `i` is `(self)[permutation[i]]`.

`shuffle[*mask: Int](self, other: Self) -> Self`

Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`.

**Parameters:**

* ​\*mask (`Int`): The permutation to use in the shuffle.

**Args:**

* ​other (`Self`): The other vector to shuffle with.

**Returns:**

A new vector with the same length as the mask where the value at
position `i` is `(self + other)[permutation[i]]`.

`shuffle[: DType, //, mask: IndexList[size, element_type=$0]](self) -> Self`

Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`.

**Parameters:**

* ​mask (`IndexList[size, element_type=$0]`): The permutation to use in the shuffle.

**Returns:**

A new vector with the same length as the mask where the value at
position `i` is `(self)[permutation[i]]`.

`shuffle[: DType, //, mask: IndexList[size, element_type=$0]](self, other: Self) -> Self`

Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`.

**Parameters:**

* ​mask (`IndexList[size, element_type=$0]`): The permutation to use in the shuffle.

**Args:**

* ​other (`Self`): The other vector to shuffle with.

**Returns:**

A new vector with the same length as the mask where the value at
position `i` is `(self + other)[permutation[i]]`.

### `slice`

`slice[output_width: Int, /, *, offset: Int = 0](self) -> SIMD[dtype, output_width]`

Returns a slice of the vector of the specified width with the given offset.

**Constraints:**

`output_width + offset` must not exceed the size of this SIMD
vector.

**Parameters:**

* ​output\_width (`Int`): The output SIMD vector size.
* ​offset (`Int`): The given offset for the slice.

**Returns:**

A new vector whose elements map to
`self[offset:offset+output_width]`.

### `insert`

`insert[*, offset: Int = 0](self, value: SIMD[dtype, size]) -> Self`

Returns a new vector where the elements between `offset` and `offset + input_width` have been replaced with the elements in `value`.

**Parameters:**

* ​offset (`Int`): The offset to insert at.

**Args:**

* ​value (`SIMD[dtype, size]`): The value to be inserted.

**Returns:**

A new vector whose elements at `self[offset:offset+input_width]`
contain the values of `value`.

### `join`

`join(self, other: Self) -> SIMD[dtype, (size * 2)]`

Concatenates the two vectors together.

**Args:**

* ​other (`Self`): The other SIMD vector.

**Returns:**

A new vector `self_0, self_1, ..., self_n, other_0, ..., other_n`.

### `interleave`

`interleave(self, other: Self) -> SIMD[dtype, (size * 2)]`

Constructs a vector by interleaving two input vectors.

**Args:**

* ​other (`Self`): The other SIMD vector.

**Returns:**

A new vector `self_0, other_0, ..., self_n, other_n`.

### `split`

`split(self) -> Tuple[SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)], SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]]`

Splits the SIMD vector into 2 subvectors.

**Returns:**

A new vector `self_0:N/2, self_N/2:N`.

### `deinterleave`

`deinterleave(self) -> Tuple[SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)], SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]]`

Constructs two vectors by deinterleaving the even and odd lanes of the vector.

**Constraints:**

The vector size must be greater than 1.

**Returns:**

Two vectors the first of the form `self_0, self_2, ..., self_{n-2}`
and the other being `self_1, self_3, ..., self_{n-1}`.

### `reduce`

`reduce[func: fn[Int](SIMD[dtype, $0], SIMD[dtype, $0]) -> SIMD[dtype, $0], size_out: Int = 1](self) -> SIMD[dtype, size_out]`

Reduces the vector using a provided reduce operator.

**Constraints:**

`size_out` must not exceed width of the vector.

**Parameters:**

* ​func (`fn[Int](SIMD[dtype, $0], SIMD[dtype, $0]) -> SIMD[dtype, $0]`): The reduce function to apply to elements in this SIMD.
* ​size\_out (`Int`): The width of the reduction.

**Returns:**

A new scalar which is the reduction of all vector elements.

`reduce[func: fn[Int](SIMD[dtype, $0], SIMD[dtype, $0]) capturing -> SIMD[dtype, $0], size_out: Int = 1](self) -> SIMD[dtype, size_out]`

Reduces the vector using a provided reduce operator.

**Constraints:**

`size_out` must not exceed width of the vector.

**Parameters:**

* ​func (`fn[Int](SIMD[dtype, $0], SIMD[dtype, $0]) capturing -> SIMD[dtype, $0]`): The reduce function to apply to elements in this SIMD.
* ​size\_out (`Int`): The width of the reduction.

**Returns:**

A new scalar which is the reduction of all vector elements.

### `reduce_max`

`reduce_max[size_out: Int = 1](self) -> SIMD[dtype, size_out]`

Reduces the vector using the `max` operator.

**Constraints:**

`size_out` must not exceed width of the vector.
The element type of the vector must be integer or FP.

**Parameters:**

* ​size\_out (`Int`): The width of the reduction.

**Returns:**

The maximum element of the vector.

### `reduce_min`

`reduce_min[size_out: Int = 1](self) -> SIMD[dtype, size_out]`

Reduces the vector using the `min` operator.

**Constraints:**

`size_out` must not exceed width of the vector.
The element type of the vector must be integer or FP.

**Parameters:**

* ​size\_out (`Int`): The width of the reduction.

**Returns:**

The minimum element of the vector.

### `reduce_add`

`reduce_add[size_out: Int = 1](self) -> SIMD[dtype, size_out]`

Reduces the vector using the `add` operator.

**Constraints:**

`size_out` must not exceed width of the vector.

**Parameters:**

* ​size\_out (`Int`): The width of the reduction.

**Returns:**

The sum of all vector elements.

### `reduce_mul`

`reduce_mul[size_out: Int = 1](self) -> SIMD[dtype, size_out]`

Reduces the vector using the `mul` operator.

**Constraints:**

`size_out` must not exceed width of the vector.
The element type of the vector must be integer or FP.

**Parameters:**

* ​size\_out (`Int`): The width of the reduction.

**Returns:**

The product of all vector elements.

### `reduce_and`

`reduce_and[size_out: Int = 1](self) -> SIMD[dtype, size_out]`

Reduces the vector using the bitwise `&` operator.

**Constraints:**

`size_out` must not exceed width of the vector.
The element type of the vector must be integer or boolean.

**Parameters:**

* ​size\_out (`Int`): The width of the reduction.

**Returns:**

The reduced vector.

### `reduce_or`

`reduce_or[size_out: Int = 1](self) -> SIMD[dtype, size_out]`

Reduces the vector using the bitwise `|` operator.

**Constraints:**

`size_out` must not exceed width of the vector.
The element type of the vector must be integer or boolean.

**Parameters:**

* ​size\_out (`Int`): The width of the reduction.

**Returns:**

The reduced vector.

### `reduce_bit_count`

`reduce_bit_count(self) -> Int`

Returns the total number of bits set in the SIMD vector.

**Constraints:**

Must be either an integral or a boolean type.

**Returns:**

Count of set bits across all elements of the vector.

### `select`

`select[dtype: DType](self, true_case: SIMD[dtype, size], false_case: SIMD[dtype, size]) -> SIMD[dtype, size]`

Selects the values of the `true_case` or the `false_case` based on the current boolean values of the SIMD vector.

**Constraints:**

The element type of the vector must be boolean.

**Parameters:**

* ​dtype (`DType`): The element type of the input and output SIMD vectors.

**Args:**

* ​true\_case (`SIMD[dtype, size]`): The values selected if the positional value is True.
* ​false\_case (`SIMD[dtype, size]`): The values selected if the positional value is False.

**Returns:**

A new vector of the form
`[true_case[i] if elem else false_case[i] for i, elem in enumerate(self)]`.

### `rotate_left`

`rotate_left[shift: Int](self) -> Self`

Shifts the elements of a SIMD vector to the left by `shift` elements (with wrap-around).

**Constraints:**

`-size shift (`Int`): The number of positions by which to rotate the elements of
  SIMD vector to the left (with wrap-around).

**Returns:**

The SIMD vector rotated to the left by `shift` elements
(with wrap-around).

### `rotate_right`

`rotate_right[shift: Int](self) -> Self`

Shifts the elements of a SIMD vector to the right by `shift` elements (with wrap-around).

**Constraints:**

`-size shift (`Int`): The number of positions by which to rotate the elements of
  SIMD vector to the right (with wrap-around).

**Returns:**

The SIMD vector rotated to the right by `shift` elements
(with wrap-around).

### `shift_left`

`shift_left[shift: Int](self) -> Self`

Shifts the elements of a SIMD vector to the left by `shift` elements (no wrap-around, fill with zero).

**Constraints:**

`0 shift (`Int`): The number of positions by which to rotate the elements of
  SIMD vector to the left (no wrap-around, fill with zero).

**Returns:**

The SIMD vector rotated to the left by `shift` elements (no
wrap-around, fill with zero).

### `shift_right`

`shift_right[shift: Int](self) -> Self`

Shifts the elements of a SIMD vector to the right by `shift` elements (no wrap-around, fill with zero).

**Constraints:**

`0 shift (`Int`): The number of positions by which to rotate the elements of
  SIMD vector to the right (no wrap-around, fill with zero).

**Returns:**

The SIMD vector rotated to the right by `shift` elements (no
wrap-around, fill with zero).

### `reversed`

`reversed(self) -> Self`

Reverses the SIMD vector by indexes.

Examples:

```mojo
print(SIMD[DType.uint8, 4](1, 2, 3, 4).reversed()) # [4, 3, 2, 1]
```

.

**Returns:**

The by index reversed vector.

---

## simd

Implements SIMD primitives and abstractions.

Provides high-performance SIMD primitives and abstractions for
vectorized computation in Mojo. It enables efficient data-parallel operations
by leveraging hardware vector processing units across different architectures.

Key Features:

1. Architecture-agnostic SIMD abstractions with automatic hardware detection
2. Optimized vector operations for common numerical computations
3. Explicit control over vectorization strategies and memory layouts
4. Zero-cost abstractions that compile to efficient machine code
5. Support for different vector widths and element types

Primary Components:

* Vector types: Strongly-typed vector containers with element-wise operations
* SIMD intrinsics: Low-level access to hardware SIMD instructions
* Vectorized algorithms: Common algorithms optimized for SIMD execution
* Memory utilities: Aligned memory allocation and vector load/store operations

Performance Considerations:

* Vector width selection should match target hardware capabilities
* Memory alignment affects load/store performance
* Data layout transformations may be necessary for optimal vectorization

Integration:
This module is designed to work seamlessly with other Mojo numerical computing
components, including tensor operations, linear algebra routines, and
domain-specific libraries for machine learning and scientific computing.

## Aliases

### `BFloat16`

`alias BFloat16 = SIMD[bfloat16, 1]`

Represents a 16-bit brain floating point value.

### `Byte`

`alias Byte = SIMD[uint8, 1]`

Represents a byte (backed by an 8-bit unsigned integer).

### `Float16`

`alias Float16 = SIMD[float16, 1]`

Represents a 16-bit floating point value.

### `Float32`

`alias Float32 = SIMD[float32, 1]`

Represents a 32-bit floating point value.

### `Float64`

`alias Float64 = SIMD[float64, 1]`

Represents a 64-bit floating point value.

### `Float8_e4m3fn`

`alias Float8_e4m3fn = SIMD[float8_e4m3fn, 1]`

Represents the E4M3 floating point format defined in the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1).

This type is named differently across libraries and vendors, for example:

* Mojo, PyTorch, JAX, and LLVM refer to it as `e4m3fn`.
* OCP, NVIDIA CUDA, and AMD ROCm refer to it as `e4m3`.

In these contexts, they are all referring to the same finite type specified
in the OFP8 standard above, encoded as `seeeemmm`:

* (s)ign: 1 bit
* (e)xponent: 4 bits
* (m)antissa: 3 bits
* exponent bias: 7
* nan: 01111111, 11111111
* -0: 10000000
* fn: finite (no inf or -inf encodings)

### `Float8_e4m3fnuz`

`alias Float8_e4m3fnuz = SIMD[float8_e4m3fnuz, 1]`

Represents an 8-bit e4m3fnuz floating point format, encoded as `seeeemmm`: - (s)ign: 1 bit - (e)xponent: 4 bits - (m)antissa: 3 bits - exponent bias: 8 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding)

### `Float8_e5m2`

`alias Float8_e5m2 = SIMD[float8_e5m2, 1]`

Represents the 8-bit E5M2 floating point format from the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1), encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 15 - nan: {0,1}11111{01,10,11} - inf: 01111100 - -inf: 11111100 - -0: 10000000

### `Float8_e5m2fnuz`

`alias Float8_e5m2fnuz = SIMD[float8_e5m2fnuz, 1]`

Represents an 8-bit floating point format, encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 16 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding)

### `Int128`

`alias Int128 = SIMD[si128, 1]`

Represents a 128-bit signed scalar integer.

### `Int16`

`alias Int16 = SIMD[int16, 1]`

Represents a 16-bit signed scalar integer.

### `Int256`

`alias Int256 = SIMD[si256, 1]`

Represents a 256-bit signed scalar integer.

### `Int32`

`alias Int32 = SIMD[int32, 1]`

Represents a 32-bit signed scalar integer.

### `Int64`

`alias Int64 = SIMD[int64, 1]`

Represents a 64-bit signed scalar integer.

### `Int8`

`alias Int8 = SIMD[int8, 1]`

Represents an 8-bit signed scalar integer.

### `Scalar`

`alias Scalar = SIMD[?, 1]`

Represents a scalar dtype.

### `U8x16`

`alias U8x16 = SIMD[uint8, 16]`

### `UInt128`

`alias UInt128 = SIMD[ui128, 1]`

Represents a 128-bit unsigned scalar integer.

### `UInt16`

`alias UInt16 = SIMD[uint16, 1]`

Represents a 16-bit unsigned scalar integer.

### `UInt256`

`alias UInt256 = SIMD[ui256, 1]`

Represents a 256-bit unsigned scalar integer.

### `UInt32`

`alias UInt32 = SIMD[uint32, 1]`

Represents a 32-bit unsigned scalar integer.

### `UInt64`

`alias UInt64 = SIMD[uint64, 1]`

Represents a 64-bit unsigned scalar integer.

### `UInt8`

`alias UInt8 = SIMD[uint8, 1]`

Represents an 8-bit unsigned scalar integer.

## Structs

* [​`SIMD`](/mojo/stdlib/builtin/simd/SIMD): Represents a small vector that is backed by a hardware vector element.

---

## sort

Implements the built-in `sort` function.

These are Mojo built-ins, so you don't need to import them.

## Aliases

### `insertion_sort_threshold`

`alias insertion_sort_threshold = 32`

## Functions

* [​`partition`](/mojo/stdlib/builtin/sort/partition): Partition the input buffer inplace such that first k elements are the largest (or smallest if cmp\_fn is

---

## partition

`partition[: origin.set, T: Copyable & Movable, origin: MutableOrigin, //, cmp_fn: fn(T, T) capturing -> Bool](span: Span[T, origin], k: Int)`

Partition the input buffer inplace such that first k elements are the largest (or smallest if cmp\_fn is T (`Copyable & Movable`): Type of the underlying data.
* ​origin (`MutableOrigin`): Origin of span.
* ​cmp\_fn (`fn(T, T) capturing -> Bool`): Comparison functor of (T, T) capturing \[\_] -> Bool type.

**Args:**

* ​span (`Span[T, origin]`): Input buffer.
* ​k (`Int`): Index of the partition element.

---

## sort

`sort[: origin.set, T: Copyable & Movable, origin: MutableOrigin, //, cmp_fn: fn(T, T) capturing -> Bool, *, stable: Bool = False](span: Span[T, origin])`

Sort the list inplace. The function doesn't return anything, the list is updated inplace.

**Parameters:**

* ​T (`Copyable & Movable`): Copyable & Movable type of the underlying data.
* ​origin (`MutableOrigin`): Origin of span.
* ​cmp\_fn (`fn(T, T) capturing -> Bool`): The comparison function.
* ​stable (`Bool`): Whether the sort should be stable.

**Args:**

* ​span (`Span[T, origin]`): The span to be sorted.

`sort[: origin.set, origin: MutableOrigin, //, cmp_fn: fn(Int, Int) capturing -> Bool, *, stable: Bool = False](span: Span[Int, origin])`

Sort the list inplace. The function doesn't return anything, the list is updated inplace.

**Parameters:**

* ​origin (`MutableOrigin`): Origin of span.
* ​cmp\_fn (`fn(Int, Int) capturing -> Bool`): The comparison function.
* ​stable (`Bool`): Whether the sort should be stable.

**Args:**

* ​span (`Span[Int, origin]`): The span to be sorted.

`sort[origin: MutableOrigin, //, *, stable: Bool = False](span: Span[Int, origin])`

Sort the list inplace. The function doesn't return anything, the list is updated inplace.

**Parameters:**

* ​origin (`MutableOrigin`): Origin of span.
* ​stable (`Bool`): Whether the sort should be stable.

**Args:**

* ​span (`Span[Int, origin]`): The span to be sorted.

`sort[dtype: DType, origin: MutableOrigin, //, *, stable: Bool = False](span: Span[SIMD[dtype, 1], origin])`

Sort the list inplace. The function doesn't return anything, the list is updated inplace.

**Parameters:**

* ​dtype (`DType`): Copyable & Movable type of the underlying data.
* ​origin (`MutableOrigin`): Origin of span.
* ​stable (`Bool`): Whether the sort should be stable.

**Args:**

* ​span (`Span[SIMD[dtype, 1], origin]`): The span to be sorted.

`sort[T: Copyable & Movable & EqualityComparable & LessThanComparable & GreaterThanComparable & LessThanOrEqualComparable & GreaterThanOrEqualComparable, origin: MutableOrigin, //, *, stable: Bool = False](span: Span[T, origin])`

Sort list of the order comparable elements in-place.

**Parameters:**

* ​T (`Copyable & Movable & EqualityComparable & LessThanComparable & GreaterThanComparable & LessThanOrEqualComparable & GreaterThanOrEqualComparable`): The order comparable collection element type.
* ​origin (`MutableOrigin`): Origin of span.
* ​stable (`Bool`): Whether the sort should be stable.

**Args:**

* ​span (`Span[T, origin]`): The span to be sorted.

---

## Stringable

The `Stringable` trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String).

Any type that conforms to `Stringable` or
[`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising) works
with the built-in [`print()`](/mojo/stdlib/builtin/io/print) and
[`String()`](/mojo/stdlib/builtin/str/str) functions.

The `Stringable` trait requires the type to define the `__str__()` method.
For example:

```mojo
struct Foo(Stringable):
    var s: String

    fn __str__(self) -> String:
        return self.s
```

Now you can pass an instance of `Foo` to the `String()` function to get back a
`String`:

```mojo
var foo = Foo("test")
print(String(foo) == "test")
```

```plaintext
True
```

**Note:** If the `__str__()` method might raise an error, use the
[`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising)
trait, instead.

About the difference between `__repr__()` and `__str__()`:
The method `__repr__` computes the "official" string representation of an object
while `__str__` computes the "informal" or nicely printable string representation of an object.

This method differs from `__repr__()` in that there is no expectation that `__str__()`
return a valid Mojo expression: a more convenient or concise representation can be used.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__str__`

`__str__(self: _Self) -> String`

Get the string representation of the type.

**Returns:**

The string representation of the type.

---

## StringableRaising

The StringableRaising trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String).

Any type that conforms to
[`Stringable`](/mojo/stdlib/builtin/str/Stringable) or
`StringableRaising` works with the built-in
[`print()`](/mojo/stdlib/builtin/io/print) and
[`String()`](/mojo/stdlib/builtin/str/str) functions.

The `StringableRaising` trait requires the type to define the `__str__()`
method, which can raise an error. For example:

```mojo
struct Foo(StringableRaising):
    var s: String

    fn __str__(self) raises -> String:
        if self.s == "":
            raise Error("Empty String")
        return self.s
```

Now you can pass an instance of `Foo` to the `String()` function to get back a
`String`:

```mojo
fn main() raises:
    var foo = Foo("test")
    print(String(foo) == "test")
```

```plaintext
True
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__str__`

`__str__(self: _Self) -> String`

Get the string representation of the type.

**Returns:**

The string representation of the type.

**Raises:**

If there is an error when computing the string representation of the type.

---

## str

Provides the `str` function.

These are Mojo built-ins, so you don't need to import them.

## Traits

* [​`Stringable`](/mojo/stdlib/builtin/str/Stringable): The `Stringable` trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String).
* [​`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising): The StringableRaising trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String).

---

## StringLiteral

`@register_passable(trivial)`
`struct StringLiteral[value: string]`

This type represents a string literal.

String literals are all null-terminated for compatibility with C APIs, but
this is subject to change. String literals store their length as an integer,
and this does not include the null terminator.

## Parameters

* ​value (`string`): The underlying string value.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Defaultable`,
`ExplicitlyCopyable`,
`FloatableRaising`,
`IntableRaising`,
`Movable`,
`PathLike`,
`PythonConvertible`,
`Representable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Constructor for any value.

### `__bool__`

`__bool__(self) -> Bool`

Convert the string to a bool value.

**Returns:**

True if the string is not empty.

### `__getitem__`

`__getitem__[IndexerType: Indexer](self, idx: IndexerType) -> String`

Gets the character at the specified position.

**Parameters:**

* ​IndexerType (`Indexer`): The inferred type of an indexer argument.

**Args:**

* ​idx (`IndexerType`): The index value.

**Returns:**

A new string containing the character at the specified position.

### `__lt__`

`__lt__(self, rhs: StringSlice[origin]) -> Bool`

Compare this value to the RHS using lesser than (LT) comparison.

**Args:**

* ​rhs (`StringSlice[origin]`): The other value to compare against.

**Returns:**

True if this is strictly less than the RHS and False otherwise.

### `__le__`

`__le__(self, rhs: StringSlice[origin]) -> Bool`

Compare this value to the RHS using lesser than or equal to (LE) comparison.

**Args:**

* ​rhs (`StringSlice[origin]`): The other value to compare against.

**Returns:**

True if this is less than or equal to the RHS and False otherwise.

### `__eq__`

`__eq__(self, rhs: StringSlice[origin]) -> Bool`

Compare two string literals for equality.

**Args:**

* ​rhs (`StringSlice[origin]`): The string to compare.

**Returns:**

True if they are equal.

### `__ne__`

`__ne__(self, rhs: StringSlice[origin]) -> Bool`

Compare two string literals for inequality.

**Args:**

* ​rhs (`StringSlice[origin]`): The string to compare.

**Returns:**

True if they are not equal.

### `__gt__`

`__gt__(self, rhs: StringSlice[origin]) -> Bool`

Compare this value to the RHS using greater than (GT) comparison.

**Args:**

* ​rhs (`StringSlice[origin]`): The other value to compare against.

**Returns:**

True if this is strictly greater than the RHS and False otherwise.

### `__ge__`

`__ge__(self, rhs: StringSlice[origin]) -> Bool`

Compare this value to the RHS using greater than or equal to (GE) comparison.

**Args:**

* ​rhs (`StringSlice[origin]`): The other value to compare against.

**Returns:**

True if this is greater than or equal to the RHS and False otherwise.

### `__add__`

`__add__(self, rhs: StringLiteral[value]) -> StringLiteral[#pop.string_concat]`

Concatenate two string literals.

**Args:**

* ​rhs (`StringLiteral[value]`): The string to concat.

**Returns:**

The concatenated string.

### `__mul__`

`__mul__(self, n: Int) -> String`

Concatenates the string `n` times.

**Args:**

* ​n (`Int`): The number of times to concatenate the string.

**Returns:**

The string concatenated `n` times.

### `copy`

`copy(self) -> Self`

Copy constructor.

**Returns:**

A copy of the value.

### `to_python_object`

`to_python_object(owned self) -> PythonObject`

Convert this value to a PythonObject.

**Returns:**

A PythonObject representing the value.

### `__len__`

`__len__(self) -> Int`

Get the string length.

**Returns:**

The length of this value.

### `__int__`

`__int__(self) -> Int`

Parses the given string as a base-10 integer and returns that value. If the string cannot be parsed as an int, an error is raised.

**Returns:**

An integer value that represents the string, or otherwise raises.

### `__float__`

`__float__(self) -> SIMD[float64, 1]`

Parses the string as a float point number and returns that value. If the string cannot be parsed as a float, an error is raised.

**Returns:**

A float value that represents the string, or otherwise raises.

### `__str__`

`__str__(self) -> String`

Convert the string literal to a string.

**Returns:**

A new string.

### `__repr__`

`__repr__(self) -> String`

Return a representation of this value.

You don't need to call this method directly, use `repr("...")` instead.

**Returns:**

A new representation of the string.

### `__fspath__`

`__fspath__(self) -> String`

Return the file system path representation of the object.

**Returns:**

The file system path representation as a string.

### `__iter__`

`__iter__(self) -> CodepointSliceIter[StaticConstantOrigin]`

Return an iterator over the string literal.

**Returns:**

An iterator over the string.

### `__reversed__`

`__reversed__(self) -> CodepointSliceIter[StaticConstantOrigin, False]`

Iterate backwards over the string, returning immutable references.

**Returns:**

A reversed iterator over the string.

### `__merge_with__`

`__merge_with__[: string, //, other_type: AnyStruct[StringLiteral[$0]]](self) -> StringSlice[StaticConstantOrigin]`

Returns a StaticString after merging with another string literal.

**Parameters:**

* ​other\_type (`AnyStruct[StringLiteral[$0]]`): The type of the string literal to merge with.

**Returns:**

A StaticString after merging with the specified `other_type`.

### `byte_length`

`byte_length(self) -> Int`

Get the string length in bytes.

Notes:
This does not include the trailing null terminator in the count.

**Returns:**

The length of this string in bytes.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[SIMD[uint8, 1], mut=False, origin=StaticConstantOrigin]`

Get raw pointer to the underlying data.

**Returns:**

The raw pointer to the data.

### `unsafe_cstr_ptr`

`unsafe_cstr_ptr(self) -> UnsafePointer[SIMD[int8, 1], mut=False, origin=StaticConstantOrigin]`

Retrieves a C-string-compatible pointer to the underlying memory.

The returned pointer is guaranteed to be NUL terminated, and not null.

**Returns:**

The pointer to the underlying memory.

### `as_string_slice`

`as_string_slice(self) -> StringSlice[StaticConstantOrigin]`

Returns a string slice of this static string literal.

**Returns:**

A string slice pointing to this static string literal.

### `as_bytes`

`as_bytes(self) -> Span[SIMD[uint8, 1], StaticConstantOrigin]`

Returns a contiguous Span of the bytes owned by this string.

**Returns:**

A contiguous slice pointing to the bytes owned by this string.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this string literal to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `find`

`find(self, substr: StringSlice[StaticConstantOrigin], start: Int = 0) -> Int`

Finds the offset of the first occurrence of `substr` starting at `start`. If not found, returns -1.

**Args:**

* ​substr (`StringSlice[StaticConstantOrigin]`): The substring to find.
* ​start (`Int`): The offset from which to find.

**Returns:**

The offset of `substr` relative to the beginning of the string.

### `rfind`

`rfind(self, substr: StringSlice[StaticConstantOrigin], start: Int = 0) -> Int`

Finds the offset of the last occurrence of `substr` starting at `start`. If not found, returns -1.

**Args:**

* ​substr (`StringSlice[StaticConstantOrigin]`): The substring to find.
* ​start (`Int`): The offset from which to find.

**Returns:**

The offset of `substr` relative to the beginning of the string.

### `count`

`count(self, substr: StringSlice[origin]) -> Int`

Return the number of non-overlapping occurrences of substring `substr` in the string literal.

If sub is empty, returns the number of empty strings between characters
which is the length of the string plus one.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to count.

**Returns:**

The number of occurrences of `substr`.

### `lower`

`lower(self) -> String`

Returns a copy of the string literal with all cased characters converted to lowercase.

**Returns:**

A new string where cased letters have been converted to lowercase.

### `upper`

`upper(self) -> String`

Returns a copy of the string literal with all cased characters converted to uppercase.

**Returns:**

A new string where cased letters have been converted to uppercase.

### `rjust`

`rjust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String`

Returns the string right justified in a string literal of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns right justified string, or self if width is not bigger than self length.

### `ljust`

`ljust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String`

Returns the string left justified in a string literal of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns left justified string, or self if width is not bigger than self length.

### `center`

`center(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String`

Returns the string center justified in a string literal of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns center justified string, or self if width is not bigger than self length.

### `startswith`

`startswith(self, prefix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool`

Checks if the string literal starts with the specified prefix between start and end positions. Returns True if found and False otherwise.

**Args:**

* ​prefix (`StringSlice[origin]`): The prefix to check.
* ​start (`Int`): The start offset from which to check.
* ​end (`Int`): The end offset from which to check.

**Returns:**

True if the `self[start:end]` is prefixed by the input prefix.

### `endswith`

`endswith(self, suffix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool`

Checks if the string literal end with the specified suffix between start and end positions. Returns True if found and False otherwise.

**Args:**

* ​suffix (`StringSlice[origin]`): The suffix to check.
* ​start (`Int`): The start offset from which to check.
* ​end (`Int`): The end offset from which to check.

**Returns:**

True if the `self[start:end]` is suffixed by the input suffix.

### `isdigit`

`isdigit(self) -> Bool`

Returns True if all characters in the string literal are digits.

Note that this currently only works with ASCII strings.

**Returns:**

True if all characters are digits else False.

### `isupper`

`isupper(self) -> Bool`

Returns True if all cased characters in the string literal are uppercase and there is at least one cased character.

Note that this currently only works with ASCII strings.

**Returns:**

True if all cased characters in the string literal are uppercase
and there is at least one cased character, False otherwise.

### `islower`

`islower(self) -> Bool`

Returns True if all cased characters in the string literal are lowercase and there is at least one cased character.

Note that this currently only works with ASCII strings.

**Returns:**

True if all cased characters in the string literal are lowercase
and there is at least one cased character, False otherwise.

### `strip`

`strip(self) -> String`

Return a copy of the string literal with leading and trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

**Returns:**

A string with no leading or trailing whitespaces.

`strip(self, chars: StringSlice[origin]) -> String`

Return a copy of the string literal with leading and trailing characters removed.

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A string with no leading or trailing characters.

### `rstrip`

`rstrip(self, chars: StringSlice[origin]) -> String`

Return a copy of the string literal with trailing characters removed.

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A string with no trailing characters.

`rstrip(self) -> String`

Return a copy of the string with trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

**Returns:**

A copy of the string with no trailing whitespaces.

### `lstrip`

`lstrip(self, chars: StringSlice[origin]) -> String`

Return a copy of the string with leading characters removed.

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A copy of the string with no leading characters.

`lstrip(self) -> String`

Return a copy of the string with leading whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

**Returns:**

A copy of the string with no leading whitespaces.

---

## string_literal

Implements the StringLiteral struct.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral): This type represents a string literal.

---

## swap

Implements the built-in `swap` function.

These are Mojo built-ins, so you don't need to import them.

## Functions

* [​`swap`](/mojo/stdlib/builtin/swap/swap): Swaps the two given arguments.

---

## swap

`swap[T: Movable](mut lhs: T, mut rhs: T)`

Swaps the two given arguments.

**Parameters:**

* ​T (`Movable`): Constrained to Movable types.

**Args:**

* ​lhs (`T`): Argument value swapped with rhs.
* ​rhs (`T`): Argument value swapped with lhs.

---

## Tuple

`struct Tuple[*element_types: Copyable & Movable]`

The type of a literal tuple expression.

A tuple consists of zero or more values, separated by commas.

## Parameters

* ​\*element\_types (`Copyable & Movable`): The elements type.

## Fields

* ​storage (`!kgen.pack> element_types>`): The underlying storage for the tuple.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self: Tuple[])`

Construct an empty tuple.

`__init__(out self, owned *args: *element_types)`

Construct the tuple.

**Args:**

* ​\*args (`*element_types`): Initial values.

`__init__(out self, *, owned storage: VariadicPack[is_owned, origin, Copyable & Movable, element_types])`

Construct the tuple from a low-level internal representation.

**Args:**

* ​storage (`VariadicPack[is_owned, origin, Copyable & Movable, element_types]`): The variadic pack storage to construct from.

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Copy construct the tuple.

**Args:**

* ​existing (`Self`): The value to copy from.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Move construct the tuple.

**Args:**

* ​existing (`Self`): The value to move from.

### `__del__`

`__del__(owned self)`

Destructor that destroys all of the elements.

### `__getitem__`

`__getitem__[idx: Int](ref self) -> ref [self] element_types[idx.value]`

Get a reference to an element in the tuple.

**Parameters:**

* ​idx (`Int`): The element to return.

**Returns:**

A reference to the specified element.

### `__contains__`

`__contains__[T: EqualityComparable & Copyable & Movable](self, value: T) -> Bool`

Return whether the tuple contains the specified value.

For example:

```mojo
var t = Tuple(True, 1, 2.5)
if 1 in t:
    print("t contains 1")
```

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the value.

**Args:**

* ​value (`T`): The value to search for.

**Returns:**

True if the value is in the tuple, False otherwise.

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

### `__len__`

`static __len__() -> Int`

Return the number of elements in the tuple.

**Returns:**

The tuple length.

`__len__(self) -> Int`

Get the number of elements in the tuple.

**Returns:**

The tuple length.

---

## tuple

Implements the Tuple type.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`Tuple`](/mojo/stdlib/builtin/tuple/Tuple): The type of a literal tuple expression.

---

## Origin

`@register_passable(trivial)`
`struct Origin[mut: Bool]`

This represents a origin reference for a memory value.

## Parameters

* ​mut (`Bool`): Whether the origin is mutable.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `cast_from`

`alias cast_from = _lit_mut_cast[mut, ?]`

Cast an existing Origin to be of the specified mutability.

This is a low-level way to coerce Origin mutability. This should be used
rarely, typically when building low-level fundamental abstractions. Strongly
consider alternatives before reaching for this "escape hatch".

Safety:
This is an UNSAFE operation if used to cast an immutable origin to
a mutable origin.

Examples:

Cast a mutable origin to be immutable:

```mojo
struct Container[mut: Bool, //, origin: Origin[mut]]:
    var data: Int

    fn imm_borrow(self) -> Container[ImmutableOrigin.cast_from[origin].result]:
        # ...
```

### `empty`

`alias empty = {}`

An empty `__origin_of()` of the given mutability. The empty origin is guaranteed not to alias any existing origins.

---

## type_aliases

Defines some type aliases.

These are Mojo built-ins, so you don't need to import them.

## Aliases

### `AnyTrivialRegType`

`alias AnyTrivialRegType = AnyTrivialRegType`

Represents any register passable Mojo data type.

### `ImmutableAnyOrigin`

`alias ImmutableAnyOrigin = ImmutableAnyOrigin`

The immutable origin that might access any memory value.

### `ImmutableOrigin`

`alias ImmutableOrigin = ImmutableOrigin`

Immutable origin reference type.

### `MutableAnyOrigin`

`alias MutableAnyOrigin = MutableAnyOrigin`

The mutable origin that might access any memory value.

### `MutableOrigin`

`alias MutableOrigin = MutableOrigin`

Mutable origin reference type.

### `OriginSet`

`alias OriginSet = origin.set`

A set of origin parameters.

### `StaticConstantOrigin`

`alias StaticConstantOrigin = StaticConstantOrigin`

An origin for strings and other always-immutable static constants.

## Structs

* [​`Origin`](/mojo/stdlib/builtin/type_aliases/Origin): This represents a origin reference for a memory value.

---

## UInt

`@register_passable(trivial)`
`struct UInt`

This type represents an unsigned integer.

The size of this unsigned integer is platform-dependent.

If you wish to use a fixed size unsigned integer, consider using
`UInt8`, `UInt16`, `UInt32`, or `UInt64`.

## Fields

* ​value (`index`): The underlying storage for the integer value.
  Note that it is the same type as the `Int.value` field.
  MLIR doesn't differentiate between signed and unsigned integers
  when it comes to storing them with the index dialect.
  The difference is in the operations that are performed on them,
  which have signed and unsigned variants.

## Implemented traits

`Absable`,
`AnyType`,
`Boolable`,
`CeilDivable`,
`Copyable`,
`Defaultable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`Hashable`,
`Indexer`,
`Intable`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`Representable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`,
`_HashableWithHasher`

## Aliases

### `BITWIDTH`

`alias BITWIDTH = __init__[::Intable](bitwidthof[::DType,__mlir_type.!kgen.target]())`

The bit width of the integer type.

### `MAX`

`alias MAX = UInt((0 if (__init__[::Intable](bitwidthof[::DType,__mlir_type.!kgen.target]()) 

Returns the maximum integer value.

### `MIN`

`alias MIN = UInt(0)`

Returns the minimum value of type.

## Methods

### `__init__`

`__init__() -> Self`

Default constructor that produces zero.

`@implicit`
`__init__(value: IntLiteral[value]) -> Self`

Construct UInt from the given IntLiteral value.

**Args:**

* ​value (`IntLiteral[value]`): The init value.

`@implicit`
`__init__(value: Int) -> Self`

Construct UInt from the given Int value.

**Args:**

* ​value (`Int`): The init value.

`__init__[T: Indexer](value: T) -> Self`

Construct UInt from the given Indexable value.

**Parameters:**

* ​T (`Indexer`): The type that that can index into a collection or pointer.

**Args:**

* ​value (`T`): The init value.

### `__bool__`

`__bool__(self) -> Bool`

Convert this Int to Bool.

**Returns:**

False Bool value if the value is equal to 0 and True otherwise.

### `__pos__`

`__pos__(self) -> Self`

Return +self.

**Returns:**

The +self value.

### `__invert__`

`__invert__(self) -> Self`

Return \~self.

**Returns:**

The \~self value.

### `__lt__`

`__lt__(self, rhs: Self) -> Bool`

Return whether this UInt is strictly less than another.

**Args:**

* ​rhs (`Self`): The other UInt to compare against.

**Returns:**

True if this UInt is less than the other UInt and False otherwise.

### `__le__`

`__le__(self, rhs: Self) -> Bool`

Compare this Int to the RHS using LE comparison.

**Args:**

* ​rhs (`Self`): The other UInt to compare against.

**Returns:**

True if this Int is less-or-equal than the RHS Int and False
otherwise.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compare this UInt to the RHS using EQ comparison.

**Args:**

* ​rhs (`Self`): The other UInt to compare against.

**Returns:**

True if this UInt is equal to the RHS UInt and False otherwise.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Compare this UInt to the RHS using NE comparison.

**Args:**

* ​rhs (`Self`): The other UInt to compare against.

**Returns:**

True if this UInt is non-equal to the RHS UInt and False otherwise.

### `__gt__`

`__gt__(self, rhs: Self) -> Bool`

Return whether this UInt is strictly greater than another.

**Args:**

* ​rhs (`Self`): The other UInt to compare against.

**Returns:**

True if this UInt is greater than the other UInt and False
otherwise.

### `__ge__`

`__ge__(self, rhs: Self) -> Bool`

Return whether this UInt is greater than or equal to another.

**Args:**

* ​rhs (`Self`): The other UInt to compare against.

**Returns:**

True if this UInt is greater than or equal to the other UInt and
False otherwise.

### `__add__`

`__add__(self, rhs: Self) -> Self`

Return `self + rhs`.

**Args:**

* ​rhs (`Self`): The value to add.

**Returns:**

`self + rhs` value.

### `__sub__`

`__sub__(self, rhs: Self) -> Self`

Return `self - rhs`.

**Args:**

* ​rhs (`Self`): The value to subtract.

**Returns:**

`self - rhs` value.

### `__mul__`

`__mul__(self, rhs: Self) -> Self`

Return `self * rhs`.

**Args:**

* ​rhs (`Self`): The value to multiply with.

**Returns:**

`self * rhs` value.

### `__truediv__`

`__truediv__(self, rhs: Self) -> SIMD[float64, 1]`

Return the floating point division of `self` and `rhs`.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

`Float64(self)/Float64(rhs)` value.

### `__floordiv__`

`__floordiv__(self, rhs: Self) -> Self`

Return the division of `self` and `rhs` rounded down to the nearest integer.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

`floor(self/rhs)` value.

### `__mod__`

`__mod__(self, rhs: Self) -> Self`

Return the remainder of self divided by rhs.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

The remainder of dividing self by rhs.

### `__pow__`

`__pow__(self, exp: Self) -> Self`

Return the value raised to the power of the given exponent.

Computes the power of an integer using the Russian Peasant Method.

**Args:**

* ​exp (`Self`): The exponent value.

**Returns:**

The value of `self` raised to the power of `exp`.

### `__lshift__`

`__lshift__(self, rhs: Self) -> Self`

Return `self rhs (`Self`): The value to shift with.

**Returns:**

`self 

### `__rshift__`

`__rshift__(self, rhs: Self) -> Self`

Return `self >> rhs`.

**Args:**

* ​rhs (`Self`): The value to shift with.

**Returns:**

`self >> rhs`.

### `__and__`

`__and__(self, rhs: Self) -> Self`

Return `self & rhs`.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self & rhs`.

### `__or__`

`__or__(self, rhs: Self) -> Self`

Return `self | rhs`.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self | rhs`.

### `__xor__`

`__xor__(self, rhs: Self) -> Self`

Return `self ^ rhs`.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self ^ rhs`.

### `__radd__`

`__radd__(self, value: Self) -> Self`

Return `value + self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value + self`.

### `__rsub__`

`__rsub__(self, value: Self) -> Self`

Return `value - self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value - self`.

### `__rmul__`

`__rmul__(self, value: Self) -> Self`

Return `value * self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value * self`.

### `__rfloordiv__`

`__rfloordiv__(self, value: Self) -> Self`

Return `value // self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value // self`.

### `__rmod__`

`__rmod__(self, value: Self) -> Self`

Return `value % self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value % self`.

### `__rpow__`

`__rpow__(self, value: Self) -> Self`

Return `pow(value,self)`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`pow(value,self)`.

### `__rlshift__`

`__rlshift__(self, value: Self) -> Self`

Return `value value (`Self`): The other value.

**Returns:**

`value 

### `__rrshift__`

`__rrshift__(self, value: Self) -> Self`

Return `value >> self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value >> self`.

### `__rand__`

`__rand__(self, value: Self) -> Self`

Return `value & self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value & self`.

### `__ror__`

`__ror__(self, value: Self) -> Self`

Return `value | self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value | self`.

### `__rxor__`

`__rxor__(self, value: Self) -> Self`

Return `value ^ self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value ^ self`.

### `__iadd__`

`__iadd__(mut self, rhs: Self)`

Compute `self + rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__isub__`

`__isub__(mut self, rhs: Self)`

Compute `self - rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__imul__`

`__imul__(mut self, rhs: Self)`

Compute self\*rhs and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__itruediv__`

`__itruediv__(mut self, rhs: Self)`

Compute `self / rhs`, convert to int, and save the result in self.

Since `floor(self / rhs)` is equivalent to `self // rhs`, this yields
the same as `__ifloordiv__`.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ifloordiv__`

`__ifloordiv__(mut self, rhs: Self)`

Compute `self // rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__imod__`

`__imod__(mut self, rhs: Self)`

Compute `self % rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ipow__`

`__ipow__(mut self, rhs: Self)`

Compute `pow(self, rhs)` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ilshift__`

`__ilshift__(mut self, rhs: Self)`

Compute `self rhs (`Self`): The RHS value.

### `__irshift__`

`__irshift__(mut self, rhs: Self)`

Compute `self >> rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__iand__`

`__iand__(mut self, rhs: Self)`

Compute `self & rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ixor__`

`__ixor__(mut self, rhs: Self)`

Compute `self ^ rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ior__`

`__ior__(mut self, rhs: Self)`

Compute self|rhs and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__divmod__`

`__divmod__(self, rhs: Self) -> Tuple[UInt, UInt]`

Computes both the quotient and remainder using integer division.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

The quotient and remainder as a `Tuple(self // rhs, self % rhs)`.

### `__index__`

`__index__(self) -> index`

Convert to index.

**Returns:**

The corresponding \_\_mlir\_type.index value.

### `__int__`

`__int__(self) -> Int`

Gets the integral value, wrapping to a negative number on overflow.

**Returns:**

The value as an integer.

### `__abs__`

`__abs__(self) -> Self`

Return the absolute value of the UInt value.

**Returns:**

The absolute value.

### `__ceil__`

`__ceil__(self) -> Self`

Return the ceiling of the UInt value, which is itself.

**Returns:**

The UInt value itself.

### `__floor__`

`__floor__(self) -> Self`

Return the floor of the UInt value, which is itself.

**Returns:**

The UInt value itself.

### `__round__`

`__round__(self) -> Self`

Return the rounded value of the UInt value, which is itself.

**Returns:**

The UInt value itself.

`__round__(self, ndigits: Self) -> Self`

Return the rounded value of the UInt value, which is itself.

**Args:**

* ​ndigits (`Self`): The number of digits to round to.

**Returns:**

The UInt value itself if ndigits >= 0 else the rounded value.

### `__trunc__`

`__trunc__(self) -> Self`

Return the truncated UInt value, which is itself.

**Returns:**

The Int value itself.

### `__ceildiv__`

`__ceildiv__(self, denominator: Self) -> Self`

Return the rounded-up result of dividing self by denominator.

**Args:**

* ​denominator (`Self`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

### `is_power_of_two`

`is_power_of_two(self) -> Bool`

Check if the integer is a (non-zero) power of two.

**Returns:**

True if the integer is a power of two, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this integer to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__str__`

`__str__(self) -> String`

Convert this UInt to a string.

A small example.

```mojo
x = UInt(50)
assert_equal(String(x), "50")
```

**Returns:**

The string representation of this UInt.

### `__repr__`

`__repr__(self) -> String`

Convert this UInt to a string.

A small example.

```mojo
x = UInt(50)
assert_equal(repr(x), "UInt(50)")
```

**Returns:**

The string representation of this UInt.

### `__hash__`

`__hash__(self) -> Self`

Hash the UInt using builtin hash.

**Returns:**

A 64-bit hash value. This value is *not* suitable for cryptographic
uses. Its intended usage is for data structures. See the `hash`
builtin documentation for more details.

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with this uint value.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

---

## uint

Implements the UInt class.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`UInt`](/mojo/stdlib/builtin/uint/UInt): This type represents an unsigned integer.

---

## Copyable

The Copyable trait denotes a type whose value can be copied.

Example implementing the `Copyable` trait on `Foo` which requires the `__copyinit__`
method:

```mojo
struct Foo(Copyable):
    var s: String

    @implicit
    fn __init__(out self, s: String):
        self.s = s

    fn __copyinit__(out self, other: Self):
        print("copying value")
        self.s = other.s
```

You can now copy objects inside a generic function:

```mojo
fn copy_return[T: Copyable](foo: T) -> T:
    var copy = foo
    return copy

var foo = Foo("test")
var res = copy_return(foo)
```

```plaintext
copying value
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

---

## Defaultable

The `Defaultable` trait describes a type with a default constructor.

Implementing the `Defaultable` trait requires the type to define
an `__init__` method with no arguments:

```mojo
struct Foo(Defaultable):
    var s: String

    fn __init__(out self):
        self.s = "default"
```

You can now construct a generic `Defaultable` type:

```mojo
fn default_init[T: Defaultable]() -> T:
    return T()

var foo = default_init[Foo]()
print(foo.s)
```

```plaintext
default
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self: _Self)`

Create a default instance of the value.

---

## ExplicitlyCopyable

The ExplicitlyCopyable trait denotes a type whose value can be copied explicitly.

Unlike `Copyable`, which denotes types that are *implicitly* copyable, an
explicitly copyable type can only be copied when the explicit copy
initializer is called intentionally by the programmer.

An explicit copy initializer is just a normal `__init__` method that takes
a `read-only` argument of `Self`.

Example implementing the `ExplicitlyCopyable` trait on `Foo` which requires
the `fn(self) -> Self` method:

```mojo
struct Foo(ExplicitlyCopyable):
    var s: String

    @implicit
    fn __init__(out self, s: String):
        self.s = s

    fn copy(self) -> Self:
        print("explicitly copying value")
        return Foo(self.s)
```

You can now copy objects inside a generic function:

```mojo
fn copy_return[T: ExplicitlyCopyable](foo: T) -> T:
    var copy = foo.copy()
    return copy

var foo = Foo("test")
var res = copy_return(foo)
```

```plaintext
explicitly copying value
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `copy`

`copy(self: _Self) -> _Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

---

## Movable

The Movable trait denotes a type whose value can be moved.

Implement the `Movable` trait on `Foo` which requires the `__moveinit__`
method:

```mojo
struct Foo(Movable):
    fn __init__(out self):
        pass

    fn __moveinit__(out self, owned existing: Self):
        print("moving")
```

You can now use the ^ suffix to move the object instead of copying
it inside generic functions:

```mojo
fn return_foo[T: Movable](owned foo: T) -> T:
    return foo^

var foo = Foo()
var res = return_foo(foo^)
```

```plaintext
moving
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

---

## value

Defines core value traits.

These are Mojo built-ins, so you don't need to import them.

## Traits

* [​`Copyable`](/mojo/stdlib/builtin/value/Copyable): The Copyable trait denotes a type whose value can be copied.
* [​`Defaultable`](/mojo/stdlib/builtin/value/Defaultable): The `Defaultable` trait describes a type with a default constructor.
* [​`ExplicitlyCopyable`](/mojo/stdlib/builtin/value/ExplicitlyCopyable): The ExplicitlyCopyable trait denotes a type whose value can be copied explicitly.
* [​`Movable`](/mojo/stdlib/builtin/value/Movable): The Movable trait denotes a type whose value can be moved.

---

## VariadicList

`@register_passable(trivial)`
`struct VariadicList[type: AnyTrivialRegType]`

A utility class to access variadic function arguments. Provides a "list" view of the function argument so that the size of the argument list and each individual argument can be accessed.

## Parameters

* ​type (`AnyTrivialRegType`): The type of the elements in the list.

## Fields

* ​value (`Variadic[type]`): The underlying storage for the variadic list.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `IterType`

`alias IterType = _VariadicListIter[type]`

## Methods

### `__init__`

`@implicit`
`__init__(*value: type) -> Self`

Constructs a VariadicList from a variadic list of arguments.

**Args:**

* ​\*value (`type`): The variadic argument list to construct the variadic list
  with.

### `__getitem__`

`__getitem__[I: Indexer](self, idx: I) -> type`

Gets a single element on the variadic list.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The index of the element to access on the list.

**Returns:**

The element on the list corresponding to the given index.

### `__len__`

`__len__(self) -> Int`

Gets the size of the list.

**Returns:**

The number of elements on the variadic list.

### `__iter__`

`__iter__(self) -> _VariadicListIter[type]`

Iterate over the list.

**Returns:**

An iterator to the start of the list.

---

## VariadicListMem

`struct VariadicListMem[elt_is_mutable: Bool, //, element_type: AnyType, origin: Origin[elt_is_mutable], is_owned: Bool]`

A utility class to access variadic function arguments of memory-only types that may have ownership. It exposes references to the elements in a way that can be enumerated.  Each element may be accessed with `elt[]`.

## Parameters

* ​elt\_is\_mutable (`Bool`): True if the elements of the list are mutable for an
  mut or owned argument.
* ​element\_type (`AnyType`): The type of the elements in the list.
* ​origin (`Origin[elt_is_mutable]`): The origin of the underlying elements.
* ​is\_owned (`Bool`): Whether the elements are owned by the list.

## Fields

* ​value (`Variadic[ref [origin._mlir_origin] element_type]`): The underlying storage, a variadic list of references to elements of the given type.

## Implemented traits

`AnyType`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `reference_type`

`alias reference_type = Pointer[element_type, origin]`

## Methods

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Moves constructor.

**Args:**

* ​existing (`Self`): The existing VariadicListMem.

### `__del__`

`__del__(owned self)`

Destructor that releases elements if owned.

### `__getitem__`

`__getitem__(self, idx: Int) -> ref [origin, *[0,0]] element_type`

Gets a single element on the variadic list.

**Args:**

* ​idx (`Int`): The index of the element to access on the list.

**Returns:**

A low-level pointer to the element on the list corresponding to the
given index.

### `__len__`

`__len__(self) -> Int`

Gets the size of the list.

**Returns:**

The number of elements on the variadic list.

### `__iter__`

`__iter__(self, out result: _VariadicListMemIter[element_type, origin, self, is_owned])`

Iterate over the list.

**Returns:**

An iterator to the start of the list.

---

## VariadicPack

`@register_passable`
`struct VariadicPack[elt_is_mutable: Bool, //, is_owned: Bool, origin: Origin[elt_is_mutable], element_trait: AnyTrait[AnyType], *element_types: element_trait]`

A utility class to access variadic pack  arguments and provide an API for doing things with them.

## Parameters

* ​elt\_is\_mutable (`Bool`): True if the elements of the list are mutable for an
  mut or owned argument pack.
* ​is\_owned (`Bool`): Whether the elements are owned by the pack. If so, the pack
  will release the elements when it is destroyed.
* ​origin (`Origin[elt_is_mutable]`): The origin of the underlying elements.
* ​element\_trait (`AnyTrait[AnyType]`): The trait that each element of the pack conforms to.
* ​\*element\_types (`element_trait`): The list of types held by the argument pack.

## Implemented traits

`AnyType`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__del__`

`__del__(owned self)`

Destructor that releases elements if owned.

### `__getitem__`

`__getitem__[index: Int](self) -> ref [origin] element_types[index.value]`

Return a reference to an element of the pack.

**Parameters:**

* ​index (`Int`): The element of the pack to return.

**Returns:**

A reference to the element.  The Pointer's mutability follows the
mutability of the pack argument convention.

### `__len__`

`static __len__() -> Int`

Return the VariadicPack length.

**Returns:**

The number of elements in the variadic pack.

`__len__(self) -> Int`

Return the VariadicPack length.

**Returns:**

The number of elements in the variadic pack.

---

## variadics

Implements the VariadicList and VariadicPack types.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`VariadicList`](/mojo/stdlib/builtin/variadics/VariadicList): A utility class to access variadic function arguments. Provides a "list" view of the function argument so that the size of the argument list and each individual argument can be accessed.
* [​`VariadicListMem`](/mojo/stdlib/builtin/variadics/VariadicListMem): A utility class to access variadic function arguments of memory-only types that may have ownership. It exposes references to the elements in a way that can be enumerated.  Each element may be accessed with `elt[]`.
* [​`VariadicPack`](/mojo/stdlib/builtin/variadics/VariadicPack): A utility class to access variadic pack  arguments and provide an API for doing things with them.

---

## BitSet

`struct BitSet[size: UInt]`

A grow-only set storing non-negative integers efficiently using bits.

Each integer element is represented by a single bit within an array
of 64-bit words (`UInt64`). This structure is optimized for:

* **Compactness:** Uses 64 times less memory than `List[Bool]`.
* **Speed:** Offers O(1) time complexity for `set`, `clear`, `test`,
  and `toggle` operations (single word load/store).

Ideal for applications like data-flow analysis, graph algorithms, or
any task requiring dense sets of small integers where memory and
lookup speed are critical.

## Parameters

* ​size (`UInt`): The maximum number of bits the bitset can store.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Defaultable`,
`Movable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(out self)`

Initializes an empty BitSet with zero capacity and size.

`__init__(out self: BitSet[UInt(size)], init: SIMD[bool, size])`

Initializes a BitSet with the given SIMD vector of booleans.

**Args:**

* ​init (`SIMD[bool, size]`): A SIMD vector of booleans to initialize the bitset with.

### `__bool__`

`__bool__(self) -> Bool`

Checks if the bitset is non-empty (contains at least one set bit).

Equivalent to `len(self) != 0` or `not self.is_empty()`.

**Returns:**

True if at least one bit is set, False otherwise.

### `__len__`

`__len__(self) -> Int`

Counts the total number of bits that are set to 1 in the bitset.

Uses the efficient `pop_count` intrinsic for each underlying word.
The complexity is proportional to the number of words used by the
bitset's capacity (`_words_size`), not the logical size (`len`).

**Returns:**

The total count of set bits (population count).

### `is_empty`

`is_empty(self) -> Bool`

Checks if the bitset contains any set bits.

Equivalent to `len(self) == 0`. Note that this checks the logical
size, not the allocated capacity.

**Returns:**

True if no bits are set (logical size is 0), False otherwise.

### `set`

`set(mut self, idx: UInt)`

Sets the bit at the specified index `idx` to 1.

If `idx` is greater than or equal to the current logical size,
the logical size is updated. Aborts if `idx` is negative or
greater than or equal to the compile-time `size`.

**Args:**

* ​idx (`UInt`): The non-negative index of the bit to set (must be 

### `clear`

`clear(mut self, idx: UInt)`

Clears the bit at the specified index `idx` (sets it to 0).

Aborts if `idx` is negative or greater than or equal to the
compile-time `size`. Does not change the logical size.

**Args:**

* ​idx (`UInt`): The non-negative index of the bit to clear (must be 

### `toggle`

`toggle(mut self, idx: UInt)`

Toggles (inverts) the bit at the specified index `idx`.

If the bit becomes 1 and `idx` is greater than or equal to the
current logical size, the logical size is updated. Aborts if `idx`
is negative or greater than or equal to the compile-time `size`.

**Args:**

* ​idx (`UInt`): The non-negative index of the bit to toggle (must be 

### `test`

`test(self, idx: UInt) -> Bool`

Tests if the bit at the specified index `idx` is set (is 1).

Aborts if `idx` is negative or greater than or equal to the
compile-time `size`.

**Args:**

* ​idx (`UInt`): The non-negative index of the bit to test (must be 

### `clear_all`

`clear_all(mut self)`

Clears all bits in the set, resetting the logical size to 0.

The allocated storage capacity remains unchanged. Equivalent to
re-initializing the set with `Self()`.

### `union`

`union(self, other: Self) -> Self`

Returns a new bitset that is the union of `self` and `other`.

**Args:**

* ​other (`Self`): The bitset to union with.

**Returns:**

A new bitset containing all elements from both sets.

### `intersection`

`intersection(self, other: Self) -> Self`

Returns a new bitset that is the intersection of `self` and `other`.

**Args:**

* ​other (`Self`): The bitset to intersect with.

**Returns:**

A new bitset containing only the elements present in both sets.

### `difference`

`difference(self, other: Self) -> Self`

Returns a new bitset that is the difference of `self` and `other`.

**Args:**

* ​other (`Self`): The bitset to subtract from `self`.

**Returns:**

A new bitset containing elements from `self` that are not in `other`.

### `write_to`

`write_to[W: Writer, //](self, mut writer: W)`

Writes a string representation of the set bits to the given writer. Outputs the indices of the set bits in ascending order, enclosed in curly braces and separated by commas (e.g., "{1, 5, 42}"). Uses efficient bitwise operations to find set bits without iterating through every possible bit.

**Parameters:**

* ​W (`Writer`): The type of the writer, conforming to the `Writer` trait.

**Args:**

* ​writer (`W`): The writer instance to output the representation to.

### `__repr__`

`__repr__(self) -> String`

Returns a developer-friendly string representation of the bitset.

Currently equivalent to `__str__`.

**Returns:**

A string showing the set bits (e.g., "{1, 5, 42}").

### `__str__`

`__str__(self) -> String`

Returns a user-friendly string representation of the bitset.

Formats the set bits as a comma-separated list within curly braces,
like "{1, 5, 42}". Uses the `write_to` method internally.

**Returns:**

A string showing the set bits.

---

## bitset

Provides a compact, grow-only set of non-negative integers.

Optimized for space (1 bit per element) and speed (O(1) operations).
Offers set/clear/test/toggle and fast population count. The underlying
storage grows automatically but does not shrink unless `shrink_to_fit`
is called (not implemented yet).

Example:

```mojo
    var bs = BitSet[128]()      # 128-bit set, all clear
    bs.set(42)                  # Mark value 42 as present.
    if bs.test(42):             # Check membership.
        print("hit")            # Prints "hit".
    bs.clear(42)                # Remove 42.
    print(bs.count())           # Prints 0.
```

## Structs

* [​`BitSet`](/mojo/stdlib/collections/bitset/BitSet): A grow-only set storing non-negative integers efficiently using bits.

---

## CountTuple

`struct CountTuple[V: Copyable & Movable & Hashable & EqualityComparable]`

A tuple representing a value and its count in a Counter.

## Parameters

* ​V (`Copyable & Movable & Hashable & EqualityComparable`): The value in the Counter.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, value: V, count: UInt)`

Create a new CountTuple.

**Args:**

* ​value (`V`): The value in the Counter.
* ​count (`UInt`): The count of the value in the Counter.

### `__getitem__`

`__getitem__(self, idx: Int) -> Variant[V, Int]`

Get an element in the tuple.

**Args:**

* ​idx (`Int`): The element to return.

**Returns:**

The value if idx is 0 and the count if idx is 1.

### `__lt__`

`__lt__(self, other: Self) -> Bool`

Compare two CountTuples by count, then by value.

**Args:**

* ​other (`Self`): The other CountTuple to compare to.

**Returns:**

True if this CountTuple is less than the other, False otherwise.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compare two CountTuples for equality.

**Args:**

* ​other (`Self`): The other CountTuple to compare to.

**Returns:**

True if the two CountTuples are equal, False otherwise.

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

---

## Counter

`struct Counter[V: Copyable & Movable & Hashable & EqualityComparable]`

A container for counting hashable items.

The value type must be specified statically, unlike a Python
Counter, which can accept arbitrary value types.

The value type must implement the `KeyElement` trait, as its values are
stored in the dictionary as keys.

Usage:

```mojo
from collections import Counter
var c = Counter[String]("a", "a", "a", "b", "b", "c", "d", "c", "c")
print(c["a"]) # prints 3
print(c["b"]) # prints 2
```

## Parameters

* ​V (`Copyable & Movable & Hashable & EqualityComparable`): The value type to be counted. Currently must be KeyElement.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Defaultable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Create a new, empty Counter object.

`__init__(out self, owned *values: V)`

Create a new Counter from a list of values.

Usage:

```mojo
from collections import Counter
var c = Counter[String]("a", "a", "a", "b", "b", "c", "d", "c", "c")
print(c["a"])  # print 3
print(c["b"])  # print 2
```

**Args:**

* ​\*values (`V`): A list of values to count.

`@implicit`
`__init__(out self, items: List[V, hint_trivial_type])`

Create a from an input iterable.

Usage:

```mojo
from collections import Counter
var c = Counter[String](["a", "a", "a", "b", "b", "c", "d", "c", "c"])
print(c["a"]) # prints 3
print(c["b"]) # prints 2
```

**Args:**

* ​items (`List[V, hint_trivial_type]`): A list of items to count.

### `__bool__`

`__bool__(self) -> Bool`

Check if the Counter is empty or not.

**Returns:**

`False` if the Counter is empty, `True` otherwise.

### `__getitem__`

`__getitem__(self, key: V) -> Int`

Get the count of a key.

**Args:**

* ​key (`V`): The key to get the count of.

**Returns:**

The count of the key.

### `__setitem__`

`__setitem__(mut self, value: V, count: Int)`

Set a value in the keyword Counter by key.

**Args:**

* ​value (`V`): The value to associate with the specified count.
* ​count (`Int`): The count to store in the Counter.

### `__neg__`

`__neg__(self) -> Self`

Subtract from an empty Counter. Strips positive and zero counts, and flips the sign on negative counts.

**Returns:**

A new Counter with stripped counts and negative counts.

### `__pos__`

`__pos__(self) -> Self`

Return a shallow copy of the Counter, stripping non-positive counts.

**Returns:**

A shallow copy of the Counter.

### `__lt__`

`__lt__(self, other: Self) -> Bool`

Check if all counts are less than in the other Counter.

**Args:**

* ​other (`Self`): The other Counter to compare to.

**Returns:**

True if all counts are less than in the other Counter, False
otherwise.

### `__le__`

`__le__(self, other: Self) -> Bool`

Check if all counts are less than or equal to the other Counter.

**Args:**

* ​other (`Self`): The other Counter to compare to.

**Returns:**

True if all counts are less than or equal to the other Counter,
False otherwise.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Check if all counts agree. Missing counts are treated as zero.

**Args:**

* ​other (`Self`): The other Counter to compare to.

**Returns:**

True if the two Counters are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Check if all counts disagree. Missing counts are treated as zero.

**Args:**

* ​other (`Self`): The other Counter to compare to.

**Returns:**

True if the two Counters are not equal, False otherwise.

### `__gt__`

`__gt__(self, other: Self) -> Bool`

Check if all counts are greater than in the other Counter.

**Args:**

* ​other (`Self`): The other Counter to compare to.

**Returns:**

True if all counts are greater than in the other Counter, False
otherwise.

### `__ge__`

`__ge__(self, other: Self) -> Bool`

Check if all counts are greater than or equal to the other Counter.

**Args:**

* ​other (`Self`): The other Counter to compare to.

**Returns:**

True if all counts are greater than or equal to the other Counter,
False otherwise.

### `__contains__`

`__contains__(self, key: V) -> Bool`

Check if a given key is in the dictionary or not.

**Args:**

* ​key (`V`): The key to check.

**Returns:**

True if there key exists in the dictionary, False otherwise.

### `__add__`

`__add__(self, other: Self) -> Self`

Add counts from two Counters.

**Args:**

* ​other (`Self`): The other Counter to add to this Counter.

**Returns:**

A new Counter with the counts from both Counters added together.

### `__sub__`

`__sub__(self, other: Self) -> Self`

Subtract counts, but keep only results with positive counts.

**Args:**

* ​other (`Self`): The other Counter to subtract from this Counter.

**Returns:**

A new Counter with the counts from the other Counter subtracted from
this Counter.

### `__and__`

`__and__(self, other: Self) -> Self`

Intersection: keep common elements with the minimum count.

**Args:**

* ​other (`Self`): The other Counter to intersect with.

**Returns:**

A new Counter with the common elements and the minimum count of
the two Counters.

### `__or__`

`__or__(self, other: Self) -> Self`

Union: keep all elements with the maximum count.

**Args:**

* ​other (`Self`): The other Counter to union with.

**Returns:**

A new Counter with all elements and the maximum count of the two
Counters.

### `__iadd__`

`__iadd__(mut self, other: Self)`

Add counts from another Counter to this Counter.

**Args:**

* ​other (`Self`): The other Counter to add to this Counter.

### `__isub__`

`__isub__(mut self, other: Self)`

Subtract counts from another Counter from this Counter.

**Args:**

* ​other (`Self`): The other Counter to subtract from this Counter.

### `__iand__`

`__iand__(mut self, other: Self)`

Intersection: keep common elements with the minimum count.

**Args:**

* ​other (`Self`): The other Counter to intersect with.

### `__ior__`

`__ior__(mut self, other: Self)`

Union: keep all elements with the maximum count.

**Args:**

* ​other (`Self`): The other Counter to union with.

### `copy`

`copy(self) -> Self`

Create a new Counter by copying another Counter.

**Returns:**

A copy of the value.

### `fromkeys`

`static fromkeys(keys: List[V, hint_trivial_type], value: Int) -> Self`

Create a new Counter from a list of keys and a default value.

**Args:**

* ​keys (`List[V, hint_trivial_type]`): The keys to create the Counter from.
* ​value (`Int`): The default value to associate with each key.

**Returns:**

A new Counter with the keys and default value.

### `__iter__`

`__iter__(self) -> _DictKeyIter[V, Int, self._data]`

Iterate over the keyword dict's keys as immutable references.

**Returns:**

An iterator of immutable references to the Counter values.

### `__len__`

`__len__(self) -> Int`

Returns the number of elements currently stored in the Counter.

**Returns:**

The number of elements in the Counter.

### `get`

`get(self, value: V) -> Optional[Int]`

Get a value from the counter.

**Args:**

* ​value (`V`): The value to search for in the Counter.

**Returns:**

An optional value containing a copy of the value if it was present,
otherwise an empty Optional.

`get(self, value: V, default: Int) -> Int`

Get a value from the Counter.

**Args:**

* ​value (`V`): The value to search for in the counter.
* ​default (`Int`): Default count to return.

**Returns:**

A copy of the value if it was present, otherwise default.

### `pop`

`pop(mut self, value: V) -> Int`

Remove a value from the Counter by value.

**Args:**

* ​value (`V`): The value to remove from the Counter.

**Returns:**

The value associated with the key, if it was in the Counter.

**Raises:**

"KeyError" if the key was not present in the Counter.

`pop(mut self, value: V, owned default: Int) -> Int`

Remove a value from the Counter by value.

**Args:**

* ​value (`V`): The value to remove from the Counter.
* ​default (`Int`): Optionally provide a default value to return if the value
  was not found instead of raising.

**Returns:**

The value associated with the key, if it was in the Counter.
If it wasn't, return the provided default value instead.

**Raises:**

"KeyError" if the key was not present in the Counter and no
default value was provided.

### `keys`

`keys(ref self) -> _DictKeyIter[V, Int, self_is_origin._data]`

Iterate over the Counter's keys as immutable references.

**Returns:**

An iterator of immutable references to the Counter keys.

### `values`

`values(ref self) -> _DictValueIter[V, Int, self_is_origin._data]`

Iterate over the Counter's values as references.

**Returns:**

An iterator of references to the Counter values.

### `items`

`items(self) -> _DictEntryIter[V, Int, self._data]`

Iterate over the dict's entries as immutable references.

**Returns:**

An iterator of immutable references to the Counter entries.

### `clear`

`clear(mut self)`

Remove all elements from the Counter.

### `popitem`

`popitem(mut self) -> CountTuple[V]`

Remove and return an arbitrary (key, value) pair from the Counter.

**Returns:**

A CountTuple containing the key and value of the removed item.

**Raises:**

"KeyError" if the Counter is empty.

### `total`

`total(self) -> UInt`

Return the total of all counts in the Counter.

**Returns:**

The total of all counts in the Counter.

### `most_common`

`most_common(self, n: UInt) -> List[CountTuple[V]]`

Return a list of the `n` most common elements and their counts from the most common to the least.

**Args:**

* ​n (`UInt`): The number of most common elements to return.

**Returns:**

A list of the n most common elements and their counts.

### `elements`

`elements(self) -> List[V]`

Return an iterator over elements repeating each as many times as its count.

**Returns:**

An iterator over the elements in the Counter.

### `update`

`update(mut self, other: Self)`

Update the Counter, like `dict.update()` but add counts instead of replacing them.

**Args:**

* ​other (`Self`): The Counter to update this Counter with.

### `subtract`

`subtract(mut self, other: Self)`

Subtract count. Both inputs and outputs may be zero or negative.

**Args:**

* ​other (`Self`): The Counter to subtract from this Counter.

---

## counter

Defines the `Counter` type.

You can import these APIs from the `collections` package. For example:

```mojo
from collections import Counter
```

## Structs

* [​`Counter`](/mojo/stdlib/collections/counter/Counter): A container for counting hashable items.
* [​`CountTuple`](/mojo/stdlib/collections/counter/CountTuple): A tuple representing a value and its count in a Counter.

---

## Deque

`struct Deque[ElementType: Copyable & Movable]`

Implements a double-ended queue.

It supports pushing and popping from both ends in O(1) time resizing the
underlying storage as needed.

## Parameters

* ​ElementType (`Copyable & Movable`): The type of the elements in the deque.
  Must implement the traits `Copyable` and `Movable`.

## Implemented traits

`AnyType`,
`Boolable`,
`ExplicitlyCopyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `default_capacity`

`alias default_capacity = 64`

The default capacity of the deque: must be the power of 2.

## Methods

### `__init__`

`__init__(out self, *, owned elements: Optional[List[ElementType]] = Optional(None), capacity: Int = 64, min_capacity: Int = 64, maxlen: Int = -1, shrink: Bool = True)`

Constructs a deque.

**Args:**

* ​elements (`Optional[List[ElementType]]`): The optional list of initial deque elements.
* ​capacity (`Int`): The initial capacity of the deque.
* ​min\_capacity (`Int`): The minimum allowed capacity of the deque when shrinking.
* ​maxlen (`Int`): The maximum allowed capacity of the deque when growing.
* ​shrink (`Bool`): Should storage be de-allocated when not needed.

`__init__(out self, owned *values: ElementType, *, __list_literal__: Tuple[] = Tuple())`

Constructs a deque from the given values.

**Args:**

* ​\*values (`ElementType`): The values to populate the deque with.
* ​**list\_literal** (`Tuple[]`): Tell Mojo to use this method for list literals.

`__init__(out self, *, owned elements: VariadicListMem[ElementType, origin, is_owned])`

Constructs a deque from the given values.

**Args:**

* ​elements (`VariadicListMem[ElementType, origin, is_owned]`): The values to populate the deque with.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Moves data of an existing deque into a new one.

**Args:**

* ​existing (`Self`): The existing deque.

### `__del__`

`__del__(owned self)`

Destroys all elements in the deque and free its memory.

### `__bool__`

`__bool__(self) -> Bool`

Checks whether the deque has any elements or not.

**Returns:**

`False` if the deque is empty, `True` if there is at least one element.

### `__getitem__`

`__getitem__(ref self, idx: Int) -> ref [self] ElementType`

Gets the deque element at the given index.

**Args:**

* ​idx (`Int`): The index of the element.

**Returns:**

A reference to the element at the given index.

### `__eq__`

`__eq__[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], other: Deque[T]) -> Bool`

Checks if two deques are equal.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the trait `EqualityComparable`.

**Args:**

* ​other (`Deque[T]`): The deque to compare with.

**Returns:**

`True` if the deques are equal, `False` otherwise.

### `__ne__`

`__ne__[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], other: Deque[T]) -> Bool`

Checks if two deques are not equal.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the trait `EqualityComparable`.

**Args:**

* ​other (`Deque[T]`): The deque to compare with.

**Returns:**

`True` if the deques are not equal, `False` otherwise.

### `__contains__`

`__contains__[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], value: T) -> Bool`

Verify if a given value is present in the deque.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the trait `EqualityComparable`.

**Args:**

* ​value (`T`): The value to find.

**Returns:**

True if the value is contained in the deque, False otherwise.

### `__add__`

`__add__(self, other: Self) -> Self`

Concatenates self with other and returns the result as a new deque.

**Args:**

* ​other (`Self`): Deque whose elements will be appended to the elements of self.

**Returns:**

The newly created deque with the properties of `self`.

### `__mul__`

`__mul__(self, n: Int) -> Self`

Concatenates `n` deques of `self` and returns a new deque.

**Args:**

* ​n (`Int`): The multiplier number.

**Returns:**

The new deque.

### `__iadd__`

`__iadd__(mut self, other: Self)`

Appends the elements of other deque into self.

**Args:**

* ​other (`Self`): Deque whose elements will be appended to self.

### `__imul__`

`__imul__(mut self, n: Int)`

Concatenates self `n` times in place.

**Args:**

* ​n (`Int`): The multiplier number.

### `copy`

`copy(self) -> Self`

Creates a deepcopy of the given deque.

**Returns:**

A copy of the value.

### `__iter__`

`__iter__(ref self) -> _DequeIter[ElementType, self_is_origin]`

Iterates over elements of the deque, returning the references.

**Returns:**

An iterator of the references to the deque elements.

### `__reversed__`

`__reversed__(ref self) -> _DequeIter[ElementType, self_is_origin, False]`

Iterate backwards over the deque, returning the references.

**Returns:**

A reversed iterator of the references to the deque elements.

### `__len__`

`__len__(self) -> Int`

Gets the number of elements in the deque.

**Returns:**

The number of elements in the deque.

### `write_to`

`write_to[T: Representable & Copyable & Movable, WriterType: Writer](self: Deque[T], mut writer: WriterType)`

Writes `my_deque.__str__()` to a `Writer`.

**Parameters:**

* ​T (`Representable & Copyable & Movable`): The type of the Deque elements.
  Must implement the trait `Representable`.
* ​WriterType (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`WriterType`): The object to write to.

### `__str__`

`__str__[T: Representable & Copyable & Movable, //](self: Deque[T]) -> String`

Returns a string representation of a `Deque`.

Note that since we can't condition methods on a trait yet,
the way to call this method is a bit special. Here is an example below:

```mojo
my_deque = Deque[Int](1, 2, 3)
print(my_deque.__str__())
```

When the compiler supports conditional methods, then a simple `String(my_deque)` will
be enough.

**Parameters:**

* ​T (`Representable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the trait `Representable`.

**Returns:**

A string representation of the deque.

### `__repr__`

`__repr__[T: Representable & Copyable & Movable, //](self: Deque[T]) -> String`

Returns a string representation of a `Deque`.

Note that since we can't condition methods on a trait yet,
the way to call this method is a bit special. Here is an example below:

```mojo
my_deque = Deque[Int](1, 2, 3)
print(my_deque.__repr__())
```

When the compiler supports conditional methods, then a simple `repr(my_deque)` will
be enough.

**Parameters:**

* ​T (`Representable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the trait `Representable`.

**Returns:**

A string representation of the deque.

### `append`

`append(mut self, owned value: ElementType)`

Appends a value to the right side of the deque.

**Args:**

* ​value (`ElementType`): The value to append.

### `appendleft`

`appendleft(mut self, owned value: ElementType)`

Appends a value to the left side of the deque.

**Args:**

* ​value (`ElementType`): The value to append.

### `clear`

`clear(mut self)`

Removes all elements from the deque leaving it with length 0.

Resets the underlying storage capacity to `_min_capacity`.

### `count`

`count[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], value: T) -> Int`

Counts the number of occurrences of a `value` in the deque.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the trait `EqualityComparable`.

**Args:**

* ​value (`T`): The value to count.

**Returns:**

The number of occurrences of the value in the deque.

### `extend`

`extend(mut self, owned values: List[ElementType])`

Extends the right side of the deque by consuming elements of the list argument.

**Args:**

* ​values (`List[ElementType]`): List whose elements will be added at the right side of the deque.

### `extendleft`

`extendleft(mut self, owned values: List[ElementType])`

Extends the left side of the deque by consuming elements from the list argument.

Acts as series of left appends resulting in reversed order of elements in the list argument.

**Args:**

* ​values (`List[ElementType]`): List whose elements will be added at the left side of the deque.

### `index`

`index[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], value: T, start: Int = 0, stop: Optional[Int] = Optional(None)) -> Int`

Returns the index of the first occurrence of a `value` in a deque restricted by the range given the `start` and `stop` bounds.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the `EqualityComparable` trait.

**Args:**

* ​value (`T`): The value to search for.
* ​start (`Int`): The starting index of the search, treated as a slice index
  (defaults to 0).
* ​stop (`Optional[Int]`): The ending index of the search, treated as a slice index
  (defaults to None, which means the end of the deque).

**Returns:**

The index of the first occurrence of the value in the deque.

**Raises:**

ValueError: If the value is not found in the deque.

### `insert`

`insert(mut self, idx: Int, owned value: ElementType)`

Inserts the `value` into the deque at position `idx`.

**Args:**

* ​idx (`Int`): The position to insert the value into.
* ​value (`ElementType`): The value to insert.

**Raises:**

IndexError: If deque is already at its maximum size.

### `remove`

`remove[T: EqualityComparable & Copyable & Movable, //](mut self: Deque[T], value: T)`

Removes the first occurrence of the `value`.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the `EqualityComparable` trait.

**Args:**

* ​value (`T`): The value to remove.

**Raises:**

ValueError: If the value is not found in the deque.

### `peek`

`peek(self) -> ElementType`

Inspect the last (rightmost) element of the deque without removing it.

**Returns:**

The last (rightmost) element of the deque.

**Raises:**

IndexError: If the deque is empty.

### `peekleft`

`peekleft(self) -> ElementType`

Inspect the first (leftmost) element of the deque without removing it.

**Returns:**

The first (leftmost) element of the deque.

**Raises:**

IndexError: If the deque is empty.

### `pop`

`pop(mut self) -> ElementType`

Removes and returns the element from the right side of the deque.

**Returns:**

The popped value.

**Raises:**

IndexError: If the deque is empty.

### `popleft`

`popleft(mut self) -> ElementType`

Removes and returns the element from the left side of the deque.

**Returns:**

The popped value.

**Raises:**

IndexError: If the deque is empty.

### `reverse`

`reverse(mut self)`

Reverses the elements of the deque in-place.

### `rotate`

`rotate(mut self, n: Int = 1)`

Rotates the deque by `n` steps.

If `n` is positive, rotates to the right.
If `n` is negative, rotates to the left.

**Args:**

* ​n (`Int`): Number of steps to rotate the deque
  (defaults to 1).

---

## deque

Defines the Deque type.

You can import these APIs from the `collections` package.

Examples:

```mojo
from collections import Deque
```

## Structs

* [​`Deque`](/mojo/stdlib/collections/deque/Deque): Implements a double-ended queue.

---

## Dict

`struct Dict[K: Copyable & Movable & Hashable & EqualityComparable, V: Copyable & Movable]`

A container that stores key-value pairs.

The key type and value type must be specified statically, unlike a Python
dictionary, which can accept arbitrary key and value types.

The key type must implement the `KeyElement` trait, which encompasses
`Movable`, `Hashable`, and `EqualityComparable`. It also includes `Copyable`
and `Movable` until we have references.

The value type must implement the `Copyable` and `Movable` traits.

Examples:

```mojo
var d = Dict[String, Int]()
d["a"] = 1
d["b"] = 2
print(len(d))      # prints 2
print(d["a"])      # prints 1
print(d.pop("b"))  # prints 2
print(len(d))      # prints 1
```

For more information on the Mojo `Dict` type, see the
[Mojo `Dict` manual](/mojo/manual/types/#dict). To learn more about using
Python dictionaries from Mojo, see
[Python types in Mojo](/mojo/manual/python/types/#python-types-in-mojo).

## Parameters

* ​K (`Copyable & Movable & Hashable & EqualityComparable`): The type of the dictionary key. Must be `Hashable` and
  `EqualityComparable` so we can find the key in the map.
* ​V (`Copyable & Movable`): The value type of the dictionary. Currently must be
  Copyable & Movable.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Defaultable`,
`ExplicitlyCopyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `EMPTY`

`alias EMPTY = -1`

### `REMOVED`

`alias REMOVED = -2`

## Methods

### `__init__`

`__init__(out self)`

Initialize an empty dictiontary.

`__init__(out self, *, power_of_two_initial_capacity: Int)`

Initialize an empty dictiontary with a pre-reserved initial capacity.

Examples:

```mojo
var x = Dict[Int, Int](power_of_two_initial_capacity = 1024)
# Insert (2/3 of 1024) entries without reallocation.
```

**Args:**

* ​power\_of\_two\_initial\_capacity (`Int`): At least 8, has to be a power of two.

`__init__(out self, owned keys: List[K], owned values: List[V], __dict_literal__: Tuple[])`

Constructs a dictionary from the given keys and values.

**Args:**

* ​keys (`List[K]`): The list of keys to build the dictionary with.
* ​values (`List[V]`): The corresponding values to pair with the keys.
* ​**dict\_literal** (`Tuple[]`): Tell Mojo to use this method for dict literals.

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Copy an existing dictiontary.

**Args:**

* ​existing (`Self`): The existing dict.

### `__bool__`

`__bool__(self) -> Bool`

Check if the dictionary is empty or not.

**Returns:**

`False` if the dictionary is empty, `True` if there is at least one
element.

### `__getitem__`

`__getitem__(self, key: K) -> ref [*[0,0]._entries._value.value] V`

Retrieve a value out of the dictionary.

**Args:**

* ​key (`K`): The key to retrieve.

**Returns:**

The value associated with the key, if it's present.

**Raises:**

"KeyError" if the key isn't present.

### `__setitem__`

`__setitem__(mut self, owned key: K, owned value: V)`

Set a value in the dictionary by key.

**Args:**

* ​key (`K`): The key to associate with the specified value.
* ​value (`V`): The data to store in the dictionary.

### `__contains__`

`__contains__(self, key: K) -> Bool`

Check if a given key is in the dictionary or not.

**Args:**

* ​key (`K`): The key to check.

**Returns:**

True if there key exists in the dictionary, False otherwise.

### `__or__`

`__or__(self, other: Self) -> Self`

Merge self with other and return the result as a new dict.

**Args:**

* ​other (`Self`): The dictionary to merge with.

**Returns:**

The result of the merge.

### `__ior__`

`__ior__(mut self, other: Self)`

Merge self with other in place.

**Args:**

* ​other (`Self`): The dictionary to merge with.

### `copy`

`copy(self) -> Self`

Copy an existing dictiontary.

**Returns:**

A copy of the value.

### `fromkeys`

`static fromkeys(keys: List[K, hint_trivial_type], value: V) -> Self`

Create a new dictionary with keys from list and values set to value.

**Args:**

* ​keys (`List[K, hint_trivial_type]`): The keys to set.
* ​value (`V`): The value to set.

**Returns:**

The new dictionary.

`static fromkeys(keys: List[K, hint_trivial_type], value: Optional[V] = Optional(None)) -> Dict[K, Optional[V]]`

Create a new dictionary with keys from list and values set to value.

**Args:**

* ​keys (`List[K, hint_trivial_type]`): The keys to set.
* ​value (`Optional[V]`): The value to set.

**Returns:**

The new dictionary.

### `__iter__`

`__iter__(ref self) -> _DictKeyIter[K, V, self_is_origin]`

Iterate over the dict's keys as immutable references.

**Returns:**

An iterator of immutable references to the dictionary keys.

### `__reversed__`

`__reversed__(ref self) -> _DictKeyIter[K, V, self_is_origin, False]`

Iterate backwards over the dict keys, returning immutable references.

**Returns:**

A reversed iterator of immutable references to the dict keys.

### `__len__`

`__len__(self) -> Int`

The number of elements currently stored in the dictionary.

**Returns:**

The number of elements currently stored in the dictionary.

### `__str__`

`__str__[T: Copyable & Movable & Hashable & EqualityComparable & Representable, U: Copyable & Movable & Representable, //](self: Dict[T, U]) -> String`

Returns a string representation of a `Dict`.

Notes:
Since we can't condition methods on a trait yet, the way to call
this method is a bit special. Here is an example below:

```mojo
var my_dict = Dict[Int, Float64]()
my_dict[1] = 1.1
my_dict[2] = 2.2
dict_as_string = my_dict.__str__()
print(dict_as_string)
# prints "{1: 1.1, 2: 2.2}"
```

When the compiler supports conditional methods, then a simple
`String(my_dict)` will be enough.

**Parameters:**

* ​T (`Copyable & Movable & Hashable & EqualityComparable & Representable`): The type of the keys in the Dict. Must implement the
  traits `Representable` and `KeyElement`.
* ​U (`Copyable & Movable & Representable`): The type of the values in the Dict. Must implement the
  traits `Representable`, `Copyable` and `Movable`.

**Returns:**

A string representation of the Dict.

### `find`

`find(self, key: K) -> Optional[V]`

Find a value in the dictionary by key.

**Args:**

* ​key (`K`): The key to search for in the dictionary.

**Returns:**

An optional value containing a copy of the value if it was present,
otherwise an empty Optional.

### `get`

`get(self, key: K) -> Optional[V]`

Get a value from the dictionary by key.

**Args:**

* ​key (`K`): The key to search for in the dictionary.

**Returns:**

An optional value containing a copy of the value if it was present,
otherwise an empty Optional.

`get(self, key: K, default: V) -> V`

Get a value from the dictionary by key.

**Args:**

* ​key (`K`): The key to search for in the dictionary.
* ​default (`V`): Default value to return.

**Returns:**

A copy of the value if it was present, otherwise default.

### `pop`

`pop(mut self, key: K, owned default: V) -> V`

Remove a value from the dictionary by key.

**Args:**

* ​key (`K`): The key to remove from the dictionary.
* ​default (`V`): A default value to return if the key
  was not found instead of raising.

**Returns:**

The value associated with the key, if it was in the dictionary.
If it wasn't, return the provided default value instead.

`pop(mut self, key: K) -> V`

Remove a value from the dictionary by key.

**Args:**

* ​key (`K`): The key to remove from the dictionary.

**Returns:**

The value associated with the key, if it was in the dictionary.
Raises otherwise.

**Raises:**

"KeyError" if the key was not present in the dictionary.

### `popitem`

`popitem(mut self) -> DictEntry[K, V]`

Remove and return a (key, value) pair from the dictionary.

Notes:
Pairs are returned in LIFO order. popitem() is useful to
destructively iterate over a dictionary, as often used in set
algorithms. If the dictionary is empty, calling popitem() raises a
KeyError.

**Returns:**

Last dictionary item

**Raises:**

"KeyError" if the dictionary is empty.

### `keys`

`keys(ref self) -> _DictKeyIter[K, V, self_is_origin]`

Iterate over the dict's keys as immutable references.

**Returns:**

An iterator of immutable references to the dictionary keys.

### `values`

`values(ref self) -> _DictValueIter[K, V, self_is_origin]`

Iterate over the dict's values as references.

**Returns:**

An iterator of references to the dictionary values.

### `items`

`items(ref self) -> _DictEntryIter[K, V, self_is_origin]`

Iterate over the dict's entries as immutable references.

Examples:

```mojo
var my_dict = Dict[String, Int]()
my_dict["a"] = 1
my_dict["b"] = 2

for e in my_dict.items():
    print(e.key, e.value)
```

Notes:
These can't yet be unpacked like Python dict items, but you can
access the key and value as attributes.

**Returns:**

An iterator of immutable references to the dictionary entries.

### `update`

`update(mut self, other: Self, /)`

Update the dictionary with the key/value pairs from other, overwriting existing keys.

Notes:
The argument must be positional only.

**Args:**

* ​other (`Self`): The dictionary to update from.

### `clear`

`clear(mut self)`

Remove all elements from the dictionary.

### `setdefault`

`setdefault(mut self, key: K, owned default: V) -> ref [*[0,0]._entries._value.value] V`

Get a value from the dictionary by key, or set it to a default if it doesn't exist.

**Args:**

* ​key (`K`): The key to search for in the dictionary.
* ​default (`V`): The default value to set if the key is not present.

**Returns:**

The value associated with the key, or the default value if it wasn't
present.

---

## DictEntry

`struct DictEntry[K: Copyable & Movable & Hashable & EqualityComparable, V: Copyable & Movable]`

Store a key-value pair entry inside a dictionary.

## Parameters

* ​K (`Copyable & Movable & Hashable & EqualityComparable`): The key type of the dict. Must be Hashable+EqualityComparable.
* ​V (`Copyable & Movable`): The value type of the dict.

## Fields

* ​hash (`SIMD[uint64, 1]`): `key.__hash__()`, stored so hashing isn't re-computed during dict lookup.
* ​key (`K`): The unique key for the entry.
* ​value (`V`): The value associated with the key.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, owned key: K, owned value: V)`

Create an entry from a key and value, computing the hash.

**Args:**

* ​key (`K`): The key of the entry.
* ​value (`V`): The value of the entry.

### `copy`

`copy(self) -> Self`

Copy an existing entry.

**Returns:**

A copy of the value.

### `reap_value`

`reap_value(owned self, out result: V)`

Take the value from an owned entry.

**Returns:**

The value of the entry.

---

## OwnedKwargsDict

`struct OwnedKwargsDict[V: Copyable & Movable]`

Container used to pass owned variadic keyword arguments to functions.

This type mimics the interface of a dictionary with `String` keys, and
should be usable more-or-less like a dictionary. Notably, however, this type
should not be instantiated directly by users.

## Parameters

* ​V (`Copyable & Movable`): The value type of the dictionary. Currently must be Copyable & Movable.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`ExplicitlyCopyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `key_type`

`alias key_type = String`

## Methods

### `__init__`

`__init__(out self)`

Initialize an empty keyword dictionary.

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Copy an existing keyword dictionary.

**Args:**

* ​existing (`Self`): The existing keyword dictionary.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Move data of an existing keyword dictionary into a new one.

**Args:**

* ​existing (`Self`): The existing keyword dictionary.

### `__getitem__`

`__getitem__(self, key: String) -> V`

Retrieve a value out of the keyword dictionary.

**Args:**

* ​key (`String`): The key to retrieve.

**Returns:**

The value associated with the key, if it's present.

**Raises:**

"KeyError" if the key isn't present.

### `__setitem__`

`__setitem__(mut self, key: String, value: V)`

Set a value in the keyword dictionary by key.

**Args:**

* ​key (`String`): The key to associate with the specified value.
* ​value (`V`): The data to store in the dictionary.

### `__contains__`

`__contains__(self, key: String) -> Bool`

Check if a given key is in the keyword dictionary or not.

**Args:**

* ​key (`String`): The key to check.

**Returns:**

True if there key exists in the keyword dictionary, False
otherwise.

### `copy`

`copy(self) -> Self`

Copy an existing keyword dictionary.

**Returns:**

A copy of the value.

### `__len__`

`__len__(self) -> Int`

The number of elements currently stored in the keyword dictionary.

**Returns:**

The number of elements currently stored in the keyword dictionary.

### `find`

`find(self, key: String) -> Optional[V]`

Find a value in the keyword dictionary by key.

**Args:**

* ​key (`String`): The key to search for in the dictionary.

**Returns:**

An optional value containing a copy of the value if it was present,
otherwise an empty Optional.

### `pop`

`pop(mut self, key: String, owned default: V) -> V`

Remove a value from the dictionary by key.

**Args:**

* ​key (`String`): The key to remove from the dictionary.
* ​default (`V`): A default value to return if the key
  was not found instead of raising.

**Returns:**

The value associated with the key, if it was in the dictionary.
If it wasn't, return the provided default value instead.

`pop(mut self, key: String) -> V`

Remove a value from the dictionary by key.

**Args:**

* ​key (`String`): The key to remove from the dictionary.

**Returns:**

The value associated with the key, if it was in the dictionary.
Raises otherwise.

**Raises:**

"KeyError" if the key was not present in the dictionary.

### `__iter__`

`__iter__(ref self) -> _DictKeyIter[String, V, self_is_origin._dict]`

Iterate over the keyword dict's keys as immutable references.

**Returns:**

An iterator of immutable references to the dictionary keys.

### `keys`

`keys(ref self) -> _DictKeyIter[String, V, self_is_origin._dict]`

Iterate over the keyword dict's keys as immutable references.

**Returns:**

An iterator of immutable references to the dictionary keys.

### `values`

`values(ref self) -> _DictValueIter[String, V, self_is_origin._dict]`

Iterate over the keyword dict's values as references.

**Returns:**

An iterator of references to the dictionary values.

### `items`

`items(ref self) -> _DictEntryIter[String, V, self_is_origin._dict]`

Iterate over the keyword dictionary's entries as immutable references.

Examples:

```mojo
var my_dict = Dict[String, Int]()
my_dict["a"] = 1
my_dict["b"] = 2

for e in my_dict.items():
    print(e.key, e.value)
```

Notes:
These can't yet be unpacked like Python dict items, but you can
access the key and value as attributes.

**Returns:**

An iterator of immutable references to the dictionary entries.

---

## dict

Defines `Dict`, a collection that stores key-value pairs.

Dict provides an efficient, O(1) amortized
average-time complexity for insert, lookup, and removal of dictionary elements.
Its implementation closely mirrors Python's `dict` implementation:

* Performance and size are heavily optimized for small dictionaries, but can
  scale to large dictionaries.

* Insertion order is implicitly preserved. Iteration over keys, values, and
  items have a deterministic order based on insertion.

* For more information on the Mojo `Dict` type, see the
  [Mojo `Dict` manual](/mojo/manual/types/#dict). To learn more about using
  Python dictionaries from Mojo, see
  [Python types in Mojo](/mojo/manual/python/types/#python-types-in-mojo).

Key elements must implement the `KeyElement` trait, which encompasses
Movable, Hashable, and EqualityComparable. It also includes Copyable and Movable
until we push references through the standard library types.

Value elements must be CollectionElements for a similar reason. Both key and
value types must always be Movable so we can resize the dictionary as it grows.

See the `Dict` docs for more details.

## Aliases

### `KeyElement`

`alias KeyElement = Copyable & Movable & Hashable & EqualityComparable`

A trait composition for types which implement all requirements of dictionary keys. Dict keys must minimally be Copyable, Movable, Hashable, and EqualityComparable for a hash map. Until we have references they must also be copyable.

## Structs

* [​`Dict`](/mojo/stdlib/collections/dict/Dict): A container that stores key-value pairs.
* [​`DictEntry`](/mojo/stdlib/collections/dict/DictEntry): Store a key-value pair entry inside a dictionary.
* [​`OwnedKwargsDict`](/mojo/stdlib/collections/dict/OwnedKwargsDict): Container used to pass owned variadic keyword arguments to functions.

---

## collections

Implements the collections package.

## Packages

* [​`string`](/mojo/stdlib/collections/string/): The string package provides comprehensive Unicode string handling functionality for Mojo.

## Modules

* [​`bitset`](/mojo/stdlib/collections/bitset/): Provides a compact, grow-only set of non-negative integers.
* [​`counter`](/mojo/stdlib/collections/counter/): Defines the `Counter` type.
* [​`deque`](/mojo/stdlib/collections/deque/): Defines the Deque type.
* [​`dict`](/mojo/stdlib/collections/dict/): Defines `Dict`, a collection that stores key-value pairs.
* [​`inline_array`](/mojo/stdlib/collections/inline_array/): Provides a fixed-size array implementation with compile-time size checking.
* [​`interval`](/mojo/stdlib/collections/interval/): A self-balancing interval tree is a specialized binary search tree designed to efficiently store and query intervals.
* [​`linked_list`](/mojo/stdlib/collections/linked_list/):
* [​`list`](/mojo/stdlib/collections/list/): Defines the List type.
* [​`optional`](/mojo/stdlib/collections/optional/): Defines Optional, a type modeling a value which may or may not be present.
* [​`set`](/mojo/stdlib/collections/set/): Implements the  Set datatype.

---

## InlineArray

`struct InlineArray[ElementType: Copyable & Movable, size: Int, *, run_destructors: Bool = False]`

A fixed-size sequence of homogeneous elements where size is a constant expression.

InlineArray provides a fixed-size array implementation with compile-time
size checking. The array size is determined at compile time and cannot be
changed. Elements must implement the `Copyable` and `Movable` traits.

Examples:

```mojo
# Create array of 3 integers
var arr = InlineArray[Int, 3](1, 2, 3)

# Create array filled with value
var filled = InlineArray[Int, 5](fill=42)

# Access elements
print(arr[0])  # Prints 1
```

## Parameters

* ​ElementType (`Copyable & Movable`): The type of the elements in the array. Must implement
  `Copyable` and `Movable`.
* ​size (`Int`): The size of the array. Must be a positive integer constant.
* ​run\_destructors (`Bool`): Whether to run destructors on the elements. Defaults to
  `False` for backwards compatibility. Will default to `True` in the
  future.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`ExplicitlyCopyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `type`

`alias type = array, :trait ElementType>`

## Methods

### `__init__`

`__init__(out self)`

This constructor will always cause a compile time error if used. It is used to steer users away from uninitialized memory.

`__init__(out self, *, uninitialized: Bool)`

Create an InlineArray with uninitialized memory.

Examples:

```mojo
var uninitialized_array = InlineArray[Int, 10](uninitialized=True)
```

Notes:
This constructor is unsafe and should be used with caution. The
array elements will be uninitialized and accessing them before
initialization is undefined behavior.

**Args:**

* ​uninitialized (`Bool`): A boolean to indicate if the array should be
  initialized. Always set to `True` (it's not actually used inside
  the constructor).

`__init__(out self, *, owned unsafe_assume_initialized: InlineArray[UnsafeMaybeUninitialized[ElementType], size])`

Constructs an `InlineArray` from an `InlineArray` of `UnsafeMaybeUninitialized`.

Warning:
This is an unsafe constructor. Only use it if you are certain all
elements are properly initialized.

Notes:
This constructor assumes all elements in the input array are
initialized. Using uninitialized elements results in undefined
behavior, even for types that are valid for any bit pattern
(e.g. `Int` or `Float`).

**Args:**

* ​unsafe\_assume\_initialized (`InlineArray[UnsafeMaybeUninitialized[ElementType], size]`): The array of `UnsafeMaybeUninitialized`
  elements. All elements must be initialized.

`@implicit`
`__init__[batch_size: Int = 64](out self, fill: ElementType)`

Constructs an array where each element is initialized to the supplied value.

Examples:

```mojo
var filled = InlineArray[Int, 5](fill=42)  # [42, 42, 42, 42, 42]

# For large arrays, consider adjusting batch_size to balance
# compile time and runtime performance:
var large = InlineArray[Int, 10000].__init__[batch_size=32](fill=0)
```

Notes:

* Full unrolling with large arrays (>2k elements) can cause significant
  compiler slowdowns.
* Using batch\_size=64 balances AVX512 efficiency and instruction cache
  usage.
* For very large arrays, using smaller batch sizes (e.g., 32 or 16) can
  further improve compilation speed while still maintaining good
  runtime performance.

**Parameters:**

* ​batch\_size (`Int`): The number of elements to unroll for filling the array.
  Default is 64, which optimizes for AVX512 operations on modern
  CPUs. For large arrays (>2k elements), this batched approach
  significantly improves compile times compared to full unrolling
  while maintaining good runtime performance.

**Args:**

* ​fill (`ElementType`): The element value to fill each index with.

`@implicit`
`__init__(out self, owned *elems: ElementType, *, __list_literal__: Tuple[] = Tuple())`

Constructs an array from a variadic list of elements.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)  # [1, 2, 3]
```

**Args:**

* ​\*elems (`ElementType`): The elements to initialize the array with. Must match the
  array size.
* ​**list\_literal** (`Tuple[]`): Specifies that this constructor can be used for
  list literals.

`__init__(out self, *, owned storage: VariadicListMem[ElementType, origin, is_owned])`

Construct an array from a low-level internal representation.

**Args:**

* ​storage (`VariadicListMem[ElementType, origin, is_owned]`): The variadic list storage to construct from. Must match
  array size.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Copy constructs the array from another array.

Notes:
Creates a deep copy by copying each element individually.

**Args:**

* ​other (`Self`): The array to copy from.

### `__del__`

`__del__(owned self)`

Deallocates the array and destroys its elements.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
# arr's destructor is called automatically when it goes out of scope
```

Notes:
This destructor is called automatically when the array goes out of
scope. If the array's `run_destructors` parameter is `True`, it will
call the destructor on each element in the array before deallocating
the array's memory.

### `__getitem__`

`__getitem__[I: Indexer](ref self, idx: I) -> ref [self] ElementType`

Gets a reference to the element at the given index.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
print(arr[0])   # Prints 1 - first element
print(arr[1])   # Prints 2 - second element
print(arr[-1])  # Prints 3 - last element
print(arr[-2])  # Prints 2 - second to last element
```

Notes:
This method provides array-style indexing access to elements in the
InlineArray. It supports both positive indices starting from 0 and
negative indices counting backwards from the end of the array. The
index is bounds-checked at runtime.

**Parameters:**

* ​I (`Indexer`): The type parameter representing the index type, must implement
  Indexer trait.

**Args:**

* ​idx (`I`): The index to access. Can be positive (0 to len-1) or negative
  (-len to -1).

**Returns:**

A reference to the element at the specified index.

`__getitem__[I: Indexer, //, idx: I](ref self) -> ref [self] ElementType`

Gets a reference to the element at the given index with compile-time bounds checking.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
print(arr[0])   # Prints 1 - first element
print(arr[-1])  # Prints 3 - last element
```

Notes:
This overload provides array-style indexing with compile-time bounds
checking. The index must be a compile-time constant value. It
supports both positive indices starting from 0 and negative indices
counting backwards from the end of the array.

**Parameters:**

* ​I (`Indexer`): The type parameter representing the index type, must implement
  Indexer trait.
* ​idx (`I`): The compile-time constant index to access. Can be positive
  (0 to len-1) or negative (-len to -1).

**Returns:**

A reference to the element at the specified index.

### `__contains__`

`__contains__[T: EqualityComparable & Copyable & Movable, //](self: InlineArray[T, size], value: T) -> Bool`

Tests if a value is present in the array using the `in` operator.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
print(3 in arr)  # Prints True - value exists
print(4 in arr)  # Prints False - value not found
```

Notes:
This method enables using the `in` operator to check if a value
exists in the array. It performs a linear search comparing each
element for equality with the given value. The element type must
implement the `EqualityComparable`, `Copyable` and `Movable` traits
to support equality comparison.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The element type, must implement both `EqualityComparable` and
  `Copyable` and `Movable`.

**Args:**

* ​value (`T`): The value to search for.

**Returns:**

True if the value is found in any position in the array, False
otherwise.

### `copy`

`copy(self) -> Self`

Creates a deep copy of the array.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
var copy = arr.copy()  # Creates new array [1, 2, 3]
```

**Returns:**

A new array containing copies of all elements.

### `__len__`

`__len__(self) -> Int`

Returns the length of the array.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
print(len(arr))  # Prints 3
```

Notes:
The length is a compile-time constant value determined by the
size parameter used when creating the array.

**Returns:**

The size of the array as an Int.

### `unsafe_get`

`unsafe_get[I: Indexer](ref self, idx: I) -> ref [self] ElementType`

Gets a reference to an element without bounds checking.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
print(arr.unsafe_get(0))  # Prints 1
```

Warning:
This is an unsafe method. No bounds checking is performed.
Using an invalid index will cause undefined behavior.
Negative indices are not supported.

Notes:
This is an unsafe method that skips bounds checking for performance.
Users should prefer `__getitem__` instead for safety.

**Parameters:**

* ​I (`Indexer`): A type parameter representing the index type, must implement
  Indexer trait.

**Args:**

* ​idx (`I`): The index of the element to get. Must be non-negative and in
  bounds. Using an invalid index will cause undefined behavior.

**Returns:**

A reference to the element at the given index.

### `unsafe_ptr`

`unsafe_ptr(ref self) -> UnsafePointer[ElementType, mut=self_is_mut, origin=self_is_origin]`

Gets an unsafe pointer to the underlying array storage.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
var ptr = arr.unsafe_ptr()
print(ptr[0])  # Prints 1
```

Warning:
This is an unsafe method. The returned pointer:

* Becomes invalid if the array is moved
* Must not be used to access memory outside array bounds
* Must be refreshed after any operation that could move the array

Notes:
Returns a raw pointer to the array's memory that can be used for
direct memory access. The pointer inherits mutability from the array
reference.

**Returns:**

An `UnsafePointer` to the underlying array storage. The pointer's
mutability matches that of the array reference.

---

## inline_array

Provides a fixed-size array implementation with compile-time size checking.

The `InlineArray` type represents a fixed-size sequence of homogeneous elements
where the size is determined at compile time. It provides efficient memory
layout and bounds checking while maintaining type safety.  The `InlineArray`
type is part of the `prelude` module and therefore does not need to be imported
in order to use it.

Examples:

```mojo
# Create an array of 3 integers
var arr = InlineArray[Int, 3](1, 2, 3)

# Access elements
print(arr[0])  # Prints 1

# Fill with a value
var filled = InlineArray[Int, 5](fill=42)
```

Notes:

* For historical reasons, destructors are not run by default on the elements of
  an `InlineArray`. This can be controlled with the `run_destructors` parameter.
  In the future, this will default to `True` and the `run_destructors` parameter
  will be removed.

## Structs

* [​`InlineArray`](/mojo/stdlib/collections/inline_array/InlineArray): A fixed-size sequence of homogeneous elements where size is a constant expression.

---

## Interval

`struct Interval[T: IntervalElement]`

A half-open interval \[start, end) that represents a range of values.

The interval includes the start value but excludes the end value.

## Parameters

* ​T (`IntervalElement`): The type of the interval bounds.

## Fields

* ​start (`T`): The inclusive start of the interval.
* ​end (`T`): The exclusive end of the interval.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`EqualityComparable`,
`Movable`,
`Representable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(out self, start: T, end: T)`

Initialize an interval with start and end values.

**Args:**

* ​start (`T`): The starting value of the interval.
* ​end (`T`): The ending value of the interval. Must be greater than or
  equal to start.

`__init__(out self, interval: Tuple[T, T], /)`

Initialize an interval with a tuple of start and end values.

**Args:**

* ​interval (`Tuple[T, T]`): A tuple containing the start and end values.

### `__copyinit__`

`__copyinit__(out self, existing: Self, /)`

Create a new instance of the interval by copying the values from an existing one.

**Args:**

* ​existing (`Self`): The interval to copy values from.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self, /)`

Create a new instance of the interval by moving the values from an existing one.

**Args:**

* ​existing (`Self`): The interval to move values from.

### `__bool__`

`__bool__(self) -> Bool`

Returns whether this interval is empty.

**Returns:**

True if the interval is not empty (start 

### `__lt__`

`__lt__(self, other: Self) -> Bool`

Returns whether this interval is less than another interval.

**Args:**

* ​other (`Self`): The interval to compare with.

**Returns:**

True if this interval's start is less than the other interval's start.

### `__le__`

`__le__(self, other: Self) -> Bool`

Returns whether this interval is less than or equal to another interval.

**Args:**

* ​other (`Self`): The interval to compare with.

**Returns:**

True if this interval's start is less than or equal to the other interval's start.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Returns whether this interval equals another interval.

**Args:**

* ​other (`Self`): The interval to compare with.

**Returns:**

True if both intervals have the same start and end values.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Returns whether this interval is not equal to another interval.

**Args:**

* ​other (`Self`): The interval to compare with.

**Returns:**

True if the intervals are not equal, False if they are equal.

### `__gt__`

`__gt__(self, other: Self) -> Bool`

Returns whether this interval is greater than another interval.

**Args:**

* ​other (`Self`): The interval to compare with.

**Returns:**

True if this interval's end is greater than the other interval's end.

### `__ge__`

`__ge__(self, other: Self) -> Bool`

Returns whether this interval is greater than or equal to another interval.

**Args:**

* ​other (`Self`): The interval to compare with.

**Returns:**

True if this interval's end is greater than or equal to the other interval's end.

### `__contains__`

`__contains__(self, other: T) -> Bool`

Returns whether a value is contained within this interval.

**Args:**

* ​other (`T`): The value to check.

**Returns:**

True if the value is within the interval bounds, False otherwise.

`__contains__(self, other: Self) -> Bool`

Returns whether another interval is fully contained within this interval.

**Args:**

* ​other (`Self`): The interval to check.

**Returns:**

True if the other interval is fully contained within this interval,
False otherwise.

### `overlaps`

`overlaps(self, other: Self) -> Bool`

Returns whether this interval overlaps with another interval.

**Args:**

* ​other (`Self`): The interval to check for overlap with.

**Returns:**

True if the intervals overlap, False otherwise.

### `union`

`union(self, other: Self) -> Self`

Returns the union of this interval and another interval.

**Args:**

* ​other (`Self`): The interval to union with.

**Returns:**

The union of this interval and the other interval.

### `intersection`

`intersection(self, other: Self) -> Self`

Returns the intersection of this interval and another interval.

**Args:**

* ​other (`Self`): The interval to intersect with.

**Returns:**

The intersection of this interval and the other interval.

### `__len__`

`__len__(self) -> Int`

Returns the length of this interval.

**Returns:**

The difference between end and start values as an integer.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes this interval to a writer in the format '(start, end)'.

**Parameters:**

* ​W (`Writer`): The writer type that implements the Writer trait.

**Args:**

* ​writer (`W`): The writer to write the interval to.

### `__str__`

`__str__(self) -> String`

Returns a string representation of this interval.

**Returns:**

A string in the format '(start, end)' representing this interval.

### `__repr__`

`__repr__(self) -> String`

Returns a string representation of this interval suitable for debugging.

**Returns:**

A string in the format '(start, end)' representing this interval.

---

## IntervalElement

The trait denotes a trait composition of the `Copyable`, `Movable`, `Writable`, `Intable`, and `Comparable` traits. Which is also subtractable.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`Intable`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `__lt__`

`__lt__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is less than `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is less than `rhs`.

### `__le__`

`__le__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is less than or equal to `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is less than or equal to `rhs`.

### `__eq__`

`__eq__(self: _Self, other: _Self) -> Bool`

Define whether two instances of the object are equal to each other.

**Args:**

* ​other (`_Self`): Another instance of the same type.

**Returns:**

True if the instances are equal according to the type's definition
of equality, False otherwise.

### `__ne__`

`__ne__(self: _Self, other: _Self) -> Bool`

Define whether two instances of the object are not equal to each other.

**Args:**

* ​other (`_Self`): Another instance of the same type.

**Returns:**

True if the instances are not equal according to the type's definition
of equality, False otherwise.

### `__gt__`

`__gt__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is greater than `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is greater than `rhs`.

### `__ge__`

`__ge__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is greater than or equal to `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is greater than or equal to `rhs`.

### `__sub__`

`__sub__(self: _Self, rhs: _Self) -> _Self`

Subtracts rhs from self, must be implemented in concrete types.

**Args:**

* ​rhs (`_Self`): The value to subtract from self.

**Returns:**

The result of subtracting rhs from self.

### `__int__`

`__int__(self: _Self) -> Int`

Get the integral representation of the value.

**Returns:**

The integral representation of the value.

### `write_to`

`write_to[W: Writer](self: _Self, mut writer: W)`

Formats the string representation of this type to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The type conforming to `Writable`.

---

## IntervalTree

`struct IntervalTree[T: IntervalElement, U: Copyable & Movable & Stringable & EqualityComparable & LessThanComparable & GreaterThanComparable & LessThanOrEqualComparable & GreaterThanOrEqualComparable]`

An interval tree data structure for efficient range queries.

## Parameters

* ​T (`IntervalElement`): The type of the interval bounds, must support subtraction, integer
  conversion, string conversion, comparison and collection operations.
* ​U (`Copyable & Movable & Stringable & EqualityComparable & LessThanComparable & GreaterThanComparable & LessThanOrEqualComparable & GreaterThanOrEqualComparable`): The type of the associated data, must support string conversion
  and collection operations.

## Implemented traits

`AnyType`,
`Defaultable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(out self)`

Initializes an empty IntervalTree.

### `insert`

`insert(mut self, interval: Tuple[T, T], data: U)`

Insert a new interval into the tree using a tuple representation.

**Args:**

* ​interval (`Tuple[T, T]`): A tuple containing the start and end values of the interval.
* ​data (`U`): The data value to associate with this interval.

`insert(mut self, interval: Interval[T], data: U)`

Insert a new interval into the tree.

This method inserts a new interval and its associated data into the interval tree.
It maintains the binary search tree property based on interval start times and
updates the tree structure to preserve red-black tree properties.

**Args:**

* ​interval (`Interval[T]`): The interval to insert into the tree.
* ​data (`U`): The data value to associate with this interval.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the interval tree.

**Returns:**

A string representation of the interval tree.

### `__repr__`

`__repr__(self) -> String`

Returns a string representation of the interval tree suitable for debugging.

**Returns:**

A string representation of the interval tree.

### `write_to`

`write_to[w: Writer](self, mut writer: w)`

Writes the interval tree to a writer.

**Parameters:**

* ​w (`Writer`): The writer type that implements the Writer trait.

**Args:**

* ​writer (`w`): The writer to write the interval tree to.

### `depth`

`depth(self) -> Int`

Returns the depth of the interval tree.

**Returns:**

The depth of the interval tree.

### `transplant`

`transplant(mut self, mut u: UnsafePointer[_IntervalNode[T, U]], mut v: UnsafePointer[_IntervalNode[T, U]])`

Transplants the subtree rooted at node u with the subtree rooted at node v.

**Args:**

* ​u (`UnsafePointer[_IntervalNode[T, U]]`): The node to transplant.
* ​v (`UnsafePointer[_IntervalNode[T, U]]`): The node to transplant to.

### `search`

`search(self, interval: Tuple[T, T]) -> List[U]`

Searches for intervals overlapping with the given tuple.

**Args:**

* ​interval (`Tuple[T, T]`): The interval tuple (start, end).

**Returns:**

A list of data associated with overlapping intervals.

`search(self, interval: Interval[T]) -> List[U]`

Searches for intervals overlapping with the given interval.

**Args:**

* ​interval (`Interval[T]`): The interval to search.

**Returns:**

A list of data associated with overlapping intervals.

---

## interval

A self-balancing interval tree is a specialized binary search tree designed to efficiently store and query intervals.

It maintains intervals sorted by their low endpoints and augments each node with a
`max_high` attribute, representing the maximum high endpoint in its subtree. This
`max_high` value enables efficient overlap searching by pruning the search space.
Self-balancing mechanisms, such as Red-Black or AVL trees, ensure logarithmic time
complexity for operations.

Key Features:

* Stores intervals (low, high).
* Nodes ordered by `low` endpoints.
* `max_high` attribute at each node for efficient overlap search.
* Self-balancing (e.g., using Red-Black tree logic) for O(log n) operations.

Operations:

* Insertion: O(log n) - Adds a new interval, maintaining balance and updating
  `max_high`.
* Overlap Search: O(log n) - Finds intervals overlapping a query interval using
  `max_high` for pruning.
* Deletion: O(log n) - Removes an interval, maintaining balance and updating
  `max_high`.

Space Complexity: O(n), where n is the number of intervals.

Use Cases:

* Calendar scheduling
* Computational geometry
* Genomics
* Database indexing
* Resource allocation

In essence, this data structure provides a fast and efficient way to manage and
query interval data, particularly for finding overlaps.

## Structs

* [​`Interval`](/mojo/stdlib/collections/interval/Interval): A half-open interval \[start, end) that represents a range of values.
* [​`IntervalTree`](/mojo/stdlib/collections/interval/IntervalTree): An interval tree data structure for efficient range queries.

## Traits

* [​`IntervalElement`](/mojo/stdlib/collections/interval/IntervalElement): The trait denotes a trait composition of the `Copyable`, `Movable`, `Writable`, `Intable`, and `Comparable` traits. Which is also subtractable.

---

## LinkedList

`struct LinkedList[ElementType: Copyable & Movable]`

A doubly-linked list implementation.

A doubly-linked list is a data structure where each element points to both
the next and previous elements, allowing for efficient insertion and deletion
at any position.

## Parameters

* ​ElementType (`Copyable & Movable`): The type of elements stored in the list. Must implement the
  `Copyable` and `Movable` traits.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Defaultable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Initialize an empty linked list.

Notes:
Time Complexity: O(1).

`__init__(out self, owned *elements: ElementType, *, __list_literal__: Tuple[] = Tuple())`

Initialize a linked list with the given elements.

Notes:
Time Complexity: O(n) in len(elements).

**Args:**

* ​\*elements (`ElementType`): Variable number of elements to initialize the list with.
* ​**list\_literal** (`Tuple[]`): Tell Mojo to use this method for list literals.

`__init__(out self, *, owned elements: VariadicListMem[ElementType, origin, is_owned])`

Construct a list from a `VariadicListMem`.

Notes:
Time Complexity: O(n) in len(elements).

**Args:**

* ​elements (`VariadicListMem[ElementType, origin, is_owned]`): The elements to add to the list.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Initialize this list as a copy of another list.

Notes:
Time Complexity: O(n) in len(elements).

**Args:**

* ​other (`Self`): The list to copy from.

### `__moveinit__`

`__moveinit__(out self, owned other: Self)`

Initialize this list by moving elements from another list.

Notes:
Time Complexity: O(1).

**Args:**

* ​other (`Self`): The list to move elements from.

### `__del__`

`__del__(owned self)`

Clean up the list by freeing all nodes.

Notes:
Time Complexity: O(n) in len(self).

### `__bool__`

`__bool__(self) -> Bool`

Check if the list is non-empty.

Notes:
Time Complexity: O(1).

**Returns:**

True if the list has elements, False otherwise.

### `__getitem__`

`__getitem__[I: Indexer](ref self, index: I) -> ref [self] ElementType`

Get the element at the specified index.

Notes:
Time Complexity: O(n) in len(self).

**Parameters:**

* ​I (`Indexer`): The type of index to use.

**Args:**

* ​index (`I`): The index of the element to get.

**Returns:**

The element at the specified index.

### `__setitem__`

`__setitem__[I: Indexer](mut self, index: I, owned value: ElementType)`

Set the element at the specified index.

Notes:
Time Complexity: O(n) in len(self).

**Parameters:**

* ​I (`Indexer`): The type of index to use.

**Args:**

* ​index (`I`): The index of the element to set.
* ​value (`ElementType`): The new value to set.

### `__eq__`

`__eq__[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], other: LinkedList[ElementType]) -> Bool`

Checks if the two lists are equal.

Notes:
Time Complexity: O(n) in min(len(self), len(other)) compares.

**Parameters:**

* ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the
  function.

**Args:**

* ​other (`LinkedList[ElementType]`): The list to compare to.

**Returns:**

Whether the lists are equal.

### `__ne__`

`__ne__[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], other: LinkedList[ElementType]) -> Bool`

Checks if the two lists are not equal.

Notes:
Time Complexity: O(n) in min(len(self), len(other)) compares.

**Parameters:**

* ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the
  function.

**Args:**

* ​other (`LinkedList[ElementType]`): The list to compare to.

**Returns:**

Whether the lists are not equal.

### `__contains__`

`__contains__[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], value: ElementType) -> Bool`

Checks if the list contains `value`.

Notes:
Time Complexity: O(n) in len(self) compares.

**Parameters:**

* ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the
  function.

**Args:**

* ​value (`ElementType`): The value to search for in the list.

**Returns:**

Whether the list contains `value`.

### `append`

`append(mut self, owned value: ElementType)`

Add an element to the end of the list.

Notes:
Time Complexity: O(1).

**Args:**

* ​value (`ElementType`): The value to append.

### `prepend`

`prepend(mut self, owned value: ElementType)`

Add an element to the beginning of the list.

Notes:
Time Complexity: O(1).

**Args:**

* ​value (`ElementType`): The value to prepend.

### `reverse`

`reverse(mut self)`

Reverse the order of elements in the list.

Notes:
Time Complexity: O(n) in len(self).

### `pop`

`pop(mut self) -> ElementType`

Remove and return the last element of the list.

Notes:
Time Complexity: O(1).

**Returns:**

The last element in the list.

`pop[I: Indexer](mut self, owned i: I) -> ElementType`

Remove the ith element of the list, counting from the tail if given a negative index.

Notes:
Time Complexity: O(n) in len(self).

**Parameters:**

* ​I (`Indexer`): The type of index to use.

**Args:**

* ​i (`I`): The index of the element to get.

**Returns:**

Ownership of the indicated element.

### `maybe_pop`

`maybe_pop(mut self) -> Optional[ElementType]`

Removes the tail of the list and returns it, if it exists.

Notes:
Time Complexity: O(1).

**Returns:**

The tail of the list, if it was present.

`maybe_pop[I: Indexer](mut self, owned i: I) -> Optional[ElementType]`

Remove the ith element of the list, counting from the tail if given a negative index.

Notes:
Time Complexity: O(n) in len(self).

**Parameters:**

* ​I (`Indexer`): The type of index to use.

**Args:**

* ​i (`I`): The index of the element to get.

**Returns:**

The element, if it was found.

### `clear`

`clear(mut self)`

Removes all elements from the list.

Notes:
Time Complexity: O(n) in len(self).

### `copy`

`copy(self) -> Self`

Create a deep copy of the list.

Notes:
Time Complexity: O(n) in len(self).

**Returns:**

A new list containing copies of all elements.

### `insert`

`insert[I: Indexer](mut self, idx: I, owned elem: ElementType)`

Insert an element `elem` into the list at index `idx`.

Notes:
Time Complexity: O(n) in len(self).

**Parameters:**

* ​I (`Indexer`): The type of index to use.

**Args:**

* ​idx (`I`): The index to insert `elem` at `-len(self) elem (`ElementType`): The item to insert into the list.

**Raises:**

When given an out of bounds index.

### `extend`

`extend(mut self, owned other: Self)`

Extends the list with another.

Notes:
Time Complexity: O(1).

**Args:**

* ​other (`Self`): The list to append to this one.

### `count`

`count[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], elem: ElementType) -> UInt`

Count the occurrences of `elem` in the list.

Notes:
Time Complexity: O(n) in len(self) compares.

**Parameters:**

* ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the
  function.

**Args:**

* ​elem (`ElementType`): The element to search for.

**Returns:**

The number of occurrences of `elem` in the list.

### `__len__`

`__len__(self) -> Int`

Get the number of elements in the list.

Notes:
Time Complexity: O(1).

**Returns:**

The number of elements in the list.

### `__iter__`

`__iter__(self) -> _LinkedListIter[ElementType, self]`

Iterate over elements of the list, returning immutable references.

Notes:
Time Complexity:

* O(1) for iterator construction.
* O(n) in len(self) for a complete iteration of the list.

**Returns:**

An iterator of immutable references to the list elements.

### `__reversed__`

`__reversed__(self) -> _LinkedListIter[ElementType, self, False]`

Iterate backwards over the list, returning immutable references.

Notes:
Time Complexity:

* O(1) for iterator construction.
* O(n) in len(self) for a complete iteration of the list.

**Returns:**

A reversed iterator of immutable references to the list elements.

### `__str__`

`__str__[ElementType: Copyable & Movable & Writable](self: LinkedList[ElementType]) -> String`

Convert the list to its string representation.

Notes:
Time Complexity: O(n) in len(self).

**Parameters:**

* ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function when
  `ElementType` is `Writable`.

**Returns:**

String representation of the list.

### `__repr__`

`__repr__[ElementType: Copyable & Movable & Writable](self: LinkedList[ElementType]) -> String`

Convert the list to its string representation.

Notes:
Time Complexity: O(n) in len(self).

**Parameters:**

* ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function when
  `ElementType` is `Writable`.

**Returns:**

String representation of the list.

### `write_to`

`write_to[W: Writer, ElementType: Copyable & Movable & Writable](self: LinkedList[ElementType], mut writer: W)`

Write the list to the given writer.

Notes:
Time Complexity: O(n) in len(self).

**Parameters:**

* ​W (`Writer`): The type of writer to write the list to.
* ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function when
  `ElementType` is `Writable`.

**Args:**

* ​writer (`W`): The writer to write the list to.

---

## Node

`struct Node[ElementType: Copyable & Movable]`

A node in a linked list data structure.

## Parameters

* ​ElementType (`Copyable & Movable`): The type of element stored in the node.

## Fields

* ​value (`ElementType`): The value stored in this node.
* ​prev (`UnsafePointer[Node[ElementType]]`): The previous node in the list.
* ​next (`UnsafePointer[Node[ElementType]]`): The next node in the list.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, owned value: ElementType, prev: Optional[UnsafePointer[Node[ElementType]]], next: Optional[UnsafePointer[Node[ElementType]]])`

Initialize a new Node with the given value and optional prev/next pointers.

**Args:**

* ​value (`ElementType`): The value to store in this node.
* ​prev (`Optional[UnsafePointer[Node[ElementType]]]`): Optional pointer to the previous node.
* ​next (`Optional[UnsafePointer[Node[ElementType]]]`): Optional pointer to the next node.

### `__str__`

`__str__[ElementType: Copyable & Movable & Writable](self: Node[ElementType]) -> String`

Convert this node's value to a string representation.

**Parameters:**

* ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function if
  `ElementType` is `Writable`.

**Returns:**

String representation of the node's value.

### `write_to`

`write_to[ElementType: Copyable & Movable & Writable, W: Writer](self: Node[ElementType], mut writer: W)`

Write this node's value to the given writer.

**Parameters:**

* ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function if
  `ElementType` is `Writable`.
* ​W (`Writer`): The type of writer to write the value to.

**Args:**

* ​writer (`W`): The writer to write the value to.

---

## linked_list

## Structs

* [​`LinkedList`](/mojo/stdlib/collections/linked_list/LinkedList): A doubly-linked list implementation.
* [​`Node`](/mojo/stdlib/collections/linked_list/Node): A node in a linked list data structure.

---

## List

`struct List[T: Copyable & Movable, hint_trivial_type: Bool = False]`

The `List` type is a dynamically-allocated list.

Notes:
It supports pushing and popping from the back resizing the underlying
storage as needed.  When it is deallocated, it frees its memory.

## Parameters

* ​T (`Copyable & Movable`): The type of the elements.
* ​hint\_trivial\_type (`Bool`): A hint to the compiler that the type T is trivial.
  It's not mandatory, but if set, it allows some optimizations.

## Fields

* ​data (`UnsafePointer[T]`): The underlying storage for the list.
* ​capacity (`Int`): The amount of elements that can fit in the list without resizing it.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Defaultable`,
`ExplicitlyCopyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Constructs an empty list.

`__init__(out self, *, capacity: Int)`

Constructs a list with the given capacity.

**Args:**

* ​capacity (`Int`): The requested capacity of the list.

`__init__(out self, *, length: UInt, fill: T)`

Constructs a list with the given capacity.

**Args:**

* ​length (`UInt`): The requested length of the list.
* ​fill (`T`): The element to fill each element of the list.

`__init__(out self, owned *values: T, *, __list_literal__: Tuple[] = Tuple())`

Constructs a list from the given values.

**Args:**

* ​\*values (`T`): The values to populate the list with.
* ​**list\_literal** (`Tuple[]`): Tell Mojo to use this method for list literals.

`__init__(out self, *, owned elements: VariadicListMem[T, origin, is_owned])`

Constructs a list from the given values.

**Args:**

* ​elements (`VariadicListMem[T, origin, is_owned]`): The values to populate the list with.

`__init__(out self, span: Span[T, origin])`

Constructs a list from the a Span of values.

**Args:**

* ​span (`Span[T, origin]`): The span of values to populate the list with.

`__init__(out self, *, unsafe_uninit_length: Int)`

Construct a list with the specified length, with uninitialized memory. This is unsafe, as it relies on the caller initializing the elements with unsafe operations, not assigning over the uninitialized data.

**Args:**

* ​unsafe\_uninit\_length (`Int`): The number of elements to allocate.

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Creates a deepcopy of the given list.

**Args:**

* ​existing (`Self`): The list to copy.

### `__del__`

`__del__(owned self)`

Destroy all elements in the list and free its memory.

### `__bool__`

`__bool__(self) -> Bool`

Checks whether the list has any elements or not.

**Returns:**

`False` if the list is empty, `True` if there is at least one
element.

### `__getitem__`

`__getitem__(self, slice: Slice) -> Self`

Gets the sequence of elements at the specified positions.

**Args:**

* ​slice (`Slice`): A slice that specifies positions of the new list.

**Returns:**

A new list containing the list at the specified slice.

`__getitem__[I: Indexer](ref self, idx: I) -> ref [self] T`

Gets the list element at the given index.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The index of the element.

**Returns:**

A reference to the element at the given index.

### `__eq__`

`__eq__[U: EqualityComparable & Copyable & Movable, //](self: List[U, hint_trivial_type], other: List[U, hint_trivial_type]) -> Bool`

Checks if two lists are equal.

Examples:

```mojo
var x = [1, 2, 3]
var y = [1, 2, 3]
print("x and y are equal" if x == y else "x and y are not equal")
```

**Parameters:**

* ​U (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the
  trait `EqualityComparable`.

**Args:**

* ​other (`List[U, hint_trivial_type]`): The list to compare with.

**Returns:**

True if the lists are equal, False otherwise.

### `__ne__`

`__ne__[U: EqualityComparable & Copyable & Movable, //](self: List[U, hint_trivial_type], other: List[U, hint_trivial_type]) -> Bool`

Checks if two lists are not equal.

Examples:

```mojo
var x = [1, 2, 3]
var y = [1, 2, 4]
print("x and y are not equal" if x != y else "x and y are equal")
```

**Parameters:**

* ​U (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the
  trait `EqualityComparable`.

**Args:**

* ​other (`List[U, hint_trivial_type]`): The list to compare with.

**Returns:**

True if the lists are not equal, False otherwise.

### `__contains__`

`__contains__[U: EqualityComparable & Copyable & Movable, //](self: List[U, hint_trivial_type], value: U) -> Bool`

Verify if a given value is present in the list.

Examples:

```mojo
var x = [1, 2, 3]
print("x contains 3" if 3 in x else "x does not contain 3")
```

**Parameters:**

* ​U (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the
  trait `EqualityComparable`.

**Args:**

* ​value (`U`): The value to find.

**Returns:**

True if the value is contained in the list, False otherwise.

### `__add__`

`__add__(self, owned other: Self) -> Self`

Concatenates self with other and returns the result as a new list.

**Args:**

* ​other (`Self`): List whose elements will be combined with the elements of
  self.

**Returns:**

The newly created list.

### `__mul__`

`__mul__(self, x: Int) -> Self`

Multiplies the list by x and returns a new list.

**Args:**

* ​x (`Int`): The multiplier number.

**Returns:**

The new list.

### `__iadd__`

`__iadd__(mut self, owned other: Self)`

Appends the elements of other into self.

**Args:**

* ​other (`Self`): List whose elements will be appended to self.

### `__imul__`

`__imul__(mut self, x: Int)`

Appends the original elements of this list x-1 times or clears it if x is x (`Int`): The multiplier number.

### `copy`

`copy(self) -> Self`

Creates a deep copy of the given list.

**Returns:**

A copy of the value.

### `__iter__`

`__iter__(ref self) -> _ListIter[T, hint_trivial_type, self_is_origin]`

Iterate over elements of the list, returning immutable references.

**Returns:**

An iterator of immutable references to the list elements.

### `__reversed__`

`__reversed__(ref self) -> _ListIter[T, hint_trivial_type, self_is_origin, False]`

Iterate backwards over the list, returning immutable references.

**Returns:**

A reversed iterator of immutable references to the list elements.

### `__len__`

`__len__(self) -> Int`

Gets the number of elements in the list.

**Returns:**

The number of elements in the list.

### `__str__`

`__str__[U: Representable & Copyable & Movable, //](self: List[U, hint_trivial_type]) -> String`

Returns a string representation of a `List`.

Notes:
Note that since we can't condition methods on a trait yet,
the way to call this method is a bit special. Here is an example
below:

```mojo
var my_list = [1, 2, 3]
print(my_list.__str__())
```

When the compiler supports conditional methods, then a simple
`String(my_list)` will be enough.

**Parameters:**

* ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the
  trait `Representable`.

**Returns:**

A string representation of the list.

### `write_to`

`write_to[W: Writer, U: Representable & Copyable & Movable, //](self: List[U, hint_trivial_type], mut writer: W)`

Write `my_list.__str__()` to a `Writer`.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.
* ​U (`Representable & Copyable & Movable`): The type of the List elements. Must have the trait
  `Representable`.

**Args:**

* ​writer (`W`): The object to write to.

### `__repr__`

`__repr__[U: Representable & Copyable & Movable, //](self: List[U, hint_trivial_type]) -> String`

Returns a string representation of a `List`.

Notes:
Note that since we can't condition methods on a trait yet, the way
to call this method is a bit special. Here is an example below:

```mojo
var my_list = [1, 2, 3]
print(my_list.__repr__())
```

When the compiler supports conditional methods, then a simple
`repr(my_list)` will be enough.

**Parameters:**

* ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the
  trait `Representable`.

**Returns:**

A string representation of the list.

### `byte_length`

`byte_length(self) -> Int`

Gets the byte length of the List (`len(self) * sizeof[T]()`).

**Returns:**

The byte length of the List (`len(self) * sizeof[T]()`).

### `append`

`append(mut self, owned value: T)`

Appends a value to this list.

Notes:
If there is no capacity left, resizes to twice the current capacity.
Except for 0 capacity where it sets 1.

**Args:**

* ​value (`T`): The value to append.

`append(mut self, elements: Span[T, origin])`

Appends elements to this list.

**Args:**

* ​elements (`Span[T, origin]`): The elements to append.

### `insert`

`insert(mut self, i: Int, owned value: T)`

Inserts a value to the list at the given index. `a.insert(len(a), value)` is equivalent to `a.append(value)`.

**Args:**

* ​i (`Int`): The index for the value.
* ​value (`T`): The value to insert.

### `extend`

`extend(mut self, owned other: List[T, hint_trivial_type])`

Extends this list by consuming the elements of `other`.

**Args:**

* ​other (`List[T, hint_trivial_type]`): List whose elements will be added in order at the end of this
  list.

`extend[D: DType, //](mut self: List[SIMD[D, 1], hint_trivial_type], value: SIMD[D, size])`

Extends this list with the elements of a vector.

Notes:
If there is no capacity left, resizes to `len(self) + value.size`.

**Parameters:**

* ​D (`DType`): The DType.

**Args:**

* ​value (`SIMD[D, size]`): The value to append.

`extend[D: DType, //](mut self: List[SIMD[D, 1], hint_trivial_type], value: SIMD[D, size], *, count: Int)`

Extends this list with `count` number of elements from a vector.

Notes:
If there is no capacity left, resizes to `len(self) + count`.

**Parameters:**

* ​D (`DType`): The DType.

**Args:**

* ​value (`SIMD[D, size]`): The value to append.
* ​count (`Int`): The amount of items to append. Must be less than or equal to
  `value.size`.

`extend[D: DType, //](mut self: List[SIMD[D, 1], hint_trivial_type], value: Span[SIMD[D, 1], origin])`

Extends this list with the elements of a `Span`.

Notes:
If there is no capacity left, resizes to `len(self) + len(value)`.

**Parameters:**

* ​D (`DType`): The DType.

**Args:**

* ​value (`Span[SIMD[D, 1], origin]`): The value to append.

### `pop`

`pop(mut self, i: Int = -1) -> T`

Pops a value from the list at the given index.

**Args:**

* ​i (`Int`): The index of the value to pop.

**Returns:**

The popped value.

### `reserve`

`reserve(mut self, new_capacity: Int)`

Reserves the requested capacity.

Notes:
If the current capacity is greater or equal, this is a no-op.
Otherwise, the storage is reallocated and the date is moved.

**Args:**

* ​new\_capacity (`Int`): The new capacity.

### `resize`

`resize(mut self, new_size: Int, value: T)`

Resizes the list to the given new size.

Notes:
If the new size is smaller than the current one, elements at the end
are discarded. If the new size is larger than the current one, the
list is appended with new values elements up to the requested size.

**Args:**

* ​new\_size (`Int`): The new size.
* ​value (`T`): The value to use to populate new elements.

`resize(mut self, *, unsafe_uninit_length: Int)`

Resizes the list to the given new size leaving any new elements uninitialized.

If the new size is smaller than the current one, elements at the end
are discarded. If the new size is larger than the current one, the
list is extended and the new elements are left uninitialized.

**Args:**

* ​unsafe\_uninit\_length (`Int`): The new size.

### `shrink`

`shrink(mut self, new_size: Int)`

Resizes to the given new size which must be new\_size (`Int`): The new size.

### `reverse`

`reverse(mut self)`

Reverses the elements of the list.

### `index`

`index[C: EqualityComparable & Copyable & Movable, //](ref self: List[C, hint_trivial_type], value: C, start: Int = 0, stop: Optional[Int] = Optional(None)) -> Int`

Returns the index of the first occurrence of a value in a list restricted by the range given the start and stop bounds.

Examples:

```mojo
var my_list = [1, 2, 3]
print(my_list.index(2)) # prints `1`
```

**Parameters:**

* ​C (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the
  `EqualityComparable` trait.

**Args:**

* ​value (`C`): The value to search for.
* ​start (`Int`): The starting index of the search, treated as a slice index
  (defaults to 0).
* ​stop (`Optional[Int]`): The ending index of the search, treated as a slice index
  (defaults to None, which means the end of the list).

**Returns:**

The index of the first occurrence of the value in the list.

**Raises:**

ValueError: If the value is not found in the list.

### `clear`

`clear(mut self)`

Clears the elements in the list.

### `steal_data`

`steal_data(mut self) -> UnsafePointer[T]`

Take ownership of the underlying pointer from the list.

**Returns:**

The underlying data.

### `unsafe_get`

`unsafe_get(ref self, idx: Int) -> ref [self] T`

Get a reference to an element of self without checking index bounds.

Notes:
Users should consider using `__getitem__` instead of this method as
it is unsafe. If an index is out of bounds, this method will not
abort, it will be considered undefined behavior.

Note that there is no wraparound for negative indices, caution is
advised. Using negative indices is considered undefined behavior.
Never use `my_list.unsafe_get(-1)` to get the last element of the
list. Instead, do `my_list.unsafe_get(len(my_list) - 1)`.

**Args:**

* ​idx (`Int`): The index of the element to get.

**Returns:**

A reference to the element at the given index.

### `unsafe_set`

`unsafe_set(mut self, idx: Int, owned value: T)`

Write a value to a given location without checking index bounds.

Notes:
Users should consider using `my_list[idx] = value` instead of this
method as it is unsafe. If an index is out of bounds, this method
will not abort, it will be considered undefined behavior.

Note that there is no wraparound for negative indices, caution is
advised. Using negative indices is considered undefined behavior.
Never use `my_list.unsafe_set(-1, value)` to set the last element of
the list. Instead, do `my_list.unsafe_set(len(my_list) - 1, value)`.

**Args:**

* ​idx (`Int`): The index of the element to set.
* ​value (`T`): The value to set.

### `count`

`count[T: EqualityComparable & Copyable & Movable, //](self: List[T, hint_trivial_type], value: T) -> Int`

Counts the number of occurrences of a value in the list.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the
  trait `EqualityComparable`.

**Args:**

* ​value (`T`): The value to count.

**Returns:**

The number of occurrences of the value in the list.

### `swap_elements`

`swap_elements(mut self, elt_idx_1: Int, elt_idx_2: Int)`

Swaps elements at the specified indexes if they are different.

Examples:

```mojo
var my_list = [1, 2, 3]
my_list.swap_elements(0, 2)
print(my_list.__str__()) # 3, 2, 1
```

Notes:
This is useful because `swap(my_list[i], my_list[j])` cannot be
supported by Mojo, because a mutable alias may be formed.

**Args:**

* ​elt\_idx\_1 (`Int`): The index of one element.
* ​elt\_idx\_2 (`Int`): The index of the other element.

### `unsafe_ptr`

`unsafe_ptr(ref self) -> UnsafePointer[T, mut=self_is_mut, origin=self_is_origin]`

Retrieves a pointer to the underlying memory.

**Returns:**

The pointer to the underlying memory.

---

## list

Defines the List type.

These APIs are imported automatically, just like builtins.

## Structs

* [​`List`](/mojo/stdlib/collections/list/List): The `List` type is a dynamically-allocated list.

---

## Optional

`struct Optional[T: Copyable & Movable]`

A type modeling a value which may or may not be present.

Optional values can be thought of as a type-safe nullable pattern.
Your value can take on a value or `None`, and you need to check
and explicitly extract the value to get it out.

Currently T is required to be a `Copyable & Movable` so we can implement
copy/move for Optional and allow it to be used in collections itself.

Examples:

```mojo
var a = Optional(1)
var b = Optional[Int](None)
if a:
    print(a.value())  # prints 1
if b:  # Bool(b) is False, so no print
    print(b.value())
var c = a.or_else(2)
var d = b.or_else(2)
print(c)  # prints 1
print(d)  # prints 2
```

## Parameters

* ​T (`Copyable & Movable`): The type of value stored in the `Optional`.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Defaultable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Construct an empty `Optional`.

`@implicit`
`__init__(out self, owned value: T)`

Construct an `Optional` containing a value.

**Args:**

* ​value (`T`): The value to store in the `Optional`.

`@implicit`
`__init__(out self, value: NoneType)`

Construct an empty `Optional`.

**Args:**

* ​value (`NoneType`): Must be exactly `None`.

### `__bool__`

`__bool__(self) -> Bool`

Return true if the Optional has a value.

**Returns:**

True if the `Optional` has a value and False otherwise.

### `__getitem__`

`__getitem__(ref self) -> ref [$1._value] T`

Retrieve a reference to the value inside the `Optional`.

**Returns:**

A reference to the value inside the `Optional`.

**Raises:**

On empty `Optional`.

### `__invert__`

`__invert__(self) -> Bool`

Return False if the `Optional` has a value.

**Returns:**

False if the `Optional` has a value and True otherwise.

### `__eq__`

`__eq__(self, rhs: NoneType) -> Bool`

Return `True` if a value is not present.

**Args:**

* ​rhs (`NoneType`): The `None` value to compare to.

**Returns:**

`True` if a value is not present, `False` otherwise.

`__eq__[T: EqualityComparable & Copyable & Movable](self: Optional[T], rhs: Optional[T]) -> Bool`

Return `True` if this is the same as another `Optional` value, meaning both are absent, or both are present and have the same underlying value.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the
  traits `Copyable`, `Movable` and `EqualityComparable`.

**Args:**

* ​rhs (`Optional[T]`): The value to compare to.

**Returns:**

True if the values are the same.

### `__ne__`

`__ne__(self, rhs: NoneType) -> Bool`

Return `True` if a value is present.

**Args:**

* ​rhs (`NoneType`): The `None` value to compare to.

**Returns:**

`False` if a value is not present, `True` otherwise.

`__ne__[T: EqualityComparable & Copyable & Movable, //](self: Optional[T], rhs: Optional[T]) -> Bool`

Return `False` if this is the same as another `Optional` value, meaning both are absent, or both are present and have the same underlying value.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the
  traits `Copyable`, `Movable` and `EqualityComparable`.

**Args:**

* ​rhs (`Optional[T]`): The value to compare to.

**Returns:**

False if the values are the same.

### `__is__`

`__is__(self, other: NoneType) -> Bool`

Return `True` if the Optional has no value.

Notes:
It allows you to use the following syntax:
`if my_optional is None:`.

**Args:**

* ​other (`NoneType`): The value to compare to (None).

**Returns:**

True if the Optional has no value and False otherwise.

### `__isnot__`

`__isnot__(self, other: NoneType) -> Bool`

Return `True` if the Optional has a value.

Notes:
It allows you to use the following syntax:
`if my_optional is not None:`.

**Args:**

* ​other (`NoneType`): The value to compare to (None).

**Returns:**

True if the Optional has a value and False otherwise.

### `copy`

`copy(self) -> Self`

Copy construct an `Optional`.

**Returns:**

A copy of the value.

### `__str__`

`__str__[U: Copyable & Movable & Representable, //](self: Optional[U]) -> String`

Return the string representation of the value of the `Optional`.

**Parameters:**

* ​U (`Copyable & Movable & Representable`): The type of the elements in the list. Must implement the
  traits `Representable`, `Copyable` and `Movable`.

**Returns:**

A string representation of the `Optional`.

### `__repr__`

`__repr__[U: Representable & Copyable & Movable, //](self: Optional[U]) -> String`

Returns the verbose string representation of the `Optional`.

**Parameters:**

* ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the
  traits `Representable`, `Copyable` and `Movable`.

**Returns:**

A verbose string representation of the `Optional`.

### `write_to`

`write_to[W: Writer, U: Representable & Copyable & Movable, //](self: Optional[U], mut writer: W)`

Write `Optional` string representation to a `Writer`.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.
* ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the
  traits `Representable`, `Copyable` and `Movable`.

**Args:**

* ​writer (`W`): The object to write to.

### `value`

`value(ref self) -> ref [$1._value] T`

Retrieve a reference to the value of the `Optional`.

Notes:
This will abort on empty `Optional`.

**Returns:**

A reference to the contained data of the `Optional` as a reference.

### `unsafe_value`

`unsafe_value(ref self) -> ref [$1._value] T`

Unsafely retrieve a reference to the value of the `Optional`.

Notes:
This will **not** abort on empty `Optional`.

**Returns:**

A reference to the contained data of the `Optional` as a reference.

### `take`

`take(mut self) -> T`

Move the value out of the `Optional`.

Notes:
This will abort on empty `Optional`.

**Returns:**

The contained data of the `Optional` as an owned T value.

### `unsafe_take`

`unsafe_take(mut self) -> T`

Unsafely move the value out of the `Optional`.

Notes:
This will **not** abort on empty `Optional`.

**Returns:**

The contained data of the `Optional` as an owned T value.

### `or_else`

`or_else(self, default: T) -> T`

Return the underlying value contained in the `Optional` or a default value if the `Optional`'s underlying value is not present.

**Args:**

* ​default (`T`): The new value to use if no value was present.

**Returns:**

The underlying value contained in the `Optional` or a default value.

### `copied`

`copied[mut: Bool, origin: Origin[mut], //, T: Copyable & Movable](self: Optional[Pointer[T, origin]]) -> Optional[T]`

Converts an `Optional` containing a Pointer to an `Optional` of an owned value by copying.

Examples:

Copy the value of an `Optional[Pointer[_]]`

```mojo
var data = String("foo")
var opt = Optional(Pointer(to=data))
var opt_owned: Optional[String] = opt.copied()
```

Notes:
If `self` is an empty `Optional`, the returned `Optional` will be
empty as well.

**Parameters:**

* ​mut (`Bool`): Mutability of the pointee origin.
* ​origin (`Origin[mut]`): Origin of the contained `Pointer`.
* ​T (`Copyable & Movable`): Type of the owned result value.

**Returns:**

An `Optional` containing an owned copy of the pointee value.

---

## OptionalReg

`@register_passable(trivial)`
`struct OptionalReg[T: AnyTrivialRegType]`

A register-passable optional type.

This struct optionally contains a value. It only works with trivial register
passable types at the moment.

## Parameters

* ​T (`AnyTrivialRegType`): The type of value stored in the Optional.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Defaultable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Create an optional with a value of None.

`@implicit`
`__init__(value: T) -> Self`

Create an optional with a value.

**Args:**

* ​value (`T`): The value.

`@implicit`
`__init__(value: NoneType) -> Self`

Create an optional without a value from a None literal.

**Args:**

* ​value (`NoneType`): The None value.

### `__bool__`

`__bool__(self) -> Bool`

Return true if the optional has a value.

**Returns:**

True if the optional has a value and False otherwise.

### `__is__`

`__is__(self, other: NoneType) -> Bool`

Return `True` if the Optional has no value.

It allows you to use the following syntax: `if my_optional is None:`

**Args:**

* ​other (`NoneType`): The value to compare to (None).

**Returns:**

True if the Optional has no value and False otherwise.

### `__isnot__`

`__isnot__(self, other: NoneType) -> Bool`

Return `True` if the Optional has a value.

It allows you to use the following syntax: `if my_optional is not None:`

**Args:**

* ​other (`NoneType`): The value to compare to (None).

**Returns:**

True if the Optional has a value and False otherwise.

### `value`

`value(self) -> T`

Get the optional value.

**Returns:**

The contained value.

### `or_else`

`or_else(owned self, owned default: T) -> T`

Return the underlying value contained in the Optional or a default value if the Optional's underlying value is not present.

**Args:**

* ​default (`T`): The new value to use if no value was present.

**Returns:**

The underlying value contained in the Optional or a default value.

---

## optional

Defines Optional, a type modeling a value which may or may not be present.

Optional values can be thought of as a type-safe nullable pattern.
Your value can take on a value or `None`, and you need to check
and explicitly extract the value to get it out.

Examples:

```mojo
var a = Optional(1)
var b = Optional[Int](None)
if a:
    print(a.value())  # prints 1
if b:  # Bool(b) is False, so no print
    print(b.value())
var c = a.or_else(2)
var d = b.or_else(2)
print(c)  # prints 1
print(d)  # prints 2
```

## Structs

* [​`Optional`](/mojo/stdlib/collections/optional/Optional): A type modeling a value which may or may not be present.
* [​`OptionalReg`](/mojo/stdlib/collections/optional/OptionalReg): A register-passable optional type.

---

## Set

`struct Set[T: Copyable & Movable & Hashable & EqualityComparable]`

A set data type.

O(1) average-case amortized add, remove, and membership check.

```mojo
from collections import Set

var set = { 1, 2, 3 }
print(len(set))  # 3
set.add(4)

for element in set:
    print(element)

set -= Set[Int](3, 4, 5)
print(set == Set[Int](1, 2))  # True
print(set | Set[Int](0, 1) == Set[Int](0, 1, 2))  # True
var element = set.pop()
print(len(set))  # 1
```

## Parameters

* ​T (`Copyable & Movable & Hashable & EqualityComparable`): The element type of the set. Must implement KeyElement.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`EqualityComparable`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`Hashable`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(out self, *ts: T, *, __set_literal__: Tuple[] = Tuple())`

Construct a set from initial elements.

**Args:**

* ​\*ts (`T`): Variadic of elements to add to the set.
* ​**set\_literal** (`Tuple[]`): Tell Mojo to use this method for set literals.

`@implicit`
`__init__(out self, elements: List[T, hint_trivial_type])`

Construct a set from a List of elements.

**Args:**

* ​elements (`List[T, hint_trivial_type]`): A vector of elements to add to the set.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Copy constructor.

**Args:**

* ​other (`Self`): The existing Set instance to copy from.

### `__bool__`

`__bool__(self) -> Bool`

Whether the set is non-empty or not.

**Returns:**

True if the set is non-empty, False if it is empty.

### `__lt__`

`__lt__(self, other: Self) -> Bool`

Overloads the other (`Self`): The set to compare against for the strict subset relationship.

**Returns:**

True if the set is a strict subset of the `other` set, False otherwise.

### `__le__`

`__le__(self, other: Self) -> Bool`

Overloads the other (`Self`): Another Set instance to check against.

**Returns:**

True if this set is a subset of the `other` set, False otherwise.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Set equality.

**Args:**

* ​other (`Self`): Another Set instance to check equality against.

**Returns:**

True if the sets contain the same elements and False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Set inequality.

**Args:**

* ​other (`Self`): Another Set instance to check equality against.

**Returns:**

True if the sets are different and False otherwise.

### `__gt__`

`__gt__(self, other: Self) -> Bool`

Overloads the > operator for strict superset comparison of sets.

**Args:**

* ​other (`Self`): The set to compare against for the strict superset relationship.

**Returns:**

True if the set is a strict superset of the `other` set, False otherwise.

### `__ge__`

`__ge__(self, other: Self) -> Bool`

Overloads the >= operator for sets. Works like as `issuperset` method.

**Args:**

* ​other (`Self`): Another Set instance to check against.

**Returns:**

True if this set is a superset of the `other` set, False otherwise.

### `__contains__`

`__contains__(self, t: T) -> Bool`

Whether or not the set contains an element.

**Args:**

* ​t (`T`): The element to check membership in the set.

**Returns:**

Whether or not the set contains the element.

### `__sub__`

`__sub__(self, other: Self) -> Self`

Set subtraction.

**Args:**

* ​other (`Self`): Another Set instance to subtract from this one.

**Returns:**

A new set containing elements of this set, but not containing
any elements which were in the `other` set.

### `__and__`

`__and__(self, other: Self) -> Self`

The set intersection operator.

**Args:**

* ​other (`Self`): Another Set instance to intersect with this one.

**Returns:**

A new set containing only the elements which appear in both
this set and the `other` set.

### `__or__`

`__or__(self, other: Self) -> Self`

The set union operator.

**Args:**

* ​other (`Self`): Another Set instance to union with this one.

**Returns:**

A new set containing any elements which appear in either
this set or the `other` set.

### `__xor__`

`__xor__(self, other: Self) -> Self`

Overloads the ^ operator for sets. Works like as `symmetric_difference` method.

**Args:**

* ​other (`Self`): The set to find the symmetric difference with.

**Returns:**

A new set containing the symmetric difference of the two sets.

### `__isub__`

`__isub__(mut self, other: Self)`

In-place set subtraction.

Updates the set to remove any elements from the `other` set.

**Args:**

* ​other (`Self`): Another Set instance to subtract from this one.

### `__iand__`

`__iand__(mut self, other: Self)`

In-place set intersection.

Updates the set to contain only the elements which are already in
the set and are also contained in the `other` set.

**Args:**

* ​other (`Self`): Another Set instance to intersect with this one.

### `__ixor__`

`__ixor__(mut self, other: Self)`

Overloads the ^= operator. Works like as `symmetric_difference_update` method.

Updates the set with the symmetric difference of itself and another set.

**Args:**

* ​other (`Self`): The set to find the symmetric difference with.

### `__ior__`

`__ior__(mut self, other: Self)`

In-place set union.

Updates the set to contain all elements in the `other` set
as well as keeping all elements it already contained.

**Args:**

* ​other (`Self`): Another Set instance to union with this one.

### `__len__`

`__len__(self) -> Int`

The size of the set.

**Returns:**

The number of elements in the set.

### `__hash__`

`__hash__(self) -> UInt`

A hash value of the elements in the set.

The hash value is order independent, so s1 == s2 -> hash(s1) == hash(s2).

**Returns:**

A hash value of the set suitable for non-cryptographic purposes.

### `__str__`

`__str__[U: Copyable & Movable & Hashable & EqualityComparable & Representable, //](self: Set[U]) -> String`

Returns the string representation of the set.

**Parameters:**

* ​U (`Copyable & Movable & Hashable & EqualityComparable & Representable`): The type of the List elements. Must implement the `Representable`
  and `KeyElement` traits.

**Returns:**

The string representation of the set.

### `__repr__`

`__repr__[U: Copyable & Movable & Hashable & EqualityComparable & Representable, //](self: Set[U]) -> String`

Returns the string representation of the set.

**Parameters:**

* ​U (`Copyable & Movable & Hashable & EqualityComparable & Representable`): The type of the List elements. Must implement the `Representable`
  and `KeyElement` traits.

**Returns:**

The string representation of the set.

### `write_to`

`write_to[W: Writer, U: Copyable & Movable & Hashable & EqualityComparable & Representable, //](self: Set[U], mut writer: W)`

Write Set string representation to a `Writer`.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writer trait.
* ​U (`Copyable & Movable & Hashable & EqualityComparable & Representable`): The type of the List elements. Must implement the `Representable`
  and `KeyElement` traits.

**Args:**

* ​writer (`W`): The object to write to.

### `__iter__`

`__iter__(ref self) -> _DictKeyIter[T, NoneType, self_is_origin._data]`

Iterate over elements of the set, returning immutable references.

**Returns:**

An iterator of immutable references to the set elements.

### `add`

`add(mut self, t: T)`

Add an element to the set.

**Args:**

* ​t (`T`): The element to add to the set.

### `remove`

`remove(mut self, t: T)`

Remove an element from the set.

**Args:**

* ​t (`T`): The element to remove from the set.

**Raises:**

If the element isn't in the set to remove.

### `pop`

`pop(mut self) -> T`

Remove any one item from the set, and return it.

As an implementation detail this will remove the first item
according to insertion order. This is practically useful
for breadth-first search implementations.

**Returns:**

The element which was removed from the set.

**Raises:**

If the set is empty.

### `union`

`union(self, other: Self) -> Self`

Set union.

**Args:**

* ​other (`Self`): Another Set instance to union with this one.

**Returns:**

A new set containing any elements which appear in either
this set or the `other` set.

### `intersection`

`intersection(self, other: Self) -> Self`

Set intersection.

**Args:**

* ​other (`Self`): Another Set instance to intersect with this one.

**Returns:**

A new set containing only the elements which appear in both
this set and the `other` set.

### `difference`

`difference(self, other: Self) -> Self`

Set difference.

**Args:**

* ​other (`Self`): Another Set instance to find the difference with this one.

**Returns:**

A new set containing elements that are in this set but not in
the `other` set.

### `update`

`update(mut self, other: Self)`

In-place set update.

Updates the set to contain all elements in the `other` set
as well as keeping all elements it already contained.

**Args:**

* ​other (`Self`): Another Set instance to union with this one.

### `intersection_update`

`intersection_update(mut self, other: Self)`

In-place set intersection update.

Updates the set by retaining only elements found in both this set and the `other` set,
removing all other elements. The result is the intersection of this set with `other`.

**Args:**

* ​other (`Self`): Another Set instance to intersect with this one.

### `difference_update`

`difference_update(mut self, other: Self)`

In-place set subtraction.

Updates the set by removing all elements found in the `other` set,
effectively keeping only elements that are unique to this set.

**Args:**

* ​other (`Self`): Another Set instance to subtract from this one.

### `issubset`

`issubset(self, other: Self) -> Bool`

Check if this set is a subset of another set.

**Args:**

* ​other (`Self`): Another Set instance to check against.

**Returns:**

True if this set is a subset of the `other` set, False otherwise.

### `isdisjoint`

`isdisjoint(self, other: Self) -> Bool`

Check if this set is disjoint with another set.

**Args:**

* ​other (`Self`): Another Set instance to check against.

**Returns:**

True if this set is disjoint with the `other` set, False otherwise.

### `issuperset`

`issuperset(self, other: Self) -> Bool`

Check if this set is a superset of another set.

**Args:**

* ​other (`Self`): Another Set instance to check against.

**Returns:**

True if this set is a superset of the `other` set, False otherwise.

### `symmetric_difference`

`symmetric_difference(self, other: Self) -> Self`

Returns the symmetric difference of two sets.

**Args:**

* ​other (`Self`): The set to find the symmetric difference with.

**Returns:**

A new set containing the symmetric difference of the two sets.

### `symmetric_difference_update`

`symmetric_difference_update(mut self, other: Self)`

Updates the set with the symmetric difference of itself and another set.

**Args:**

* ​other (`Self`): The set to find the symmetric difference with.

### `discard`

`discard(mut self, value: T)`

Remove a value from the set if it exists. Pass otherwise.

**Args:**

* ​value (`T`): The element to remove from the set.

### `clear`

`clear(mut self)`

Removes all elements from the set.

This method modifies the set in-place, removing all of its elements.
After calling this method, the set will be empty.

---

## set

Implements the  Set datatype.

## Structs

* [​`Set`](/mojo/stdlib/collections/set/Set): A set data type.

---

## Codepoint

`struct Codepoint`

A Unicode codepoint, typically a single user-recognizable character; restricted to valid Unicode scalar values.

This type is restricted to store a single Unicode [*scalar value*][1],
typically encoding a single user-recognizable character.

All valid Unicode scalar values are in the range(s) 0 to 0xD7FF and
0xE000 to 0x10FFFF, inclusive. This type guarantees that the stored integer
value falls in these ranges.

[1]: https://www.unicode.org/glossary/#unicode_scalar_value

**Codepoints versus Scalar Values**

Formally, Unicode defines a codespace of values in the range 0 to
0x10FFFF inclusive, and a
[Unicode codepoint](https://www.unicode.org/glossary/#code_point) is any
integer falling within that range. However, due to historical reasons,
it became necessary to "carve out" a subset of the codespace, excluding
codepoints in the range 0xD7FF–0xE000. That subset of codepoints excluding
that range are known as [Unicode scalar values][1]. The codepoints in the
range 0xD7FF-0xE000 are known as "surrogate" codepoints. The surrogate
codepoints will never be assigned a semantic meaning, and can only
validly appear in UTF-16 encoded text.

The difference between codepoints and scalar values is a technical
distinction related to the backwards-compatible workaround chosen to enable
UTF-16 to encode the full range of the Unicode codespace. For simplicities
sake, and to avoid a confusing clash with the Mojo `Scalar` type, this type
is pragmatically named `Codepoint`, even though it is restricted to valid
scalar values.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Intable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, *, unsafe_unchecked_codepoint: SIMD[uint32, 1])`

Construct a `Codepoint` from a code point value without checking that it falls in the valid range.

Safety:
The provided codepoint value MUST be a valid Unicode scalar value.
Providing a value outside of the valid range could lead to undefined
behavior in algorithms that depend on the validity guarantees of
this type.

**Args:**

* ​unsafe\_unchecked\_codepoint (`SIMD[uint32, 1]`): A valid Unicode scalar value code point.

`__init__(out self, codepoint: SIMD[uint8, 1])`

Construct a `Codepoint` from a single byte value.

This constructor cannot fail because non-negative 8-bit integers are
valid Unicode scalar values.

**Args:**

* ​codepoint (`SIMD[uint8, 1]`): The 8-bit codepoint value to convert to a `Codepoint`.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Return True if this character has the same codepoint value as `other`.

**Args:**

* ​other (`Self`): The codepoint value to compare against.

**Returns:**

True if this character and `other` have the same codepoint value;
False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Return True if this character has a different codepoint value from `other`.

**Args:**

* ​other (`Self`): The codepoint value to compare against.

**Returns:**

True if this character and `other` have different codepoint values;
False otherwise.

### `from_u32`

`static from_u32(codepoint: SIMD[uint32, 1]) -> Optional[Codepoint]`

Construct a `Codepoint` from a code point value. Returns None if the provided `codepoint` is not in the valid range.

**Args:**

* ​codepoint (`SIMD[uint32, 1]`): An integer representing a Unicode scalar value.

**Returns:**

A `Codepoint` if `codepoint` falls in the valid range for Unicode
scalar values, otherwise None.

### `ord`

`static ord(string: StringSlice[origin]) -> Self`

Returns the `Codepoint` that represents the given single-character string.

Given a string containing one character, return a `Codepoint`
representing the codepoint of that character. For example,
`Codepoint.ord("a")` returns the codepoint `97`. This is the inverse of
the `chr()` function.

This function is similar to the `ord()` free function, except that it
returns a `Codepoint` instead of an `Int`.

**Args:**

* ​string (`StringSlice[origin]`): The input string, which must contain only a single character.

**Returns:**

A `Codepoint` representing the codepoint of the given character.

### `unsafe_decode_utf8_codepoint`

`static unsafe_decode_utf8_codepoint(s: Span[SIMD[uint8, 1], origin]) -> Tuple[Codepoint, Int]`

Decodes a single `Codepoint` and number of bytes read from a given UTF-8 string pointer.

Safety:
`_ptr` MUST point to the first byte in a **known-valid** UTF-8
character sequence. This function MUST NOT be used on unvalidated
input.

**Args:**

* ​s (`Span[SIMD[uint8, 1], origin]`): Span to UTF-8 encoded data containing at least one valid
  encoded codepoint.

**Returns:**

The decoded codepoint `Codepoint`, as well as the number of bytes
read.

### `__int__`

`__int__(self) -> Int`

Returns the numeric value of this scalar value as an integer.

**Returns:**

The numeric value of this scalar value as an integer.

### `__str__`

`__str__(self) -> String`

Formats this `Codepoint` as a single-character string.

**Returns:**

A string containing this single character.

### `is_ascii`

`is_ascii(self) -> Bool`

Returns True if this `Codepoint` is an ASCII character.

All ASCII characters are less than or equal to codepoint value 127, and
take exactly 1 byte to encode in UTF-8.

**Returns:**

A boolean indicating if this `Codepoint` is an ASCII character.

### `is_ascii_digit`

`is_ascii_digit(self) -> Bool`

Determines whether the given character is a digit \[0-9].

**Returns:**

True if the character is a digit.

### `is_ascii_upper`

`is_ascii_upper(self) -> Bool`

Determines whether the given character is an uppercase character.

This currently only respects the default "C" locale, i.e. returns True
iff the character specified is one of "ABCDEFGHIJKLMNOPQRSTUVWXYZ".

**Returns:**

True if the character is uppercase.

### `is_ascii_lower`

`is_ascii_lower(self) -> Bool`

Determines whether the given character is an lowercase character.

This currently only respects the default "C" locale, i.e. returns True
iff the character specified is one of "abcdefghijklmnopqrstuvwxyz".

**Returns:**

True if the character is lowercase.

### `is_ascii_printable`

`is_ascii_printable(self) -> Bool`

Determines whether the given character is a printable character.

**Returns:**

True if the character is a printable character, otherwise False.

### `is_python_space`

`is_python_space(self) -> Bool`

Determines whether this character is a Python whitespace string.

This corresponds to Python's [universal separators](https://docs.python.org/3/library/stdtypes.html#str.splitlines):
`" \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`.

# Examples

Check if a string contains only whitespace:

```mojo
from testing import assert_true, assert_false

# ASCII space characters
assert_true(Codepoint.ord(" ").is_python_space())
assert_true(Codepoint.ord("	").is_python_space())

# Unicode paragraph separator:
assert_true(Codepoint.from_u32(0x2029).value().is_python_space())

# Letters are not space characters
assert_fales(Codepoint.ord("a").is_python_space())
```

.

**Returns:**

True if this character is one of the whitespace characters listed
above, otherwise False.

### `is_posix_space`

`is_posix_space(self) -> Bool`

Returns True if this `Codepoint` is a **space** character according to the [POSIX locale][1].

The POSIX locale is also known as the C locale.

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_03_01

This only respects the default "C" locale, i.e. returns True only if the
character specified is one of " \t\n\v\f\r". For semantics similar
to Python, use `String.isspace()`.

**Returns:**

True iff the character is one of the whitespace characters listed
above.

### `to_u32`

`to_u32(self) -> SIMD[uint32, 1]`

Returns the numeric value of this scalar value as an unsigned 32-bit integer.

**Returns:**

The numeric value of this scalar value as an unsigned 32-bit
integer.

### `unsafe_write_utf8`

`unsafe_write_utf8[optimize_ascii: Bool = True, branchless: Bool = False](self, ptr: UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, origin=origin]) -> UInt`

Shift unicode to utf8 representation.

Safety:
`ptr` MUST point to at least `self.utf8_byte_length()` allocated
bytes or else an out-of-bounds write will occur, which is undefined
behavior.

### Unicode (represented as UInt32 BE) to UTF-8 conversion:

* 1: 00000000 00000000 00000000 0aaaaaaa -> 0aaaaaaa
  * a
* 2: 00000000 00000000 00000aaa aabbbbbb -> 110aaaaa 10bbbbbb
  * (a >> 6)  | 0b11000000, b         | 0b10000000
* 3: 00000000 00000000 aaaabbbb bbcccccc -> 1110aaaa 10bbbbbb 10cccccc
  * (a >> 12) | 0b11100000, (b >> 6)  | 0b10000000, c        | 0b10000000
* 4: 00000000 000aaabb bbbbcccc ccdddddd -> 11110aaa 10bbbbbb 10cccccc
  10dddddd
  * (a >> 18) | 0b11110000, (b >> 12) | 0b10000000, (c >> 6) | 0b10000000,
    d | 0b10000000
    .

**Parameters:**

* ​optimize\_ascii (`Bool`): Optimize for languages with mostly ASCII characters.
* ​branchless (`Bool`): Use a branchless algorithm.

**Args:**

* ​ptr (`UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, origin=origin]`): Pointer value to write the encoded UTF-8 bytes. Must validly
  point to a sufficient number of bytes (1-4) to hold the encoded
  data.

**Returns:**

Returns the number of bytes written.

### `utf8_byte_length`

`utf8_byte_length(self) -> UInt`

Returns the number of UTF-8 bytes required to encode this character.

Notes:
The returned value is always between 1 and 4 bytes.

**Returns:**

Byte count of UTF-8 bytes required to encode this character.

---

## codepoint

Unicode codepoint handling.

This module provides the `Codepoint` type for representing single Unicode scalar values.
A codepoint represents a single Unicode character, restricted to valid Unicode scalar
values in the ranges 0 to 0xD7FF and 0xE000 to 0x10FFFF inclusive.

The `Codepoint` type provides functionality for:

* Converting between codepoints and UTF-8 encoded bytes.
* Testing character properties like ASCII, digits, whitespace etc.
* Converting between codepoints and strings.
* Safe construction from integers with validation.

Example:

```mojo
from collections.string import Codepoint
from testing import assert_true

# Create a codepoint from a character
var c = Codepoint.ord('A')

# Check properties
assert_true(c.is_ascii())
assert_true(c.is_ascii_upper())

# Convert to string
var s = String(c)  # "A"
```

## Structs

* [​`Codepoint`](/mojo/stdlib/collections/string/codepoint/Codepoint): A Unicode codepoint, typically a single user-recognizable character; restricted to valid Unicode scalar values.

---

## format

String formatting utilities for Mojo.

This module provides string formatting functionality similar to Python's
`str.format()` method. The `format()` method (available on the
[`String`](/mojo/stdlib/collections/string/string/String#format) and
[`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice#format)
types) takes the current string as a template (or "format string"), which can
contain literal text and/or replacement fields delimited by curly braces (`{}`).
The replacement fields are replaced with the values of the arguments.

Replacement fields can mapped to the arguments in one of two ways:

* Automatic indexing by argument position:

  ```mojo
  var s = String("{} is {}").format("Mojo", "🔥")
  ```

* Manual indexing by argument position:

  ```mojo
  var s = String("{1} is {0}").format("hot", "🔥")
  ```

The replacement fields can also contain the `!r` or `!s` conversion flags, to
indicate whether the argument should be formatted using `repr()` or `String()`,
respectively:

```mojo
var s = String("{!r}").format(myComplicatedObject)
```

Note that the following features from Python's `str.format()` are
**not yet supported**:

* Named arguments (for example `"{name} is {adjective}"`).
* Accessing the attributes of an argument value (for example, `"{0.name}"`.
* Accessing an indexed value from the argument (for example, `"{1[0]}"`).
* Format specifiers for controlling output format (width, precision, and so on).

Examples:

```mojo
# Basic formatting
var s1 = String("Hello {0}!").format("World")  # Hello World!

# Multiple arguments
var s2 = String("{0} plus {1} equals {2}").format(1, 2, 3)  # 1 plus 2 equals 3

# Conversion flags
var s4 = String("{!r}").format("test")  # "'test'"
```

This module has no public API; its functionality is available through the
[`String.format()`](/mojo/stdlib/collections/string/string/String#format) and
[`StringSlice.format()`](/mojo/stdlib/collections/string/string_slice/StringSlice#format)
methods.

---

## string

The string package provides comprehensive Unicode string handling functionality for Mojo.

This package implements Unicode-aware string types and operations, with UTF-8 support.
It includes efficient implementations for string manipulation, formatting, and Unicode
operations while maintaining memory safety and performance.

Key Components:

* `String`: The main string type supporting UTF-8 encoded text,
* `StringSlice`: Memory-efficient string view type for zero-copy operations
* `Codepoint`: Unicode code point handling and operations
* Format: String formatting and interpolation utilities

Core Features:

* Unicode support with UTF-8 encoding
* Efficient string slicing and views
* String formatting and interpolation
* Memory-safe string operations
* Unicode case conversion
* Unicode property lookups and validation

Example:

```mojo
    # Basic string creation and manipulation
    var s = String("Hello, 世界")
    var slice = s[0:5] # "Hello"

    # Unicode-aware operations
    for c in s.codepoints():
        print(c.to_uppercase())

    # String formatting
    var name = "Mojo"
    var formatted = String("Hello, {name}!")
```

Note:

String stores data using UTF-8, and all operations (unless clearly noted) are intended to
be fully Unicode compliant and maintain correct UTF-8 encoded data.
A handful of operations are known to not be Unicode / UTF-8 compliant yet, but will be
fixed as time permits.

## Modules

* [​`codepoint`](/mojo/stdlib/collections/string/codepoint/): Unicode codepoint handling.
* [​`format`](/mojo/stdlib/collections/string/format/): String formatting utilities for Mojo.
* [​`string`](/mojo/stdlib/collections/string/string/): The core `String` type implementation for Mojo.
* [​`string_slice`](/mojo/stdlib/collections/string/string_slice/): The `StringSlice` type implementation for efficient string operations.

---

## String

`struct String`

Represents a mutable string.

See the [`string` module](/mojo/stdlib/collections/string/string/) for
more information and examples.

## Implemented traits

`AnyType`,
`Boolable`,
`ConvertibleFromPython`,
`Copyable`,
`Defaultable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`FloatableRaising`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`Hashable`,
`IntableRaising`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`PathLike`,
`PythonConvertible`,
`Representable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`,
`Writer`,
`_HashableWithHasher`

## Aliases

### `ASCII_LETTERS`

`alias ASCII_LETTERS = "abcdefghijklmnopqrstuvwxyz".__add__[__mlir_type.!kgen.string]("ABCDEFGHIJKLMNOPQRSTUVWXYZ")`

### `ASCII_LOWERCASE`

`alias ASCII_LOWERCASE = "abcdefghijklmnopqrstuvwxyz"`

### `ASCII_UPPERCASE`

`alias ASCII_UPPERCASE = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"`

### `DIGITS`

`alias DIGITS = "0123456789"`

### `HEX_DIGITS`

`alias HEX_DIGITS = "0123456789".__add__[__mlir_type.!kgen.string]("abcdef").__add__[__mlir_type.!kgen.string]("ABCDEF")`

### `OCT_DIGITS`

`alias OCT_DIGITS = "01234567"`

### `PRINTABLE`

`alias PRINTABLE = "0123456789".__add__[__mlir_type.!kgen.string]("abcdefghijklmnopqrstuvwxyz".__add__[__mlir_type.!kgen.string]("ABCDEFGHIJKLMNOPQRSTUVWXYZ")).__add__[__mlir_type.!kgen.string]("!\22#$%&'()*+,-./:;?@[\\]^_`{|}\~").**add**\[\_\_mlir\_type.!kgen.string]\(" \t\n\r\v\f")\`

### `PUNCTUATION`

`alias PUNCTUATION = "!\22#$%&'()*+,-./:;?@[\\]^_`{|}\~"\`

## Methods

### `__init__`

`__init__(out self)`

Construct an empty string.

`__init__(out self, *, capacity: Int)`

Construct an empty string with a given capacity.

**Args:**

* ​capacity (`Int`): The capacity of the string to allocate.

`@implicit`
`__init__(out self, data: StringSlice[StaticConstantOrigin])`

Construct a string from a static constant string without allocating.

**Args:**

* ​data (`StringSlice[StaticConstantOrigin]`): The static constant string to refer to.

`@implicit`
`__init__(out self, data: StringLiteral[value])`

Construct a string from a string literal without allocating.

**Args:**

* ​data (`StringLiteral[value]`): The static constant string to refer to.

`__init__(out self, *, bytes: Span[SIMD[uint8, 1], origin])`

Construct a string by copying the data. This constructor is explicit because it can involve memory allocation.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The bytes to copy.

`__init__[T: Stringable](out self, value: T)`

Initialize from a type conforming to `Stringable`.

**Parameters:**

* ​T (`Stringable`): The type conforming to Stringable.

**Args:**

* ​value (`T`): The object to get the string representation of.

`__init__[T: StringableRaising](out self, value: T)`

Initialize from a type conforming to `StringableRaising`.

**Parameters:**

* ​T (`StringableRaising`): The type conforming to Stringable.

**Args:**

* ​value (`T`): The object to get the string representation of.

**Raises:**

If there is an error when computing the string representation of the type.

`__init__[*Ts: Writable](out self, *args: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""))`

Construct a string by concatenating a sequence of Writable arguments.

Examples:

Construct a String from several `Writable` arguments:

```mojo
var string = String(1, 2.0, "three", sep=", ")
print(string) # "1, 2.0, three"
```

**Parameters:**

* ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy
  `Writable`.

**Args:**

* ​\*args (`*Ts`): A sequence of Writable arguments.
* ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements.
* ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements.

`__init__[*Ts: Writable](out self, args: VariadicPack[is_owned, origin, Writable, Ts], sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""))`

Construct a string by passing a variadic pack.

Examples:

```mojo
fn variadic_pack_to_string[
    *Ts: Writable,
](*args: *Ts) -> String:
    return String(args)

string = variadic_pack_to_string(1, ", ", 2.0, ", ", "three")
```

.

**Parameters:**

* ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy
  `Writable`.

**Args:**

* ​args (`VariadicPack[is_owned, origin, Writable, Ts]`): A VariadicPack of Writable arguments.
* ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements.
* ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements.

`__init__(out self, *, unsafe_uninit_length: UInt)`

Construct a String with the specified length, with uninitialized memory. This is unsafe, as it relies on the caller initializing the elements with unsafe operations, not assigning over the uninitialized data.

**Args:**

* ​unsafe\_uninit\_length (`UInt`): The number of bytes to allocate.

`__init__(out self, *, unsafe_from_utf8_ptr: UnsafePointer[SIMD[int8, 1], mut=mut, origin=origin])`

Creates a string from a UTF-8 encoded nul-terminated pointer.

Safety:

* `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data.
* `unsafe_from_utf8_ptr` MUST be null terminated.

**Args:**

* ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[int8, 1], mut=mut, origin=origin]`): An `UnsafePointer[Byte]` of null-terminated bytes encoded in UTF-8.

`__init__(out self, *, unsafe_from_utf8_ptr: UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin])`

Creates a string from a UTF-8 encoded nul-terminated pointer.

Safety:

* `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data.
* `unsafe_from_utf8_ptr` MUST be null terminated.

**Args:**

* ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin]`): An `UnsafePointer[Byte]` of null-terminated bytes encoded in UTF-8.

`__init__(out self, obj: PythonObject)`

Construct a `String` from a PythonObject.

**Args:**

* ​obj (`PythonObject`): The PythonObject to convert from.

**Raises:**

An error if the conversion failed.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Copy initialize the string from another string.

**Args:**

* ​other (`Self`): The string to copy.

### `__moveinit__`

`__moveinit__(out self, owned other: Self)`

Move initialize the string from another string.

**Args:**

* ​other (`Self`): The string to move.

### `__del__`

`__del__(owned self)`

Destroy the string data.

### `__bool__`

`__bool__(self) -> Bool`

Checks if the string is not empty.

**Returns:**

True if the string length is greater than zero, and False otherwise.

### `__getitem__`

`__getitem__[I: Indexer](self, idx: I) -> Self`

Gets the character at the specified position.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The index value.

**Returns:**

A new string containing the character at the specified position.

`__getitem__(self, span: Slice) -> Self`

Gets the sequence of characters at the specified positions.

**Args:**

* ​span (`Slice`): A slice that specifies positions of the new substring.

**Returns:**

A new string containing the string at the specified positions.

### `__lt__`

`__lt__(self, rhs: Self) -> Bool`

Compare this String to the RHS using LT comparison.

**Args:**

* ​rhs (`Self`): The other String to compare against.

**Returns:**

True if this String is strictly less than the RHS String and False
otherwise.

### `__le__`

`__le__(self, rhs: Self) -> Bool`

Compare this String to the RHS using LE comparison.

**Args:**

* ​rhs (`Self`): The other String to compare against.

**Returns:**

True iff this String is less than or equal to the RHS String.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compares two Strings if they have the same values.

**Args:**

* ​other (`Self`): The rhs of the operation.

**Returns:**

True if the Strings are equal and False otherwise.

`__eq__(self, other: StringSlice[origin]) -> Bool`

Compares two Strings if they have the same values.

**Args:**

* ​other (`StringSlice[origin]`): The rhs of the operation.

**Returns:**

True if the Strings are equal and False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Compares two Strings if they do not have the same values.

**Args:**

* ​other (`Self`): The rhs of the operation.

**Returns:**

True if the Strings are not equal and False otherwise.

`__ne__(self, other: StringSlice[origin]) -> Bool`

Compares two Strings if they have the same values.

**Args:**

* ​other (`StringSlice[origin]`): The rhs of the operation.

**Returns:**

True if the Strings are equal and False otherwise.

### `__gt__`

`__gt__(self, rhs: Self) -> Bool`

Compare this String to the RHS using GT comparison.

**Args:**

* ​rhs (`Self`): The other String to compare against.

**Returns:**

True iff this String is strictly greater than the RHS String.

### `__ge__`

`__ge__(self, rhs: Self) -> Bool`

Compare this String to the RHS using GE comparison.

**Args:**

* ​rhs (`Self`): The other String to compare against.

**Returns:**

True iff this String is greater than or equal to the RHS String.

### `__contains__`

`__contains__(self, substr: StringSlice[origin]) -> Bool`

Returns True if the substring is contained within the current string.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to check.

**Returns:**

True if the string contains the substring.

### `__add__`

`__add__(self, other: StringSlice[origin]) -> Self`

Creates a string by appending a string slice at the end.

**Args:**

* ​other (`StringSlice[origin]`): The string slice to append.

**Returns:**

The new constructed string.

### `__mul__`

`__mul__(self, n: Int) -> Self`

Concatenates the string `n` times.

**Args:**

* ​n (`Int`): The number of times to concatenate the string.

**Returns:**

The string concatenated `n` times.

### `__radd__`

`__radd__(self, other: StringSlice[origin]) -> Self`

Creates a string by prepending another string slice to the start.

**Args:**

* ​other (`StringSlice[origin]`): The string to prepend.

**Returns:**

The new constructed string.

### `__iadd__`

`__iadd__(mut self, other: StringSlice[origin])`

Appends another string slice to this string.

**Args:**

* ​other (`StringSlice[origin]`): The string to append.

### `copy`

`copy(self) -> Self`

Explicitly copy the provided value.

**Returns:**

A copy of the value.

### `capacity`

`capacity(self) -> UInt`

Get the capacity of the string.

**Returns:**

The capacity of the string.

### `write_bytes`

`write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])`

Write a byte span to this String.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this String. Must NOT be
  null terminated.

### `write`

`write[*Ts: Writable](mut self, *args: *Ts)`

Write a sequence of Writable arguments to the provided Writer.

**Parameters:**

* ​\*Ts (`Writable`): Types of the provided argument sequence.

**Args:**

* ​\*args (`*Ts`): Sequence of arguments to write to this Writer.

`static write[*Ts: Writable](*args: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")) -> Self`

Construct a string by concatenating a sequence of Writable arguments.

This is used only when reusing the `write_to` method for
`__str__` in order to avoid an endless loop recalling
the constructor.

**Parameters:**

* ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy
  `Writable`.

**Args:**

* ​\*args (`*Ts`): A sequence of Writable arguments.
* ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements.
* ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements.

**Returns:**

A string formed by formatting the argument sequence.

### `append_byte`

`append_byte(mut self, byte: SIMD[uint8, 1])`

Append a byte to the string.

**Args:**

* ​byte (`SIMD[uint8, 1]`): The byte to append.

### `__iter__`

`__iter__(self) -> CodepointSliceIter[self]`

Iterate over the string, returning immutable references.

**Returns:**

An iterator of references to the string elements.

### `__reversed__`

`__reversed__(self) -> CodepointSliceIter[self, False]`

Iterate backwards over the string, returning immutable references.

**Returns:**

A reversed iterator of references to the string elements.

### `__len__`

`__len__(self) -> Int`

Get the string length of in bytes.

This function returns the number of bytes in the underlying UTF-8
representation of the string.

To get the number of Unicode codepoints in a string, use
`len(str.codepoints())`.

# Examples

Query the length of a string, in bytes and Unicode codepoints:

```mojo
from testing import assert_equal

var s = String("ನಮಸ್ಕಾರ")

assert_equal(len(s), 21)
assert_equal(len(s.codepoints()), 7)
```

Strings containing only ASCII characters have the same byte and
Unicode codepoint length:

```mojo
from testing import assert_equal

var s = String("abc")

assert_equal(len(s), 3)
assert_equal(len(s.codepoints()), 3)
```

.

**Returns:**

The string length in bytes.

### `__str__`

`__str__(self) -> Self`

Gets the string itself.

This method ensures that you can pass a `String` to a method that
takes a `Stringable` value.

**Returns:**

The string itself.

### `__repr__`

`__repr__(self) -> Self`

Return a Mojo-compatible representation of the `String` instance.

**Returns:**

A new representation of the string.

### `__fspath__`

`__fspath__(self) -> Self`

Return the file system path representation (just the string itself).

**Returns:**

The file system path representation as a string.

### `to_python_object`

`to_python_object(owned self) -> PythonObject`

Convert this value to a PythonObject.

**Returns:**

A PythonObject representing the value.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this string to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `join`

`join[*Ts: Writable](self, *elems: *Ts) -> Self`

Joins string elements using the current string as a delimiter.

**Parameters:**

* ​\*Ts (`Writable`): The types of the elements.

**Args:**

* ​\*elems (`*Ts`): The input values.

**Returns:**

The joined string.

`join[T: Copyable & Movable & Writable, //, buffer_size: Int = 4096](self, elems: List[T, hint_trivial_type]) -> Self`

Joins string elements using the current string as a delimiter. Defaults to writing to the stack if total bytes of `elems` is less than `buffer_size`, otherwise will allocate once to the heap and write directly into that. The `buffer_size` defaults to 4096 bytes to match the default page size on arm64 and x86-64, but you can increase this if you're joining a very large `List` of elements to write into the stack instead of the heap.

**Parameters:**

* ​T (`Copyable & Movable & Writable`): The type of the elements. Must implement the `Copyable`,
  `Movable` and `Writable` traits.
* ​buffer\_size (`Int`): The max size of the stack buffer.

**Args:**

* ​elems (`List[T, hint_trivial_type]`): The input values.

**Returns:**

The joined string.

### `codepoints`

`codepoints(self) -> CodepointsIter[self]`

Returns an iterator over the `Codepoint`s encoded in this string slice.

# Examples

Print the characters in a string:

```mojo
from testing import assert_equal

var s = String("abc")
var iter = s.codepoints()
assert_equal(iter.__next__(), Codepoint.ord("a"))
assert_equal(iter.__next__(), Codepoint.ord("b"))
assert_equal(iter.__next__(), Codepoint.ord("c"))
assert_equal(iter.__has_next__(), False)
```

`codepoints()` iterates over Unicode codepoints, and supports multibyte
codepoints:

```mojo
from testing import assert_equal

# A visual character composed of a combining sequence of 2 codepoints.
var s = String("á")
assert_equal(s.byte_length(), 3)

var iter = s.codepoints()
assert_equal(iter.__next__(), Codepoint.ord("a"))
 # U+0301 Combining Acute Accent
assert_equal(iter.__next__().to_u32(), 0x0301)
assert_equal(iter.__has_next__(), False)
```

.

**Returns:**

An iterator type that returns successive `Codepoint` values stored in
this string slice.

### `codepoint_slices`

`codepoint_slices(self) -> CodepointSliceIter[self]`

Returns an iterator over single-character slices of this string.

Each returned slice points to a single Unicode codepoint encoded in the
underlying UTF-8 representation of this string.

# Examples

Iterate over the character slices in a string:

```mojo
from testing import assert_equal, assert_true

var s = String("abc")
var iter = s.codepoint_slices()
assert_true(iter.__next__() == "a")
assert_true(iter.__next__() == "b")
assert_true(iter.__next__() == "c")
assert_equal(iter.__has_next__(), False)
```

.

**Returns:**

An iterator of references to the string elements.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[SIMD[uint8, 1], mut=False, origin=self]`

Retrieves a pointer to the underlying memory.

**Returns:**

The pointer to the underlying memory.

### `unsafe_ptr_mut`

`unsafe_ptr_mut(mut self) -> UnsafePointer[SIMD[uint8, 1], origin=self]`

Retrieves a mutable pointer to the underlying memory, copying to a new buffer if this was previously pointing to a static constant.

**Returns:**

The pointer to the underlying memory.

### `unsafe_cstr_ptr`

`unsafe_cstr_ptr(mut self) -> UnsafePointer[SIMD[int8, 1], origin=self]`

Retrieves a C-string-compatible pointer to the underlying memory.

The returned pointer is guaranteed to be null, or NUL terminated.

**Returns:**

The pointer to the underlying memory.

### `as_bytes`

`as_bytes(self) -> Span[SIMD[uint8, 1], self]`

Returns a contiguous slice of the bytes owned by this string.

**Returns:**

A contiguous slice pointing to the bytes owned by this string.

### `as_bytes_mut`

`as_bytes_mut(mut self) -> Span[SIMD[uint8, 1], self]`

Returns a mutable contiguous slice of the bytes owned by this string. This name has a \_mut suffix so the as\_bytes() method doesn't have to guarantee mutability.

**Returns:**

A contiguous slice pointing to the bytes owned by this string.

### `as_string_slice`

`as_string_slice(self) -> StringSlice[self]`

Returns a string slice of the data owned by this string.

**Returns:**

A string slice pointing to the data owned by this string.

### `as_string_slice_mut`

`as_string_slice_mut(mut self) -> StringSlice[self]`

Returns a mutable string slice of the data owned by this string.

**Returns:**

A string slice pointing to the data owned by this string.

### `byte_length`

`byte_length(self) -> Int`

Get the string length in bytes.

**Returns:**

The length of this string in bytes.

### `count`

`count(self, substr: StringSlice[origin]) -> Int`

Return the number of non-overlapping occurrences of substring `substr` in the string.

If sub is empty, returns the number of empty strings between characters
which is the length of the string plus one.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to count.

**Returns:**

The number of occurrences of `substr`.

### `find`

`find(self, substr: StringSlice[origin], start: Int = 0) -> Int`

Finds the offset of the first occurrence of `substr` starting at `start`. If not found, returns -1.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to find.
* ​start (`Int`): The offset from which to find.

**Returns:**

The offset of `substr` relative to the beginning of the string.

### `rfind`

`rfind(self, substr: StringSlice[origin], start: Int = 0) -> Int`

Finds the offset of the last occurrence of `substr` starting at `start`. If not found, returns -1.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to find.
* ​start (`Int`): The offset from which to find.

**Returns:**

The offset of `substr` relative to the beginning of the string.

### `isspace`

`isspace(self) -> Bool`

Determines whether every character in the given String is a python whitespace String. This corresponds to Python's [universal separators](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `" \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`.

**Returns:**

True if the whole String is made up of whitespace characters
listed above, otherwise False.

### `split`

`split(self, sep: StringSlice[origin], maxsplit: Int = -1) -> List[String]`

Split the string by a separator.

Examples:

```mojo
# Splitting a space
_ = String("hello world").split(" ") # ["hello", "world"]
# Splitting adjacent separators
_ = String("hello,,world").split(",") # ["hello", "", "world"]
# Splitting with maxsplit
_ = String("1,2,3").split(",", 1) # ['1', '2,3']
# Splitting with an empty separator
_ = StringSlice("123").split("") # ["", "1", "2", "3", ""]
```

**Args:**

* ​sep (`StringSlice[origin]`): The string to split on.
* ​maxsplit (`Int`): The maximum amount of items to split from String.
  Defaults to unlimited.

**Returns:**

A List of Strings containing the input split by the separator.

`split(self, sep: NoneType = NoneType(None), maxsplit: Int = -1) -> List[String]`

Split the string by every Whitespace separator.

Examples:

```mojo
# Splitting an empty string or filled with whitespaces
_ = String("      ").split() # []
_ = String("").split() # []

# Splitting a string with leading, trailing, and middle whitespaces
_ = String("      hello    world     ").split() # ["hello", "world"]
# Splitting adjacent universal newlines:
_ = String(
    "hello \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029world"
).split()  # ["hello", "world"]
```

.

**Args:**

* ​sep (`NoneType`): None.
* ​maxsplit (`Int`): The maximum amount of items to split from String. Defaults
  to unlimited.

**Returns:**

A List of Strings containing the input split by the separator.

### `splitlines`

`splitlines(self, keepends: Bool = False) -> List[String]`

Split the string at line boundaries. This corresponds to Python's [universal newlines:](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `"\r\n"` and `"\t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`.

**Args:**

* ​keepends (`Bool`): If True, line breaks are kept in the resulting strings.

**Returns:**

A List of Strings containing the input split by line boundaries.

### `replace`

`replace(self, old: StringSlice[origin], new: StringSlice[origin]) -> Self`

Return a copy of the string with all occurrences of substring `old` if replaced by `new`.

**Args:**

* ​old (`StringSlice[origin]`): The substring to replace.
* ​new (`StringSlice[origin]`): The substring to replace with.

**Returns:**

The string where all occurrences of `old` are replaced with `new`.

### `strip`

`strip(self, chars: StringSlice[origin]) -> StringSlice[self]`

Return a copy of the string with leading and trailing characters removed.

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A copy of the string with no leading or trailing characters.

`strip(self) -> StringSlice[self]`

Return a copy of the string with leading and trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

**Returns:**

A copy of the string with no leading or trailing whitespaces.

### `rstrip`

`rstrip(self, chars: StringSlice[origin]) -> StringSlice[self]`

Return a copy of the string with trailing characters removed.

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A copy of the string with no trailing characters.

`rstrip(self) -> StringSlice[self]`

Return a copy of the string with trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

**Returns:**

A copy of the string with no trailing whitespaces.

### `lstrip`

`lstrip(self, chars: StringSlice[origin]) -> StringSlice[self]`

Return a copy of the string with leading characters removed.

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A copy of the string with no leading characters.

`lstrip(self) -> StringSlice[self]`

Return a copy of the string with leading whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

**Returns:**

A copy of the string with no leading whitespaces.

### `__hash__`

`__hash__(self) -> UInt`

Hash the underlying buffer using builtin hash.

**Returns:**

A 64-bit hash value. This value is *not* suitable for cryptographic
uses. Its intended usage is for data structures. See the `hash`
builtin documentation for more details.

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with the underlying bytes.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

### `lower`

`lower(self) -> Self`

Returns a copy of the string with all cased characters converted to lowercase.

**Returns:**

A new string where cased letters have been converted to lowercase.

### `upper`

`upper(self) -> Self`

Returns a copy of the string with all cased characters converted to uppercase.

**Returns:**

A new string where cased letters have been converted to uppercase.

### `startswith`

`startswith(self, prefix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool`

Checks if the string starts with the specified prefix between start and end positions. Returns True if found and False otherwise.

**Args:**

* ​prefix (`StringSlice[origin]`): The prefix to check.
* ​start (`Int`): The start offset from which to check.
* ​end (`Int`): The end offset from which to check.

**Returns:**

True if the `self[start:end]` is prefixed by the input prefix.

### `endswith`

`endswith(self, suffix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool`

Checks if the string end with the specified suffix between start and end positions. Returns True if found and False otherwise.

**Args:**

* ​suffix (`StringSlice[origin]`): The suffix to check.
* ​start (`Int`): The start offset from which to check.
* ​end (`Int`): The end offset from which to check.

**Returns:**

True if the `self[start:end]` is suffixed by the input suffix.

### `removeprefix`

`removeprefix(self, prefix: StringSlice[origin], /) -> StringSlice[self]`

Returns a new string with the prefix removed if it was present.

Examples:

```mojo
print(String('TestHook').removeprefix('Test')) # 'Hook'
print(String('BaseTestCase').removeprefix('Test')) # 'BaseTestCase'
```

**Args:**

* ​prefix (`StringSlice[origin]`): The prefix to remove from the string.

**Returns:**

`string[len(prefix):]` if the string starts with the prefix string,
or a copy of the original string otherwise.

### `removesuffix`

`removesuffix(self, suffix: StringSlice[origin], /) -> StringSlice[self]`

Returns a new string with the suffix removed if it was present.

Examples:

```mojo
print(String('TestHook').removesuffix('Hook')) # 'Test'
print(String('BaseTestCase').removesuffix('Test')) # 'BaseTestCase'
```

**Args:**

* ​suffix (`StringSlice[origin]`): The suffix to remove from the string.

**Returns:**

`string[:-len(suffix)]` if the string ends with the suffix string,
or a copy of the original string otherwise.

### `__int__`

`__int__(self) -> Int`

Parses the given string as a base-10 integer and returns that value. If the string cannot be parsed as an int, an error is raised.

**Returns:**

An integer value that represents the string, or otherwise raises.

### `__float__`

`__float__(self) -> SIMD[float64, 1]`

Parses the string as a float point number and returns that value. If the string cannot be parsed as a float, an error is raised.

**Returns:**

A float value that represents the string, or otherwise raises.

### `format`

`format[*Ts: Stringable & Representable](self, *args: *Ts) -> Self`

Produce a formatted string using the current string as a template.

The template, or "format string" can contain literal text and/or
replacement fields delimited with curly braces (`{}`). Returns a copy of
the format string with the replacement fields replaced with string
representations of the `args` arguments.

For more information, see the discussion in the
[`format` module](/mojo/stdlib/collections/string/format/).

Example:

```mojo
# Manual indexing:
print(String("{0} {1} {0}").format("Mojo", 1.125)) # Mojo 1.125 Mojo
# Automatic indexing:
print(String("{} {}").format(True, "hello world")) # True hello world
```

**Parameters:**

* ​\*Ts (`Stringable & Representable`): The types of substitution values that implement `Representable`
  and `Stringable` (to be changed and made more flexible).

**Args:**

* ​\*args (`*Ts`): The substitution values.

**Returns:**

The template with the given values substituted.

### `isdigit`

`isdigit(self) -> Bool`

A string is a digit string if all characters in the string are digits and there is at least one character in the string.

Note that this currently only works with ASCII strings.

**Returns:**

True if all characters are digits and it's not empty else False.

### `isupper`

`isupper(self) -> Bool`

Returns True if all cased characters in the string are uppercase and there is at least one cased character.

**Returns:**

True if all cased characters in the string are uppercase and there
is at least one cased character, False otherwise.

### `islower`

`islower(self) -> Bool`

Returns True if all cased characters in the string are lowercase and there is at least one cased character.

**Returns:**

True if all cased characters in the string are lowercase and there
is at least one cased character, False otherwise.

### `isprintable`

`isprintable(self) -> Bool`

Returns True if all characters in the string are ASCII printable.

Note that this currently only works with ASCII strings.

**Returns:**

True if all characters are printable else False.

### `rjust`

`rjust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> Self`

Returns the string right justified in a string of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns right justified string, or self if width is not bigger than self length.

### `ljust`

`ljust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> Self`

Returns the string left justified in a string of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns left justified string, or self if width is not bigger than self length.

### `center`

`center(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> Self`

Returns the string center justified in a string of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns center justified string, or self if width is not bigger than self length.

### `resize`

`resize(mut self, length: Int, fill_byte: SIMD[uint8, 1] = __init__[__mlir_type.!pop.int_literal](0))`

Resize the string to a new length.

Notes:
If the new length is greater than the current length, the string is
extended by the difference, and the new bytes are initialized to
`fill_byte`.

**Args:**

* ​length (`Int`): The new length of the string.
* ​fill\_byte (`SIMD[uint8, 1]`): The byte to fill any new space with.

`resize(mut self, *, unsafe_uninit_length: Int)`

Resizes the string to the given new size leaving any new data uninitialized.

If the new size is smaller than the current one, elements at the end
are discarded. If the new size is larger than the current one, the
string is extended and the new data is left uninitialized.

**Args:**

* ​unsafe\_uninit\_length (`Int`): The new size.

### `reserve`

`reserve(mut self, new_capacity: UInt)`

Reserves the requested capacity.

Notes:
If the current capacity is greater or equal, this is a no-op.
Otherwise, the storage is reallocated and the data is moved.

**Args:**

* ​new\_capacity (`UInt`): The new capacity in stored bytes.

---

## ascii

`ascii(value: StringSlice[origin]) -> String`

Get the ASCII representation of the object.

**Args:**

* ​value (`StringSlice[origin]`): The object to get the ASCII representation of.

**Returns:**

A string containing the ASCII representation of the object.

---

## atof

`atof(str_slice: StringSlice[origin]) -> SIMD[float64, 1]`

Parses the given string as a floating point and returns that value.

For example, `atof("2.25")` returns `2.25`.

This function is in the prelude, so you don't need to import it.

**Args:**

* ​str\_slice (`StringSlice[origin]`): A string to be parsed as a floating point.

**Returns:**

An floating point value that represents the string, or otherwise raises.

**Raises:**

If the given string cannot be parsed as an floating point value, for
example in `atof("hi")`.

---

## atol

`atol(str_slice: StringSlice[origin], base: Int = 10) -> Int`

Parses and returns the given string as an integer in the given base.

If base is set to 0, the string is parsed as an integer literal, with the
following considerations:

* '0b' or '0B' prefix indicates binary (base 2)
* '0o' or '0O' prefix indicates octal (base 8)
* '0x' or '0X' prefix indicates hexadecimal (base 16)
* Without a prefix, it's treated as decimal (base 10)

This follows [Python's integer literals format](https://docs.python.org/3/reference/lexical_analysis.html#integers).

This function is in the prelude, so you don't need to import it.

Examples:

```text
>>> atol("32")
32
>>> atol("FF", 16)
255
>>> atol("0xFF", 0)
255
>>> atol("0b1010", 0)
10
```

**Args:**

* ​str\_slice (`StringSlice[origin]`): A string to be parsed as an integer in the given base.
* ​base (`Int`): Base used for conversion, value must be between 2 and 36, or 0.

**Returns:**

An integer value that represents the string.

**Raises:**

If the given string cannot be parsed as an integer value or if an
incorrect base is provided.

---

## chr

`chr(c: Int) -> String`

Returns a String based on the given Unicode code point. This is the inverse of the `ord()` function.

This function is in the prelude, so you don't need to import it.

Example:

```mojo
print(chr(97), chr(8364)) # "a €"
```

**Args:**

* ​c (`Int`): An integer that represents a code point.

**Returns:**

A string containing a single character based on the given code point.

---

## string

The core `String` type implementation for Mojo.

This module provides the primary `String` type and its fundamental operations.
The `String` type is a mutable string, and is designed to handle UTF-8 encoded
text efficiently while providing a safe and ergonomic interface for string
manipulation.

Related types:

* [`StringSlice`](/mojo/stdlib/collections/string/string_slice/). A non-owning
  view of string data, which can be either mutable or immutable.
* [`StaticString`](/mojo/stdlib/collections/string/string_slice/#aliases). An
  alias for an immutable constant `StringSlice`.
* [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral/). A
  string literal. String literals are compile-time values. For use at runtime,
  you usually want wrap a `StringLiteral` in a `String` (for a mutable string)
  or `StaticString` (for an immutable constant string).

Key Features:

* Short string optimization (SSO) and lazy copying of constant string data.
* O(1) copy operation.
* Memory-safe string operations.
* Efficient string concatenation and slicing.
* String-to-number conversions (
  [`atof()`](/mojo/stdlib/collections/string/string/atof),
  [`atol()`](/mojo/stdlib/collections/string/string/atol)).
* Character code conversions (
  [`chr()`](/mojo/stdlib/collections/string/string/chr),
  [`ord()`](/mojo/stdlib/collections/string/string/ord)).
* String formatting with
  [`format()`](/mojo/stdlib/collections/string/string/String/#format).

The `String` type has Unicode support through UTF-8 encoding. A handful of
operations are known to not be Unicode / UTF-8 compliant yet, but will be fixed
as time permits.

This type is in the prelude, so it is automatically imported into every Mojo
program.

Example:

```mojo
# String creation and basic operations
var s1 = String("Hello")
var s2 = String("World")
var combined = s1 + " " + s2  # "Hello World"

# String-to-number conversion
var num = atof("3.14")
var int_val = atol("42")

# Character operations
var char = chr(65)  # "A"
var code = ord("A")  # 65

# String formatting
print(String("Codepoint {} is {}").format(code, char)) # Codepoint 65 is A

# ASCII utilities
var ascii_str = ascii("Hello")  # ASCII-only string
```

## Structs

* [​`String`](/mojo/stdlib/collections/string/string/String): Represents a mutable string.

## Functions

* [​`ascii`](/mojo/stdlib/collections/string/string/ascii): Get the ASCII representation of the object.
* [​`atof`](/mojo/stdlib/collections/string/string/atof): Parses the given string as a floating point and returns that value.
* [​`atol`](/mojo/stdlib/collections/string/string/atol): Parses and returns the given string as an integer in the given base.
* [​`chr`](/mojo/stdlib/collections/string/string/chr): Returns a String based on the given Unicode code point. This is the inverse of the `ord()` function.
* [​`ord`](/mojo/stdlib/collections/string/string/ord): Returns an integer that represents the codepoint of a single-character string.

---

## ord

`ord(s: StringSlice[origin]) -> Int`

Returns an integer that represents the codepoint of a single-character string.

Given a string containing a single character `Codepoint`, return an integer
representing the codepoint of that character. For example, `ord("a")`
returns the integer `97`. This is the inverse of the `chr()` function.

This function is in the prelude, so you don't need to import it.

**Args:**

* ​s (`StringSlice[origin]`): The input string, which must contain only a single- character.

**Returns:**

An integer representing the code point of the given character.

---

## CodepointSliceIter

`struct CodepointSliceIter[mut: Bool, //, origin: Origin[mut], forward: Bool = True]`

Iterator for `StringSlice` over substring slices containing a single Unicode codepoint.

The `forward` parameter only controls the behavior of the `__next__()`
method used for normal iteration. Calls to `next()` will always take an
element from the front of the iterator, and calls to `next_back()` will
always take an element from the end.

## Parameters

* ​mut (`Bool`): Whether the slice is mutable.
* ​origin (`Origin[mut]`): The origin of the underlying string data.
* ​forward (`Bool`): The iteration direction. `False` is backwards.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__next__`

`__next__(mut self) -> StringSlice[origin]`

Get the next codepoint in the underlying string slice.

This returns the next single-codepoint substring slice encoded in the
underlying string, and advances the iterator state.

If `forward` is set to `False`, this will return the next codepoint
from the end of the string.

This function will abort if this iterator has been exhausted.

**Returns:**

The next character in the string.

### `__has_next__`

`__has_next__(self) -> Bool`

Returns True if there are still elements in this iterator.

**Returns:**

A boolean indicating if there are still elements in this iterator.

### `__len__`

`__len__(self) -> Int`

Returns the remaining length of this iterator in `Codepoint`s.

The value returned from this method indicates the number of subsequent
calls to `next()` that will return a value.

**Returns:**

Number of codepoints remaining in this iterator.

### `peek_next`

`peek_next(self) -> Optional[StringSlice[origin]]`

Check what the next single-codepoint slice in this iterator is, without advancing the iterator state.

Repeated calls to this method will return the same value.

# Examples

`peek_next()` does not advance the iterator, so repeated calls will
return the same value:

```mojo
from collections.string import Codepoint
from testing import assert_equal

var input = StringSlice("123")
var iter = input.codepoint_slices()

assert_equal(iter.peek_next().value(), "1")
assert_equal(iter.peek_next().value(), "1")
assert_equal(iter.peek_next().value(), "1")

# A call to `next()` return the same value as `peek_next()` had,
# but also advance the iterator.
assert_equal(iter.next().value(), "1")

# Later `peek_next()` calls will return the _new_ next character:
assert_equal(iter.peek_next().value(), "2")
```

.

**Returns:**

The next codepoint slice in the underlying string, or None if the
string is empty.

### `peek_back`

`peek_back(mut self) -> Optional[StringSlice[origin]]`

Check what the last single-codepoint slice in this iterator is, without advancing the iterator state.

Repeated calls to this method will return the same value.

# Examples

`peek_back()` does not advance the iterator, so repeated calls will
return the same value:

```mojo
from collections.string import Codepoint
from testing import assert_equal

var input = StringSlice("123")
var iter = input.codepoint_slices()

# Repeated calls to `peek_back()` return the same value.
assert_equal(iter.peek_back().value(), "3")
assert_equal(iter.peek_back().value(), "3")
assert_equal(iter.peek_back().value(), "3")

# A call to `next_back()` return the same value as `peek_back()` had,
# but also advance the iterator.
assert_equal(iter.next_back().value(), "3")

# Later `peek_back()` calls will return the _new_ next character:
assert_equal(iter.peek_back().value(), "2")
```

.

**Returns:**

The last codepoint slice in the underlying string, or None if the
string is empty.

### `next`

`next(mut self) -> Optional[StringSlice[origin]]`

Get the next codepoint slice in the underlying string slice, or None if the iterator is empty.

This returns the next single-codepoint substring encoded in the
underlying string, and advances the iterator state.

**Returns:**

A character if the string is not empty, otherwise None.

### `next_back`

`next_back(mut self) -> Optional[StringSlice[origin]]`

Get the last single-codepoint slice in this iterator is, or None if the iterator is empty.

This returns the last codepoint slice in this iterator, and advances
the iterator state.

**Returns:**

The last codepoint slice in the underlying string, or None if the
string is empty.

---

## CodepointsIter

`struct CodepointsIter[mut: Bool, //, origin: Origin[mut]]`

Iterator over the `Codepoint`s in a string slice, constructed by `StringSlice.codepoints()`.

## Parameters

* ​mut (`Bool`): Mutability of the underlying string data.
* ​origin (`Origin[mut]`): Origin of the underlying string data.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__next__`

`__next__(mut self) -> Codepoint`

Get the next codepoint in the underlying string slice.

This returns the next `Codepoint` encoded in the underlying string, and
advances the iterator state.

This function will abort if this iterator has been exhausted.

**Returns:**

The next character in the string.

### `__has_next__`

`__has_next__(self) -> Bool`

Returns True if there are still elements in this iterator.

**Returns:**

A boolean indicating if there are still elements in this iterator.

### `__len__`

`__len__(self) -> Int`

Returns the remaining length of this iterator in `Codepoint`s.

The value returned from this method indicates the number of subsequent
calls to `next()` that will return a value.

**Returns:**

Number of codepoints remaining in this iterator.

### `peek_next`

`peek_next(self) -> Optional[Codepoint]`

Check what the next codepoint in this iterator is, without advancing the iterator state.

Repeated calls to this method will return the same value.

# Examples

`peek_next()` does not advance the iterator, so repeated calls will
return the same value:

```mojo
from collections.string import Codepoint
from testing import assert_equal

var input = StringSlice("123")
var iter = input.codepoints()

assert_equal(iter.peek_next().value(), Codepoint.ord("1"))
assert_equal(iter.peek_next().value(), Codepoint.ord("1"))
assert_equal(iter.peek_next().value(), Codepoint.ord("1"))

# A call to `next()` return the same value as `peek_next()` had,
# but also advance the iterator.
assert_equal(iter.next().value(), Codepoint.ord("1"))

# Later `peek_next()` calls will return the _new_ next character:
assert_equal(iter.peek_next().value(), Codepoint.ord("2"))
```

.

**Returns:**

The next character in the underlying string, or None if the string
is empty.

### `next`

`next(mut self) -> Optional[Codepoint]`

Get the next codepoint in the underlying string slice, or None if the iterator is empty.

This returns the next `Codepoint` encoded in the underlying string, and
advances the iterator state.

**Returns:**

A character if the string is not empty, otherwise None.

---

## StringSlice

`@register_passable(trivial)`
`struct StringSlice[mut: Bool, //, origin: Origin[mut]]`

A non-owning view to encoded string data.

This type is guaranteed to have the same ABI (size, alignment, and field
layout) as the `llvm::StringRef` type.

See the
[`string_slice` module](/mojo/stdlib/collections/string/string_slice/)
for more information and examples.

Notes:
TODO: The underlying string data is guaranteed to be encoded using
UTF-8.

## Parameters

* ​mut (`Bool`): Whether the slice is mutable.
* ​origin (`Origin[mut]`): The origin of the underlying string data.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Defaultable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`FloatableRaising`,
`Hashable`,
`IntableRaising`,
`Movable`,
`PathLike`,
`PythonConvertible`,
`Representable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `Immutable`

`alias Immutable = StringSlice[(muttoimm origin._mlir_origin)]`

The immutable version of the `StringSlice`.

### `Mutable`

`alias Mutable = StringSlice[(mutcast origin._mlir_origin)]`

The mutable version of the `StringSlice`.

## Methods

### `__init__`

`__init__() -> Self`

Create an empty / zero-length slice.

`@implicit`
`__init__(lit: StringLiteral[value]) -> StringSlice[StaticConstantOrigin]`

Construct a new `StringSlice` from a `StringLiteral`.

**Args:**

* ​lit (`StringLiteral[value]`): The literal to construct this `StringSlice` from.

`__init__(*, unsafe_from_utf8: Span[SIMD[uint8, 1], origin, address_space=address_space, alignment=alignment]) -> Self`

Construct a new `StringSlice` from a sequence of UTF-8 encoded bytes.

Safety:
`unsafe_from_utf8` MUST be valid UTF-8 encoded data.

**Args:**

* ​unsafe\_from\_utf8 (`Span[SIMD[uint8, 1], origin, address_space=address_space, alignment=alignment]`): A `Span[Byte]` encoded in UTF-8.

`__init__(*, unsafe_from_utf8_ptr: UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self`

Construct a new StringSlice from a `UnsafePointer[Byte]` pointing to null-terminated UTF-8 encoded bytes.

Safety:

* `unsafe_from_utf8_ptr` MUST point to data that is valid for
  `origin`.
* `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data.
* `unsafe_from_utf8_ptr` MUST be null terminated.

**Args:**

* ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): An `UnsafePointer[Byte]` of null-terminated
  bytes encoded in UTF-8.

`__init__(*, unsafe_from_utf8_ptr: UnsafePointer[SIMD[int8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self`

Construct a new StringSlice from a `UnsafePointer[c_char]` pointing to null-terminated UTF-8 encoded bytes.

Safety:

* `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data.
* `unsafe_from_utf8_ptr` MUST be null terminated.

**Args:**

* ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[int8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): An `UnsafePointer[c_char]` of null-terminated
  bytes encoded in UTF-8.

`__init__(*, ptr: UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], length: UInt) -> Self`

Construct a `StringSlice` from a pointer to a sequence of UTF-8 encoded bytes and a length.

Safety:

* `ptr` MUST point to at least `length` bytes of valid UTF-8 encoded
  data.
* `ptr` must point to data that is live for the duration of
  `origin`.

**Args:**

* ​ptr (`UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): A pointer to a sequence of bytes encoded in UTF-8.
* ​length (`UInt`): The number of bytes of encoded data.

`@implicit`
`__init__[origin: ImmutableOrigin, //](ref [origin] value: String) -> StringSlice[origin]`

Construct an immutable StringSlice.

**Parameters:**

* ​origin (`ImmutableOrigin`): The immutable origin.

**Args:**

* ​value (`String`): The string value.

`__init__[origin: MutableOrigin, //](ref [origin] value: String) -> StringSlice[origin]`

Construct a mutable StringSlice.

**Parameters:**

* ​origin (`MutableOrigin`): The mutable origin.

**Args:**

* ​value (`String`): The string value.

### `__bool__`

`__bool__(self) -> Bool`

Check if a string slice is non-empty.

**Returns:**

True if a string slice is non-empty, False otherwise.

### `__getitem__`

`__getitem__(self, span: Slice) -> Self`

Gets the sequence of characters at the specified positions.

Raises: This function will raise if the specified slice start or end
position are outside the bounds of the string, or if they do not
both fall on codepoint boundaries.

**Args:**

* ​span (`Slice`): A slice that specifies positions of the new substring.

**Returns:**

A new StringSlice containing the substring at the specified positions.

`__getitem__[I: Indexer](self, idx: I) -> String`

Gets the character at the specified position.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The index value.

**Returns:**

A new string containing the character at the specified position.

### `__lt__`

`__lt__(self, rhs: StringSlice[origin]) -> Bool`

Verify if the `StringSlice` bytes are strictly less than the input in overlapping content.

**Args:**

* ​rhs (`StringSlice[origin]`): The other `StringSlice` to compare against.

**Returns:**

If the `StringSlice` bytes are strictly less than the input in
overlapping content.

### `__eq__`

`__eq__(self, rhs_same: Self) -> Bool`

Verify if a `StringSlice` is equal to another `StringSlice` with the same origin.

**Args:**

* ​rhs\_same (`Self`): The `StringSlice` to compare against.

**Returns:**

If the `StringSlice` is equal to the input in length and contents.

`__eq__(self, rhs: StringSlice[origin]) -> Bool`

Verify if a `StringSlice` is equal to another `StringSlice`.

**Args:**

* ​rhs (`StringSlice[origin]`): The `StringSlice` to compare against.

**Returns:**

If the `StringSlice` is equal to the input in length and contents.

### `__ne__`

`__ne__(self, rhs_same: Self) -> Bool`

Verify if a `StringSlice` is not equal to another `StringSlice` with the same origin.

**Args:**

* ​rhs\_same (`Self`): The `StringSlice` to compare against.

**Returns:**

If the `StringSlice` is not equal to the input in length and
contents.

`__ne__(self, rhs: StringSlice[origin]) -> Bool`

Verify if span is not equal to another `StringSlice`.

**Args:**

* ​rhs (`StringSlice[origin]`): The `StringSlice` to compare against.

**Returns:**

If the `StringSlice` is not equal to the input in length and
contents.

### `__contains__`

`__contains__(self, substr: StringSlice[origin]) -> Bool`

Returns True if the substring is contained within the current string.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to check.

**Returns:**

True if the string contains the substring.

### `__add__`

`__add__(self, rhs: StringSlice[origin]) -> String`

Returns a string with this value prefixed on another string.

**Args:**

* ​rhs (`StringSlice[origin]`): The right side of the result.

**Returns:**

The result string.

### `__mul__`

`__mul__(self, n: Int) -> String`

Concatenates the string `n` times.

**Args:**

* ​n (`Int`): The number of times to concatenate the string.

**Returns:**

The string concatenated `n` times.

### `__radd__`

`__radd__(self, lhs: StringSlice[origin]) -> String`

Returns a string with this value appended to another string.

**Args:**

* ​lhs (`StringSlice[origin]`): The left side of the result.

**Returns:**

The result string.

### `copy`

`copy(self) -> Self`

Explicitly construct a deep copy of the provided `StringSlice`.

**Returns:**

A copy of the value.

### `from_utf8`

`static from_utf8(from_utf8: Span[SIMD[uint8, 1], origin]) -> Self`

Construct a new `StringSlice` from a buffer containing UTF-8 encoded data.

**Args:**

* ​from\_utf8 (`Span[SIMD[uint8, 1], origin]`): A span of bytes containing UTF-8 encoded data.

**Returns:**

A new validated `StringSlice` pointing to the provided buffer.

**Raises:**

An exception is raised if the provided buffer byte values do not
form valid UTF-8 encoded codepoints.

### `__str__`

`__str__(self) -> String`

Convert this StringSlice to a String.

Notes:
This will allocate a new string that copies the string contents from
the provided string slice.

**Returns:**

A new String.

### `__repr__`

`__repr__(self) -> String`

Return a Mojo-compatible representation of this string slice.

**Returns:**

Representation of this string slice as a Mojo string literal input
form syntax.

### `__len__`

`__len__(self) -> Int`

Get the string length in bytes.

This function returns the number of bytes in the underlying UTF-8
representation of the string.

To get the number of Unicode codepoints in a string, use
`len(str.codepoints())`.

# Examples

Query the length of a string, in bytes and Unicode codepoints:

```mojo

from testing import assert_equal

var s = StringSlice("ನಮಸ್ಕಾರ")

assert_equal(len(s), 21)
assert_equal(len(s.codepoints()), 7)
```

Strings containing only ASCII characters have the same byte and
Unicode codepoint length:

```mojo

from testing import assert_equal

var s = StringSlice("abc")

assert_equal(len(s), 3)
assert_equal(len(s.codepoints()), 3)
```

.

**Returns:**

The string length in bytes.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this string slice to the provided `Writer`.

**Parameters:**

* ​W (`Writer`): A type conforming to the `Writable` trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__hash__`

`__hash__(self) -> UInt`

Hash the underlying buffer using builtin hash.

**Returns:**

A 64-bit hash value. This value is *not* suitable for cryptographic
uses. Its intended usage is for data structures. See the `hash`
builtin documentation for more details.

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with the underlying bytes.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

### `__fspath__`

`__fspath__(self) -> String`

Return the file system path representation of this string.

**Returns:**

The file system path representation as a string.

### `to_python_object`

`to_python_object(owned self) -> PythonObject`

Convert this value to a PythonObject.

**Returns:**

A PythonObject representing the value.

### `__iter__`

`__iter__(self) -> CodepointSliceIter[origin]`

Iterate over the string, returning immutable references.

**Returns:**

An iterator of references to the string elements.

### `__reversed__`

`__reversed__(self) -> CodepointSliceIter[origin, False]`

Iterate backwards over the string, returning immutable references.

**Returns:**

A reversed iterator of references to the string elements.

### `__int__`

`__int__(self) -> Int`

Parses the given string as a base-10 integer and returns that value. If the string cannot be parsed as an int, an error is raised.

**Returns:**

An integer value that represents the string, or otherwise raises.

### `__float__`

`__float__(self) -> SIMD[float64, 1]`

Parses the string as a float point number and returns that value. If the string cannot be parsed as a float, an error is raised.

**Returns:**

A float value that represents the string, or otherwise raises.

### `__merge_with__`

`__merge_with__[: Bool, : Origin[$0], //, other_type: AnyStruct[StringSlice[$1]]](self) -> StringSlice[origin]`

Returns a string slice with merged origins.

**Parameters:**

* ​other\_type (`AnyStruct[StringSlice[$1]]`): The type of the origin to merge with.

**Returns:**

A StringSlice merged with the other origin.

### `get_immutable`

`get_immutable(self) -> StringSlice[(muttoimm origin._mlir_origin)]`

Return an immutable version of this Span.

**Returns:**

An immutable version of the same Span.

### `replace`

`replace(self, old: StringSlice[origin], new: StringSlice[origin]) -> String`

Return a copy of the string with all occurrences of substring `old` if replaced by `new`.

**Args:**

* ​old (`StringSlice[origin]`): The substring to replace.
* ​new (`StringSlice[origin]`): The substring to replace with.

**Returns:**

The string where all occurrences of `old` are replaced with `new`.

### `strip`

`strip(self, chars: StringSlice[origin]) -> Self`

Return a copy of the string with leading and trailing characters removed.

Example:

```mojo
print("himojohi".strip("hi")) # "mojo"
```

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A copy of the string with no leading or trailing characters.

`strip(self) -> Self`

Return a copy of the string with leading and trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

Example:

```mojo
print("  mojo  ".strip()) # "mojo"
```

**Returns:**

A copy of the string with no leading or trailing whitespaces.

### `rstrip`

`rstrip(self, chars: StringSlice[origin]) -> Self`

Return a copy of the string with trailing characters removed.

Example:

```mojo
print("mojohi".strip("hi")) # "mojo"
```

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A copy of the string with no trailing characters.

`rstrip(self) -> Self`

Return a copy of the string with trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

Example:

```mojo
print("mojo  ".strip()) # "mojo"
```

**Returns:**

A copy of the string with no trailing whitespaces.

### `lstrip`

`lstrip(self, chars: StringSlice[origin]) -> Self`

Return a copy of the string with leading characters removed.

Example:

```mojo
print("himojo".strip("hi")) # "mojo"
```

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A copy of the string with no leading characters.

`lstrip(self) -> Self`

Return a copy of the string with leading whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

Example:

```mojo
print("  mojo".strip()) # "mojo"
```

**Returns:**

A copy of the string with no leading whitespaces.

### `codepoints`

`codepoints(self) -> CodepointsIter[origin]`

Returns an iterator over the `Codepoint`s encoded in this string slice.

# Examples

Print the characters in a string:

```mojo

from testing import assert_equal

var s = StringSlice("abc")
var iter = s.codepoints()
assert_equal(iter.__next__(), Codepoint.ord("a"))
assert_equal(iter.__next__(), Codepoint.ord("b"))
assert_equal(iter.__next__(), Codepoint.ord("c"))
assert_equal(iter.__has_next__(), False)
```

`codepoints()` iterates over Unicode codepoints, and supports multibyte
codepoints:

```mojo

from testing import assert_equal

# A visual character composed of a combining sequence of 2 codepoints.
var s = StringSlice("á")
assert_equal(s.byte_length(), 3)

var iter = s.codepoints()
assert_equal(iter.__next__(), Codepoint.ord("a"))
 # U+0301 Combining Acute Accent
assert_equal(iter.__next__().to_u32(), 0x0301)
assert_equal(iter.__has_next__(), False)
```

.

**Returns:**

An iterator type that returns successive `Codepoint` values stored in
this string slice.

### `codepoint_slices`

`codepoint_slices(self) -> CodepointSliceIter[origin]`

Iterate over the string, returning immutable references.

**Returns:**

An iterator of references to the string elements.

### `as_bytes`

`as_bytes(self) -> Span[SIMD[uint8, 1], origin]`

Get the sequence of encoded bytes of the underlying string.

**Returns:**

A slice containing the underlying sequence of encoded bytes.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin]`

Gets a pointer to the first element of this string slice.

**Returns:**

A pointer pointing at the first element of this string slice.

### `byte_length`

`byte_length(self) -> Int`

Get the length of this string slice in bytes.

**Returns:**

The length of this string slice in bytes.

### `char_length`

`char_length(self) -> UInt`

Returns the length in Unicode codepoints.

This returns the number of `Codepoint` codepoint values encoded in the UTF-8
representation of this string.

Note: To get the length in bytes, use `StringSlice.byte_length()`.

# Examples

Query the length of a string, in bytes and Unicode codepoints:

```mojo

from testing import assert_equal

var s = StringSlice("ನಮಸ್ಕಾರ")

assert_equal(s.char_length(), 7)
assert_equal(len(s), 21)
```

Strings containing only ASCII characters have the same byte and
Unicode codepoint length:

```mojo

from testing import assert_equal

var s = StringSlice("abc")

assert_equal(s.char_length(), 3)
assert_equal(len(s), 3)
```

The character length of a string with visual combining characters is
the length in Unicode codepoints, not grapheme clusters:

```mojo

from testing import assert_equal

var s = StringSlice("á")
assert_equal(s.char_length(), 2)
assert_equal(s.byte_length(), 3)
```

.

**Returns:**

The length in Unicode codepoints.

### `is_codepoint_boundary`

`is_codepoint_boundary(self, index: UInt) -> Bool`

Returns True if `index` is the position of the first byte in a UTF-8 codepoint sequence, or is at the end of the string.

A byte position is considered a codepoint boundary if a valid subslice
of the string would end (noninclusive) at `index`.

Positions `0` and `len(self)` are considered to be codepoint boundaries.

Positions beyond the length of the string slice will return False.

Examples:

Check if particular byte positions are codepoint boundaries:

```mojo
from testing import assert_equal, assert_true, assert_false
var abc = StringSlice("abc")
assert_equal(len(abc), 3)
assert_true(abc.is_codepoint_boundary(0))
assert_true(abc.is_codepoint_boundary(1))
assert_true(abc.is_codepoint_boundary(2))
assert_true(abc.is_codepoint_boundary(3))
```

Only the index of the first byte in a multi-byte codepoint sequence is
considered a codepoint boundary:

```mojo
var thumb = StringSlice("👍")
assert_equal(len(thumb), 4)
assert_true(thumb.is_codepoint_boundary(0))
assert_false(thumb.is_codepoint_boundary(1))
assert_false(thumb.is_codepoint_boundary(2))
assert_false(thumb.is_codepoint_boundary(3))
```

Visualization showing which bytes are considered codepoint boundaries,
within a piece of text that includes codepoints whose UTF-8
representation requires, respectively, 1, 2, 3, and 4-bytes. The
codepoint boundary byte indices are indicated by a vertical arrow (↑).

For example, this diagram shows that a slice of bytes formed by the
half-open range starting at byte 3 and extending up to but not including
byte 6 (`[3, 6)`) is a valid UTF-8 sequence.

```text
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                a©➇𝄞                  ┃ String
┣━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┫
┃97┃  169  ┃   10119   ┃    119070     ┃ Unicode Codepoints
┣━━╋━━━┳━━━╋━━━┳━━━┳━━━╋━━━┳━━━┳━━━┳━━━┫
┃97┃194┃169┃226┃158┃135┃240┃157┃132┃158┃ UTF-8 Bytes
┗━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┛
0  1   2   3   4   5   6   7   8   9  10
↑  ↑       ↑           ↑               ↑
```

The following program verifies the above diagram:

```mojo
from testing import assert_true, assert_false

var text = StringSlice("a©➇𝄞")
assert_true(text.is_codepoint_boundary(0))
assert_true(text.is_codepoint_boundary(1))
assert_false(text.is_codepoint_boundary(2))
assert_true(text.is_codepoint_boundary(3))
assert_false(text.is_codepoint_boundary(4))
assert_false(text.is_codepoint_boundary(5))
assert_true(text.is_codepoint_boundary(6))
assert_false(text.is_codepoint_boundary(7))
assert_false(text.is_codepoint_boundary(8))
assert_false(text.is_codepoint_boundary(9))
assert_true(text.is_codepoint_boundary(10))
```

**Args:**

* ​index (`UInt`): An index into the underlying byte representation of the
  string.

**Returns:**

A boolean indicating if `index` gives the position of the first
byte in a UTF-8 codepoint sequence, or is at the end of the string.

### `startswith`

`startswith(self, prefix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool`

Verify if the `StringSlice` starts with the specified prefix between start and end positions.

The `start` and `end` positions must be offsets given in bytes, and
must be codepoint boundaries.

**Args:**

* ​prefix (`StringSlice[origin]`): The prefix to check.
* ​start (`Int`): The start offset in bytes from which to check.
* ​end (`Int`): The end offset in bytes from which to check.

**Returns:**

True if the `self[start:end]` is prefixed by the input prefix.

### `endswith`

`endswith(self, suffix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool`

Verify if the `StringSlice` end with the specified suffix between start and end positions.

The `start` and `end` positions must be offsets given in bytes, and
must be codepoint boundaries.

**Args:**

* ​suffix (`StringSlice[origin]`): The suffix to check.
* ​start (`Int`): The start offset in bytes from which to check.
* ​end (`Int`): The end offset in bytes from which to check.

**Returns:**

True if the `self[start:end]` is suffixed by the input suffix.

### `removeprefix`

`removeprefix(self, prefix: StringSlice[origin], /) -> Self`

Returns a new string with the prefix removed if it was present.

Examples:

```mojo
print(StringSlice('TestHook').removeprefix('Test')) # 'Hook'
print(StringSlice('BaseTestCase').removeprefix('Test')) # 'BaseTestCase'
```

**Args:**

* ​prefix (`StringSlice[origin]`): The prefix to remove from the string.

**Returns:**

`string[len(prefix):]` if the string starts with the prefix string,
or a copy of the original string otherwise.

### `removesuffix`

`removesuffix(self, suffix: StringSlice[origin], /) -> Self`

Returns a new string with the suffix removed if it was present.

Examples:

```mojo
print(StringSlice('TestHook').removesuffix('Hook')) # 'Test'
print(StringSlice('BaseTestCase').removesuffix('Test')) # 'BaseTestCase'
```

**Args:**

* ​suffix (`StringSlice[origin]`): The suffix to remove from the string.

**Returns:**

`string[:-len(suffix)]` if the string ends with the suffix string,
or a copy of the original string otherwise.

### `format`

`format[*Ts: Stringable & Representable](self, *args: *Ts) -> String`

Produce a formatted string using the current string as a template.

The template, or "format string" can contain literal text and/or
replacement fields delimited with curly braces (`{}`). Returns a copy of
the format string with the replacement fields replaced with string
representations of the `args` arguments.

For more information, see the discussion in the
[`format` module](/mojo/stdlib/collections/string/format/).

Examples:

```mojo
# Manual indexing:
print(StringSlice("{0} {1} {0}").format("Mojo", 1.125)) # Mojo 1.125 Mojo
# Automatic indexing:
print(StringSlice("{} {}").format(True, "hello world")) # True hello world
```

**Parameters:**

* ​\*Ts (`Stringable & Representable`): The types of substitution values that implement `Representable`
  and `Stringable` (to be changed and made more flexible).

**Args:**

* ​\*args (`*Ts`): The substitution values.

**Returns:**

The template with the given values substituted.

### `find`

`find(self, substr: StringSlice[origin], start: Int = 0) -> Int`

Finds the offset in bytes of the first occurrence of `substr` starting at `start`. If not found, returns `-1`.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to find.
* ​start (`Int`): The offset in bytes from which to find. Must be a codepoint
  boundary.

**Returns:**

The offset in bytes of `substr` relative to the beginning of the
string.

### `rfind`

`rfind(self, substr: StringSlice[origin], start: Int = 0) -> Int`

Finds the offset in bytes of the last occurrence of `substr` starting at `start`. If not found, returns `-1`.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to find.
* ​start (`Int`): The offset in bytes from which to find. Must be a valid
  codepoint boundary.

**Returns:**

The offset in bytes of `substr` relative to the beginning of the
string.

### `isspace`

`isspace[single_character: Bool = False](self) -> Bool`

Determines whether every character in the given StringSlice is a python whitespace String. This corresponds to Python's [universal separators](https://docs.python.org/3/library/stdtypes.html#str.splitlines):  `" \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`.

Example:

Check if a string contains only whitespace:

```mojo

# An empty string is not considered to contain only whitespace chars:
assert_false(StringSlice("").isspace())

# ASCII space characters
assert_true(StringSlice(" ").isspace())
assert_true(StringSlice("	").isspace())

# Contains non-space characters
assert_false(StringSlice(" abc  ").isspace())
```

**Parameters:**

* ​single\_character (`Bool`): Whether to evaluate the `StringSlice` as a single
  unicode character (avoids overhead when already iterating).

**Returns:**

True if the whole StringSlice is made up of whitespace characters
listed above, otherwise False.

### `split`

`split(self, sep: StringSlice[origin], maxsplit: Int = -1) -> List[StringSlice[(muttoimm origin._mlir_origin)]]`

Split the string by a separator.

Examples:

```mojo
# Splitting a space
_ = StringSlice("hello world").split(" ") # ["hello", "world"]
# Splitting adjacent separators
_ = StringSlice("hello,,world").split(",") # ["hello", "", "world"]
# Splitting with maxsplit
_ = StringSlice("1,2,3").split(",", 1) # ['1', '2,3']
# Splitting with an empty separator
_ = StringSlice("123").split("") # ["", "1", "2", "3", ""]
```

**Args:**

* ​sep (`StringSlice[origin]`): The string to split on.
* ​maxsplit (`Int`): The maximum amount of items to split from String.
  Defaults to unlimited.

**Returns:**

A List of Strings containing the input split by the separator.

`split(self, sep: NoneType = NoneType(None), maxsplit: Int = -1) -> List[StringSlice[(muttoimm origin._mlir_origin)]]`

Split the string by every Whitespace separator.

Examples:

```mojo
# Splitting an empty string or filled with whitespaces
_ = StringSlice("      ").split() # []
_ = StringSlice("").split() # []

# Splitting a string with leading, trailing, and middle whitespaces
_ = StringSlice("      hello    world     ").split() # ["hello", "world"]
# Splitting adjacent universal newlines:
_ = StringSlice(
    "hello \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029world"
).split()  # ["hello", "world"]
```

**Args:**

* ​sep (`NoneType`): None.
* ​maxsplit (`Int`): The maximum amount of items to split from String. Defaults
  to unlimited.

**Returns:**

A List of Strings containing the input split by the separator.

### `isnewline`

`isnewline[single_character: Bool = False](self) -> Bool`

Determines whether every character in the given StringSlice is a python newline character. This corresponds to Python's [universal newlines:](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `"\r\n"` and `"\t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`.

**Parameters:**

* ​single\_character (`Bool`): Whether to evaluate the stringslice as a single
  unicode character (avoids overhead when already iterating).

**Returns:**

True if the whole StringSlice is made up of whitespace characters
listed above, otherwise False.

### `splitlines`

`splitlines[O: ImmutableOrigin, //](self: StringSlice[O], keepends: Bool = False) -> List[StringSlice[O]]`

Split the string at line boundaries. This corresponds to Python's [universal newlines:](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `"\r\n"` and `"\t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`.

**Parameters:**

* ​O (`ImmutableOrigin`): The immutable origin.

**Args:**

* ​keepends (`Bool`): If True, line breaks are kept in the resulting strings.

**Returns:**

A List of Strings containing the input split by line boundaries.

### `count`

`count(self, substr: StringSlice[origin]) -> Int`

Return the number of non-overlapping occurrences of substring `substr` in the string.

If sub is empty, returns the number of empty strings between characters
which is the length of the string plus one.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to count.

**Returns:**

The number of occurrences of `substr`.

### `is_ascii_digit`

`is_ascii_digit(self) -> Bool`

A string is a digit string if all characters in the string are digits and there is at least one character in the string.

Note that this currently only works with ASCII strings.

**Returns:**

True if all characters are digits and it's not empty else False.

### `isupper`

`isupper(self) -> Bool`

Returns True if all cased characters in the string are uppercase and there is at least one cased character.

**Returns:**

True if all cased characters in the string are uppercase and there
is at least one cased character, False otherwise.

### `islower`

`islower(self) -> Bool`

Returns True if all cased characters in the string are lowercase and there is at least one cased character.

**Returns:**

True if all cased characters in the string are lowercase and there
is at least one cased character, False otherwise.

### `lower`

`lower(self) -> String`

Returns a copy of the string with all cased characters converted to lowercase.

**Returns:**

A new string where cased letters have been converted to lowercase.

### `upper`

`upper(self) -> String`

Returns a copy of the string with all cased characters converted to uppercase.

**Returns:**

A new string where cased letters have been converted to uppercase.

### `is_ascii_printable`

`is_ascii_printable(self) -> Bool`

Returns True if all characters in the string are ASCII printable.

Note that this currently only works with ASCII strings.

**Returns:**

True if all characters are printable else False.

### `rjust`

`rjust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String`

Returns the string right justified in a string of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns right justified string, or self if width is not bigger than self length.

### `ljust`

`ljust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String`

Returns the string left justified in a string of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns left justified string, or self if width is not bigger than self length.

### `center`

`center(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String`

Returns the string center justified in a string of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns center justified string, or self if width is not bigger than self length.

### `join`

`join[T: Copyable & Movable & Writable](self, elems: List[T, hint_trivial_type]) -> String`

Joins string elements using the current string as a delimiter.

**Parameters:**

* ​T (`Copyable & Movable & Writable`): The type of the elements, must implement the `Copyable`,
  `Movable` and `Writable` traits.

**Args:**

* ​elems (`List[T, hint_trivial_type]`): The input values.

**Returns:**

The joined string.

`join[*Ts: Writable](self: StringSlice[StaticConstantOrigin], *elems: *Ts) -> String`

Joins string elements using the current string as a delimiter.

**Parameters:**

* ​\*Ts (`Writable`): The types of the elements.

**Args:**

* ​\*elems (`*Ts`): The input values.

**Returns:**

The joined string.

---

## get_static_string

`get_static_string[string: StringSlice[StaticConstantOrigin], *extra: StringSlice[StaticConstantOrigin]]() -> StringSlice[StaticConstantOrigin]`

Form a StaticString from compile-time StringSlice values. This guarantees that the returned string is compile-time constant in static memory.  It also guarantees that there is a 'nul' zero byte at the end, which is not included in the returned range.

**Parameters:**

* ​string (`StringSlice[StaticConstantOrigin]`): The first StringSlice value.
* ​\*extra (`StringSlice[StaticConstantOrigin]`): Additional StringSlice values to concatenate.

**Returns:**

The string value as a StaticString.

---

## string_slice

The `StringSlice` type implementation for efficient string operations.

This module provides the `StringSlice` type, which is a lightweight view into
string data that enables zero-copy string operations. `StringSlice` is designed
for high-performance string manipulation while maintaining memory safety and
UTF-8 awareness.

The `StringSlice` type is particularly useful for:

* High-performance string operations without copying.
* Efficient string parsing and tokenization.

`StaticString` is an alias for an immutable constant `StringSlice`.

`StringSlice` and `StaticString` are in the prelude, so they are automatically
imported into every Mojo program.

Example:

```mojo
# Create a string slice
var text = StringSlice("Hello, 世界")

# Zero-copy slicing
var hello = text[0:5] # Hello

# Unicode-aware operations
var world = text[7:13]  # "世界"

# String comparison
if text.startswith("Hello"):
    print("Found greeting")

# String formatting
var format_string = StaticString("{}: {}")
print(format_string.format("bats", 6)) # bats: 6
```

## Aliases

### `StaticString`

`alias StaticString = StringSlice[StaticConstantOrigin]`

An immutable static string slice.

## Structs

* [​`CodepointsIter`](/mojo/stdlib/collections/string/string_slice/CodepointsIter): Iterator over the `Codepoint`s in a string slice, constructed by `StringSlice.codepoints()`.
* [​`CodepointSliceIter`](/mojo/stdlib/collections/string/string_slice/CodepointSliceIter): Iterator for `StringSlice` over substring slices containing a single Unicode codepoint.
* [​`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice): A non-owning view to encoded string data.

## Functions

* [​`get_static_string`](/mojo/stdlib/collections/string/string_slice/get_static_string): Form a StaticString from compile-time StringSlice values. This guarantees that the returned string is compile-time constant in static memory.  It also guarantees that there is a 'nul' zero byte at the end, which is not included in the returned range.

---

## Info

`@register_passable(trivial)`
`struct Info[func_type: AnyTrivialRegType, func: func_type, target: target]`

Contains compilation information and results for a function.

Stores assembly/IR code, function metadata, and error information from
compiling a function.

Attributes:
populate: Function to populate captures

## Parameters

* ​func\_type (`AnyTrivialRegType`): Type of the function being compiled.
* ​func (`func_type`): The function being compiled.
* ​target (`target`): The target architecture to compile for.

## Fields

* ​asm (`StringSlice[StaticConstantOrigin]`): Generated assembly/IR code from the compilation process.
* ​function\_name (`StringSlice[StaticConstantOrigin]`): Mangled name of the compiled function, used for symbol resolution.
* ​module\_name (`StringSlice[StaticConstantOrigin]`): Name of the module containing the compiled function.
* ​num\_captures (`Int`): Number of variables captured by the function closure.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `populate`

`alias populate = rebind[AnyTrivialRegType,AnyTrivialRegType](#kgen.compile_offload_closure : !kgen.param>)`

Function pointer to populate captured variables in the function closure.

## Methods

### `__contains__`

`__contains__(self, content: String) -> Bool`

Checks if content exists in the assembly/IR.

**Args:**

* ​content (`String`): String to search for.

**Returns:**

True if content is found, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the assembly/IR to a writer.

**Parameters:**

* ​W (`Writer`): Type that implements the Writer interface for writing data.

**Args:**

* ​writer (`W`): Writer object to write the assembly to.

### `__str__`

`__str__(self) -> String`

Converts the assembly/IR to a string.

**Returns:**

The assembly/IR as a string.

### `write_text`

`write_text[path_like: PathLike](self, path: path_like)`

Writes the assembly/IR to a file.

**Parameters:**

* ​path\_like (`PathLike`): Type that implements the `PathLike` interface for file
  path representation.

**Args:**

* ​path (`path_like`): Path to write the file to.

**Raises:**

If file writing operations fail.

---

## compile_info

`compile_info[func_type: AnyTrivialRegType, //, func: func_type, /, *, emission_kind: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("asm"), compile_options: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), target: target = _current_target()]() -> Info[func_type, func, target]`

Compiles a function and returns detailed compilation information.

This function takes a Mojo function and compiles it, providing access to the
generated assembly code, linkage information, and other compilation
artifacts. It can be used for inspection, debugging, and low-level
optimization.

Example:

```mojo
from compile import compile_info

fn my_func(x: Int) -> Int:
    return x

info = compile_info[my_func]()
print(info)  # Print assembly
```

Note:
The compilation is always performed, even if the function is not used.
For performance-critical code, consider caching the compilation results.

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): Type of the function to compile. Must be a trivially-copyable
  register type.
* ​func (`func_type`): The function to compile. Must match the specified func\_type.
* ​emission\_kind (`StringSlice[StaticConstantOrigin]`): The desired output format. Valid options are:
  * "asm": Assembly code (default).
  * "llvm": Unoptimized LLVM IR.
  * "llvm-opt": Optimized LLVM IR.
  * "object": Object code.
* ​compile\_options (`StringSlice[StaticConstantOrigin]`): Additional compiler flags and options as a string.
* ​target (`target`): The target architecture to compile for. Defaults to current
  architecture.

**Returns:**

An `Info` struct containing:

* asm: The generated code in the requested format
* linkage\_name: The mangled function name for linking
* module\_hash: A unique hash of the compiled module
* num\_captures: Number of captured variables
* error: Any error message (empty if successful)
* failed: Boolean indicating if compilation failed

---

## compile

Provides utilities for compiling and inspecting Mojo code.

This module contains functionality for compiling Mojo functions and examining
their assembly, LLVM IR, or object code output. It is particularly useful for
kernel engineers who want to inspect the low-level implementation details of
specific functions without dealing with entire files or manual invocation of
compilation tools.

Key features:

* Compile individual functions to assembly, LLVM IR, or object code
* Get linkage names and module information
* Inspect number of captures and other function metadata
* Write compilation output to files
* Control compilation options and targets

Example:

```mojo
from compile import compile_info

fn my_func(x: Int) -> Int:
    return x

# Get assembly for the function
info = compile_info[my_func]()
print(info)
```

## Structs

* [​`Info`](/mojo/stdlib/compile/compile/Info): Contains compilation information and results for a function.

## Functions

* [​`compile_info`](/mojo/stdlib/compile/compile/compile_info): Compiles a function and returns detailed compilation information.

---

## compile

Provides utilities for compiling and inspecting Mojo code at runtime.

This module exposes functionality for compiling individual Mojo functions and
examining their low-level implementation details. It is particularly useful for:

* Inspecting assembly, LLVM IR, or object code output
* Getting linkage names and module information
* Examining function metadata like captures
* Writing compilation output to files
* Controlling compilation options and targets

Example:

```mojo
    from compile import compile_info

    fn my_func():
        print("Hello")

    # Get assembly for the function
    info = compile_info[my_func]()
    print(info.asm)
```

## Modules

* [​`compile`](/mojo/stdlib/compile/compile/): Provides utilities for compiling and inspecting Mojo code.
* [​`reflection`](/mojo/stdlib/compile/reflection/):

---

## get_linkage_name

`get_linkage_name[func_type: AnyTrivialRegType, //, target: target, func: func_type]() -> StringSlice[StaticConstantOrigin]`

Returns `func` symbol name.

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): Type of func.
* ​target (`target`): The compilation target.
* ​func (`func_type`): A mojo function.

**Returns:**

Symbol name.

`get_linkage_name[func_type: AnyTrivialRegType, //, func: func_type]() -> StringSlice[StaticConstantOrigin]`

Returns `func` symbol name.

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): Type of func.
* ​func (`func_type`): A mojo function.

**Returns:**

Symbol name.

---

## get_type_name

`get_type_name[type_type: AnyTrivialRegType, //, type: type_type]() -> StringSlice[StaticConstantOrigin]`

Returns the struct name of the given type parameter.

**Parameters:**

* ​type\_type (`AnyTrivialRegType`): Type of type.
* ​type (`type_type`): A mojo type.

**Returns:**

Type name.

---

## reflection

## Functions

* [​`get_linkage_name`](/mojo/stdlib/compile/reflection/get_linkage_name): Returns `func` symbol name.
* [​`get_type_name`](/mojo/stdlib/compile/reflection/get_type_name): Returns the struct name of the given type parameter.

---

## ComplexSIMD

`@register_passable(trivial)`
`struct ComplexSIMD[type: DType, size: Int]`

Represents a complex SIMD value.

The class provides basic methods for manipulating complex values.

## Parameters

* ​type (`DType`): DType of the value.
* ​size (`Int`): SIMD width of the value.

## Fields

* ​re (`SIMD[type, size]`): The real part of the complex SIMD value.
* ​im (`SIMD[type, size]`): The imaginary part of the complex SIMD value.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`,
`_Expable`

## Aliases

### `element_type`

`alias element_type = SIMD[type, size]`

## Methods

### `__init__`

`__init__(re: SIMD[type, size], im: SIMD[type, size] = __init__[__mlir_type.!pop.int_literal](0)) -> Self`

Initializes a complex SIMD value.

**Args:**

* ​re (`SIMD[type, size]`): The real part of the complex value.
* ​im (`SIMD[type, size]`): The imaginary part of the complex value.

### `__neg__`

`__neg__(self) -> Self`

Negates the complex value.

**Returns:**

The negative of the complex value.

### `__add__`

`__add__(self, rhs: Self) -> Self`

Adds two complex values.

**Args:**

* ​rhs (`Self`): Complex value to add.

**Returns:**

A sum of this and RHS complex values.

### `__sub__`

`__sub__(self, rhs: Self) -> Self`

Subtracts two complex values.

**Args:**

* ​rhs (`Self`): Complex value to subtract.

**Returns:**

A difference of this and RHS complex values.

### `__mul__`

`__mul__(self, rhs: Self) -> Self`

Multiplies two complex values.

**Args:**

* ​rhs (`Self`): Complex value to multiply with.

**Returns:**

A product of this and RHS complex values.

### `__truediv__`

`__truediv__(self, rhs: Self) -> Self`

Divides two complex values.

**Args:**

* ​rhs (`Self`): Complex value to divide by.

**Returns:**

A quotient of this and RHS complex values.

### `__str__`

`__str__(self) -> String`

Get the complex as a string.

**Returns:**

A string representation.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this complex value to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__abs__`

`__abs__(self) -> SIMD[type, size]`

Returns the magnitude of the complex value.

**Returns:**

Value of `sqrt(re*re + im*im)`.

### `norm`

`norm(self) -> SIMD[type, size]`

Returns the magnitude of the complex value.

**Returns:**

Value of `sqrt(re*re + im*im)`.

### `squared_norm`

`squared_norm(self) -> SIMD[type, size]`

Returns the squared magnitude of the complex value.

**Returns:**

Value of `re*re + im*im`.

### `fma`

`fma(self, b: Self, c: Self) -> Self`

Computes FMA operation.

Compute fused multiple-add with two other complex values:
`result = self * b + c`

**Args:**

* ​b (`Self`): Multiplier complex value.
* ​c (`Self`): Complex value to add.

**Returns:**

Computed `Self * B + C` complex value.

### `squared_add`

`squared_add(self, c: Self) -> Self`

Computes Square-Add operation.

Compute `Self * Self + C`.

**Args:**

* ​c (`Self`): Complex value to add.

**Returns:**

Computed `Self * Self + C` complex value.

### `__exp__`

`__exp__(self) -> Self`

Computes the exponential of the complex value.

**Returns:**

The exponential of the complex value.

---

## abs

`abs(x: ComplexSIMD[type, size]) -> SIMD[type, size]`

Performs elementwise abs (norm) on each element of the complex value.

**Args:**

* ​x (`ComplexSIMD[type, size]`): The complex vector to perform absolute value on.

**Returns:**

The elementwise abs of x.

---

## complex

Implements the Complex type.

You can import these APIs from the `complex` package. For example:

```mojo
from complex import ComplexSIMD
```

## Aliases

### `ComplexFloat32`

`alias ComplexFloat32 = ComplexSIMD[float32, 1]`

### `ComplexFloat64`

`alias ComplexFloat64 = ComplexSIMD[float64, 1]`

## Structs

* [​`ComplexSIMD`](/mojo/stdlib/complex/complex/ComplexSIMD): Represents a complex SIMD value.

## Functions

* [​`abs`](/mojo/stdlib/complex/complex/abs): Performs elementwise abs (norm) on each element of the complex value.

---

## complex

Provides types and functions for working with complex numbers.

## Modules

* [​`complex`](/mojo/stdlib/complex/complex/): Implements the Complex type.

---

## doc_private

`doc_private()`

Indicate that the decorated declaration is private from the viewpoint of documentation generation.

This decorator allows for hiding the documentation for a declaration during
generation. This is often used to hide `__init__`, and other special
methods, that are not intended to be part of a library's documentation.

For example:

```mojo
struct Foo:
    @doc_private
    fn __init__(out self):
        "This should not be called directly, use `Foo.create` instead."
        return

    @staticmethod
    fn create() -> Self:
        return Self()
```

---

## documentation

Provides decorators and utilities for interacting with Mojo documentation generation and validation.

These are Mojo built-ins, so you don't need to import them.

## Functions

* [​`doc_private`](/mojo/stdlib/documentation/documentation/doc_private): Indicate that the decorated declaration is private from the viewpoint of documentation generation.

---

## documentation

Implements the documentation package.

## Modules

* [​`documentation`](/mojo/stdlib/documentation/documentation/): Provides decorators and utilities for interacting with Mojo documentation generation and validation.

---

## broadcast

`broadcast[type: DType, width: Int, //, *, block_size: Int](val: SIMD[type, width], src_thread: UInt = UInt(0)) -> SIMD[type, width]`

Broadcasts a value from a source thread to all threads in a block.

This function takes a SIMD value from the specified source thread and
copies it to all other threads in the block, effectively broadcasting
the value across the entire block.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements.
* ​width (`Int`): The number of elements in each SIMD vector.
* ​block\_size (`Int`): The total number of threads in the block.

**Args:**

* ​val (`SIMD[type, width]`): The SIMD value to broadcast from the source thread.
* ​src\_thread (`UInt`): The thread ID of the source thread (default: 0).

**Returns:**

A SIMD value where all threads contain a copy of the input value from
the source thread.

---

## block

GPU block-level operations and utilities.

This module provides block-level operations for NVIDIA and AMD GPUs, including:

* Block-wide reductions:
  * sum: Compute sum across block
  * max: Find maximum value across block
  * min: Find minimum value across block
  * broadcast: Broadcast value to all threads

The module builds on warp-level operations from the warp module, extending them
to work across a full thread block (potentially multiple warps). It handles both
NVIDIA and AMD GPU architectures and supports various data types with SIMD
vectorization.

## Functions

* [​`broadcast`](/mojo/stdlib/gpu/block/broadcast): Broadcasts a value from a source thread to all threads in a block.
* [​`max`](/mojo/stdlib/gpu/block/max): Computes the maximum value across all threads in a block.
* [​`min`](/mojo/stdlib/gpu/block/min): Computes the minimum value across all threads in a block.
* [​`prefix_sum`](/mojo/stdlib/gpu/block/prefix_sum): Performs a prefix sum (scan) operation across all threads in a block.
* [​`sum`](/mojo/stdlib/gpu/block/sum): Computes the sum of values across all threads in a block.

---

## max

`max[type: DType, width: Int, //, *, block_size: Int, broadcast: Bool = True](val: SIMD[type, width]) -> SIMD[type, width]`

Computes the maximum value across all threads in a block.

Performs a parallel reduction using warp-level operations and shared memory
to find the global maximum across all threads in the block.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements.
* ​width (`Int`): The number of elements in each SIMD vector.
* ​block\_size (`Int`): The total number of threads in the block.
* ​broadcast (`Bool`): If True, the final reduced value is broadcast to all
  threads in the block. If False, only the first thread will have the
  complete result.

**Args:**

* ​val (`SIMD[type, width]`): The SIMD value to reduce. Each thread contributes its value to find
  the maximum.

**Returns:**

If broadcast is True, each thread in the block will receive the maximum
value across the entire block. Otherwise, only the first thread will
have the complete result.

---

## min

`min[type: DType, width: Int, //, *, block_size: Int, broadcast: Bool = True](val: SIMD[type, width]) -> SIMD[type, width]`

Computes the minimum value across all threads in a block.

Performs a parallel reduction using warp-level operations and shared memory
to find the global minimum across all threads in the block.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements.
* ​width (`Int`): The number of elements in each SIMD vector.
* ​block\_size (`Int`): The total number of threads in the block.
* ​broadcast (`Bool`): If True, the final minimum is broadcast to all threads in the
  block. If False, only the first thread will have the complete min.

**Args:**

* ​val (`SIMD[type, width]`): The SIMD value to reduce. Each thread contributes its value to find
  the minimum.

**Returns:**

If broadcast is True, each thread in the block will receive the minimum
value across the entire block. Otherwise, only the first thread will
have the complete result.

---

## prefix_sum

`prefix_sum[type: DType, //, *, block_size: Int, exclusive: Bool = False](val: SIMD[type, 1]) -> SIMD[type, 1]`

Performs a prefix sum (scan) operation across all threads in a block.

This function implements a block-level inclusive or exclusive scan,
efficiently computing the cumulative sum for each thread based on
thread indices.

**Parameters:**

* ​type (`DType`): The data type of the Scalar elements.
* ​block\_size (`Int`): The total number of threads in the block.
* ​exclusive (`Bool`): If True, perform exclusive scan instead of inclusive.

**Args:**

* ​val (`SIMD[type, 1]`): The Scalar value from each thread to include in the scan.

**Returns:**

A Scalar value containing the result of the scan operation for each
thread.

---

## sum

`sum[type: DType, width: Int, //, *, block_size: Int, broadcast: Bool = True](val: SIMD[type, width]) -> SIMD[type, width]`

Computes the sum of values across all threads in a block.

Performs a parallel reduction using warp-level operations and shared memory
to find the global sum across all threads in the block.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements.
* ​width (`Int`): The number of elements in each SIMD vector.
* ​block\_size (`Int`): The total number of threads in the block.
* ​broadcast (`Bool`): If True, the final sum is broadcast to all threads in the
  block. If False, only the first thread will have the complete sum.

**Args:**

* ​val (`SIMD[type, width]`): The SIMD value to reduce. Each thread contributes its value to the
  sum.

**Returns:**

If broadcast is True, each thread in the block will receive the final
sum. Otherwise, only the first thread will have the complete sum.

---

## block_rank_in_cluster

`block_rank_in_cluster() -> SIMD[uint32, 1]`

Returns the unique identifier (rank) for the current thread block within its cluster.

Note:

* Only supported on NVIDIA SM90+ GPUs.
* Maps directly to the `%cluster_ctarank` special register in CUDA PTX.

**Returns:**

A unique identifier in the range \[0, cluster\_size-1] where `cluster_size`
is the total number of thread blocks in the cluster.

---

## cluster_arrive

`cluster_arrive()`

Signals arrival at a cluster synchronization point with memory ordering guarantees.

This function ensures all prior memory operations from this thread block are visible to
other thread blocks in the cluster before proceeding. Only supported on NVIDIA SM90+ GPUs.

---

## cluster_arrive_relaxed

`cluster_arrive_relaxed()`

Signals arrival at a cluster synchronization point with relaxed memory ordering.

This is a relaxed version of cluster\_arrive() that does not enforce memory ordering
guarantees. It should be used when memory ordering is not required between thread blocks
in the cluster. Only supported on NVIDIA SM90+ GPUs.

---

## cluster_sync

`cluster_sync()`

Performs a full cluster synchronization with memory ordering guarantees.

This is a convenience function that combines cluster\_arrive() and cluster\_wait()
to provide a full barrier synchronization across all thread blocks in the cluster.
Ensures memory ordering between thread blocks. Only supported on NVIDIA SM90+ GPUs.

---

## cluster_sync_acquire

`cluster_sync_acquire()`

Acquires the cluster sync proxy.

Only supported on NVIDIA SM90+ GPUs.

---

## cluster_sync_relaxed

`cluster_sync_relaxed()`

Performs a full cluster synchronization with relaxed memory ordering.

This is a convenience function that combines cluster\_arrive\_relaxed() and cluster\_wait()
to provide a barrier synchronization across all thread blocks in the cluster without
memory ordering guarantees. Only supported on NVIDIA SM90+ GPUs.

---

## cluster_sync_release

`cluster_sync_release()`

Release the cluster sync proxy.

Only supported on NVIDIA SM90+ GPUs.

---

## cluster_wait

`cluster_wait()`

Waits for all thread blocks in the cluster to arrive at the synchronization point.

This function blocks until all thread blocks in the cluster have called cluster\_arrive()
or cluster\_arrive\_relaxed(). Only supported on NVIDIA SM90+ GPUs.

---

## clusterlaunchcontrol_query_cancel_get_first_ctaid

`clusterlaunchcontrol_query_cancel_get_first_ctaid[id: String](result: UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)]) -> SIMD[uint32, 1]`

Decodes the cancellation request.

Only supported on NVIDIA SM100+ GPUs.

**Parameters:**

* ​id (`String`): The dimension to decode. Must be one of `x`, `y`, `z`.

**Args:**

* ​result (`UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)]`): A pointer to 2 `UInt64`s that make up the cancellation request result to decode.

**Returns:**

The coordinate of the first CTAID in the canceled cluster.

---

## clusterlaunchcontrol_query_cancel_get_first_ctaid_v4

`clusterlaunchcontrol_query_cancel_get_first_ctaid_v4(block_dim: UnsafePointer[SIMD[uint32, 1]], result: UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)])`

Decodes the cancellation request.

Only supported on NVIDIA SM100+ GPUs.

**Args:**

* ​block\_dim (`UnsafePointer[SIMD[uint32, 1]]`): A pointer to 4 `UInt32`s that will store the coordinates of the first CTAID in the canceled cluster.
* ​result (`UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)]`): A pointer to 2 `UInt64`s that make up the cancellation request result to decode.

---

## clusterlaunchcontrol_query_cancel_is_canceled

`clusterlaunchcontrol_query_cancel_is_canceled(result: UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)]) -> SIMD[uint32, 1]`

Decodes the cancellation request.

Only supported on NVIDIA SM100+ GPUs.

**Args:**

* ​result (`UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)]`): A pointer to 2 `UInt64`s that make up the cancellation request result to decode.

**Returns:**

True if the cancellation request is canceled, False otherwise.

---

## clusterlaunchcontrol_try_cancel

`clusterlaunchcontrol_try_cancel[multicast: Bool = False](result: UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)], mbar: UnsafePointer[SIMD[int64, 1], address_space=AddressSpace(3)])`

Requests to atomically cancel the cluster launch if it has not started running yet.

Only supported on NVIDIA SM100+ GPUs.

**Args:**

* ​result (`UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)]`): A pointer to 2 `UInt64`s (16B aligned) that will store the result of the cancellation request.
* ​mbar (`UnsafePointer[SIMD[int64, 1], address_space=AddressSpace(3)]`): A pointer to an `Int64` (8B aligned) memory barrier state.

---

## elect_one_sync

`elect_one_sync() -> Bool`

Elects a single thread within a warp to perform an operation.

Note:

* Only supported on NVIDIA SM90+ GPUs.
* Maps directly to the `elect.sync` instruction in CUDA PTX.
* Useful for having a single thread perform an operation while
  maintaining warp synchronization.

**Returns:**

True for the elected thread, False for all other threads in the warp.

---

## elect_one_sync_with_mask

`elect_one_sync_with_mask(mask: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](4294967295)) -> Bool`

Elects a single thread within a warp to perform an operation.

Note:

* Only supported on NVIDIA SM90+ GPUs.
* Maps directly to the `elect.sync` instruction in CUDA PTX.
* Useful for having a single thread perform an operation while
  maintaining warp synchronization.

**Args:**

* ​mask (`SIMD[uint32, 1]`): The mask to use for the election. Defaults to 0xFFFFFFFF.

**Returns:**

True for the elected thread, False for all other threads in the warp.

---

## cluster

This module provides low-level NVIDIA GPU cluster synchronization primitives for SM90+ architectures.

The module implements thread block cluster operations that enable efficient communication and
synchronization between thread blocks (CTAs) within a cluster on NVIDIA Hopper architecture and newer GPUs.

All functions are constrained to NVIDIA SM90+ GPUs and will raise an error if used on unsupported hardware.

Note: These are low-level primitives that correspond directly to PTX/NVVM instructions and should be used
with careful consideration of the underlying hardware synchronization mechanisms.

## Functions

* [​`block_rank_in_cluster`](/mojo/stdlib/gpu/cluster/block_rank_in_cluster): Returns the unique identifier (rank) for the current thread block within its cluster.
* [​`cluster_arrive`](/mojo/stdlib/gpu/cluster/cluster_arrive): Signals arrival at a cluster synchronization point with memory ordering guarantees.
* [​`cluster_arrive_relaxed`](/mojo/stdlib/gpu/cluster/cluster_arrive_relaxed): Signals arrival at a cluster synchronization point with relaxed memory ordering.
* [​`cluster_sync`](/mojo/stdlib/gpu/cluster/cluster_sync): Performs a full cluster synchronization with memory ordering guarantees.
* [​`cluster_sync_acquire`](/mojo/stdlib/gpu/cluster/cluster_sync_acquire): Acquires the cluster sync proxy.
* [​`cluster_sync_relaxed`](/mojo/stdlib/gpu/cluster/cluster_sync_relaxed): Performs a full cluster synchronization with relaxed memory ordering.
* [​`cluster_sync_release`](/mojo/stdlib/gpu/cluster/cluster_sync_release): Release the cluster sync proxy.
* [​`cluster_wait`](/mojo/stdlib/gpu/cluster/cluster_wait): Waits for all thread blocks in the cluster to arrive at the synchronization point.
* [​`clusterlaunchcontrol_query_cancel_get_first_ctaid`](/mojo/stdlib/gpu/cluster/clusterlaunchcontrol_query_cancel_get_first_ctaid): Decodes the cancellation request.
* [​`clusterlaunchcontrol_query_cancel_get_first_ctaid_v4`](/mojo/stdlib/gpu/cluster/clusterlaunchcontrol_query_cancel_get_first_ctaid_v4): Decodes the cancellation request.
* [​`clusterlaunchcontrol_query_cancel_is_canceled`](/mojo/stdlib/gpu/cluster/clusterlaunchcontrol_query_cancel_is_canceled): Decodes the cancellation request.
* [​`clusterlaunchcontrol_try_cancel`](/mojo/stdlib/gpu/cluster/clusterlaunchcontrol_try_cancel): Requests to atomically cancel the cluster launch if it has not started running yet.
* [​`elect_one_sync`](/mojo/stdlib/gpu/cluster/elect_one_sync): Elects a single thread within a warp to perform an operation.
* [​`elect_one_sync_with_mask`](/mojo/stdlib/gpu/cluster/elect_one_sync_with_mask): Elects a single thread within a warp to perform an operation.

---

## allgather

`allgather[type: DType, rank: Int, ngpus: Int, //](input_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus], output_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], (ngpus * ngpus)], ctxs: List[DeviceContext])`

Performs all-gather across GPUs with variadic output.

Each device receives individual copies of all input buffers.

**Parameters:**

* ​type (`DType`): DType - The data type of tensor elements.
* ​rank (`Int`): Int - Number of dimensions in input tensors.
* ​ngpus (`Int`): Int - Number of GPUs participating in all-gather.

**Args:**

* ​input\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus]`): Input buffers from each GPU.
* ​output\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], (ngpus * ngpus)]`): Flat array of ngpus \* ngpus output buffers.
  Layout: output\_buffers\[device\_idx \* ngpus + input\_idx]
  contains device\_idx's copy of input\_idx's data.
* ​ctxs (`List[DeviceContext]`): List of device contexts for participating GPUs.

---

## allgather

Multi-GPU allgather implementation that gathers values from multiple GPUs into an output buffer.

## Functions

* [​`allgather`](/mojo/stdlib/gpu/comm/allgather/allgather): Performs all-gather across GPUs with variadic output.

---

## Signal

`@register_passable(trivial)`
`struct Signal`

A synchronization primitive for coordinating GPU thread blocks across multiple devices.

This struct provides counter-based synchronization between thread blocks on different GPUs.
It maintains two sets of counters:

1. self\_counter: Used by blocks on the current GPU to signal their progress
2. peer\_counter: Used to track progress of blocks on other GPUs

Note:
The counters use unsigned integers that may overflow, but this is safe since
unsigned integer overflow has well-defined behavior.

## Fields

* ​self\_counter (`StaticTuple[StaticTuple[SIMD[uint32, 1], 8], 512]`): A 2D array of counters with shape (MAX\_NUM\_BLOCKS\_UPPER\_BOUND, MAX\_GPUS). Each counter tracks the progress of a specific thread block on the current GPU. Thread blocks increment their corresponding counter to signal completion of a phase, allowing other GPUs to detect when synchronization points are reached. The counters use atomic operations to ensure proper synchronization across devices.
* ​peer\_counter (`StaticTuple[StaticTuple[StaticTuple[SIMD[uint32, 1], 8], 512], 2]`): A 3D array of counters with shape (2, MAX\_NUM\_BLOCKS\_UPPER\_BOUND, MAX\_GPUS). Contains two sets of counters to handle two synchronization points safely. The dual counter design prevents race conditions where a peer block arrives at the second sync point before the current block passes the first sync point.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

---

## allreduce

`allreduce[type: DType, rank: Int, ngpus: Int, outputs_lambda: fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None, pdl_level: PDLLevel = PDLLevel()](input_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus], output_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus], rank_sigs: InlineArray[UnsafePointer[Signal], 8], ctxs: List[DeviceContext], _max_num_blocks: Optional[Int] = Optional(None))`

Performs an allreduce operation across multiple GPUs.

This function serves as the main entry point for performing allreduce operations
across multiple GPUs. It automatically selects between two implementations:

* A peer-to-peer (P2P) based implementation when P2P access is possible between GPUs
* A naive implementation as fallback when P2P access is not available

The allreduce operation combines values from all GPUs using element-wise addition
and distributes the result back to all GPUs.

Note:

* Input and output buffers must have identical shapes across all GPUs.
* The number of elements must be identical across all input/output buffers.
* Performance is typically better with P2P access enabled between GPUs.

**Parameters:**

* ​type (`DType`): The data type of the tensor elements (e.g. DType.float32).
* ​rank (`Int`): The number of dimensions in the input/output tensors.
* ​ngpus (`Int`): The number of GPUs participating in the allreduce.
* ​outputs\_lambda (`fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None`): An output elementwise lambda.
* ​pdl\_level (`PDLLevel`): Control PDL behavior for the kernel.

**Args:**

* ​input\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus]`): Array of input tensors from each GPU, one per GPU.
* ​output\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus]`): Array of output tensors for each GPU to store results.
* ​rank\_sigs (`InlineArray[UnsafePointer[Signal], 8]`): Array of Signal pointers used for cross-GPU synchronization.
* ​ctxs (`List[DeviceContext]`): List of device contexts for each participating GPU.
* ​\_max\_num\_blocks (`Optional[Int]`): Optional maximum number of blocks used to compute grid
  configuration.
  If not passed a dispatch table sets the grid configuration.

---

## can_enable_p2p

`can_enable_p2p(ctxs: List[DeviceContext]) -> Bool`

If peer-to-peer access is supported, enables it between all GPU pairs.

**Args:**

* ​ctxs (`List[DeviceContext]`): List of device contexts representing different GPUs.

**Returns:**

True if P2P access is possible between all GPU pairs, False otherwise.

---

## allreduce

Multi-GPU allreduce implementation for efficient tensor reduction across GPUs.

This module provides an optimized implementation of allreduce operations across multiple GPUs,
supporting both peer-to-peer (P2P) and non-P2P communication patterns. The implementation
automatically selects between two approaches based on hardware capabilities:

1. P2P-based implementation (when P2P access is available):
   * Uses direct GPU-to-GPU memory access for better performance
   * Implements both single-stage and two-stage algorithms:
     * Single-stage for latency-bound transfers (small tensors)
     * Two-stage (reduce-scatter + all-gather) for bandwidth-bound transfers (large tensors)
   * Optimized for NVLink bandwidth utilization
   * Uses vectorized memory access and higher precision accumulation

2. Non-P2P fallback implementation:
   * Copies data through host memory when direct GPU access isn't possible
   * Simple but functional approach for systems without P2P support

The implementation is tuned for common GPU architectures (A100, H100) and includes
parameters that can be adjusted for different hardware configurations.

Limitations:

* Number of elements must be a multiple of SIMD width
* Maximum of 8 GPUs supported
* All input/output buffers must have identical shapes

## Aliases

### `elementwise_epilogue_type`

`alias elementwise_epilogue_type = fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None`

### `MAX_GPUS`

`alias MAX_GPUS = 8`

Maximum number of GPUs supported in the allreduce implementation.

This constant sets the upper bound for the number of GPUS supported in this algorithm.

### `MAX_NUM_BLOCKS_UPPER_BOUND`

`alias MAX_NUM_BLOCKS_UPPER_BOUND = 512`

Maximum number of thread blocks to use for reduction kernels.

This value has been empirically optimized through grid search across different GPU architectures.
While this value is optimal for A100 GPUs, H100 GPUs may benefit from more blocks to fully
saturate NVLink bandwidth.

## Structs

* [​`Signal`](/mojo/stdlib/gpu/comm/allreduce/Signal): A synchronization primitive for coordinating GPU thread blocks across multiple devices.

## Functions

* [​`allreduce`](/mojo/stdlib/gpu/comm/allreduce/allreduce): Performs an allreduce operation across multiple GPUs.
* [​`can_enable_p2p`](/mojo/stdlib/gpu/comm/allreduce/can_enable_p2p): If peer-to-peer access is supported, enables it between all GPU pairs.

---

## comm

The `gpu.comm` package provides communication primitives for GPUs.

This package includes functions for sending and receiving data between GPUs,
as well as for synchronizing threads across GPUs.

## Modules

* [​`allgather`](/mojo/stdlib/gpu/comm/allgather/): Multi-GPU allgather implementation that gathers values from multiple GPUs into an output buffer.
* [​`allreduce`](/mojo/stdlib/gpu/comm/allreduce/): Multi-GPU allreduce implementation for efficient tensor reduction across GPUs.

---

## globals

This module provides GPU-specific global constants and configuration values.

The module defines hardware-specific constants like warp size and thread block limits
that are used throughout the GPU programming interface. It handles both NVIDIA and AMD
GPU architectures, automatically detecting and configuring the appropriate values based
on the available hardware.

The constants are resolved at compile time based on the target GPU architecture and
are used to optimize code generation and ensure hardware compatibility.

## Aliases

### `MAX_THREADS_PER_BLOCK_METADATA`

`alias MAX_THREADS_PER_BLOCK_METADATA = _resolve_max_threads_per_block_metadata()`

This is metadata tag that is used in conjunction with \_\_llvm\_metadata to give a hint to the compiler about the max threads per block that's used.

### `WARP_SIZE`

`alias WARP_SIZE = _resolve_warp_size()`

The number of threads that execute in lockstep within a warp on the GPU.

This constant represents the hardware warp size, which is the number of threads that execute
instructions synchronously as a unit. The value is architecture-dependent:

* 32 threads per warp on NVIDIA GPUs
* 64 threads per warp on AMD GPUs
* 0 if no GPU is detected

The warp size is a fundamental parameter that affects:

* Thread scheduling and execution
* Memory access coalescing
* Synchronization primitives
* Overall performance optimization

---

## PDL

`struct PDL`

Programmatic Dependency Launch (PDL) control structure.

This struct provides a way to manage programmatic stream serialization on
NVIDIA GPUs. It includes functions for launching dependent grids and waiting
for them to complete.

Note:

* Only supported on NVIDIA SM90+ (Hopper architecture and newer) GPUs.

## Implemented traits

`AnyType`,
`Defaultable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Initialize the PDL control structure.

### `__enter__`

`__enter__(self)`

Launch dependent grids that were previously configured to depend on the current grid.

### `__exit__`

`__exit__(self)`

Wait for all dependent grids launched by this grid to complete execution.

---

## PDLLevel

`@register_passable(trivial)`
`struct PDLLevel`

Programmatic Dependency Launch (PDL) level.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `NO_WAIT_OVERLAP_AT_END`

`alias NO_WAIT_OVERLAP_AT_END = PDLLevel(3)`

### `OFF`

`alias OFF = PDLLevel(0)`

### `OVERLAP_AT_BEGINNING`

`alias OVERLAP_AT_BEGINNING = PDLLevel(2)`

### `OVERLAP_AT_END`

`alias OVERLAP_AT_END = PDLLevel(1)`

## Methods

### `__init__`

`__init__() -> Self`

Initialize the PDL level to OFF.

`__init__(level: Int) -> Self`

Initialize the PDL level.

**Args:**

* ​level (`Int`): The PDL level to initialize.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Check if the PDL level is equal to another PDL level.

**Args:**

* ​other (`Self`): The other PDL level to compare against.

**Returns:**

True if the PDL level is equal to the other PDL level, False otherwise.

`__eq__(self, other: Int) -> Bool`

Check if the PDL level is equal to another PDL level.

**Args:**

* ​other (`Int`): The other PDL level to compare against.

**Returns:**

True if the PDL level is equal to the other PDL level, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Check if the PDL level is not equal to another PDL level.

**Args:**

* ​other (`Self`): The other PDL level to compare against.

**Returns:**

True if the PDL level is not equal to the other PDL level, False otherwise.

### `__gt__`

`__gt__(self, other: Self) -> Bool`

Check if the PDL level is greater than another PDL level.

**Args:**

* ​other (`Self`): The other PDL level to compare against.

**Returns:**

True if the PDL level is greater than the other PDL level, False otherwise.

### `__ge__`

`__ge__(self, other: Self) -> Bool`

Check if the PDL level is greater than or equal to another PDL level.

**Args:**

* ​other (`Self`): The other PDL level to compare against.

**Returns:**

True if the PDL level is greater or equal to the other PDL level,
False otherwise.

---

## grid_controls

Grid Dependent Control primitives for NVIDIA Hopper (SM90+) GPUs.

This module provides low-level primitives for managing grid dependencies on NVIDIA
Hopper architecture and newer GPUs. It enables efficient orchestration of multi-grid
workloads by allowing grids to launch dependent grids and synchronize with them.

The module includes functions that map directly to CUDA grid dependency control
instructions, providing fine-grained control over grid execution order:

* `launch_dependent_grids()`: Triggers execution of grids that depend on the
  current grid
* `wait_on_dependent_grids()`: Blocks until all dependent grids complete execution

These primitives are essential for implementing complex GPU execution pipelines where
multiple kernels need to execute in a specific order with minimal overhead. They
eliminate the need for host-side synchronization when orchestrating dependent GPU work.

## Structs

* [​`PDL`](/mojo/stdlib/gpu/grid_controls/PDL): Programmatic Dependency Launch (PDL) control structure.
* [​`PDLLevel`](/mojo/stdlib/gpu/grid_controls/PDLLevel): Programmatic Dependency Launch (PDL) level.

## Functions

* [​`launch_dependent_grids`](/mojo/stdlib/gpu/grid_controls/launch_dependent_grids): Launches dependent grids that were previously configured to depend on the current grid.
* [​`wait_on_dependent_grids`](/mojo/stdlib/gpu/grid_controls/wait_on_dependent_grids): Waits for all dependent grids launched by this grid to complete execution.

---

## launch_dependent_grids

`launch_dependent_grids()`

Launches dependent grids that were previously configured to depend on the current grid.

This function triggers the execution of dependent grids that have been configured
with a dependency on the current grid. It maps directly to the CUDA grid
dependency control instruction for launching dependent grids.

Note:

* Only supported on NVIDIA SM90+ (Hopper architecture and newer) GPUs.
* Must be called by all threads in a thread block to avoid undefined behavior.
* Typically used in multi-grid pipeline scenarios where one grid's completion
  should trigger the execution of other grids.

---

## wait_on_dependent_grids

`wait_on_dependent_grids()`

Waits for all dependent grids launched by this grid to complete execution.

This function blocks the calling grid until all dependent grids that were launched
by this grid have completed their execution. It provides a synchronization point
between parent and child grids in a multi-grid dependency chain.

Note:

* Only supported on NVIDIA SM90+ (Hopper architecture and newer) GPUs.
* Must be called by all threads in a thread block to avoid undefined behavior.
* Can be used to ensure dependent grid work is complete before proceeding
  with subsequent operations in the parent grid.

---

## ConstantMemoryMapping

`@register_passable(trivial)`
`struct ConstantMemoryMapping`

Represents a mapping of constant memory between host and device.

This struct encapsulates the information needed to manage constant memory
that can be accessed by GPU kernels. Constant memory provides a fast, read-only
cache accessible by all threads on the GPU device.

Attributes:
name: A string identifier for the constant memory mapping.
ptr: Pointer to the memory location.
byte\_count: Size of the memory mapping in bytes.

## Fields

* ​name (`StringSlice[StaticConstantOrigin]`): A string identifier for the constant memory mapping.
  This name is used to uniquely identify the constant memory region in the GPU
  programming model, allowing the runtime to properly associate the memory with
  kernel references to constant memory symbols.
* ​ptr (`UnsafePointer[NoneType]`): Pointer to the host memory location that will be mapped to device constant memory.
  This raw pointer represents the starting address of the memory region that will be
  accessible as constant memory on the GPU. The memory should remain valid for the
  lifetime of any kernels that access it.
* ​byte\_count (`Int`): Size of the memory mapping in bytes.
  Specifies the total size of the constant memory region. This value is used by the
  runtime to determine how much data to transfer between host and device. The size
  must be sufficient to hold all data needed by GPU kernels.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

---

## constant_memory_mapping

This module provides functionality for mapping constant memory between host and device.

The module includes the `ConstantMemoryMapping` struct which represents a mapping of
constant memory that can be used for efficient data transfer between host and GPU device.

## Structs

* [​`ConstantMemoryMapping`](/mojo/stdlib/gpu/host/constant_memory_mapping/ConstantMemoryMapping): Represents a mapping of constant memory between host and device.

---

## DeviceAttribute

`@register_passable(trivial)`
`struct DeviceAttribute`

Represents CUDA device attributes that can be queried from a GPU device.

This struct encapsulates the various device properties and capabilities that can be
queried through the CUDA driver API. Each attribute is represented as a constant
with a corresponding integer value that maps to the CUDA driver's attribute enum.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `CLOCK_RATE`

`alias CLOCK_RATE = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](13))`

Typical clock frequency in kilohertz

### `COMPUTE_CAPABILITY_MAJOR`

`alias COMPUTE_CAPABILITY_MAJOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](75))`

Major compute capability version number

### `COMPUTE_CAPABILITY_MINOR`

`alias COMPUTE_CAPABILITY_MINOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](76))`

Minor compute capability version number

### `MAX_ACCESS_POLICY_WINDOW_SIZE`

`alias MAX_ACCESS_POLICY_WINDOW_SIZE = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](109))`

CUDA-only: Maximum value of CUaccessPolicyWindow::num\_bytes.

### `MAX_BLOCK_DIM_X`

`alias MAX_BLOCK_DIM_X = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](2))`

Maximum block dimension X

### `MAX_BLOCK_DIM_Y`

`alias MAX_BLOCK_DIM_Y = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](3))`

Maximum block dimension Y

### `MAX_BLOCK_DIM_Z`

`alias MAX_BLOCK_DIM_Z = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](4))`

Maximum block dimension Z

### `MAX_BLOCKS_PER_MULTIPROCESSOR`

`alias MAX_BLOCKS_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](106))`

Maximum resident blocks per multiprocessor

### `MAX_GRID_DIM_X`

`alias MAX_GRID_DIM_X = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](5))`

Maximum grid dimension X

### `MAX_GRID_DIM_Y`

`alias MAX_GRID_DIM_Y = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](6))`

Maximum grid dimension Y

### `MAX_GRID_DIM_Z`

`alias MAX_GRID_DIM_Z = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](7))`

Maximum grid dimension Z

### `MAX_REGISTERS_PER_BLOCK`

`alias MAX_REGISTERS_PER_BLOCK = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](12))`

Maximum number of 32-bit registers available per block

### `MAX_REGISTERS_PER_MULTIPROCESSOR`

`alias MAX_REGISTERS_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](82))`

Maximum number of 32-bit registers available per multiprocessor

### `MAX_SHARED_MEMORY_PER_BLOCK`

`alias MAX_SHARED_MEMORY_PER_BLOCK = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](8))`

Maximum shared memory available per block in bytes

### `MAX_SHARED_MEMORY_PER_MULTIPROCESSOR`

`alias MAX_SHARED_MEMORY_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](81))`

Maximum shared memory available per multiprocessor in bytes

### `MAX_THREADS_PER_BLOCK`

`alias MAX_THREADS_PER_BLOCK = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](1))`

Maximum number of threads per block

### `MAX_THREADS_PER_MULTIPROCESSOR`

`alias MAX_THREADS_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](39))`

Maximum resident threads per multiprocessor

### `MULTIPROCESSOR_COUNT`

`alias MULTIPROCESSOR_COUNT = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](16))`

Number of multiprocessors on device

### `WARP_SIZE`

`alias WARP_SIZE = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](10))`

Warp size in threads

---

## device_attribute

This module defines GPU device attributes that can be queried from CUDA-compatible devices.

The module provides the `DeviceAttribute` struct which encapsulates the various device
properties and capabilities that can be queried through the CUDA driver API. Each attribute
is represented as a constant with a corresponding integer value that maps to the CUDA
driver's attribute enumeration.

These attributes allow applications to query specific hardware capabilities and limitations
of GPU devices, such as maximum thread counts, memory sizes, compute capabilities, and
supported features.

## Structs

* [​`DeviceAttribute`](/mojo/stdlib/gpu/host/device_attribute/DeviceAttribute): Represents CUDA device attributes that can be queried from a GPU device.

---

## DeviceBuffer

`struct DeviceBuffer[type: DType]`

Represents a block of device-resident storage. For GPU devices, a device buffer is allocated in the device's global memory.

To allocate a `DeviceBuffer`, use one of the methods provided by
`DeviceContext`, such as
[`enqueue_create_buffer()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_create_buffer).

## Parameters

* ​type (`DType`): Data type to be stored in the buffer.

## Implemented traits

`AnyType`,
`Copyable`,
`DevicePassable`,
`Movable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `device_type`

`alias device_type = UnsafePointer[SIMD[type, 1]]`

DeviceBuffer types are remapped to UnsafePointer when passed to accelerator devices.

## Methods

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Creates a copy of an existing device buffer by incrementing its reference count.

This copy constructor creates a new reference to the same underlying device buffer
by incrementing the reference count of the native buffer object. Both the original
and the copy will refer to the same memory on the device.

**Args:**

* ​existing (`Self`): The device buffer to copy.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Initializes this buffer by taking ownership of an existing buffer.

This move constructor transfers ownership of the device buffer from the existing
instance to the new instance without incrementing the reference count.

**Args:**

* ​existing (`Self`): The buffer to move from, which will no longer be valid after this call.

### `__del__`

`__del__(owned self)`

Releases resources associated with this device buffer.

This function schedules an owned buffer free using the stream in the
device context. The actual deallocation may occur asynchronously after
all operations using this buffer have completed.

### `get_type_name`

`static get_type_name() -> String`

Gets this type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls.

**Returns:**

This type's name.

### `get_device_type_name`

`static get_device_type_name() -> String`

Gets device\_type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls.

**Returns:**

This type's name.

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

### `__len__`

`__len__(self) -> Int`

Returns the number of elements in this buffer.

This method calculates the number of elements by dividing the total byte size
of the buffer by the size of each element.

**Returns:**

The number of elements in the buffer.

### `create_sub_buffer`

`create_sub_buffer[view_type: DType](self, offset: Int, size: Int) -> DeviceBuffer[view_type]`

Creates a sub-buffer view of this buffer with a different element type.

This method creates a new buffer that references a subset of the memory in this
buffer, potentially with a different element type. The sub-buffer shares the
underlying memory with the original buffer.

**Parameters:**

* ​view\_type (`DType`): The data type for elements in the new sub-buffer.

**Args:**

* ​offset (`Int`): The starting offset in elements from the beginning of this buffer.
* ​size (`Int`): The number of elements in the new sub-buffer.

**Returns:**

A new DeviceBuffer referencing the specified region with the specified element type.

### `enqueue_copy_to`

`enqueue_copy_to(self, dst: Self)`

Enqueues an asynchronous copy from this buffer to another device buffer.

This method schedules a memory copy operation from this buffer to the destination
buffer. The operation is asynchronous and will be executed in the stream associated
with this buffer's context.

**Args:**

* ​dst (`Self`): The destination device buffer to copy data to.

`enqueue_copy_to(self, dst: HostBuffer[type])`

Enqueues an asynchronous copy from this buffer to a host buffer.

This method schedules a memory copy operation from this buffer to the destination
buffer. The operation is asynchronous and will be executed in the stream associated
with this buffer's context.

**Args:**

* ​dst (`HostBuffer[type]`): The destination host buffer to copy data to.

`enqueue_copy_to(self, dst_ptr: UnsafePointer[SIMD[type, 1]])`

Enqueues an asynchronous copy from this buffer to host memory.

This method schedules a memory copy operation from this device buffer to the
specified host memory location. The operation is asynchronous and will be
executed in the stream associated with this buffer's context.

**Args:**

* ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the destination host memory location.

### `enqueue_copy_from`

`enqueue_copy_from(self, src: Self)`

Enqueues an asynchronous copy to this buffer from another device buffer.

This method schedules a memory copy operation to this buffer from the source
buffer. The operation is asynchronous and will be executed in the stream
associated with this buffer's context.

**Args:**

* ​src (`Self`): The source device buffer to copy data from.

`enqueue_copy_from(self, src: HostBuffer[type])`

Enqueues an asynchronous copy to this buffer from a host buffer.

This method schedules a memory copy operation to this buffer from the source
buffer. The operation is asynchronous and will be executed in the stream
associated with this buffer's context.

**Args:**

* ​src (`HostBuffer[type]`): The source host buffer to copy data from.

`enqueue_copy_from(self, src_ptr: UnsafePointer[SIMD[type, 1]])`

Enqueues an asynchronous copy to this buffer from host memory.

This method schedules a memory copy operation to this device buffer from the
specified host memory location. The operation is asynchronous and will be
executed in the stream associated with this buffer's context.

**Args:**

* ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the source host memory location.

### `enqueue_fill`

`enqueue_fill(self, val: SIMD[type, 1]) -> Self`

Enqueues an operation to fill this buffer with a specified value.

This method schedules a memory set operation that fills the entire buffer
with the specified value. The operation is asynchronous and will be executed
in the stream associated with this buffer's context.

**Args:**

* ​val (`SIMD[type, 1]`): The value to fill the buffer with.

**Returns:**

Self reference for method chaining.

### `reassign_ownership_to`

`reassign_ownership_to(self, ctx: DeviceContext)`

Transfers ownership of this buffer to another device context.

This method changes the device context that owns this buffer. This can be
useful when sharing buffers between different contexts or when migrating
workloads between devices.

**Args:**

* ​ctx (`DeviceContext`): The new device context to take ownership of this buffer.

### `take_ptr`

`take_ptr(owned self) -> UnsafePointer[SIMD[type, 1]]`

Takes ownership of the device pointer from this buffer.

This method releases the device pointer from the buffer's control and
returns it to the caller. After this call, the buffer no longer owns
the pointer, and the caller is responsible for managing its lifecycle.

**Returns:**

The raw device pointer that was owned by this buffer.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[SIMD[type, 1]]`

Returns the raw device pointer without transferring ownership.

This method provides direct access to the underlying device pointer
for advanced use cases. The buffer retains ownership of the pointer.

**Returns:**

The raw device pointer owned by this buffer.

### `context`

`context(self) -> DeviceContext`

Returns the device context associated with this buffer.

This method retrieves the device context that owns this buffer and is
responsible for managing its lifecycle and operations.

**Returns:**

The device context associated with this buffer.

### `map_to_host`

`map_to_host(self, out mapped_buffer: _HostMappedBuffer[type])`

Maps this device buffer to host memory for CPU access.

This method creates a host-accessible view of the device buffer's contents.
The mapping operation may involve copying data from device to host memory.

Notes:

Values modified inside the `with` statement are updated on the
device when the `with` statement exits.

Example:

```mojo
from gpu.host import DeviceContext

var ctx = DeviceContext()
var length = 1024
var in_dev = ctx.enqueue_create_buffer[DType.float32](length)
var out_dev = ctx.enqueue_create_buffer[DType.float32](length)

# Initialize the input and output with known values.
with in_dev.map_to_host() as in_host, out_dev.map_to_host() as out_host:
    for i in range(length):
        in_host[i] = i
        out_host[i] = 255
```

**Returns:**

A host-mapped buffer that provides CPU access to the device buffer's
contents inside a with-statement.

**Raises:**

If there's an error during buffer creation or data transfer.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes a string representation of this buffer to the provided writer.

This method formats the buffer's contents as a string and writes it to
the specified writer. For large buffers, a compact representation is used.

**Parameters:**

* ​W (`Writer`): The writer type.

**Args:**

* ​writer (`W`): The writer to output the formatted string to.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the `DeviceBuffer`.

This method creates a human-readable string representation of the buffer's contents
by mapping the device memory to host memory and formatting the elements.

**Returns:**

A string containing the formatted buffer contents.

---

## DeviceContext

`@register_passable`
`struct DeviceContext`

Represents a single stream of execution on a particular accelerator (GPU).

A `DeviceContext` serves as the low-level interface to the
accelerator inside a MAX [custom operation](/max/custom-ops/) and provides
methods for allocating buffers on the device, copying data between host and
device, and for compiling and running functions (also known as kernels) on
the device.

The device context can be used as a
[context manager](/mojo/manual/errors#use-a-context-manager). For example:

```mojo
from gpu.host import DeviceContext
from gpu import thread_idx

fn kernel():
    print("hello from thread:", thread_idx.x, thread_idx.y, thread_idx.z)

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2))
    ctx.synchronize()
```

A custom operation receives an opaque `DeviceContextPtr`, which provides
a `get_device_context()` method to retrieve the device context:

```mojo
from runtime.asyncrt import DeviceContextPtr

@register("custom_op")
struct CustomOp:
    @staticmethod
    fn execute(ctx_ptr: DeviceContextPtr) raises:
        var ctx = ctx_ptr.get_device_context()
        ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2))
        ctx.synchronize()
```

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `device_api`

`alias device_api = from_name[::StringSlice[::Bool().api`

Device API for the default accelerator (for example, "cuda" or "hip").

### `device_info`

`alias device_info = from_name[::StringSlice[::Bool()`

`gpu.info.Info` object for the default accelerator.

## Methods

### `__init__`

`__init__(out self, device_id: Int = 0, *, owned api: String = String(from_name[::StringSlice[::Bool()))`

Constructs a `DeviceContext` for the specified device.

This initializer creates a new device context for the specified accelerator device.
The device context provides an interface for interacting with the GPU, including
memory allocation, data transfer, and kernel execution.

Example:

```mojo
from gpu.host import DeviceContext

# Create a context for the default GPU
var ctx = DeviceContext()

# Create a context for a specific GPU (device 1)
var ctx2 = DeviceContext(1)
```

**Args:**

* ​device\_id (`Int`): ID of the accelerator device. If not specified, uses
  the default accelerator (device 0).
* ​api (`String`): Requested device API (for example, "cuda" or "hip"). Defaults to the
  device API specified by the DeviceContext class.

**Raises:**

If device initialization fails or the specified device is not available.

### `__copyinit__`

`__copyinit__(existing: Self) -> Self`

Creates a copy of an existing device context by incrementing its reference count.

This copy constructor creates a new reference to the same underlying device context
by incrementing the reference count of the native context object. Both the original
and the copy will refer to the same device context.

**Args:**

* ​existing (`Self`): The device context to copy.

### `__del__`

`__del__(owned self)`

Releases resources associated with this device context.

This destructor decrements the reference count of the native device context.
When the reference count reaches zero, the underlying resources are released,
including any cached memory buffers and compiled device functions.

### `copy`

`copy(self) -> Self`

Explicitly constructs a copy of this device context.

This method creates a new reference to the same underlying device context
by incrementing the reference count of the native context object.

**Returns:**

A copy of this device context that refers to the same underlying context.

### `__enter__`

`__enter__(owned self) -> Self`

Enables the use of DeviceContext in a 'with' statement context manager.

This method allows DeviceContext to be used with Python-style context managers,
which ensures proper resource management and cleanup when the context exits.

Example:

```mojo
from gpu.host import DeviceContext

# Using DeviceContext as a context manager
with DeviceContext() as ctx:
    # Perform GPU operations
    # Resources are automatically released when exiting the block
```

**Returns:**

The DeviceContext instance to be used within the context manager block.

### `name`

`name(self) -> String`

Returns the device name, an ASCII string identifying this device, defined by the native device API.

This method queries the underlying GPU device for its name, which typically
includes the model and other identifying information. This can be useful for
logging, debugging, or making runtime decisions based on the specific GPU hardware.

Example:

```mojo
from gpu.host import DeviceContext

var ctx = DeviceContext()
print("Running on device:", ctx.name())
```

**Returns:**

A string containing the device name.

### `api`

`api(self) -> String`

Returns the name of the API used to program the device.

This method queries the underlying device context to determine which GPU programming
API is being used for the current device. This information is useful for writing
code that can adapt to different GPU architectures and programming models.

Possible values are:

* "cpu": Generic host device (CPU).
* "cuda": NVIDIA GPUs.
* "hip": AMD GPUs.

Example:

```mojo
from gpu.host import DeviceContext

var ctx = DeviceContext()
var api_name = ctx.api()
print("Using device API:", api_name)

# Conditionally execute code based on the API
if api_name == "cuda":
    print("Running on NVIDIA GPU")
elif api_name == "hip":
    print("Running on AMD GPU")
```

**Returns:**

A string identifying the device API.

### `enqueue_create_buffer`

`enqueue_create_buffer[type: DType](self, size: Int) -> DeviceBuffer[type]`

Enqueues a buffer creation using the `DeviceBuffer` constructor.

For GPU devices, the space is allocated in the device's global memory.

**Parameters:**

* ​type (`DType`): The data type to be stored in the allocated memory.

**Args:**

* ​size (`Int`): The number of elements of `type` to allocate memory for.

**Returns:**

The allocated buffer.

### `create_buffer_sync`

`create_buffer_sync[type: DType](self, size: Int) -> DeviceBuffer[type]`

Creates a buffer synchronously using the `DeviceBuffer` constructor.

**Parameters:**

* ​type (`DType`): The data type to be stored in the allocated memory.

**Args:**

* ​size (`Int`): The number of elements of `type` to allocate memory for.

**Returns:**

The allocated buffer.

### `enqueue_create_host_buffer`

`enqueue_create_host_buffer[type: DType](self, size: Int) -> HostBuffer[type]`

Enqueues the creation of a HostBuffer.

This function allocates memory on the host that is accessible by the device.
The memory is page-locked (pinned) for efficient data transfer between host and device.

Pinned memory is guaranteed to remain resident in the host's RAM, not be
paged/swapped out to disk. Memory allocated normally (for example, using
[`UnsafePointer.alloc()`](/mojo/stdlib/memory/unsafe_ptr/UnsafePointer#alloc))
is pageable—individual pages of memory can be moved to secondary storage
(disk/SSD) when main memory fills up.

Using pinned memory allows devices to make fast transfers
between host memory and device memory, because they can use direct
memory access (DMA) to transfer data without relying on the CPU.

Allocating too much pinned memory can cause performance issues, since it
reduces the amount of memory available for other processes.

Example:

```mojo
from gpu.host import DeviceContext

with DeviceContext() as ctx:
    # Allocate host memory accessible by the device
    var host_buffer = ctx.enqueue_create_host_buffer[DType.float32](1024)

    # Use the host buffer for device operations
    # ...
```

**Parameters:**

* ​type (`DType`): The data type to be stored in the allocated memory.

**Args:**

* ​size (`Int`): The number of elements of `type` to allocate memory for.

**Returns:**

A `HostBuffer` object that wraps the allocated host memory.

**Raises:**

If memory allocation fails or if the device context is invalid.

### `compile_function`

`compile_function[func_type: AnyTrivialRegType, //, func: func_type, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(None), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])`

Compiles the provided function for execution on this device.

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): Type of the function.
* ​func (`func_type`): The function to compile.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).
* ​\_target (`target`): Change the target to different device type than the
  one associated with this `DeviceContext`.

**Args:**

* ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such
  as maximum shared memory size).

**Returns:**

The compiled function.

### `compile_function_unchecked`

`compile_function_unchecked[func_type: AnyTrivialRegType, //, func: func_type, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(None), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])`

Compiles the provided function for execution on this device.

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): Type of the function.
* ​func (`func_type`): The function to compile.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).
* ​\_target (`target`): Change the target to different device type than the
  one associated with this `DeviceContext`.

**Args:**

* ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such
  as maximum shared memory size).

**Returns:**

The compiled function.

### `compile_function_checked`

`compile_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])`

Compiles the provided function for execution on this device.

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): Type of the function.
* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`func_type`): The function to compile.
* ​signature\_func (`fn(*args: *declared_arg_types) -> None`): The function to compile, passed in again. Used for
  checking argument types later.
  Note: This will disappear in future versions.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).
* ​\_target (`target`): Change the target to different device type than the
  one associated with this `DeviceContext`.

**Args:**

* ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such
  as maximum shared memory size).

**Returns:**

The compiled function.

`compile_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) capturing -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])`

Compiles the provided function for execution on this device.

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): Type of the function.
* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`func_type`): The function to compile.
* ​signature\_func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile, passed in again. Used for
  checking argument types later.
  Note: This will disappear in future versions.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).
* ​\_target (`target`): Change the target to different device type than the
  one associated with this `DeviceContext`.

**Args:**

* ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such
  as maximum shared memory size).

**Returns:**

The compiled function.

### `compile_function_experimental`

`compile_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])`

Compiles the provided function for execution on this device.

**Parameters:**

* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`fn(*args: *declared_arg_types) -> None`): The function to compile.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).
* ​\_target (`target`): Change the target to different device type than the
  one associated with this `DeviceContext`.

**Args:**

* ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such
  as maximum shared memory size).

**Returns:**

The compiled function.

`compile_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) capturing -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])`

Compiles the provided function for execution on this device.

**Parameters:**

* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).
* ​\_target (`target`): Change the target to different device type than the
  one associated with this `DeviceContext`.

**Args:**

* ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such
  as maximum shared memory size).

**Returns:**

The compiled function.

### `load_function`

`load_function[func_type: AnyTrivialRegType, //, func: func_type](self, *, function_name: StringSlice[origin], asm: StringSlice[origin], func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceExternalFunction)`

Loads a pre-compiled device function from assembly code.

This method loads an external GPU function from provided assembly code (PTX/SASS)
rather than compiling it from Mojo source. This is useful for integrating with
existing CUDA/HIP code or for using specialized assembly optimizations.

Example:

```mojo
from gpu.host import DeviceContext
from gpu.host.device_context import DeviceExternalFunction

fn func_signature(
    # Arguments being passed to the assembly code
    # e.g. two pointers and a length
    input: UnsafePointer[Float32],
    output: UnsafePointer[Float32],
    len: Int,
):
    # No body because that is passed as assembly code below.
    pass

var ctx = DeviceContext()
var ptx_code = "..."  # PTX assembly code
var ext_func = ctx.load_function[func_signature](
    function_name="my_kernel",
    asm=ptx_code,
)
```

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): The type of the function to load.
* ​func (`func_type`): The function reference.

**Args:**

* ​function\_name (`StringSlice[origin]`): The name of the function in the assembly code.
* ​asm (`StringSlice[origin]`): The assembly code (PTX/SASS) containing the function.
* ​func\_attribute (`OptionalReg[FuncAttribute]`): Optional attribute to apply to the function (such as
  maximum shared memory size).

**Returns:**

The loaded function is stored in the `result` parameter.

**Raises:**

If loading the function fails or the assembly code is invalid.

### `enqueue_function`

`enqueue_function[func_type: AnyTrivialRegType, //, func: func_type, *Ts: AnyType, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))`

Compiles and enqueues a kernel for execution on this device.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it
first to remove the overhead:

```mojo
with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): The type of the function to launch.
* ​func (`func_type`): The function to launch.
* ​\*Ts (`AnyType`): The types of the arguments being passed to the function.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).

**Args:**

* ​\*args (`*Ts`): Variadic arguments which are passed to the `func`.
* ​grid\_dim (`Dim`): The grid dimensions.
* ​block\_dim (`Dim`): The block dimensions.
* ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions.
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks.
* ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings.
* ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum.

`enqueue_function[*Ts: AnyType](self, f: DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()))`

Enqueues a compiled function for execution on this device.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile
the function first to remove the overhead:

```mojo
from gpu.host import DeviceContext

with DeviceContext() as ctx:
    var compiled_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​\*Ts (`AnyType`): Argument types.

**Args:**

* ​f (`DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose]`): The compiled function to execute.
* ​\*args (`*Ts`): Arguments to pass to the function.
* ​grid\_dim (`Dim`): Dimensions of the compute grid, made up of thread
  blocks.
* ​block\_dim (`Dim`): Dimensions of each thread block in the grid.
* ​cluster\_dim (`OptionalReg[Dim]`): Dimensions of clusters (if the thread blocks are
  grouped into clusters).
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Amount of shared memory per thread block.
* ​attributes (`List[LaunchAttribute]`): Launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): Constant memory mapping.

`enqueue_function[*Ts: AnyType](self, f: DeviceExternalFunction, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()))`

Enqueues an external device function for asynchronous execution on the GPU.

This method schedules an external device function to be executed on the GPU with the
specified execution configuration. The function and its arguments are passed to the
underlying GPU runtime, which will execute them when resources are available.

Example:

```mojo
from gpu.host import DeviceContext
from gpu.host.device_context import DeviceExternalFunction

# Create a device context and load an external function
with DeviceContext() as ctx:
    var ext_func = DeviceExternalFunction("my_kernel")

    # Enqueue the external function with execution configuration
    ctx.enqueue_function(
        ext_func,
        grid_dim=Dim(16),
        block_dim=Dim(256)
    )

    # Wait for completion
    ctx.synchronize()
```

**Parameters:**

* ​\*Ts (`AnyType`): The types of the arguments to be passed to the device function.

**Args:**

* ​f (`DeviceExternalFunction`): The external device function to execute.
* ​\*args (`*Ts`): The arguments to pass to the device function.
* ​grid\_dim (`Dim`): The dimensions of the grid (number of thread blocks).
* ​block\_dim (`Dim`): The dimensions of each thread block (number of threads per block).
* ​cluster\_dim (`OptionalReg[Dim]`): Optional dimensions for thread block clusters (for newer GPU architectures).
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Optional amount of dynamic shared memory to allocate per block.
* ​attributes (`List[LaunchAttribute]`): Optional list of launch attributes for fine-grained control.
* ​constant\_memory (`List[ConstantMemoryMapping]`): Optional list of constant memory mappings to use during execution.

**Raises:**

If there's an error enqueuing the function or if the function execution fails.

### `enqueue_function_unchecked`

`enqueue_function_unchecked[func_type: AnyTrivialRegType, //, func: func_type, *Ts: AnyType, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))`

Compiles and enqueues a kernel for execution on this device.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it
first to remove the overhead:

```mojo
with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): The type of the function to launch.
* ​func (`func_type`): The function to launch.
* ​\*Ts (`AnyType`): The types of the arguments being passed to the function.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).

**Args:**

* ​\*args (`*Ts`): Variadic arguments which are passed to the `func`.
* ​grid\_dim (`Dim`): The grid dimensions.
* ​block\_dim (`Dim`): The block dimensions.
* ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions.
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks.
* ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings.
* ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum.

`enqueue_function_unchecked[*Ts: AnyType](self, f: DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()))`

Enqueues a compiled function for execution on this device.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile
the function first to remove the overhead:

```mojo
from gpu.host import DeviceContext

with DeviceContext() as ctx:
    var compiled_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​\*Ts (`AnyType`): Argument types.

**Args:**

* ​f (`DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose]`): The compiled function to execute.
* ​\*args (`*Ts`): Arguments to pass to the function.
* ​grid\_dim (`Dim`): Dimensions of the compute grid, made up of thread
  blocks.
* ​block\_dim (`Dim`): Dimensions of each thread block in the grid.
* ​cluster\_dim (`OptionalReg[Dim]`): Dimensions of clusters (if the thread blocks are
  grouped into clusters).
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Amount of shared memory per thread block.
* ​attributes (`List[LaunchAttribute]`): Launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): Constant memory mapping.

### `enqueue_function_checked`

`enqueue_function_checked[*Ts: DevicePassable](self, f: DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()))`

Enqueues a compiled function for execution on this device.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile
the function first to remove the overhead:

```mojo
from gpu.host import DeviceContext

with DeviceContext() as ctx:
    var compiled_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​\*Ts (`DevicePassable`): Argument types.

**Args:**

* ​f (`DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose]`): The compiled function to execute.
* ​\*args (`*Ts`): Arguments to pass to the function.
* ​grid\_dim (`Dim`): Dimensions of the compute grid, made up of thread
  blocks.
* ​block\_dim (`Dim`): Dimensions of each thread block in the grid.
* ​cluster\_dim (`OptionalReg[Dim]`): Dimensions of clusters (if the thread blocks are
  grouped into clusters).
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Amount of shared memory per thread block.
* ​attributes (`List[LaunchAttribute]`): Launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): Constant memory mapping.

`enqueue_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))`

Compiles and enqueues a kernel for execution on this device.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it
first to remove the overhead:

```mojo
with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): The type of the function to launch.
* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`func_type`): The function to compile and launch.
* ​signature\_func (`fn(*args: *declared_arg_types) -> None`): The function to compile and launch, passed in
  again. Used for checking argument types later.
  Note: This will disappear in future versions.
* ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).

**Args:**

* ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`.
* ​grid\_dim (`Dim`): The grid dimensions.
* ​block\_dim (`Dim`): The block dimensions.
* ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions.
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks.
* ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings.
* ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum.

`enqueue_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) capturing -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))`

Compiles and enqueues a kernel for execution on this device. This overload takes in a function that's `capturing`.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it
first to remove the overhead:

```mojo
with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): The type of the function to launch.
* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`func_type`): The function to compile and launch.
* ​signature\_func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile and launch, passed in
  again. Used for checking argument types later.
  Note: This will disappear in future versions.
* ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).

**Args:**

* ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`.
* ​grid\_dim (`Dim`): The grid dimensions.
* ​block\_dim (`Dim`): The block dimensions.
* ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions.
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks.
* ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings.
* ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum.

### `enqueue_function_experimental`

`enqueue_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))`

Compiles and enqueues a kernel for execution on this device.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it
first to remove the overhead:

```mojo
with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`fn(*args: *declared_arg_types) -> None`): The function to compile and launch.
* ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).

**Args:**

* ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`.
* ​grid\_dim (`Dim`): The grid dimensions.
* ​block\_dim (`Dim`): The block dimensions.
* ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions.
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks.
* ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings.
* ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum.

`enqueue_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) capturing -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))`

Compiles and enqueues a kernel for execution on this device. This overload takes in a function that's `capturing`.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it
first to remove the overhead:

```mojo
with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile and launch.
* ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).

**Args:**

* ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`.
* ​grid\_dim (`Dim`): The grid dimensions.
* ​block\_dim (`Dim`): The block dimensions.
* ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions.
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks.
* ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings.
* ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum.

### `execution_time`

`execution_time[: origin.set, //, func: fn(DeviceContext) raises capturing -> None](self, num_iters: Int) -> Int`

Measures the execution time of a function that takes a DeviceContext parameter.

This method times the execution of a provided function that requires the
DeviceContext as a parameter. It runs the function for the specified number
of iterations and returns the total elapsed time in nanoseconds.

Example:

```mojo
from gpu.host import DeviceContext

fn gpu_operation(ctx: DeviceContext) raises capturing [_] -> None:
    # Perform some GPU operation using ctx
    pass

with DeviceContext() as ctx:
    # Measure execution time of a function that uses the context
    var time_ns = ctx.execution_time[gpu_operation](10)
    print("Execution time for 10 iterations:", time_ns, "ns")
```

**Parameters:**

* ​func (`fn(DeviceContext) raises capturing -> None`): A function that takes a DeviceContext parameter to execute and time.

**Args:**

* ​num\_iters (`Int`): The number of iterations to run the function.

**Returns:**

The total elapsed time in nanoseconds for all iterations.

**Raises:**

If the timer operations fail or if the function raises an exception.

`execution_time[: origin.set, //, func: fn() raises capturing -> None](self, num_iters: Int) -> Int`

Measures the execution time of a function over multiple iterations.

This method times the execution of a provided function that doesn't require
the DeviceContext as a parameter. It runs the function for the specified
number of iterations and returns the total elapsed time in nanoseconds.

Example:

```mojo
from gpu.host import DeviceContext

fn some_gpu_operation() raises capturing [_] -> None:
    # Perform some GPU operation
    pass

with DeviceContext() as ctx:
    # Measure execution time of a function
    var time_ns = ctx.execution_time[some_gpu_operation]
    print("Execution time:", time_ns, "ns")
```

**Parameters:**

* ​func (`fn() raises capturing -> None`): A function with no parameters to execute and time.

**Args:**

* ​num\_iters (`Int`): The number of iterations to run the function.

**Returns:**

The total elapsed time in nanoseconds for all iterations.

**Raises:**

If the timer operations fail or if the function raises an exception.

### `execution_time_iter`

`execution_time_iter[: origin.set, //, func: fn(DeviceContext, Int) raises capturing -> None](self, num_iters: Int) -> Int`

Measures the execution time of a function that takes iteration index as input.

This method times the execution of a provided function that requires both the
DeviceContext and the current iteration index as parameters. It runs the function
for the specified number of iterations, passing the iteration index to each call,
and returns the total elapsed time in nanoseconds.

Example:

```mojo
from gpu.host import DeviceContext

var my_kernel = DeviceFunction(...)

fn benchmark_kernel(ctx: DeviceContext, i: Int) raises capturing [_] -> None:
    # Run kernel with different parameters based on iteration
    ctx.enqueue_function[my_kernel](grid_dim=Dim(i), block_dim=Dim(256))

with DeviceContext() as ctx:
    # Measure execution time with iteration awareness
    var time_ns = ctx.execution_time_iter[benchmark_kernel](10)
    print("Total execution time:", time_ns, "ns")
```

**Parameters:**

* ​func (`fn(DeviceContext, Int) raises capturing -> None`): A function that takes the DeviceContext and an iteration index.

**Args:**

* ​num\_iters (`Int`): The number of iterations to run the function.

**Returns:**

The total elapsed time in nanoseconds for all iterations.

**Raises:**

If the timer operations fail or if the function raises an exception.

### `enqueue_copy`

`enqueue_copy[type: DType](self, dst_buf: DeviceBuffer[type], src_ptr: UnsafePointer[SIMD[type, 1]])`

Enqueues an async copy from the host to the provided device buffer. The number of bytes copied is determined by the size of the device buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_buf (`DeviceBuffer[type]`): Device buffer to copy to.
* ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy from.

`enqueue_copy[type: DType](self, dst_buf: HostBuffer[type], src_ptr: UnsafePointer[SIMD[type, 1]])`

Enqueues an async copy from the host to the provided device buffer. The number of bytes copied is determined by the size of the device buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_buf (`HostBuffer[type]`): Device buffer to copy to.
* ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy from.

`enqueue_copy[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_buf: DeviceBuffer[type])`

Enqueues an async copy from the device to the host. The number of bytes copied is determined by the size of the device buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy to.
* ​src\_buf (`DeviceBuffer[type]`): Device buffer to copy from.

`enqueue_copy[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_buf: HostBuffer[type])`

Enqueues an async copy from the device to the host. The number of bytes copied is determined by the size of the device buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy to.
* ​src\_buf (`HostBuffer[type]`): Device buffer to copy from.

`enqueue_copy[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_ptr: UnsafePointer[SIMD[type, 1]], size: Int)`

Enqueues an async copy of `size` elements from a device pointer to another device pointer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy to.
* ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Device pointer to copy from.
* ​size (`Int`): Number of elements (of the specified `DType`) to copy.

`enqueue_copy[type: DType](self, dst_buf: DeviceBuffer[type], src_buf: DeviceBuffer[type])`

Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_buf (`DeviceBuffer[type]`): Device buffer to copy to.
* ​src\_buf (`DeviceBuffer[type]`): Device buffer to copy from. Must be at least as large as
  `dst`.

`enqueue_copy[type: DType](self, dst_buf: DeviceBuffer[type], src_buf: HostBuffer[type])`

Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_buf (`DeviceBuffer[type]`): Device buffer to copy to.
* ​src\_buf (`HostBuffer[type]`): Device buffer to copy from. Must be at least as large as
  `dst`.

`enqueue_copy[type: DType](self, dst_buf: HostBuffer[type], src_buf: DeviceBuffer[type])`

Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_buf (`HostBuffer[type]`): Device buffer to copy to.
* ​src\_buf (`DeviceBuffer[type]`): Device buffer to copy from. Must be at least as large as
  `dst`.

`enqueue_copy[type: DType](self, dst_buf: HostBuffer[type], src_buf: HostBuffer[type])`

Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_buf (`HostBuffer[type]`): Device buffer to copy to.
* ​src\_buf (`HostBuffer[type]`): Device buffer to copy from. Must be at least as large as
  `dst`.

### `enqueue_memset`

`enqueue_memset[type: DType](self, dst: DeviceBuffer[type], val: SIMD[type, 1])`

Enqueues an async memset operation, setting all of the elements in the destination device buffer to the specified value.

**Parameters:**

* ​type (`DType`): Type of the data stored in the buffer.

**Args:**

* ​dst (`DeviceBuffer[type]`): Destination buffer.
* ​val (`SIMD[type, 1]`): Value to set all elements of `dst` to.

`enqueue_memset[type: DType](self, dst: HostBuffer[type], val: SIMD[type, 1])`

Enqueues an async memset operation, setting all of the elements in the destination host buffer to the specified value.

**Parameters:**

* ​type (`DType`): Type of the data stored in the buffer.

**Args:**

* ​dst (`HostBuffer[type]`): Destination buffer.
* ​val (`SIMD[type, 1]`): Value to set all elements of `dst` to.

### `synchronize`

`synchronize(self)`

Blocks until all asynchronous calls on the stream associated with this device context have completed.

This should never be necessary when writing a custom operation.

### `enqueue_wait_for`

`enqueue_wait_for(self, other: Self)`

Enqueues a wait operation for another device context to complete its work.

This method creates a dependency between two device contexts, ensuring that operations
in the current context will not begin execution until all previously enqueued operations
in the other context have completed. This is useful for synchronizing work across
multiple devices or streams.

Example:

```mojo
from gpu.host import DeviceContext

# Create two device contexts
var ctx1 = DeviceContext(0)  # First GPU
var ctx2 = DeviceContext(1)  # Second GPU

# Enqueue operations on ctx1
# ...

# Make ctx2 wait for ctx1 to complete before proceeding
ctx2.enqueue_wait_for(ctx1)

# Enqueue operations on ctx2 that depend on ctx1's completion
# ...
```

**Args:**

* ​other (`Self`): The device context whose operations must complete before operations in this context can proceed.

**Raises:**

If there's an error enqueuing the wait operation or if the operation
is not supported by the underlying device API.

### `get_api_version`

`get_api_version(self) -> Int`

Returns the API version associated with this device.

This method retrieves the version number of the GPU driver currently installed
on the system for the device associated with this context. The version is
returned as an integer that can be used to check compatibility with specific
features or to troubleshoot driver-related issues.

Example:

```mojo
from gpu.host import DeviceContext

with DeviceContext() as ctx:
    # Get the API version
    var api_version = ctx.get_api_version()
    print("GPU API version:", api_version)
```

**Returns:**

An integer representing the driver version.

**Raises:**

If the driver version cannot be retrieved or if the device context is invalid.

### `get_attribute`

`get_attribute(self, attr: DeviceAttribute) -> Int`

Returns the specified attribute for this device.

Use the aliases defined by
[DeviceAttribute](/mojo/stdlib/gpu/host/device_attribute/DeviceAttribute)
to specify attributes. For example:

```mojo
from gpu.host import DeviceAttribute, DeviceContext

def main():
    var ctx = DeviceContext()
    var attr = DeviceAttribute.MAX_BLOCKS_PER_MULTIPROCESSOR
    var max_blocks = ctx.get_attribute(attr)
    print(max_blocks)
```

**Args:**

* ​attr (`DeviceAttribute`): The device attribute to query.

**Returns:**

The value for `attr` on this device.

### `is_compatible`

`is_compatible(self) -> Bool`

Returns True if this device is compatible with MAX.

This method checks whether the current device is compatible with the
Modular Accelerated Execution (MAX) runtime. It's useful for validating
that the device can execute the compiled code before attempting operations.

Example:

```mojo
from gpu.host import DeviceContext

var ctx = DeviceContext()
print("Device is compatible with MAX:", ctx.is_compatible())
```

**Returns:**

True if the device is compatible with MAX, False otherwise.

### `id`

`id(self) -> SIMD[int64, 1]`

Returns the ID associated with this device.

This method retrieves the unique identifier for the current device.
Device IDs are used to distinguish between multiple devices in a system
and are often needed for multi-GPU programming.

Example:

```mojo
var ctx = DeviceContext()
try:
    var device_id = ctx.id()
    print("Using device with ID:", device_id)
except:
    print("Failed to get device ID")
```

**Returns:**

The unique device ID as an Int64.

**Raises:**

If there's an error retrieving the device ID.

### `get_memory_info`

`get_memory_info(self) -> Tuple[UInt, UInt]`

Returns the free and total memory size for this device.

This method queries the current state of device memory, providing information
about how much memory is available and the total memory capacity of the device.
This is useful for memory management and determining if there's enough space
for planned operations.

Example:

```mojo
from gpu.host import DeviceContext

var ctx = DeviceContext()
try:
    (free, total) = ctx.get_memory_info()
    print("Free memory:", free / (1024*1024), "MB")
    print("Total memory:", total / (1024*1024), "MB")
except:
    print("Failed to get memory information")
```

**Returns:**

A tuple of (free memory, total memory) in bytes.

**Raises:**

If there's an error retrieving the memory information.

### `can_access`

`can_access(self, peer: Self) -> Bool`

Returns True if this device can access the identified peer device.

This method checks whether the current device can directly access memory on
the specified peer device. Peer-to-peer access allows for direct memory transfers
between devices without going through host memory, which can significantly
improve performance in multi-GPU scenarios.

Example:

```mojo
from gpu.host import DeviceContext
var ctx1 = DeviceContext(0)  # First GPU
var ctx2 = DeviceContext(1)  # Second GPU

try:
    if ctx1.can_access(ctx2):
        print("Direct peer access is possible")
        ctx1.enable_peer_access(ctx2)
    else:
        print("Direct peer access is not supported")
except:
    print("Failed to check peer access capability")
```

**Args:**

* ​peer (`Self`): The peer device to check for accessibility.

**Returns:**

True if the current device can access the peer device, False otherwise.

**Raises:**

If there's an error checking peer access capability.

### `enable_peer_access`

`enable_peer_access(self, peer: Self)`

Enables direct memory access to the peer device.

This method establishes peer-to-peer access from the current device to the
specified peer device. Once enabled, the current device can directly read from
and write to memory allocated on the peer device without going through host memory,
which can significantly improve performance for multi-GPU operations.

Notes:

* It's recommended to call `can_access()` first to check if peer access is possible.
* Peer access is not always symmetric; you may need to enable access in both directions.

Example:

```mojo
from gpu.host import DeviceContext

var ctx1 = DeviceContext(0)  # First GPU
var ctx2 = DeviceContext(1)  # Second GPU

try:
    if ctx1.can_access(ctx2):
        ctx1.enable_peer_access(ctx2)
        print("Peer access enabled from device 0 to device 1")

        # For bidirectional access
        if ctx2.can_access(ctx1):
            ctx2.enable_peer_access(ctx1)
            print("Peer access enabled from device 1 to device 0")
    else:
        print("Peer access not supported between these devices")
except:
    print("Failed to enable peer access")
```

**Args:**

* ​peer (`Self`): The peer device to enable access to.

**Raises:**

If there's an error enabling peer access or if peer access is not supported
between the devices.

### `supports_multicast`

`supports_multicast(self) -> Bool`

Returns True if this device supports multicast memory mappings.

**Returns:**

True if the current device supports multicast memory, False otherwise.

**Raises:**

If there's an error checking peer access capability.

### `number_of_devices`

`static number_of_devices(*, api: String = String(from_name[::StringSlice[::Bool())) -> Int`

Returns the number of devices available that support the specified API.

This function queries the system for available devices that support the
requested API (such as CUDA or HIP). It's useful for determining how many
accelerators are available before allocating resources or distributing work.

Example:

```mojo
from gpu.host import DeviceContext

# Get number of CUDA devices
var num_cuda_devices = DeviceContext.number_of_devices(api="cuda")

# Get number of devices for the default API
var num_devices = DeviceContext.number_of_devices()
```

**Args:**

* ​api (`String`): Requested device API (for example, "cuda" or "hip"). Defaults to the
  device API specified by the DeviceContext class.

**Returns:**

The number of available devices supporting the specified API.

---

## DeviceExternalFunction

`struct DeviceExternalFunction`

Represents an external device function loaded from PTX/SASS assembly.

This class provides functionality to load and execute pre-compiled GPU functions
from assembly code rather than compiling them from Mojo source. This is useful
for integrating with existing CUDA/HIP code or for using specialized assembly
optimizations.

The `DeviceExternalFunction` handles reference counting of the underlying device
function handle and provides methods for launching the function on a GPU with
specified execution configuration.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Creates a copy of an existing device function by incrementing its reference count.

**Args:**

* ​existing (`Self`): The device function to copy.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Moves an existing device function into this one.

**Args:**

* ​existing (`Self`): The device function to move from.

### `__del__`

`__del__(owned self)`

Releases resources associated with this device function.

### `get_attribute`

`get_attribute(self, attr: Attribute) -> Int`

Retrieves a specific attribute of this device function.

**Args:**

* ​attr (`Attribute`): The attribute to query.

**Returns:**

The value of the requested attribute.

**Raises:**

If the attribute query fails.

---

## DeviceFunction

`struct DeviceFunction[func_type: AnyTrivialRegType, //, func: func_type, declared_arg_types: Optional[Variadic[AnyType]], *, target: target = _get_gpu_target[::StringSlice[::Bool(), _ptxas_info_verbose: Bool = False]`

Represents a compiled device function for GPU execution.

This struct encapsulates a compiled GPU function that can be launched on a device.
It handles the compilation, loading, and resource management of device functions.

Example:

```mojo
from gpu.host import DeviceContext, DeviceFunction

fn my_kernel(x: Int, y: Int):
    # Kernel implementation
    pass

var ctx = DeviceContext()
var kernel = ctx.compile_function[my_kernel]()
ctx.enqueue_function(kernel, grid_dim=(1,1,1), block_dim=(32,1,1))
```

## Parameters

* ​func\_type (`AnyTrivialRegType`): The type of the function to compile.
* ​func (`func_type`): The function to compile for GPU execution.
* ​declared\_arg\_types (`Optional[Variadic[AnyType]]`): An optional containing a variadic of the declared types of the kernel signature.
* ​target (`target`): The target architecture for compilation. Defaults to the current GPU target.
* ​\_ptxas\_info\_verbose (`Bool`): Whether to enable verbose PTX assembly output. Defaults to False.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Creates a copy of an existing DeviceFunction.

This increases the reference count of the underlying device function handle.

**Args:**

* ​existing (`Self`): The DeviceFunction to copy from.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Moves an existing DeviceFunction into this one.

**Args:**

* ​existing (`Self`): The DeviceFunction to move from.

### `__del__`

`__del__(owned self)`

Releases resources associated with this DeviceFunction.

This decrements the reference count of the underlying device function handle.

### `dump_rep`

`dump_rep[dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False)](self)`

Dumps various representations of the compiled device function.

This method dumps the assembly, LLVM IR, and/or SASS code for the compiled
device function based on the provided parameters. The output can be directed
to stdout or written to files.

Notes:

When a path contains '%', it will be replaced with the module name to
help disambiguate multiple kernel dumps.

**Parameters:**

* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Controls dumping of assembly code. Can be a boolean, a file path,
  or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Controls dumping of LLVM IR. Can be a boolean, a file path,
  or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Controls dumping of SASS code (internal use). Can be a boolean,
  a file path, or a function returning a file path.

**Raises:**

If any file operations fail during the dumping process.

### `get_attribute`

`get_attribute(self, attr: Attribute) -> Int`

Retrieves a specific attribute value from the compiled device function.

This method queries the device function for information about its resource
requirements, execution capabilities, or other properties defined by the
specified attribute.

Example:

```mojo
from gpu.host import Attribute, DeviceFunction

var device_function = DeviceFunction(...)

# Get the maximum number of threads per block for this function
var max_threads = device_function.get_attribute(Attribute.MAX_THREADS_PER_BLOCK)
```

**Args:**

* ​attr (`Attribute`): The attribute to query, defined in the Attribute enum.

**Returns:**

The integer value of the requested attribute.

**Raises:**

If the attribute query fails or the attribute is not supported.

---

## DeviceMulticastBuffer

`struct DeviceMulticastBuffer[type: DType]`

Represents a multicast memory object enables special memory operations to be broadcast across a group of devices.

## Parameters

* ​type (`DType`): Data type to be stored in the associated memory regions.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

---

## DeviceStream

`struct DeviceStream`

Represents a CUDA/HIP stream for asynchronous GPU operations.

A DeviceStream provides a queue for GPU operations that can execute concurrently
with operations in other streams. Operations within a single stream execute in
the order they are issued, but operations in different streams may execute in
any relative order or concurrently.

This abstraction allows for better utilization of GPU resources by enabling
overlapping of computation and data transfers.

Example:

```mojo
from gpu.host import DeviceContext, DeviceStream
var ctx = DeviceContext(0)  # Select first GPU
var stream = DeviceStream(ctx)

# Launch operations on the stream
# ...

# Wait for all operations in the stream to complete
stream.synchronize()
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `synchronize`

`synchronize(self)`

Blocks the calling CPU thread until all operations in this stream complete.

This function waits until all previously issued commands in this stream
have completed execution. It provides a synchronization point between
host and device code.

Example:

```mojo
# Launch kernel or memory operations on the stream
# ...

# Wait for completion
stream.synchronize()

# Now it's safe to use results on the host
```

**Raises:**

If synchronization fails.

---

## HostBuffer

`struct HostBuffer[type: DType]`

Represents a block of host-resident storage. For GPU devices, a host buffer is allocated in the host's global memory.

To allocate a `HostBuffer`, use one of the methods provided by
`DeviceContext`, such as
[`enqueue_create_host_buffer()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_create_host_buffer).

## Parameters

* ​type (`DType`): Data type to be stored in the buffer.

## Implemented traits

`AnyType`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Creates a copy of an existing host buffer by incrementing its reference count.

This copy constructor creates a new reference to the same underlying host buffer
by incrementing the reference count of the native buffer object. Both the original
and the copy will refer to the same memory on the device.

**Args:**

* ​existing (`Self`): The host buffer to copy.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Initializes this buffer by taking ownership of an existing buffer.

This move constructor transfers ownership of the device buffer from the existing
instance to the new instance without incrementing the reference count.

**Args:**

* ​existing (`Self`): The buffer to move from, which will no longer be valid after this call.

### `__del__`

`__del__(owned self)`

Releases resources associated with this host buffer.

This function schedules an owned buffer free using the stream in the
device context. The actual deallocation may occur asynchronously after
all operations using this buffer have completed.

### `__getitem__`

`__getitem__(self, idx: Int) -> SIMD[type, 1]`

Retrieves the element at the specified index from the host buffer.

This operator allows direct access to individual elements in the host buffer
using array indexing syntax.

**Args:**

* ​idx (`Int`): The index of the element to retrieve.

**Returns:**

The scalar value at the specified index.

### `__setitem__`

`__setitem__(self, idx: Int, val: SIMD[type, 1])`

Sets the element at the specified index in the host buffer.

This operator allows direct modification of individual elements in the host buffer
using array indexing syntax.

**Args:**

* ​idx (`Int`): The index of the element to modify.
* ​val (`SIMD[type, 1]`): The new value to store at the specified index.

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

### `__len__`

`__len__(self) -> Int`

Returns the number of elements in this buffer.

This method calculates the number of elements by dividing the total byte size
of the buffer by the size of each element.

**Returns:**

The number of elements in the buffer.

### `create_sub_buffer`

`create_sub_buffer[view_type: DType](self, offset: Int, size: Int) -> HostBuffer[view_type]`

Creates a sub-buffer view of this buffer with a different element type.

This method creates a new buffer that references a subset of the memory in this
buffer, potentially with a different element type. The sub-buffer shares the
underlying memory with the original buffer.

**Parameters:**

* ​view\_type (`DType`): The data type for elements in the new sub-buffer.

**Args:**

* ​offset (`Int`): The starting offset in elements from the beginning of this buffer.
* ​size (`Int`): The number of elements in the new sub-buffer.

**Returns:**

A new HostBuffer referencing the specified region with the specified element type.

### `enqueue_copy_to`

`enqueue_copy_to(self, dst: Self)`

Enqueues an asynchronous copy from this buffer to another host buffer.

This method schedules a memory copy operation from this buffer to the destination
buffer. The operation is asynchronous and will be executed in the stream associated
with this buffer's context.

**Args:**

* ​dst (`Self`): The destination host buffer to copy data to.

`enqueue_copy_to(self, dst: DeviceBuffer[type])`

Enqueues an asynchronous copy from this buffer to a device buffer.

This method schedules a memory copy operation from this buffer to the destination
buffer. The operation is asynchronous and will be executed in the stream associated
with this buffer's context.

**Args:**

* ​dst (`DeviceBuffer[type]`): The destination device buffer to copy data to.

`enqueue_copy_to(self, dst_ptr: UnsafePointer[SIMD[type, 1]])`

Enqueues an asynchronous copy from this buffer to host memory.

This method schedules a memory copy operation from this device buffer to the
specified host memory location. The operation is asynchronous and will be
executed in the stream associated with this buffer's context.

**Args:**

* ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the destination host memory location.

### `enqueue_copy_from`

`enqueue_copy_from(self, src: Self)`

Enqueues an asynchronous copy to this buffer from another host buffer.

This method schedules a memory copy operation to this buffer from the source
buffer. The operation is asynchronous and will be executed in the stream
associated with this buffer's context.

**Args:**

* ​src (`Self`): The source host buffer to copy data from.

`enqueue_copy_from(self, src: DeviceBuffer[type])`

Enqueues an asynchronous copy to this buffer from a device buffer.

This method schedules a memory copy operation to this buffer from the source
buffer. The operation is asynchronous and will be executed in the stream
associated with this buffer's context.

**Args:**

* ​src (`DeviceBuffer[type]`): The source device buffer to copy data from.

`enqueue_copy_from(self, src_ptr: UnsafePointer[SIMD[type, 1]])`

Enqueues an asynchronous copy to this buffer from host memory.

This method schedules a memory copy operation to this device buffer from the
specified host memory location. The operation is asynchronous and will be
executed in the stream associated with this buffer's context.

**Args:**

* ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the source host memory location.

### `enqueue_fill`

`enqueue_fill(self, val: SIMD[type, 1]) -> Self`

Enqueues an operation to fill this buffer with a specified value.

This method schedules a memory set operation that fills the entire buffer
with the specified value. The operation is asynchronous and will be executed
in the stream associated with this buffer's context.

**Args:**

* ​val (`SIMD[type, 1]`): The value to fill the buffer with.

**Returns:**

Self reference for method chaining.

### `reassign_ownership_to`

`reassign_ownership_to(self, ctx: DeviceContext)`

Transfers ownership of this buffer to another device context.

This method changes the device context that owns this buffer. This can be
useful when sharing buffers between different contexts or when migrating
workloads between devices.

**Args:**

* ​ctx (`DeviceContext`): The new device context to take ownership of this buffer.

### `take_ptr`

`take_ptr(owned self) -> UnsafePointer[SIMD[type, 1]]`

Takes ownership of the device pointer from this buffer.

This method releases the device pointer from the buffer's control and
returns it to the caller. After this call, the buffer no longer owns
the pointer, and the caller is responsible for managing its lifecycle.

**Returns:**

The raw device pointer that was owned by this buffer.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[SIMD[type, 1]]`

Returns the raw device pointer without transferring ownership.

This method provides direct access to the underlying device pointer
for advanced use cases. The buffer retains ownership of the pointer.

**Returns:**

The raw device pointer owned by this buffer.

### `context`

`context(self) -> DeviceContext`

Returns the device context associated with this buffer.

This method retrieves the device context that owns this buffer and is
responsible for managing its lifecycle and operations.

**Returns:**

The device context associated with this buffer.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes a string representation of this buffer to the provided writer.

This method formats the buffer's contents as a string and writes it to
the specified writer. For large buffers, a compact representation is used.

**Parameters:**

* ​W (`Writer`): The writer type.

**Args:**

* ​writer (`W`): The writer to output the formatted string to.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the `HostBuffer`.

This method creates a human-readable string representation of the buffer's contents
by mapping the device memory to host memory and formatting the elements.

**Returns:**

A string containing the formatted buffer contents.

### `as_span`

`as_span(ref self) -> Span[SIMD[type, 1], self_is_origin]`

Returns a `Span` pointing to the underlying memory of the `HostBuffer`.

**Returns:**

A `Span` pointing to the underlying memory of the `HostBuffer`.

---

## device_context

This module provides functionality for interacting with accelerators. In particular the [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) struct, which represents a single stream of execution on a given accelerator. You can use this struct to allocate accelerator memory, copy data to and from the accelerator, and compile and execute functions on the accelerator.

## Structs

* [​`DeviceBuffer`](/mojo/stdlib/gpu/host/device_context/DeviceBuffer): Represents a block of device-resident storage. For GPU devices, a device buffer is allocated in the device's global memory.
* [​`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext): Represents a single stream of execution on a particular accelerator (GPU).
* [​`DeviceExternalFunction`](/mojo/stdlib/gpu/host/device_context/DeviceExternalFunction): Represents an external device function loaded from PTX/SASS assembly.
* [​`DeviceFunction`](/mojo/stdlib/gpu/host/device_context/DeviceFunction): Represents a compiled device function for GPU execution.
* [​`DeviceMulticastBuffer`](/mojo/stdlib/gpu/host/device_context/DeviceMulticastBuffer): Represents a multicast memory object enables special memory operations to be broadcast across a group of devices.
* [​`DeviceStream`](/mojo/stdlib/gpu/host/device_context/DeviceStream): Represents a CUDA/HIP stream for asynchronous GPU operations.
* [​`HostBuffer`](/mojo/stdlib/gpu/host/device_context/HostBuffer): Represents a block of host-resident storage. For GPU devices, a host buffer is allocated in the host's global memory.

---

## Dim

`@register_passable(trivial)`
`struct Dim`

Represents a dimension with up to three components (x, y, z).

This struct is commonly used to represent grid and block dimensions
for kernel launches.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`@implicit`
`__init__[I: Indexer](x: I) -> Self`

Initializes Dim with a single indexable value for x.

y and z dimensions are set to 1.

**Parameters:**

* ​I (`Indexer`): The type of the indexable value.

**Args:**

* ​x (`I`): The value for the x dimension.

`__init__[I0: Indexer, I1: Indexer](x: I0, y: I1) -> Self`

Initializes Dim with indexable values for x and y.

z dimension is set to 1.

**Parameters:**

* ​I0 (`Indexer`): The type of the first indexable value.
* ​I1 (`Indexer`): The type of the second indexable value.

**Args:**

* ​x (`I0`): The value for the x dimension.
* ​y (`I1`): The value for the y dimension.

`__init__[I0: Indexer, I1: Indexer, I2: Indexer](x: I0, y: I1, z: I2) -> Self`

Initializes Dim with indexable values for x, y, and z.

**Parameters:**

* ​I0 (`Indexer`): The type of the first indexable value.
* ​I1 (`Indexer`): The type of the second indexable value.
* ​I2 (`Indexer`): The type of the third indexable value.

**Args:**

* ​x (`I0`): The value for the x dimension.
* ​y (`I1`): The value for the y dimension.
* ​z (`I2`): The value for the z dimension.

`@implicit`
`__init__[I: Indexer](dims: Tuple[I]) -> Self`

Initializes Dim with a tuple containing a single indexable value.

y and z dimensions are set to 1.

**Parameters:**

* ​I (`Indexer`): The type of the indexable value in the tuple.

**Args:**

* ​dims (`Tuple[I]`): A tuple with one element for x dimension.

`@implicit`
`__init__[I0: Indexer, I1: Indexer](dims: Tuple[I0, I1]) -> Self`

Initializes Dim with a tuple of two indexable values.

The z dimension is set to 1.

**Parameters:**

* ​I0 (`Indexer`): The type of the first indexable value in the tuple.
* ​I1 (`Indexer`): The type of the second indexable value in the tuple.

**Args:**

* ​dims (`Tuple[I0, I1]`): A tuple with two elements: x and y dimensions.

`@implicit`
`__init__[I0: Indexer, I1: Indexer, I2: Indexer](dims: Tuple[I0, I1, I2]) -> Self`

Initializes Dim with a tuple of three indexable values.

**Parameters:**

* ​I0 (`Indexer`): The type of the first indexable value in the tuple.
* ​I1 (`Indexer`): The type of the second indexable value in the tuple.
* ​I2 (`Indexer`): The type of the third indexable value in the tuple.

**Args:**

* ​dims (`Tuple[I0, I1, I2]`): Tuple with three elements: x, y, and z dimensions.

### `__getitem__`

`__getitem__(self, idx: Int) -> Int`

Gets the dimension value at the specified index.

**Args:**

* ​idx (`Int`): The index (0 for x, 1 for y, 2 for z).

**Returns:**

The value of the dimension at the given index.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the Dim.

**Returns:**

String representation of this Dim object.

### `__repr__`

`__repr__(self) -> String`

Returns a string representation of the Dim.

**Returns:**

String representation of this Dim object.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes a formatted string representation of the Dim.

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait.

**Args:**

* ​writer (`W`): The Writer to write to.

### `z`

`z(self) -> Int`

Returns the z dimension.

**Returns:**

The value of the z dimension.

### `y`

`y(self) -> Int`

Returns the y dimension.

**Returns:**

The value of the y dimension.

### `x`

`x(self) -> Int`

Returns the x dimension.

**Returns:**

The value of the x dimension.

---

## dim

This module implements the dim type.

## Structs

* [​`Dim`](/mojo/stdlib/gpu/host/dim/Dim): Represents a dimension with up to three components (x, y, z).

---

## Attribute

`@register_passable(trivial)`
`struct Attribute`

Represents GPU kernel function attributes.

This struct defines constants for various function attributes that can be queried
or set for GPU kernels. These attributes provide information about resource
requirements and execution constraints of kernel functions.

## Fields

* ​code (`SIMD[int32, 1]`): The numeric code representing the attribute type.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `BINARY_VERSION`

`alias BINARY_VERSION = Attribute(__init__[__mlir_type.!pop.int_literal](6))`

The binary architecture version for which the function was compiled. This value is the major binary version \* 10 + the minor binary version, so a binary version 1.3 function would return the value 13. Note that this will return a value of 10 for legacy cubins that do not have a properly- encoded binary architecture version..

### `CACHE_MODE_CA`

`alias CACHE_MODE_CA = Attribute(__init__[__mlir_type.!pop.int_literal](7))`

The attribute to indicate whether the function has been compiled with user specified option "-Xptxas --dlcm=ca" set .

### `CLUSTER_SCHEDULING_POLICY_PREFERENCE`

`alias CLUSTER_SCHEDULING_POLICY_PREFERENCE = Attribute(__init__[__mlir_type.!pop.int_literal](15))`

The block scheduling policy of a function. The value type is CUclusterSchedulingPolicy / cudaClusterSchedulingPolicy.

### `CLUSTER_SIZE_MUST_BE_SET`

`alias CLUSTER_SIZE_MUST_BE_SET = Attribute(__init__[__mlir_type.!pop.int_literal](10))`

If this attribute is set, the kernel must launch with a valid cluster size specified.

### `CONST_SIZE_BYTES`

`alias CONST_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](2))`

The size in bytes of user-allocated constant memory required by this function.

### `LOCAL_SIZE_BYTES`

`alias LOCAL_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](3))`

The size in bytes of local memory used by each thread of this function.

### `MAX_DYNAMIC_SHARED_SIZE_BYTES`

`alias MAX_DYNAMIC_SHARED_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](8))`

The maximum size in bytes of dynamically-allocated shared memory that can be used by this function. If the user-specified dynamic shared memory size is larger than this value.

### `MAX_THREADS_PER_BLOCK`

`alias MAX_THREADS_PER_BLOCK = Attribute(__init__[__mlir_type.!pop.int_literal](0))`

The maximum number of threads per block, beyond which a launch of the function would fail. This number depends on both the function and the device on which the function is currently loaded.

### `NON_PORTABLE_CLUSTER_SIZE_ALLOWED`

`alias NON_PORTABLE_CLUSTER_SIZE_ALLOWED = Attribute(__init__[__mlir_type.!pop.int_literal](14))`

Whether the function can be launched with non-portable cluster size. 1 is allowed, 0 is disallowed. A non-portable cluster size may only function on the specific SKUs the program is tested on. The launch might fail if the program is run on a different hardware platform.CUDA API provides cudaOccupancyMaxActiveClusters to assist with checking whether the desired size can be launched on the current device.Portable Cluster SizeA portable cluster size is guaranteed to be functional on all compute capabilities higher than the target compute capability. The portable cluster size for sm\_90 is 8 blocks per cluster.

### `NUM_REGS`

`alias NUM_REGS = Attribute(__init__[__mlir_type.!pop.int_literal](4))`

The number of registers used by each thread of this function.

### `PREFERRED_SHARED_MEMORY_CARVEOUT`

`alias PREFERRED_SHARED_MEMORY_CARVEOUT = Attribute(__init__[__mlir_type.!pop.int_literal](9))`

On devices where the L1 cache and shared memory use the same hardware resources, this sets the shared memory carveout preference, in percent of the total shared memory.

### `PTX_VERSION`

`alias PTX_VERSION = Attribute(__init__[__mlir_type.!pop.int_literal](5))`

The PTX virtual architecture version for which the function was compiled. This value is the major PTX version \* 10 + the minor PTX version, so a PTX version 1.3 function would return the value 13. Note that this may return the undefined value of 0 for cubins compiled prior to CUDA 3.0..

### `REQUIRED_CLUSTER_DEPTH`

`alias REQUIRED_CLUSTER_DEPTH = Attribute(__init__[__mlir_type.!pop.int_literal](13))`

The required cluster depth in blocks. The values must either all be 0 or all be positive. The validity of the cluster dimensions is otherwise checked at launch time.

### `REQUIRED_CLUSTER_HEIGHT`

`alias REQUIRED_CLUSTER_HEIGHT = Attribute(__init__[__mlir_type.!pop.int_literal](12))`

The required cluster height in blocks. The values must either all be 0 or all be positive. The validity of the cluster dimensions is otherwise checked at launch time.

### `REQUIRED_CLUSTER_WIDTH`

`alias REQUIRED_CLUSTER_WIDTH = Attribute(__init__[__mlir_type.!pop.int_literal](11))`

The required cluster width in blocks. The values must either all be 0 or all be positive. The validity of the cluster dimensions is otherwise checked at launch time.

### `SHARED_SIZE_BYTES`

`alias SHARED_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](1))`

The size in bytes of statically-allocated shared memory required by this function. This does not include dynamically-allocated shared memory requested by the user at runtime.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two Attribute instances are equal.

**Args:**

* ​other (`Self`): The Attribute to compare with.

**Returns:**

True if both attributes have the same code, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if two Attribute instances are not equal.

**Args:**

* ​other (`Self`): The Attribute to compare with.

**Returns:**

True if the attributes have different codes, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Identity comparison operator for Attribute instances.

**Args:**

* ​other (`Self`): The Attribute to compare with.

**Returns:**

True if both attributes are identical (have the same code), False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Negative identity comparison operator for Attribute instances.

**Args:**

* ​other (`Self`): The Attribute to compare with.

**Returns:**

True if the attributes are not identical, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes a string representation of the `Attribute` to the provided writer.

```
This method converts the `Attribute` enum value to its corresponding string name
and writes it to the provided writer object.
```

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait.

**Args:**

* ​writer (`W`): A Writer object that will receive the string representation.

---

## FuncAttribute

`@register_passable(trivial)`
`struct FuncAttribute`

Implements CUDA's CUfunction\_attribute enum for GPU kernel function attributes.

This struct represents function attributes that can be set or queried for GPU kernels,
following NVIDIA's CUDA driver API conventions. Each attribute consists of a type
(represented by the Attribute enum) and an associated value.

The struct provides factory methods for creating common attribute configurations,
such as cache mode settings and shared memory allocations.

Reference: 

## Fields

* ​attribute (`Attribute`): The type of function attribute.
* ​value (`SIMD[int32, 1]`): The value associated with this attribute.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `NULL`

`alias NULL = FuncAttribute(Attribute(__init__[__mlir_type.!pop.int_literal](-1)), __init__[__mlir_type.!pop.int_literal](-1))`

A null/invalid function attribute constant.

## Methods

### `__init__`

`__init__(*, other: Self) -> Self`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two `FuncAttribute` instances are equal.

**Args:**

* ​other (`Self`): The FuncAttribute to compare with.

**Returns:**

True if both the attribute type and value are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if two `FuncAttribute` instances are not equal.

**Args:**

* ​other (`Self`): The `FuncAttribute` to compare with.

**Returns:**

True if either the attribute type or value differs, False otherwise.

### `CACHE_MODE_CA`

`static CACHE_MODE_CA(val: Bool) -> Self`

Creates a CACHE\_MODE\_CA function attribute.

Indicates whether the function has been compiled with user specified
option `CacheMode.L1_CACHE_DISABLED` set.

**Args:**

* ​val (`Bool`): Boolean value indicating if L1 cache is disabled.

**Returns:**

A `FuncAttribute` instance with CACHE\_MODE\_CA attribute type.

### `MAX_DYNAMIC_SHARED_SIZE_BYTES`

`static MAX_DYNAMIC_SHARED_SIZE_BYTES(val: SIMD[uint32, 1]) -> Self`

Creates a MAX\_DYNAMIC\_SHARED\_SIZE\_BYTES function attribute.

The maximum size in bytes of dynamically-allocated shared memory that
can be used by this function. If the user-specified dynamic shared memory
size is larger than this value, the launch will fail.

**Args:**

* ​val (`SIMD[uint32, 1]`): Maximum dynamic shared memory size in bytes.

**Returns:**

A `FuncAttribute` instance with `MAX_DYNAMIC_SHARED_SIZE_BYTES` attribute type.

### `PREFERRED_SHARED_MEMORY_CARVEOUT`

`static PREFERRED_SHARED_MEMORY_CARVEOUT(val: SIMD[int32, 1]) -> Self`

Creates a PREFERRED\_SHARED\_MEMORY\_CARVEOUT function attribute.

On devices where the L1 cache and shared memory use the same hardware
resources, this sets the shared memory carveout preference, in percent
of the total shared memory.

**Args:**

* ​val (`SIMD[int32, 1]`): Shared memory carveout preference as a percentage (0-100).

**Returns:**

A FuncAttribute instance with `PREFERRED_SHARED_MEMORY_CARVEOUT` attribute type.

---

## func_attribute

GPU Kernel Function Attributes Module

This module provides structures for defining and managing GPU kernel function attributes.
It implements functionality similar to CUDA's CUfunction\_attribute enum, allowing
for querying and setting various attributes that control kernel execution behavior
and resource allocation.

The module includes:

* `Attribute`: A value type representing different GPU kernel function attribute types
* `FuncAttribute`: A structure that pairs an attribute type with its value

These structures enable fine-grained control over GPU kernel execution parameters
such as shared memory allocation, cache behavior, and cluster configuration.

## Structs

* [​`Attribute`](/mojo/stdlib/gpu/host/func_attribute/Attribute): Represents GPU kernel function attributes.
* [​`FuncAttribute`](/mojo/stdlib/gpu/host/func_attribute/FuncAttribute): Implements CUDA's CUfunction\_attribute enum for GPU kernel function attributes.

---

## host

Implements the gpu host package.

## Modules

* [​`constant_memory_mapping`](/mojo/stdlib/gpu/host/constant_memory_mapping/): This module provides functionality for mapping constant memory between host and device.
* [​`device_attribute`](/mojo/stdlib/gpu/host/device_attribute/): This module defines GPU device attributes that can be queried from CUDA-compatible devices.
* [​`device_context`](/mojo/stdlib/gpu/host/device_context/): This module provides functionality for interacting with accelerators. In particular the [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) struct, which represents a single stream of execution on a given accelerator. You can use this struct to allocate accelerator memory, copy data to and from the accelerator, and compile and execute functions on the accelerator.
* [​`dim`](/mojo/stdlib/gpu/host/dim/): This module implements the dim type.
* [​`func_attribute`](/mojo/stdlib/gpu/host/func_attribute/): GPU Kernel Function Attributes Module
* [​`info`](/mojo/stdlib/gpu/host/info/): Contains information about GPU architectures and their capabilities.
* [​`launch_attribute`](/mojo/stdlib/gpu/host/launch_attribute/): GPU Launch Attributes for Kernel Execution Control

---

## Info

`@register_passable`
`struct Info`

Comprehensive information about a GPU architecture.

This struct contains detailed specifications about GPU capabilities,
including compute units, memory, thread organization, and performance
characteristics.

## Fields

* ​name (`StringSlice[StaticConstantOrigin]`): The model name of the GPU.
* ​vendor (`Vendor`): The vendor/manufacturer of the GPU (e.g., NVIDIA, AMD).
* ​api (`StringSlice[StaticConstantOrigin]`): The graphics/compute API supported by the GPU (e.g., CUDA, ROCm).
* ​arch\_name (`StringSlice[StaticConstantOrigin]`): The architecture name of the GPU (e.g., sm\_80, gfx942).
* ​compile\_options (`StringSlice[StaticConstantOrigin]`): Compiler options specific to this GPU architecture.
* ​compute (`SIMD[float32, 1]`): Compute capability version number for NVIDIA GPUs.
* ​version (`StringSlice[StaticConstantOrigin]`): Version string of the GPU architecture.
* ​sm\_count (`Int`): Number of streaming multiprocessors (SMs) on the GPU.
* ​warp\_size (`Int`): Number of threads in a warp/wavefront.
* ​threads\_per\_sm (`Int`): Maximum number of threads per streaming multiprocessor.
* ​threads\_per\_warp (`Int`): Number of threads that execute together in a warp/wavefront.
* ​warps\_per\_multiprocessor (`Int`): Maximum number of warps that can be active on a multiprocessor.
* ​threads\_per\_multiprocessor (`Int`): Maximum number of threads that can be active on a multiprocessor.
* ​thread\_blocks\_per\_multiprocessor (`Int`): Maximum number of thread blocks that can be active on a multiprocessor.
* ​shared\_memory\_per\_multiprocessor (`Int`): Size of shared memory available per multiprocessor in bytes.
* ​register\_file\_size (`Int`): Total size of the register file per multiprocessor in bytes.
* ​register\_allocation\_unit\_size (`Int`): Minimum allocation size for registers in bytes.
* ​allocation\_granularity (`StringSlice[StaticConstantOrigin]`): Description of how resources are allocated on the GPU.
* ​max\_registers\_per\_thread (`Int`): Maximum number of registers that can be allocated to a single thread.
* ​max\_registers\_per\_block (`Int`): Maximum number of registers that can be allocated to a thread block.
* ​max\_blocks\_per\_multiprocessor (`Int`): Maximum number of blocks that can be scheduled on a multiprocessor.
* ​shared\_memory\_allocation\_unit\_size (`Int`): Minimum allocation size for shared memory in bytes.
* ​warp\_allocation\_granularity (`Int`): Granularity at which warps are allocated resources.
* ​max\_thread\_block\_size (`Int`): Maximum number of threads allowed in a thread block.

## Implemented traits

`AnyType`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__lt__`

`__lt__(self, other: Self) -> Bool`

Compares if this GPU has lower compute capability than another.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if this GPU has lower compute capability, False otherwise.

### `__le__`

`__le__(self, other: Self) -> Bool`

Compares if this GPU has lower or equal compute capability.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if this GPU has lower or equal compute capability.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two GPU Info instances represent the same GPU model.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if both instances represent the same GPU model.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if two GPU Info instances represent different GPU models.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if instances represent different GPU models.

### `__gt__`

`__gt__(self, other: Self) -> Bool`

Compares if this GPU has higher compute capability than another.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if this GPU has higher compute capability, False otherwise.

### `__ge__`

`__ge__(self, other: Self) -> Bool`

Compares if this GPU has higher or equal compute capability.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if this GPU has higher or equal compute capability.

### `__is__`

`__is__(self, other: Self) -> Bool`

Identity comparison operator for GPU Info instances.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if both instances represent the same GPU model.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Negative identity comparison operator for GPU Info instances.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if instances represent different GPU models.

### `target`

`target(self) -> target`

Gets the MLIR target configuration for this GPU.

**Returns:**

MLIR target configuration for the GPU.

### `from_target`

`static from_target[target: target]() -> Self`

Creates an Info instance from an MLIR target.

**Parameters:**

* ​target (`target`): MLIR target configuration.

**Returns:**

GPU info corresponding to the target.

### `from_name`

`static from_name[name: StringSlice[StaticConstantOrigin]]() -> Self`

Creates an Info instance from a GPU architecture name.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): GPU architecture name (e.g., "sm\_80", "gfx942").

**Returns:**

GPU info corresponding to the architecture name.

### `occupancy`

`occupancy(self, *, threads_per_block: Int, registers_per_thread: Int) -> SIMD[float64, 1]`

Calculates theoretical occupancy for given thread and register config.

Occupancy represents the ratio of active warps to the maximum possible
warps on a streaming multiprocessor.

Note:
TODO (KERN-795): Add occupancy calculation based on shared memory
usage and thread block size and take use the minimum value.

**Args:**

* ​threads\_per\_block (`Int`): Number of threads in each block.
* ​registers\_per\_thread (`Int`): Number of registers used by each thread.

**Returns:**

Occupancy as a ratio between 0.0 and 1.0.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes GPU information to a writer.

Outputs all GPU specifications and capabilities to the provided writer
in a human-readable format.

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait.

**Args:**

* ​writer (`W`): A Writer instance to output the GPU information.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the GPU information.

Converts all GPU specifications and capabilities to a human-readable
string format.

**Returns:**

String containing all GPU information.

---

## Vendor

`@register_passable`
`struct Vendor`

Represents GPU vendors.

This struct provides identifiers for different GPU vendors and utility
methods for comparison and string representation.

The Vendor struct defines constants for common GPU vendors (NVIDIA, AMD)
and includes a NO\_GPU option for systems without GPU support. It provides
comparison operators and string conversion methods for vendor identification.

## Implemented traits

`AnyType`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `AMD_GPU`

`alias AMD_GPU = Vendor(__init__[__mlir_type.!pop.int_literal](1))`

Represents AMD GPU vendor.

### `NO_GPU`

`alias NO_GPU = Vendor(__init__[__mlir_type.!pop.int_literal](0))`

Represents no GPU or CPU-only execution.

### `NVIDIA_GPU`

`alias NVIDIA_GPU = Vendor(__init__[__mlir_type.!pop.int_literal](2))`

Represents NVIDIA GPU vendor.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two `Vendor` instances are equal.

**Args:**

* ​other (`Self`): The `Vendor` to compare with.

**Returns:**

True if vendors are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if two `Vendor` instances are not equal.

**Args:**

* ​other (`Self`): The `Vendor` to compare with.

**Returns:**

True if vendors are not equal, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Identity comparison for vendors.

**Args:**

* ​other (`Self`): The `Vendor` to compare with.

**Returns:**

True if vendors are identical, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Negative identity comparison for vendors.

**Args:**

* ​other (`Self`): The Vendor to compare with.

**Returns:**

True if vendors are not identical, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes vendor information to a writer.

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait.

**Args:**

* ​writer (`W`): The writer to output vendor information to.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the vendor.

**Returns:**

String representation of the vendor.

---

## info

Contains information about GPU architectures and their capabilities.

This module provides detailed specifications for various GPU models including
NVIDIA and AMD GPUs. It includes information about compute capabilities,
memory specifications, thread organization, and performance characteristics.

## Aliases

### `A10`

`alias A10 = Info(__init__[__mlir_type.!kgen.string]("A10"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ampere"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.5999999999999996), __init__[__mlir_type.!kgen.string]("sm_86"), 72, 32, 1536, 32, 64, 2048, 32, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 16, 128, 4, 1024)`

### `A100`

`alias A100 = Info(__init__[__mlir_type.!kgen.string]("A100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ampere"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8), __init__[__mlir_type.!kgen.string]("sm_80"), 108, 32, 2048, 32, 64, 2048, 32, 167936, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)`

### `B100`

`alias B100 = Info(__init__[__mlir_type.!kgen.string]("B100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("blackwell"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](10), __init__[__mlir_type.!kgen.string]("sm_100a"), 132, 32, -1, 32, 64, 1536, 32, 262144, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)`

### `B200`

`alias B200 = Info(__init__[__mlir_type.!kgen.string]("B200"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("blackwell"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](10), __init__[__mlir_type.!kgen.string]("sm_100a"), 148, 32, -1, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)`

### `DEFAULT_GPU`

`alias DEFAULT_GPU = from_name[::StringSlice[::Bool()`

### `DEFAULT_GPU_ARCH`

`alias DEFAULT_GPU_ARCH = _accelerator_arch()`

### `DEFAULT_GPU_TARGET`

`alias DEFAULT_GPU_TARGET = from_name[::StringSlice[::Bool().target()`

### `H100`

`alias H100 = Info(__init__[__mlir_type.!kgen.string]("H100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("hopper"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](9), __init__[__mlir_type.!kgen.string]("sm_90a"), 132, 32, 2048, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)`

### `L4`

`alias L4 = Info(__init__[__mlir_type.!kgen.string]("L4"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ada"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.9000000000000004), __init__[__mlir_type.!kgen.string]("sm_89"), 58, 32, 1536, 32, 64, 2048, 32, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 24, 128, 4, 1024)`

### `MI300X`

`alias MI300X = Info(__init__[__mlir_type.!kgen.string]("MI300X"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx942"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](9.4000000000000003), __init__[__mlir_type.!kgen.string]("CDNA3"), 304, 64, 2048, 64, 32, 2048, 2, 65536, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 2, 128, 4, 1024)`

### `NoGPU`

`alias NoGPU = Info(__init__[__mlir_type.!kgen.string]("NoGPU"), Vendor(__init__[__mlir_type.!pop.int_literal](0)), __init__[__mlir_type.!kgen.string]("none"), __init__[__mlir_type.!kgen.string]("no_gpu"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.int_literal](0), __init__[__mlir_type.!kgen.string](""), 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, __init__[__mlir_type.!kgen.string]("none"), 0, 0, 0, 0, 0, 0)`

### `OrinNano`

`alias OrinNano = Info(__init__[__mlir_type.!kgen.string]("Orin Nano"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ampere"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.6999999999999993), __init__[__mlir_type.!kgen.string]("sm_87"), 8, 32, 1536, 32, 64, 2048, 32, 167936, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 16, 128, 4, 1024)`

### `Radeon7600`

`alias Radeon7600 = Info(__init__[__mlir_type.!kgen.string]("Radeon 7600"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx1102"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](11), __init__[__mlir_type.!kgen.string]("RDNA3"), 32, 32, 1024, 32, 32, 1024, 2, 32768, 32768, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 2, 128, 4, 1024)`

### `Radeon7800`

`alias Radeon7800 = Info(__init__[__mlir_type.!kgen.string]("Radeon 7800/7700"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx1101"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](11), __init__[__mlir_type.!kgen.string]("RDNA3"), 60, 32, 1024, 32, 32, 1024, 2, 32768, 32768, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 2, 128, 4, 1024)`

### `Radeon780m`

`alias Radeon780m = Info(__init__[__mlir_type.!kgen.string]("Radeon 780M"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx1103"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](11), __init__[__mlir_type.!kgen.string]("RDNA3"), 12, 32, 1024, 32, 32, 1024, 2, 32768, 32768, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 2, 128, 4, 1024)`

### `Radeon7900`

`alias Radeon7900 = Info(__init__[__mlir_type.!kgen.string]("Radeon 7900"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx1100"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](11), __init__[__mlir_type.!kgen.string]("RDNA3"), 96, 32, 1024, 32, 32, 1024, 2, 32768, 32768, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 2, 128, 4, 1024)`

### `Radeon9060`

`alias Radeon9060 = Info(__init__[__mlir_type.!kgen.string]("Radeon 9060"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx1200"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](12), __init__[__mlir_type.!kgen.string]("RDNA4"), 32, 32, 1024, 32, 32, 1024, 2, 32768, 32768, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 2, 128, 4, 1024)`

### `Radeon9070`

`alias Radeon9070 = Info(__init__[__mlir_type.!kgen.string]("Radeon 9070"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx1201"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](12), __init__[__mlir_type.!kgen.string]("RDNA4"), 64, 32, 1024, 32, 32, 1024, 2, 32768, 32768, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 2, 128, 4, 1024)`

### `RTX2060`

`alias RTX2060 = Info(__init__[__mlir_type.!kgen.string]("RTX2060"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("turing"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](7.5), __init__[__mlir_type.!kgen.string]("sm_75"), 30, 32, 2048, 32, 64, 2048, 16, 65536, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 16, 32, 4, 1024)`

### `RTX4090`

`alias RTX4090 = Info(__init__[__mlir_type.!kgen.string]("RTX4090"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ada lovelace"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.9000000000000004), __init__[__mlir_type.!kgen.string]("sm_89"), 128, 32, -1, 32, 64, 1536, 24, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 24, 128, 4, 1024)`

### `RTX4090m`

`alias RTX4090m = Info(__init__[__mlir_type.!kgen.string]("RTX4090m"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ada lovelace"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.9000000000000004), __init__[__mlir_type.!kgen.string]("sm_89"), 76, 32, -1, 32, 64, 1536, 24, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 24, 128, 4, 1024)`

### `RTX5090`

`alias RTX5090 = Info(__init__[__mlir_type.!kgen.string]("RTX5090"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("blackwell"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](12), __init__[__mlir_type.!kgen.string]("sm_120a"), 170, 32, -1, 32, 64, 1536, 32, 59392, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)`

## Structs

* [​`Info`](/mojo/stdlib/gpu/host/info/Info): Comprehensive information about a GPU architecture.
* [​`Vendor`](/mojo/stdlib/gpu/host/info/Vendor): Represents GPU vendors.

## Functions

* [​`is_cpu`](/mojo/stdlib/gpu/host/info/is_cpu): Checks if the target is a CPU (compile-time version).
* [​`is_gpu`](/mojo/stdlib/gpu/host/info/is_gpu): Checks if the target is a GPU (compile-time version).
* [​`is_valid_target`](/mojo/stdlib/gpu/host/info/is_valid_target): Checks if the target is valid (compile-time version).

---

## is_cpu

`is_cpu[: Bool, : Origin[$0], //, target: StringSlice[$1]]() -> Bool`

Checks if the target is a CPU (compile-time version).

**Parameters:**

* ​target (`StringSlice[$1]`): Target string to check.

**Returns:**

True if the target is a CPU, False otherwise.

`is_cpu(target: StringSlice[origin]) -> Bool`

Checks if the target is a CPU (runtime version).

**Args:**

* ​target (`StringSlice[origin]`): Target string to check.

**Returns:**

True if the target is a CPU, False otherwise.

---

## is_gpu

`is_gpu[: Bool, : Origin[$0], //, target: StringSlice[$1]]() -> Bool`

Checks if the target is a GPU (compile-time version).

**Parameters:**

* ​target (`StringSlice[$1]`): Target string to check.

**Returns:**

True if the target is a GPU, False otherwise.

`is_gpu(target: StringSlice[origin]) -> Bool`

Checks if the target is a GPU (runtime version).

**Args:**

* ​target (`StringSlice[origin]`): Target string to check.

**Returns:**

True if the target is a GPU, False otherwise.

---

## is_valid_target

`is_valid_target[: Bool, : Origin[$0], //, target: StringSlice[$1]]() -> Bool`

Checks if the target is valid (compile-time version).

**Parameters:**

* ​target (`StringSlice[$1]`): Target string to check.

**Returns:**

True if the target is valid (CPU or GPU), False otherwise.

`is_valid_target(target: StringSlice[origin]) -> Bool`

Checks if the target is valid (runtime version).

**Args:**

* ​target (`StringSlice[origin]`): Target string to check.

**Returns:**

True if the target is valid (CPU or GPU), False otherwise.

---

## AccessPolicyWindow

`@register_passable(trivial)`
`struct AccessPolicyWindow`

Specifies an access policy for a window of memory.

This struct defines a contiguous extent of memory beginning at base\_ptr and
ending at base\_ptr + num\_bytes, with associated access policies. It allows
fine-grained control over how memory is accessed and cached, which can
significantly impact performance for memory-bound workloads.

The window is partitioned into segments with different access properties based
on the hit\_ratio. Accesses to "hit segments" use the hit\_prop policy, while
accesses to "miss segments" use the miss\_prop policy.

Note:
The `num_bytes` value is limited by `CU_DEVICE_ATTRIBUTE_MAX_ACCESS_POLICY_WINDOW_SIZE`.
The CUDA driver may align the `base_ptr` and restrict the maximum size.

## Fields

* ​base\_ptr (`UnsafePointer[NoneType]`): Starting address of the access policy window. Driver may align it.
* ​num\_bytes (`Int`): Size in bytes of the window policy. CUDA driver may restrict the maximum size and alignment.
* ​hit\_ratio (`SIMD[float32, 1]`): Specifies percentage of lines assigned hit\_prop, rest are assigned miss\_prop. Value should be between 0.0 and 1.0.
* ​hit\_prop (`AccessProperty`): AccessProperty applied to hit segments within the window.
* ​miss\_prop (`AccessProperty`): AccessProperty applied to miss segments within the window. Must be either NORMAL or STREAMING.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Initializes a new AccessPolicyWindow with default values.

`__init__[T: AnyType](*, base_ptr: UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin], count: Int, hit_ratio: SIMD[float32, 1], hit_prop: AccessProperty = AccessProperty(__init__[__mlir_type.!pop.int_literal](0)), miss_prop: AccessProperty = AccessProperty(__init__[__mlir_type.!pop.int_literal](0))) -> Self`

Initializes an `AccessPolicyWindow` for a typed memory region.

**Parameters:**

* ​T (`AnyType`): The type of data in the memory region.

**Args:**

* ​base\_ptr (`UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the start of the memory region.
* ​count (`Int`): Number of elements of type T in the memory region.
* ​hit\_ratio (`SIMD[float32, 1]`): Fraction of the window that should use hit\_prop (0.0 to 1.0).
* ​hit\_prop (`AccessProperty`): Access property for hit segments (default: NORMAL).
* ​miss\_prop (`AccessProperty`): Access property for miss segments (default: NORMAL).

### `__str__`

`__str__(self) -> String`

Returns a string representation of the `AccessPolicyWindow`.

**Returns:**

A string representation of the `AccessPolicyWindow`.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes a string representation of the `AccessPolicyWindow` to a writer.

This method formats all the fields of the AccessPolicyWindow into a human-readable
string representation and writes it to the provided writer.

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait.

**Args:**

* ​writer (`W`): The writer instance to write the formatted string to.

---

## AccessProperty

`@register_passable(trivial)`
`struct AccessProperty`

Specifies performance hint with AccessPolicyWindow for hit\_prop and miss\_prop fields.

This struct defines cache persistence properties that can be used with
`AccessPolicyWindow` to control how data is cached during GPU memory accesses.
It provides hints to the memory subsystem about the expected access patterns,
which can improve performance for specific workloads.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `NORMAL`

`alias NORMAL = AccessProperty(__init__[__mlir_type.!pop.int_literal](0))`

Normal cache persistence with default caching behavior.

### `PERSISTING`

`alias PERSISTING = AccessProperty(__init__[__mlir_type.!pop.int_literal](2))`

Persisting access is more likely to persist in cache, optimized for reused data.

### `STREAMING`

`alias STREAMING = AccessProperty(__init__[__mlir_type.!pop.int_literal](1))`

Streaming access is less likely to persist in cache, optimized for single-use data.

## Methods

### `__init__`

`__init__(*, other: Self) -> Self`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compares two `AccessProperty` instances for equality.

**Args:**

* ​other (`Self`): The `AccessProperty` to compare with.

**Returns:**

True if the instances have the same value, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Compares two `AccessProperty` instances for inequality.

**Args:**

* ​other (`Self`): The `AccessProperty` to compare with.

**Returns:**

True if the instances have different values, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Checks if two `AccessProperty` instances have the same value.

**Args:**

* ​other (`Self`): The `AccessProperty` to compare with.

**Returns:**

True if the instances have the same value, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Checks if two `AccessProperty` instances have different values.

**Args:**

* ​other (`Self`): The `AccessProperty` to compare with.

**Returns:**

True if the instances have different values, False otherwise.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the `AccessProperty`.

**Returns:**

A string representation of the `AccessProperty`.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes a string representation of the `AccessProperty` to a writer.

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait.

**Args:**

* ​writer (`W`): The writer instance to write the formatted string to.

---

## LaunchAttribute

`@register_passable(trivial)`
`struct LaunchAttribute`

Represents a complete launch attribute with ID and value.

This struct combines a `LaunchAttributeID` and `LaunchAttributeValue` to form
a complete attribute that can be passed to GPU kernel launches. It provides
a way to specify various execution parameters that control kernel behavior.

## Fields

* ​id (`LaunchAttributeID`): The identifier specifying the type of this launch attribute.
* ​\_\_pad (`StaticTuple[SIMD[uint8, 1], ((sizeof[::AnyType,__mlir_type.!kgen.target]() * -1) + 8)]`): Padding to ensure proper alignment of the structure.
* ​value (`LaunchAttributeValue`): The value associated with this launch attribute.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Initializes a new LaunchAttribute with IGNORE ID and zeroed value.

`__init__(id: LaunchAttributeID, value: LaunchAttributeValue) -> Self`

Initializes a `LaunchAttribute` with a specific ID and value.

**Args:**

* ​id (`LaunchAttributeID`): The `LaunchAttributeID` to set.
* ​value (`LaunchAttributeValue`): The `LaunchAttributeValue` to set.

`@implicit`
`__init__(policy: AccessPolicyWindow) -> Self`

Initializes a `LaunchAttribute` from an `AccessPolicyWindow`.

Creates a launch attribute with `ACCESS_POLICY_WINDOW` ID and the provided policy.

**Args:**

* ​policy (`AccessPolicyWindow`): The `AccessPolicyWindow` to use for this attribute.

### `from_cluster_dim`

`static from_cluster_dim(dim: Dim) -> Self`

Creates a `LaunchAttribute` for cluster dimensions.

Creates a launch attribute with `CLUSTER_DIMENSION` ID and the provided dimensions.

**Args:**

* ​dim (`Dim`): The dimensions to use for this attribute.

**Returns:**

A new `LaunchAttribute` configured with the specified cluster dimensions.

---

## LaunchAttributeID

`@register_passable(trivial)`
`struct LaunchAttributeID`

Identifies the type of launch attribute for GPU kernel execution.

This struct represents the various types of launch attributes that can be specified
when launching GPU kernels or configuring streams and graph nodes. Each attribute
controls different aspects of kernel execution behavior such as memory access policies,
synchronization, scheduling, and resource allocation.

The attributes are compatible with CUDA's launch attribute system and provide
fine-grained control over kernel execution characteristics.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `ACCESS_POLICY_WINDOW`

`alias ACCESS_POLICY_WINDOW = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](1))`

Valid for streams, graph nodes, launches.

### `CLUSTER_DIMENSION`

`alias CLUSTER_DIMENSION = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](4))`

Valid for graph nodes, launches.

### `CLUSTER_SCHEDULING_POLICY_PREFERENCE`

`alias CLUSTER_SCHEDULING_POLICY_PREFERENCE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](5))`

Valid for graph nodes, launches.

### `COOPERATIVE`

`alias COOPERATIVE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](2))`

Valid for graph nodes, launches.

### `DEVICE_UPDATABLE_KERNEL_NODE`

`alias DEVICE_UPDATABLE_KERNEL_NODE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](13))`

Valid for graph nodes, launches. This attribute is graphs-only, and passing it to a launch in a non-capturing stream will result in an error. CUlaunchAttributeValue::deviceUpdatableKernelNode::deviceUpdatable can only be set to 0 or 1. Setting the field to 1 indicates that the corresponding kernel node should be device-updatable. On success, a handle will be returned via CUlaunchAttributeValue::deviceUpdatableKernelNode::devNode which can be passed to the various device-side update functions to update the node's kernel parameters from within another kernel. For more information on the types of device updates that can be made, as well as the relevant limitations thereof, see cudaGraphKernelNodeUpdatesApply. Nodes which are device-updatable have additional restrictions compared to regular kernel nodes. Firstly, device-updatable nodes cannot be removed from their graph via cuGraphDestroyNode. Additionally, once opted-in to this functionality, a node cannot opt out, and any attempt to set the deviceUpdatable attribute to 0 will result in an error. Device-updatable kernel nodes also cannot have their attributes copied to/from another kernel node via cuGraphKernelNodeCopyAttributes. Graphs containing one or more device-updatable nodes also do not allow multiple instantiation, and neither the graph nor its instantiated version can be passed to cuGraphExecUpdate. If a graph contains device-updatable nodes and updates those nodes from the device from within the graph, the graph must be uploaded with cuGraphUpload before it is launched. For such a graph, if host-side executable graph updates are made to the device-updatable nodes, the graph must be uploaded before it is launched again.

### `IGNORE`

`alias IGNORE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](0))`

Ignored entry, for convenient composition.

### `LAUNCH_COMPLETION_EVENT`

`alias LAUNCH_COMPLETION_EVENT = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](12))`

Valid for launches. Set CUlaunchAttributeValue::launchCompletionEvent to record the event. Nominally, the event is triggered once all blocks of the kernel have begun execution. Currently this is a best effort. If a kernel B has a launch completion dependency on a kernel A, B may wait until A is complete. Alternatively, blocks of B may begin before all blocks of A have begun, for example if B can claim execution resources unavailable to A (e.g. they run on different GPUs) or if B is a higher priority than A. Exercise caution if such an ordering inversion could lead to deadlock. A launch completion event is nominally similar to a programmatic event with triggerAtBlockStart set except that it is not visible to cudaGridDependencySynchronize() and can be used with compute capability less than 9.0. The event supplied must not be an interprocess or interop event. The event must disable timing (i.e. must be created with the CU\_EVENT\_DISABLE\_TIMING flag set).

### `MEM_SYNC_DOMAIN`

`alias MEM_SYNC_DOMAIN = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](10))`

Valid for streams, graph nodes, launches.

### `MEM_SYNC_DOMAIN_MAP`

`alias MEM_SYNC_DOMAIN_MAP = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](9))`

Valid for streams, graph nodes, launches.

### `PREFERRED_SHARED_MEMORY_CARVEOUT`

`alias PREFERRED_SHARED_MEMORY_CARVEOUT = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](14))`

Valid for launches. On devices where the L1 cache and shared memory use the same hardware resources, setting CUlaunchAttributeValue::sharedMemCarveout to a percentage between 0-100 signals the CUDA driver to set the shared memory carveout preference, in percent of the total shared memory for that kernel launch. This attribute takes precedence over CU\_FUNC\_ATTRIBUTE\_PREFERRED\_SHARED\_MEMORY\_CARVEOUT. This is only a hint, and the CUDA driver can choose a different configuration if required for the launch.

### `PRIORITY`

`alias PRIORITY = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](8))`

Valid for streams, graph nodes, launches.

### `PROGRAMMATIC_EVENT`

`alias PROGRAMMATIC_EVENT = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](7))`

Valid for launches. Set CUlaunchAttributeValue::programmaticEvent to record the event. Event recorded through this launch attribute is guaranteed to only trigger after all block in the associated kernel trigger the event. A block can trigger the event through PTX launchdep.release or CUDA builtin function cudaTriggerProgrammaticLaunchCompletion(). A trigger can also be inserted at the beginning of each block's execution if triggerAtBlockStart is set to non-0. The dependent launches can choose to wait on the dependency using the programmatic sync (cudaGridDependencySynchronize() or equivalent PTX instructions). Note that dependents (including the CPU thread calling cuEventSynchronize()) are not guaranteed to observe the release precisely when it is released. For example, cuEventSynchronize() may only observe the event trigger long after the associated kernel has completed. This recording type is primarily meant for establishing programmatic dependency between device tasks. Note also this type of dependency allows, but does not guarantee, concurrent execution of tasks. The event supplied must not be an interprocess or interop event. The event must disable timing (i.e. must be created with the CU\_EVENT\_DISABLE\_TIMING flag set).

### `PROGRAMMATIC_STREAM_SERIALIZATION`

`alias PROGRAMMATIC_STREAM_SERIALIZATION = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](6))`

Valid for launches. Setting CUlaunchAttributeValue:: programmaticStreamSerializationAllowed to non-0 signals that the kernel will use programmatic means to resolve its stream dependency, so that the CUDA runtime should opportunistically allow the grid's execution to overlap with the previous kernel in the stream, if that kernel requests the overlap. The dependent launches can choose to wait on the dependency using the programmatic sync.

### `SYNCHRONIZATION_POLICY`

`alias SYNCHRONIZATION_POLICY = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](3))`

Valid for streams.

## Methods

### `__init__`

`__init__(*, other: Self) -> Self`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two `LaunchAttribute` instances are equal.

Compares the underlying integer values of the attributes.

**Args:**

* ​other (`Self`): The other `LaunchAttribute` instance to compare with.

**Returns:**

True if the attributes are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if two `LaunchAttribute` instances are not equal.

**Args:**

* ​other (`Self`): The other `LaunchAttribute` instance to compare with.

**Returns:**

True if the attributes are not equal, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Checks if two `LaunchAttribute` instances have the same value.

This is an identity comparison that delegates to equality comparison.

**Args:**

* ​other (`Self`): The other \`LaunchAttribute instance to compare with.

**Returns:**

True if the attributes have the same value, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Checks if two `LaunchAttribute` instances have different values.

**Args:**

* ​other (`Self`): The other `LaunchAttribute` instance to compare with.

**Returns:**

True if the attributes have different values, False otherwise.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the `LaunchAttribute`.

**Returns:**

A string representation of the attribute.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the string representation of the attribute to a writer.

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer interface.

**Args:**

* ​writer (`W`): The writer to write to.

---

## LaunchAttributeValue

`@register_passable(trivial)`
`struct LaunchAttributeValue`

Represents a value for a CUDA launch attribute.

This struct emulates a C union to store different types of launch attribute values.
It provides a fixed-size storage that can be initialized with different attribute types
such as AccessPolicyWindow or dimension specifications.

Note:
This implementation uses a fixed-size byte array to emulate the union behavior
defined in the CUDA Driver API's CUlaunchAttributeValue.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Initializes a new `LaunchAttributeValue` with zeroed storage.

`@implicit`
`__init__(policy: AccessPolicyWindow) -> Self`

Initializes a `LaunchAttributeValue` from an `AccessPolicyWindow`.

**Args:**

* ​policy (`AccessPolicyWindow`): The `AccessPolicyWindow` to store in this attribute value.

`@implicit`
`__init__(dim: Dim) -> Self`

Initializes a LaunchAttributeValue from a Dim (dimension) object.

**Args:**

* ​dim (`Dim`): The dimension specification to store in this attribute value.

`@implicit`
`__init__(value: Bool) -> Self`

Initializes a LaunchAttributeValue from a boolean object..

**Args:**

* ​value (`Bool`): The boolean value to store in this attribute value.

---

## launch_attribute

GPU Launch Attributes for Kernel Execution Control

This module provides structures for configuring GPU kernel execution through launch attributes.
It implements a Mojo interface to CUDA's launch attribute system, allowing fine-grained control
over kernel execution characteristics such as memory access policies, synchronization behavior,
cluster dimensions, and resource allocation.

The main components include:

* `LaunchAttributeID`: Identifies different types of launch attributes
* `LaunchAttributeValue`: Stores the value for a specific attribute type
* `LaunchAttribute`: Combines an ID and value to form a complete attribute
* `AccessPolicyWindow`: Configures memory access patterns and caching behavior
* `AccessProperty`: Defines cache persistence properties for memory access

These structures enable optimizing GPU kernel performance by controlling execution parameters
at a granular level, similar to CUDA's native launch attribute system.

## Structs

* [​`AccessPolicyWindow`](/mojo/stdlib/gpu/host/launch_attribute/AccessPolicyWindow): Specifies an access policy for a window of memory.
* [​`AccessProperty`](/mojo/stdlib/gpu/host/launch_attribute/AccessProperty): Specifies performance hint with AccessPolicyWindow for hit\_prop and miss\_prop fields.
* [​`LaunchAttribute`](/mojo/stdlib/gpu/host/launch_attribute/LaunchAttribute): Represents a complete launch attribute with ID and value.
* [​`LaunchAttributeID`](/mojo/stdlib/gpu/host/launch_attribute/LaunchAttributeID): Identifies the type of launch attribute for GPU kernel execution.
* [​`LaunchAttributeValue`](/mojo/stdlib/gpu/host/launch_attribute/LaunchAttributeValue): Represents a value for a CUDA launch attribute.

---

## id

This module provides GPU thread and block indexing functionality.

It defines aliases and functions for accessing GPU grid, block, thread and cluster
dimensions and indices. These are essential primitives for GPU programming that allow
code to determine its position and dimensions within the GPU execution hierarchy.

Most functionality is architecture-agnostic, with some NVIDIA-specific features clearly marked.
The module is designed to work seamlessly across different GPU architectures while providing
optimal performance through hardware-specific optimizations where applicable.

## Aliases

### `block_dim`

`alias block_dim = _BlockDim()`

Contains the dimensions of the block as `x`, `y`, and `z` values (for example, `block_dim.y`)

### `block_id_in_cluster`

`alias block_id_in_cluster = _Cluster_BlockIdx()`

Contains the block id of the threadblock within a cluster, as `x`, `y`, and `z` values.

### `block_idx`

`alias block_idx = _BlockIdx()`

Contains the block index in the grid, as `x`, `y`, and `z` values.

### `cluster_dim`

`alias cluster_dim = _ClusterDim()`

Contains the dimensions of the cluster, as `x`, `y`, and `z` values.

### `cluster_idx`

`alias cluster_idx = _ClusterIdx()`

Contains the cluster index in the grid, as `x`, `y`, and `z` values.

### `global_idx`

`alias global_idx = _GridIdx()`

Contains the global offset of the kernel launch, as `x`, `y`, and `z` values.

### `grid_dim`

`alias grid_dim = _GridDim()`

Provides accessors for getting the `x`, `y`, and `z` dimensions of a grid.

### `thread_idx`

`alias thread_idx = _ThreadIdx()`

Contains the thread index in the block, as `x`, `y`, and `z` values.

## Functions

* [​`lane_id`](/mojo/stdlib/gpu/id/lane_id): Returns the lane ID of the current thread within its warp.
* [​`sm_id`](/mojo/stdlib/gpu/id/sm_id): Returns the Streaming Multiprocessor (SM) ID of the current thread.
* [​`warp_id`](/mojo/stdlib/gpu/id/warp_id): Returns the warp ID of the current thread within its block. The warp ID is a unique identifier for each warp within a block, ranging from 0 to BLOCK\_SIZE/WARP\_SIZE-1. This ID is commonly used for warp-level programming and synchronization within a block.

---

## lane_id

`lane_id() -> UInt`

Returns the lane ID of the current thread within its warp.

The lane ID is a unique identifier for each thread within a warp, ranging from 0 to
WARP\_SIZE-1. This ID is commonly used for warp-level programming and thread
synchronization within a warp.

**Returns:**

The lane ID (0 to WARP\_SIZE-1) of the current thread.

---

## sm_id

`sm_id() -> UInt`

Returns the Streaming Multiprocessor (SM) ID of the current thread.

The SM ID uniquely identifies which physical streaming multiprocessor the thread is
executing on. This is useful for SM-level optimizations and understanding hardware
utilization.

If called on non-NVIDIA GPUs, this function aborts as this functionality
is only supported on NVIDIA hardware.

**Returns:**

The SM ID of the current thread.

---

## warp_id

`warp_id() -> UInt`

Returns the warp ID of the current thread within its block. The warp ID is a unique identifier for each warp within a block, ranging from 0 to BLOCK\_SIZE/WARP\_SIZE-1. This ID is commonly used for warp-level programming and synchronization within a block.

**Returns:**

The warp ID (0 to BLOCK\_SIZE/WARP\_SIZE-1) of the current thread.

---

## gpu

Provides low-level programming constructs for working with GPUs.

These low level constructs allow you to write code that runs on the GPU with
traditional programming style--partitioning work across threads that are mapped
onto 1-, 2-, or 3-dimensional blocks. The thread blocks can subsequently be
grouped into a grid of thread blocks.

A *kernel* is a function that runs on the GPU in parallel across many threads.
Currently, the
[`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) struct
provides the interface for compiling and launching GPU kernels inside MAX
[custom operations](/max/custom-ops/).

The [`gpu.host`](/mojo/stdlib/gpu/host/) package includes APIs to manage
interaction between the *host* (that is, the CPU) and *device* (that is, the GPU
or accelerator).

See the [`gpu.id`](/mojo/stdlib/gpu/id#aliases) module for a list of aliases you
can use to access information about the grid and the current thread, including
block dimensions, block index in the grid and thread index.

The [`sync`](/mojo/stdlib/gpu/sync/) module provides functions for synchronizing
threads.

For an example of launching a GPU kernel from a MAX custom operation, see the
[vector addition example](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/vector_addition.mojo)
in the MAX repo.

## Packages

* [​`comm`](/mojo/stdlib/gpu/comm/): The `gpu.comm` package provides communication primitives for GPUs.
* [​`host`](/mojo/stdlib/gpu/host/): Implements the gpu host package.

## Modules

* [​`block`](/mojo/stdlib/gpu/block/): GPU block-level operations and utilities.
* [​`cluster`](/mojo/stdlib/gpu/cluster/): This module provides low-level NVIDIA GPU cluster synchronization primitives for SM90+ architectures.
* [​`globals`](/mojo/stdlib/gpu/globals/): This module provides GPU-specific global constants and configuration values.
* [​`grid_controls`](/mojo/stdlib/gpu/grid_controls/): Grid Dependent Control primitives for NVIDIA Hopper (SM90+) GPUs.
* [​`id`](/mojo/stdlib/gpu/id/): This module provides GPU thread and block indexing functionality.
* [​`intrinsics`](/mojo/stdlib/gpu/intrinsics/): Provides low-level GPU intrinsic operations and memory access primitives.
* [​`memory`](/mojo/stdlib/gpu/memory/): This module provides GPU memory operations and utilities.
* [​`mma`](/mojo/stdlib/gpu/mma/): This module includes utilities for working with the warp-matrix-matrix-multiplication (wmma) instructions.
* [​`mma_operand_descriptor`](/mojo/stdlib/gpu/mma_operand_descriptor/):
* [​`mma_sm100`](/mojo/stdlib/gpu/mma_sm100/): This module includes utilities for working with the SM100 MMA instructions.
* [​`mma_util`](/mojo/stdlib/gpu/mma_util/): Matrix multiply accumulate (MMA) utilities for GPU tensor cores.
* [​`profiler`](/mojo/stdlib/gpu/profiler/): This module provides GPU profiling functionality.
* [​`random`](/mojo/stdlib/gpu/random/): Random number generation for GPU kernels.
* [​`semaphore`](/mojo/stdlib/gpu/semaphore/): This module provides a device-wide semaphore implementation for NVIDIA GPUs.
* [​`sync`](/mojo/stdlib/gpu/sync/): This module provides GPU synchronization primitives and barriers.
* [​`tcgen05`](/mojo/stdlib/gpu/tcgen05/): This module includes utilities for working with the tensorcore 5th generation (tcgen05) instructions.
* [​`tensor_ops`](/mojo/stdlib/gpu/tensor_ops/): This module provides tensor core operations and utilities for GPU computation.
* [​`warp`](/mojo/stdlib/gpu/warp/): GPU warp-level operations and utilities.

---

## Scope

`struct Scope`

Represents memory synchronization scope levels for GPU memory operations.

Defines different scopes of memory visibility and synchronization, from
thread-local to system-wide. Each scope level determines how memory
operations are ordered and visible across different execution units.

The scope levels form a hierarchy, with each higher level providing
stronger ordering guarantees but potentially higher synchronization costs.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `BLOCK`

`alias BLOCK = Scope(3)`

Block-level scope. Memory operations ordered within a thread block/CTA.

### `CLUSTER`

`alias CLUSTER = Scope(4)`

Cluster-level scope. Memory operations ordered within a thread block cluster.

### `GPU`

`alias GPU = Scope(5)`

GPU-level scope. Memory operations are ordered across all threads on the GPU.

### `NONE`

`alias NONE = Scope(0)`

No memory ordering guarantees. Operations may be reordered freely.

### `SYSTEM`

`alias SYSTEM = Scope(6)`

System-wide scope. Memory operations ordered across the entire system.

### `THREAD`

`alias THREAD = Scope(1)`

Thread-level scope. Memory operations are ordered within a single thread.

### `WARP`

`alias WARP = Scope(2)`

Warp-level scope. Memory operations are ordered within a warp of threads.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two `Scope` instances are equal.

Uses pointer comparison for efficiency.

**Args:**

* ​other (`Self`): The other `Scope` instance to compare with.

**Returns:**

True if the instances are the same, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if two `Scope` instances are not equal.

**Args:**

* ​other (`Self`): The other `Scope` instance to compare with.

**Returns:**

True if the instances are different, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Checks if two `Scope` instances have the same value.

Compares the underlying integer values.

**Args:**

* ​other (`Self`): The other `Scope` instance to compare with.

**Returns:**

True if the values are the same, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Checks if two `Scope` instances have different values.

**Args:**

* ​other (`Self`): The other `Scope` instance to compare with.

**Returns:**

True if the values are different, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut w: W)`

Writes the string representation of the scope to a writer.

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer interface.

**Args:**

* ​w (`W`): The writer to write to.

### `__str__`

`__str__(self) -> String`

Returns the string representation of the memory scope.

**Returns:**

A string representation of the memory scope.

### `__repr__`

`__repr__(self) -> String`

Returns the string representation of the memory scope.

**Returns:**

A string representation of the memory scope.

### `mnemonic`

`mnemonic(self) -> StringSlice[StaticConstantOrigin]`

Returns the mnemonic string representation of the memory scope.

Converts the memory scope level into a string mnemonic used by LLVM/NVVM
intrinsics for memory operations.

**Returns:**

A string literal containing the mnemonic.

---

## buffer_load

`buffer_load[type: DType, width: Int](src_resource: SIMD[uint32, 4], gds_offset: SIMD[int32, 1]) -> SIMD[type, width]`

Loads data from global memory into a SIMD register.

This function provides a hardware-accelerated global memory load operation
that maps directly to the AMDGPU buffer\_load instruction. It efficiently
transfers data from global memory to registers.

Note:

* Only supported on AMD GPUs.
* Uses non-glc loads by default (can hit L1 cache and persist across wavefronts).
* Supports widths that map to 1, 2, 4, 8, or 16 byte loads.
* Maps directly to llvm.amdgcn.raw\.buffer.load intrinsics.

**Parameters:**

* ​type (`DType`): The data type to load.
* ​width (`Int`): The SIMD vector width for vectorized loads.

**Args:**

* ​src\_resource (`SIMD[uint32, 4]`): Buffer resource descriptor created by make\_buffer\_resource().
* ​gds\_offset (`SIMD[int32, 1]`): Offset in elements (not bytes) from the base address in the resource.

**Returns:**

SIMD vector containing the loaded data.

---

## buffer_load_store_lds

`buffer_load_store_lds[type: DType](src_resource: SIMD[uint32, 4], gds_offset: SIMD[int32, 1], lds_ptr_base: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)], lds_offset: SIMD[int32, 1])`

Loads four bytes from global memory and writes them to shared memory.

Copies from global memory to shared memory (aka LDS) bypassing storing to
register.

**Parameters:**

* ​type (`DType`): The type of the data to be loaded.

**Args:**

* ​src\_resource (`SIMD[uint32, 4]`): Buffer resource descriptor from make\_buffer\_resource.
* ​gds\_offset (`SIMD[int32, 1]`): Global memory offset.
* ​lds\_ptr\_base (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)]`): LDS base address.
* ​lds\_offset (`SIMD[int32, 1]`): LDS offset.

---

## buffer_store

`buffer_store[type: DType, width: Int](src_resource: SIMD[uint32, 4], gds_offset: SIMD[int32, 1], val: SIMD[type, width])`

Stores a register variable to global memory.

Writes to global memory from a register.

**Parameters:**

* ​type (`DType`): The data type.
* ​width (`Int`): The SIMD vector width.

**Args:**

* ​src\_resource (`SIMD[uint32, 4]`): Buffer resource descriptor.
* ​gds\_offset (`SIMD[int32, 1]`): Global memory offset.
* ​val (`SIMD[type, width]`): Value to write.

---

## byte_permute

`byte_permute(a: SIMD[uint32, 1], b: SIMD[uint32, 1], c: SIMD[uint32, 1]) -> SIMD[uint32, 1]`

Permutes bytes from two 32-bit integers based on a control mask.

Selects and rearranges bytes from two source integers based on a control
mask to create a new 32-bit value.

Note:
Byte selection behavior depends on the GPU architecture:

* On NVIDIA: Maps to PRMT instruction
* On AMD: Maps to PERM instruction.

**Args:**

* ​a (`SIMD[uint32, 1]`): First source integer containing bytes to select from.
* ​b (`SIMD[uint32, 1]`): Second source integer containing bytes to select from.
* ​c (`SIMD[uint32, 1]`): Control mask that specifies which bytes to select and their
  positions. Each byte in the mask controls selection/placement of
  one output byte.

**Returns:**

A new 32-bit integer containing the selected and rearranged bytes

---

## intrinsics

Provides low-level GPU intrinsic operations and memory access primitives.

Implements hardware-specific intrinsics that map directly to GPU assembly
instructions, focusing on NVIDIA GPU architectures. Includes:

* Global memory load/store operations with cache control
* Warp-level primitives and synchronization
* Memory fence and barrier operations
* Atomic operations and memory ordering primitives

These low-level primitives should be used carefully as they correspond
directly to hardware instructions and require understanding of the
underlying GPU architecture.

## Structs

* [​`Scope`](/mojo/stdlib/gpu/intrinsics/Scope): Represents memory synchronization scope levels for GPU memory operations.

## Functions

* [​`buffer_load`](/mojo/stdlib/gpu/intrinsics/buffer_load): Loads data from global memory into a SIMD register.
* [​`buffer_load_store_lds`](/mojo/stdlib/gpu/intrinsics/buffer_load_store_lds): Loads four bytes from global memory and writes them to shared memory.
* [​`buffer_store`](/mojo/stdlib/gpu/intrinsics/buffer_store): Stores a register variable to global memory.
* [​`byte_permute`](/mojo/stdlib/gpu/intrinsics/byte_permute): Permutes bytes from two 32-bit integers based on a control mask.
* [​`ldg`](/mojo/stdlib/gpu/intrinsics/ldg): Load data from global memory through the non-coherent cache.
* [​`load_acquire`](/mojo/stdlib/gpu/intrinsics/load_acquire): Performs an atomic load operation with acquire memory ordering semantics.
* [​`load_volatile`](/mojo/stdlib/gpu/intrinsics/load_volatile): Performs a volatile load operation that cannot be optimized away.
* [​`lop`](/mojo/stdlib/gpu/intrinsics/lop): Performs an arbitrary logical operation on 3 inputs using a lookup table.
* [​`make_buffer_resource`](/mojo/stdlib/gpu/intrinsics/make_buffer_resource): Creates a 128-bit buffer resource descriptor for AMD GPU buffer operations.
* [​`mulhi`](/mojo/stdlib/gpu/intrinsics/mulhi): Calculates the most significant 32 bits of the product of two 16-bit unsigned integers.
* [​`mulwide`](/mojo/stdlib/gpu/intrinsics/mulwide): Performs a wide multiplication of two 32-bit unsigned integers.
* [​`store_release`](/mojo/stdlib/gpu/intrinsics/store_release): Performs an atomic store with release memory ordering semantics.
* [​`store_volatile`](/mojo/stdlib/gpu/intrinsics/store_volatile): Performs a volatile store operation that cannot be optimized away.
* [​`threadfence`](/mojo/stdlib/gpu/intrinsics/threadfence): Enforces ordering of memory operations across threads.
* [​`warpgroup_reg_alloc`](/mojo/stdlib/gpu/intrinsics/warpgroup_reg_alloc): Allocates additional registers for the executing warp group.
* [​`warpgroup_reg_dealloc`](/mojo/stdlib/gpu/intrinsics/warpgroup_reg_dealloc): Deallocates additional registers for the executing warp group.

---

## ldg

`ldg[type: DType, //, width: Int = 1, *, alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]()](x: UnsafePointer[SIMD[type, 1]]) -> SIMD[type, width]`

Load data from global memory through the non-coherent cache.

This function provides a hardware-accelerated global memory load operation
that uses the GPU's non-coherent cache (equivalent to CUDA's `__ldg` instruction).
It optimizes for read-only data access patterns.

Note:

* Uses invariant loads which indicate the memory won't change during kernel execution.
* Particularly beneficial for read-only texture-like access patterns.
* May improve performance on memory-bound kernels.

**Parameters:**

* ​type (`DType`): The data type to load (must be numeric).
* ​width (`Int`): The SIMD vector width for vectorized loads.
* ​alignment (`Int`): Memory alignment in bytes. Defaults to natural alignment
  of the SIMD vector type.

**Args:**

* ​x (`UnsafePointer[SIMD[type, 1]]`): Pointer to global memory location to load from.

**Returns:**

SIMD vector containing the loaded data.

---

## load_acquire

`load_acquire[type: DType, //, *, scope: Scope = Scope(6), memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[type, 1]`

Performs an atomic load operation with acquire memory ordering semantics.

This function provides a memory barrier that ensures no subsequent memory operations
from the calling thread are executed until after this load completes.

Note:

* Only supported on GPUs.
* Maps directly to PTX ld.acquire instruction on NVIDIA, LLVM atomic
  load on AMDGPU.
* Ensures subsequent memory operations don't execute until after load.
* Critical for implementing synchronization primitives.

**Parameters:**

* ​type (`DType`): The data type to load.
* ​scope (`Scope`): Memory scope for the operation (default: Scope.SYSTEM).
* ​memory (`Bool`): Whether to include memory side effects in constraints (default: True).

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from.

**Returns:**

The loaded value.

---

## load_volatile

`load_volatile[type: DType, //, memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[type, 1]`

Performs a volatile load operation that cannot be optimized away.

This function guarantees that the load operation will be performed exactly as
specified, without being reordered or optimized away by the compiler.

Note:

* Only supported on NVIDIA GPUs.
* Maps directly to PTX ld.volatile instruction.
* Prevents compiler optimization of the load operation.
* Useful for memory-mapped I/O or synchronization primitives.
* May have performance implications compared to regular loads.

**Parameters:**

* ​type (`DType`): The data type to load.
* ​memory (`Bool`): Whether to include memory side effects in constraints (default: True).

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from.

**Returns:**

The loaded value.

---

## lop

`lop[lut: SIMD[int32, 1]](a: SIMD[int32, 1], b: SIMD[int32, 1], c: SIMD[int32, 1]) -> SIMD[int32, 1]`

Performs an arbitrary logical operation on 3 inputs using a lookup table.

Implements a 3-input lookup table (LUT) operation. The result is
determined by bits in the lookup table value for each input combination.

Note:

* Only supported on NVIDIA GPUs.
* Maps to the LOP3.B32 PTX instruction.
* Lookup table value determines output for each possible input combo.

**Parameters:**

* ​lut (`SIMD[int32, 1]`): 32-bit lookup table value that defines the logical operation.

**Args:**

* ​a (`SIMD[int32, 1]`): First input value.
* ​b (`SIMD[int32, 1]`): Second input value.
* ​c (`SIMD[int32, 1]`): Third input value.

**Returns:**

Result of applying the lookup table operation to the inputs.

---

## make_buffer_resource

`make_buffer_resource[type: DType](gds_ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], num_records: Int = __init__[::Intable](SIMD(max_or_inf[::DType]()))) -> SIMD[uint32, 4]`

Creates a 128-bit buffer resource descriptor for AMD GPU buffer operations.

This function constructs a 128-bit buffer resource descriptor used by AMD GPUs
for buffer load/store operations. The descriptor contains information about
the memory location, size, and access properties needed by the hardware to
perform memory operations.

Notes:

* Only supported on AMD GPUs.
* The descriptor follows AMD's hardware-specific format:
  * Bits 0-63: Base address
  * Bits 64-95: Number of records (size)
  * Bits 96-127: Flags controlling access properties
* Used with buffer\_load and buffer\_store operations.
* Performance-critical for optimized memory access patterns on AMD GPUs.

Example:

```mojo
from gpu.intrinsics import make_buffer_resource

var ptr = UnsafePointer[Scalar[DType.float32]].alloc(1024)
var resource = make_buffer_resource[DType.float32](ptr, 1024)
# Use resource with buffer_load/buffer_store operations
```

.

**Parameters:**

* ​type (`DType`): The data type of elements in the buffer.

**Args:**

* ​gds\_ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Global memory base address pointer to the start of the buffer.
* ​num\_records (`Int`): Maximum number of records that can be accessed through this
  resource descriptor. Reads with offsets beyond this value return 0.
  Defaults to UInt32.MAX for maximum possible range.

**Returns:**

A 128-bit buffer resource descriptor as a SIMD\[DType.uint32, 4].

---

## mulhi

`mulhi(a: SIMD[uint16, 1], b: SIMD[uint16, 1]) -> SIMD[uint32, 1]`

Calculates the most significant 32 bits of the product of two 16-bit unsigned integers.

Multiplies two 16-bit unsigned integers and returns the high 32 bits
of their product. Useful for fixed-point arithmetic and overflow
detection.

Note:
On NVIDIA GPUs, this maps directly to the MULHI.U16 PTX instruction.
On others, it performs multiplication using 32-bit arithmetic.

**Args:**

* ​a (`SIMD[uint16, 1]`): First 16-bit unsigned integer operand.
* ​b (`SIMD[uint16, 1]`): Second 16-bit unsigned integer operand.

**Returns:**

The high 32 bits of the product a \* b

`mulhi(a: SIMD[int16, 1], b: SIMD[int16, 1]) -> SIMD[int32, 1]`

Calculates the most significant 32 bits of the product of two 16-bit signed integers.

Multiplies two 16-bit signed integers and returns the high 32 bits
of their product. Useful for fixed-point arithmetic and overflow detection.

Note:
On NVIDIA GPUs, this maps directly to the MULHI.S16 PTX instruction.
On others, it performs multiplication using 32-bit arithmetic.

**Args:**

* ​a (`SIMD[int16, 1]`): First 16-bit signed integer operand.
* ​b (`SIMD[int16, 1]`): Second 16-bit signed integer operand.

**Returns:**

The high 32 bits of the product a \* b

`mulhi(a: SIMD[uint32, 1], b: SIMD[uint32, 1]) -> SIMD[uint32, 1]`

Calculates the most significant 32 bits of the product of two 32-bit unsigned integers.

Multiplies two 32-bit unsigned integers and returns the high 32 bits
of their product. Useful for fixed-point arithmetic and overflow detection.

Note:
On NVIDIA GPUs, this maps directly to the MULHI.U32 PTX instruction.
On others, it performs multiplication using 64-bit arithmetic.

**Args:**

* ​a (`SIMD[uint32, 1]`): First 32-bit unsigned integer operand.
* ​b (`SIMD[uint32, 1]`): Second 32-bit unsigned integer operand.

**Returns:**

The high 32 bits of the product a \* b

`mulhi(a: SIMD[int32, 1], b: SIMD[int32, 1]) -> SIMD[int32, 1]`

Calculates the most significant 32 bits of the product of two 32-bit signed integers.

Multiplies two 32-bit signed integers and returns the high 32 bits
of their product. Useful for fixed-point arithmetic and overflow detection.

Note:
On NVIDIA GPUs, this maps directly to the MULHI.S32 PTX instruction.
On others, it performs multiplication using 64-bit arithmetic.

**Args:**

* ​a (`SIMD[int32, 1]`): First 32-bit signed integer operand.
* ​b (`SIMD[int32, 1]`): Second 32-bit signed integer operand.

**Returns:**

The high 32 bits of the product a \* b

---

## mulwide

`mulwide(a: SIMD[uint32, 1], b: SIMD[uint32, 1]) -> SIMD[uint64, 1]`

Performs a wide multiplication of two 32-bit unsigned integers.

Multiplies two 32-bit unsigned integers and returns the full 64-bit result.
Useful when the product may exceed 32 bits.

Note:
On NVIDIA GPUs, this maps directly to the MUL.WIDE.U32 PTX instruction.
On others, it performs multiplication using 64-bit casts.

**Args:**

* ​a (`SIMD[uint32, 1]`): First 32-bit unsigned integer operand.
* ​b (`SIMD[uint32, 1]`): Second 32-bit unsigned integer operand.

**Returns:**

The full 64-bit product of a \* b

`mulwide(a: SIMD[int32, 1], b: SIMD[int32, 1]) -> SIMD[int64, 1]`

Performs a wide multiplication of two 32-bit signed integers.

Multiplies two 32-bit signed integers and returns the full 64-bit result.
Useful when the product may exceed 32 bits or be negative.

Note:
On NVIDIA GPUs, this maps directly to the MUL.WIDE.S32 PTX instruction.
On others, it performs multiplication using 64-bit casts.

**Args:**

* ​a (`SIMD[int32, 1]`): First 32-bit signed integer operand.
* ​b (`SIMD[int32, 1]`): Second 32-bit signed integer operand.

**Returns:**

The full 64-bit signed product of a \* b

---

## store_release

`store_release[type: DType, //, scope: Scope = Scope(6), memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], value: SIMD[type, 1])`

Performs an atomic store with release memory ordering semantics.

This function provides a memory barrier that ensures all previous memory operations
from the calling thread are visible to other threads before this store is performed.

Note:

* Only supported on GPUs.
* Maps directly to PTX st.release instruction on NVIDIA, LLVM atomic
  store on AMDGPU.
* Ensures all previous memory operations complete before this store.
* Critical for implementing synchronization primitives.

**Parameters:**

* ​type (`DType`): The data type to store.
* ​scope (`Scope`): Memory scope for the operation (default: Scope.SYSTEM).
* ​memory (`Bool`): Whether to include memory side effects in constraints (default: True).

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to store to.
* ​value (`SIMD[type, 1]`): Value to store.

---

## store_volatile

`store_volatile[type: DType, //, memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], value: SIMD[type, 1])`

Performs a volatile store operation that cannot be optimized away.

This function guarantees that the store operation will be performed exactly as
specified, without being reordered or optimized away by the compiler.

Note:

* Only supported on NVIDIA GPUs.
* Maps directly to PTX st.volatile instruction.
* Prevents compiler optimization of the store operation.
* Useful for memory-mapped I/O or synchronization primitives.
* May have performance implications compared to regular stores.

**Parameters:**

* ​type (`DType`): The data type to store.
* ​memory (`Bool`): Whether to include memory side effects in constraints (default: True).

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to store to.
* ​value (`SIMD[type, 1]`): Value to store.

---

## threadfence

`threadfence[scope: Scope = Scope(5)]()`

Enforces ordering of memory operations across threads.

Acts as a memory fence/barrier that ensures all memory operations (both
loads and stores) issued before the fence are visible to other threads
within the specified scope before any memory operations after the fence.

Note:

* Maps directly to CUDA `__threadfence()` family of functions.
* Critical for synchronizing memory access in parallel algorithms.
* Performance impact increases with broader scopes.

**Parameters:**

* ​scope (`Scope`): Memory scope level for the fence. Defaults to GPU-wide scope.
  Valid values are:
  * Scope.BLOCK: Orders memory within a thread block/CTA.
  * Scope.GPU: Orders memory across all threads on the GPU (default).
  * Scope.SYSTEM: Orders memory across the entire system.

---

## warpgroup_reg_alloc

`warpgroup_reg_alloc[count: Int]()`

Allocates additional registers for the executing warp group.

Hints to the system to increase per-thread registers owned by the
executing warp. Requests additional registers to increase the absolute
per-thread maximum register count from its current value to the specified
count.

Note:

* Only supported on NVIDIA SM90+ GPUs
* Performance optimization hint that may be ignored by the hardware
* Pair with \`warpgroup\_reg\_dealloc() when extra registers are no
  longer needed

**Parameters:**

* ​count (`Int`): The desired number of registers per thread. Must be:
  * A multiple of 8
  * Between 24 and 256 (inclusive).

---

## warpgroup_reg_dealloc

`warpgroup_reg_dealloc[count: Int]()`

Deallocates additional registers for the executing warp group.

Hints to the system to decrease per-thread registers owned by the
executing warp. Releases extra registers to reduce the absolute per-thread
maximum register count from its current value to the specified count.

Note:

* Only supported on NVIDIA SM90+ GPUs.
* Performance optimization hint that may be ignored by the hardware.
* Pair with `warpgroup_reg_alloc()` when extra registers are needed.

**Parameters:**

* ​count (`Int`): The desired number of registers per thread. Must be:
  * A multiple of 8.
  * Between 24 and 256 (inclusive).

---

## CacheEviction

`@register_passable(trivial)`
`struct CacheEviction`

Represents cache eviction policies for GPU memory operations.

This struct defines different cache eviction priorities that control how data is
evicted from cache when space is needed. The policies affect cache utilization
and performance by controlling which data gets evicted first.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `EVICT_FIRST`

`alias EVICT_FIRST = CacheEviction(1)`

Highest eviction priority - data will be evicted first.

Data cached with this priority is marked as the first candidate for eviction
when cache space is needed. This is optimal for:

* Streaming data that will not be reused
* Single-pass algorithms
* Data with low temporal locality

### `EVICT_LAST`

`alias EVICT_LAST = CacheEviction(2)`

Lowest eviction priority - data will be evicted last.

Data cached with this priority remains in cache until all higher priority data
is evicted. Best used for:

* Frequently accessed data
* Data needed across multiple kernel launches
* Critical data structures that benefit from cache persistence

### `EVICT_NORMAL`

`alias EVICT_NORMAL = CacheEviction(0)`

Default cache eviction priority.

Data cached with normal priority follows standard cache replacement policies.
This is the default behavior and suitable for most general-purpose data access
patterns where no special caching requirements exist.

### `EVICT_UNCHANGED`

`alias EVICT_UNCHANGED = CacheEviction(3)`

Preserves existing cache eviction priority.

When this policy is used:

* Existing cache entries maintain their current eviction priority
* No changes are made to the cache replacement order
* Useful for operations that should not affect caching behavior

### `NO_ALLOCATE`

`alias NO_ALLOCATE = CacheEviction(4)`

Prevents cache allocation for accessed data.

Data is not cached when using this policy. Optimal for:

* Large sequential reads/writes
* Data that will only be accessed once
* Preserving cache space for more critical data
* Streaming operations with no data reuse

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Tests if two CacheEviction instances are equal.

**Args:**

* ​other (`Self`): The CacheEviction to compare against.

**Returns:**

True if the eviction policies are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Tests if two CacheEviction instances are not equal.

**Args:**

* ​other (`Self`): The CacheEviction to compare against.

**Returns:**

True if the eviction policies are not equal, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Tests if two CacheEviction instances are identical.

**Args:**

* ​other (`Self`): The CacheEviction to compare against.

**Returns:**

True if the eviction policies are identical, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Tests if two CacheEviction instances are not identical.

**Args:**

* ​other (`Self`): The CacheEviction to compare against.

**Returns:**

True if the eviction policies are not identical, False otherwise.

### `mnemonic`

`mnemonic(self) -> StringSlice[StaticConstantOrigin]`

Returns the string mnemonic for this cache eviction policy.

Converts the cache eviction policy into its corresponding string
representation used in GPU instructions and debugging.

**Returns:**

A string literal containing the mnemonic for this eviction policy.

---

## CacheOperation

`@register_passable(trivial)`
`struct CacheOperation`

Represents different GPU cache operation policies.

This struct defines various caching behaviors for GPU memory operations,
controlling how data is cached and evicted at different cache levels.
The policies affect performance and memory coherency.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ALWAYS`

`alias ALWAYS = CacheOperation(0)`

Cache at all levels. This will be accessed again.

Best for data that will be frequently reused across multiple threads.
Provides fastest subsequent access but uses the most cache space.

### `GLOBAL`

`alias GLOBAL = CacheOperation(1)`

Cache at global level.

Caches data only in the L2 cache, bypassing L1.
Good for data shared between different thread blocks.

### `LAST_USE`

`alias LAST_USE = CacheOperation(3)`

Indicates the cache line will not be used again.

Hints to the cache that this data can be evicted after this access.
Helps optimize cache utilization.

### `STREAMING`

`alias STREAMING = CacheOperation(2)`

Streaming, this is likely to be accessed once.

Optimizes for streaming access patterns where data is only read once.
May bypass certain cache levels for better throughput.

### `VOLATILE`

`alias VOLATILE = CacheOperation(4)`

Don't cache, and fetch again.

Forces reads/writes to bypass cache and go directly to memory.
Useful for memory-mapped I/O or when cache coherency is required.

### `WRITE_BACK`

`alias WRITE_BACK = CacheOperation(5)`

Write back at all coherent levels.

Updates all cache levels and eventually writes to memory.
Most efficient for multiple writes to same location.

### `WRITE_THROUGH`

`alias WRITE_THROUGH = CacheOperation(6)`

Write through to system memory.

Immediately writes updates to memory while updating cache.
Provides stronger consistency but lower performance than write-back.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Tests if two CacheOperation instances are equal.

**Args:**

* ​other (`Self`): The CacheOperation to compare against.

**Returns:**

True if the operations are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Tests if two CacheOperation instances are not equal.

**Args:**

* ​other (`Self`): The CacheOperation to compare against.

**Returns:**

True if the operations are not equal, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Tests if two CacheOperation instances are identical.

**Args:**

* ​other (`Self`): The CacheOperation to compare against.

**Returns:**

True if the operations are identical, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Tests if two CacheOperation instances are not identical.

**Args:**

* ​other (`Self`): The CacheOperation to compare against.

**Returns:**

True if the operations are not identical, False otherwise.

### `mnemonic`

`mnemonic(self) -> StringSlice[StaticConstantOrigin]`

Returns the PTX mnemonic string for this cache operation.

Converts the cache operation into its corresponding PTX assembly
mnemonic string used in GPU instructions.

**Returns:**

A string literal containing the PTX mnemonic for this operation.

---

## Consistency

`@register_passable(trivial)`
`struct Consistency`

Represents memory consistency models for GPU memory operations.

This struct defines different memory consistency levels that control how memory
operations are ordered and synchronized between threads. The consistency model
affects both performance and correctness of parallel algorithms.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ACQUIRE`

`alias ACQUIRE = Consistency(2)`

Acquire consistency for synchronization operations.

Ensures all subsequent memory operations are ordered after this operation.
Used in producer-consumer patterns.

### `RELAXED`

`alias RELAXED = Consistency(1)`

Relaxed consistency with basic ordering guarantees.

Provides some ordering guarantees while still allowing optimizations.
Suitable for operations that don't require strict ordering.

### `RELEASE`

`alias RELEASE = Consistency(3)`

Release consistency for synchronization operations.

Ensures all previous memory operations are ordered before this operation.
Paired with acquire operations for synchronization.

### `WEAK`

`alias WEAK = Consistency(0)`

Weakest consistency model with minimal ordering guarantees.

Provides maximum flexibility for hardware/compiler optimizations but requires
careful synchronization by the programmer.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Tests if two Consistency instances are equal.

**Args:**

* ​other (`Self`): The Consistency instance to compare against.

**Returns:**

True if the consistency levels are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Tests if two Consistency instances are not equal.

**Args:**

* ​other (`Self`): The Consistency instance to compare against.

**Returns:**

True if the consistency levels are different, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Tests if two Consistency instances are identical.

**Args:**

* ​other (`Self`): The Consistency instance to compare against.

**Returns:**

True if the consistency levels are identical, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Tests if two Consistency instances are not identical.

**Args:**

* ​other (`Self`): The Consistency instance to compare against.

**Returns:**

True if the consistency levels are not identical, False otherwise.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the consistency level.

**Returns:**

A string describing the consistency level.

### `mnemonic`

`mnemonic(self) -> StringSlice[StaticConstantOrigin]`

Returns the mnemonic string for the consistency level.

**Returns:**

A string literal containing the consistency level mnemonic.

---

## Fill

`@register_passable(trivial)`
`struct Fill`

Represents memory fill patterns for GPU memory operations.

This struct defines different fill patterns that can be used when allocating or
initializing GPU memory. The patterns control how memory is initialized, which
can be important for debugging and performance optimization.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `NAN`

`alias NAN = Fill(2)`

Fill memory with NaN values. Useful for debugging floating point computations.

### `NONE`

`alias NONE = Fill(0)`

No fill pattern - memory is left uninitialized.

### `ZERO`

`alias ZERO = Fill(1)`

Fill memory with zeros.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Tests if two Fill instances have the same fill pattern.

**Args:**

* ​other (`Self`): The Fill instance to compare against.

**Returns:**

True if the fill patterns are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Tests if two Fill instances have different fill patterns.

**Args:**

* ​other (`Self`): The Fill instance to compare against.

**Returns:**

True if the fill patterns are different, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Tests if two Fill instances are identical.

**Args:**

* ​other (`Self`): The Fill instance to compare against.

**Returns:**

True if the fill patterns are identical, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Tests if two Fill instances are not identical.

**Args:**

* ​other (`Self`): The Fill instance to compare against.

**Returns:**

True if the fill patterns are not identical, False otherwise.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the fill pattern.

Converts the fill pattern into a human-readable string for debugging
and display purposes.

**Returns:**

A string describing the fill pattern.

---

## ReduceOp

`@register_passable(trivial)`
`struct ReduceOp`

Represents reduction operations for parallel reduction algorithms.

This struct defines different reduction operations that can be performed
across multiple threads in parallel. These operations are commonly used
in parallel reduction algorithms on GPUs.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ADD`

`alias ADD = ReduceOp(0)`

Addition reduction operation.

Combines values by adding them together.

### `AND`

`alias AND = ReduceOp(3)`

Bitwise AND reduction operation.

Performs bitwise AND across all inputs.

### `MAX`

`alias MAX = ReduceOp(2)`

Maximum reduction operation.

Finds the maximum value across all inputs.

### `MIN`

`alias MIN = ReduceOp(1)`

Minimum reduction operation.

Finds the minimum value across all inputs.

### `OR`

`alias OR = ReduceOp(4)`

Bitwise OR reduction operation.

Performs bitwise OR across all inputs.

### `XOR`

`alias XOR = ReduceOp(5)`

Bitwise XOR reduction operation.

Performs bitwise XOR across all inputs.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Tests if two ReduceOp instances are equal.

**Args:**

* ​other (`Self`): The ReduceOp instance to compare against.

**Returns:**

True if the reduction operations are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Tests if two ReduceOp instances are not equal.

**Args:**

* ​other (`Self`): The ReduceOp instance to compare against.

**Returns:**

True if the reduction operations are different, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Tests if two ReduceOp instances are identical.

**Args:**

* ​other (`Self`): The ReduceOp instance to compare against.

**Returns:**

True if the reduction operations are identical, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Tests if two ReduceOp instances are not identical.

**Args:**

* ​other (`Self`): The ReduceOp instance to compare against.

**Returns:**

True if the reduction operations are not identical, False otherwise.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the reduction operation.

**Returns:**

A string describing the reduction operation.

### `mnemonic`

`mnemonic(self) -> StringSlice[StaticConstantOrigin]`

Returns the mnemonic string for the reduction operation.

**Returns:**

A string literal containing the reduction operation mnemonic.

---

## async_copy

`async_copy[type: DType, //, size: Int, *, fill: OptionalReg[SIMD[type, 1]] = OptionalReg[SIMD[type, 1]]({:i1 0, 1}), bypass_L1_16B: Bool = True, l2_prefetch: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), eviction_policy: CacheEviction = CacheEviction(0)](src: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)], dst: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)], src_size: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](0), predicate: Bool = False)`

Asynchronously copies data from global memory to shared memory.

This function provides a high-performance asynchronous memory copy operation with
configurable caching behavior, prefetching, and fill values. It maps directly to
the PTX cp.async instruction on NVIDIA GPUs.

**Constraints:**

* Fill value only supported for types type (`DType`): The data type to copy (e.g. float32, int32).
* ​size (`Int`): Number of bytes to copy (must be 4, 8, or 16).
* ​fill (`OptionalReg[SIMD[type, 1]]`): Optional fill value for uncopied bytes when src\_size bypass\_L1\_16B (`Bool`): If True, bypasses L1 cache for 16-byte copies.
* ​l2\_prefetch (`OptionalReg[Int]`): Optional L2 prefetch size (64, 128, or 256 bytes).
* ​eviction\_policy (`CacheEviction`): Cache eviction policy for the copy operation.

**Args:**

* ​src (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]`): Source pointer in global memory.
* ​dst (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)]`): Destination pointer in shared memory.
* ​src\_size (`SIMD[int32, 1]`): Actual bytes to copy from src (remaining bytes use fill value).
* ​predicate (`Bool`): Optional predicate to conditionally execute the copy.

---

## async_copy_commit_group

`async_copy_commit_group()`

Commits all prior initiated but uncommitted cp.async instructions into a cp.async-group.

This function creates a new cp.async-group containing all previously initiated but uncommitted
asynchronous copy operations. The group can then be waited on using async\_copy\_wait\_group().

Notes:

* Only supported on NVIDIA GPUs
* Maps to the cp.async.commit.group PTX instruction
* Used for managing asynchronous memory transfers
* Should be paired with async\_copy\_wait\_group() or async\_copy\_wait\_all()

---

## async_copy_wait_all

`async_copy_wait_all()`

Waits for completion of all committed cp.async-groups.

This function blocks execution until all previously committed cp.async-groups
have completed their memory transfers. It provides a barrier to ensure all
asynchronous copies are finished.

Notes:

* Only supported on NVIDIA GPUs.
* Maps to the cp.async.wait.all PTX instruction.
* Ensures all outstanding asynchronous transfers are complete.
* More coarse-grained than `async_copy_wait_group()`.

---

## async_copy_wait_group

`async_copy_wait_group(n: SIMD[int32, 1])`

Waits for the completion of `n` most recently committed cp.async-groups.

This function blocks execution until the specified number of previously committed
cp.async-groups have completed their memory transfers.

Notes:

* Only supported on NVIDIA GPUs.
* Maps to the cp.async.wait.group PTX instruction.
* Provides fine-grained control over asynchronous transfer synchronization.
* Can be used to implement a pipeline of asynchronous transfers.

**Args:**

* ​n (`SIMD[int32, 1]`): The number of pending cp.async-groups to wait for. Must be > 0.

---

## cp_async_bulk_tensor_global_shared_cta

`cp_async_bulk_tensor_global_shared_cta[src_type: AnyType, rank: Int, /, eviction_policy: CacheEviction = CacheEviction(0)](src_mem: UnsafePointer[src_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], coords: IndexList[rank])`

Initiates an asynchronous copy operation to transfer tensor data from shared CTA memory to global memory using NVIDIA's Tensor Memory Access (TMA) mechanism.

This function provides an efficient way to write data back from shared memory to global
memory using TMA. It supports both rank-1 and rank-2 tensors and allows control over
cache eviction policy.

Notes:

* This operation is asynchronous - use appropriate memory barriers to ensure completion.
* Only supports rank-1 and rank-2 tensors.
* Requires NVIDIA GPU with TMA support.
* The source memory must be properly aligned for TMA operations.
* The TMA descriptor must be properly initialized before use.

**Parameters:**

* ​src\_type (`AnyType`): The data type of the source tensor elements.
* ​rank (`Int`): The dimensionality of the tensor (must be 1 or 2).
* ​eviction\_policy (`CacheEviction`): Optional cache eviction policy that controls how the data is handled
  in the cache hierarchy. Defaults to EVICT\_NORMAL.

**Args:**

* ​src\_mem (`UnsafePointer[src_type, address_space=AddressSpace(3)]`): Pointer to the source data in shared memory that will be copied to global
  memory. Must be properly aligned according to TMA requirements.
* ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor containing metadata about tensor layout
  and memory access patterns.
* ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to copy. For rank-1 tensors,
  this is a single coordinate. For rank-2 tensors, this contains both row and
  column coordinates.

---

## cp_async_bulk_tensor_reduce

`cp_async_bulk_tensor_reduce[src_type: AnyType, rank: Int, /, *, reduction_kind: ReduceOp, eviction_policy: CacheEviction = CacheEviction(0)](src_mem: UnsafePointer[src_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], coords: IndexList[rank])`

Initiates an asynchronous reduction operation between shared CTA memory and global memory using NVIDIA's Tensor Memory Access (TMA) mechanism.

This function performs an in-place reduction operation, combining data from shared memory
with data in global memory using the specified reduction operation. The operation is
performed asynchronously and uses TMA's tile mode for efficient memory access.

Notes:

* This operation is asynchronous - use appropriate memory barriers to ensure completion.
* Only supports rank-1 and rank-2 tensors.
* Requires NVIDIA GPU with TMA support.
* The source memory must be properly aligned for TMA operations.
* The TMA descriptor must be properly initialized before use.
* The reduction operation is performed atomically to ensure correctness.

**Parameters:**

* ​src\_type (`AnyType`): The data type of the source tensor elements.
* ​rank (`Int`): The dimensionality of the tensor (must be 1 or 2).
* ​reduction\_kind (`ReduceOp`): The type of reduction operation to perform. Supported operations are:
  "add", "min", "max", "inc", "dec", "and", "or", "xor".
* ​eviction\_policy (`CacheEviction`): Optional cache eviction policy that controls how the data is handled
  in the cache hierarchy. Defaults to `EVICT_NORMAL`.

**Args:**

* ​src\_mem (`UnsafePointer[src_type, address_space=AddressSpace(3)]`): Pointer to the source data in shared memory that will be reduced with the
  global memory data. Must be properly aligned according to TMA requirements.
* ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor containing metadata about tensor layout
  and memory access patterns.
* ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to operate on. For rank-1
  tensors, this is a single coordinate. For rank-2 tensors, this contains both
  row and column coordinates.

---

## cp_async_bulk_tensor_shared_cluster_global

`cp_async_bulk_tensor_shared_cluster_global[dst_type: AnyType, mbr_type: AnyType, rank: Int, /, *, cta_group: Int = 1](dst_mem: UnsafePointer[dst_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], mem_bar: UnsafePointer[mbr_type, address_space=AddressSpace(3)], coords: IndexList[rank])`

Initiates an asynchronous bulk copy operation of tensor data from global memory to shared memory.

This function performs an asynchronous copy of tensor data using NVIDIA's Tensor Memory Access (TMA)
mechanism. It supports both rank-1 and rank-2 tensors and uses cluster-level synchronization for
efficient data movement.

Notes:

* This operation is asynchronous - use appropriate memory barriers to ensure
  copy completion.
* Only supports rank-1 and rank-2 tensors.
* Requires NVIDIA GPU with TMA support.
* The memory barrier should be properly initialized before use.

**Parameters:**

* ​dst\_type (`AnyType`): The data type of the destination memory.
* ​mbr\_type (`AnyType`): The data type of the memory barrier.
* ​rank (`Int`): The dimensionality of the tensor (1, 2, or 3).
* ​cta\_group (`Int`): The CTA group to use for the copy operation. Must be 1 or 2.

**Args:**

* ​dst\_mem (`UnsafePointer[dst_type, address_space=AddressSpace(3)]`): Pointer to the destination in shared memory where the tensor data will be copied.
  Must be properly aligned according to TMA requirements.
* ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor that contains metadata about the tensor layout
  and memory access patterns.
* ​mem\_bar (`UnsafePointer[mbr_type, address_space=AddressSpace(3)]`): Pointer to a shared memory barrier used for synchronizing the asynchronous copy
  operation across threads in the cluster.
* ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to copy. For rank-1 tensors,
  this is a single coordinate. For rank-2 tensors, this contains both row and
  column coordinates.

---

## cp_async_bulk_tensor_shared_cluster_global_multicast

`cp_async_bulk_tensor_shared_cluster_global_multicast[dst_type: AnyType, mbr_type: AnyType, rank: Int, /, *, cta_group: Int = 1](dst_mem: UnsafePointer[dst_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], mem_bar: UnsafePointer[mbr_type, address_space=AddressSpace(3)], coords: IndexList[rank], multicast_mask: SIMD[uint16, 1])`

Initiates an asynchronous multicast load operation using NVIDIA's Tensor Memory Access (TMA) to copy tensor data from global memory to shared memories of multiple CTAs in a cluster.

This function performs an optimized multicast copy operation where a single global memory read
can be distributed to multiple CTAs' shared memories simultaneously, reducing memory bandwidth
usage. It supports both rank-1 and rank-2 tensors and uses cluster-level synchronization.

Notes:

* This operation is asynchronous - use appropriate memory barriers to ensure copy completion.
* Only supports rank-1 and rank-2 tensors.
* Requires NVIDIA GPU with TMA support.
* The memory barrier should be properly initialized before use.
* The multicast\_mask must be properly configured based on cluster size and desired distribution.

**Parameters:**

* ​dst\_type (`AnyType`): The data type of the destination tensor elements.
* ​mbr\_type (`AnyType`): The data type of the memory barrier.
* ​rank (`Int`): The dimensionality of the tensor (must be 1 or 2).
* ​cta\_group (`Int`): The CTA group to use for the copy operation. Must be 1 or 2.

**Args:**

* ​dst\_mem (`UnsafePointer[dst_type, address_space=AddressSpace(3)]`): Pointer to the destination in shared memory where the tensor data will be copied.
  Must be properly aligned according to TMA requirements.
* ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor containing metadata about tensor layout
  and memory access patterns.
* ​mem\_bar (`UnsafePointer[mbr_type, address_space=AddressSpace(3)]`): Pointer to a shared memory barrier used for synchronizing the asynchronous copy
  operation across threads in the cluster.
* ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to copy. For rank-1 tensors,
  this is a single coordinate. For rank-2 tensors, this contains both row and
  column coordinates.
* ​multicast\_mask (`SIMD[uint16, 1]`): A 16-bit bitmask where each bit corresponds to a CTA in the cluster.
  Set bits indicate which CTAs will receive a copy of the loaded data.
  This enables efficient data sharing across multiple CTAs.

---

## external_memory

`external_memory[type: AnyTrivialRegType, *, address_space: AddressSpace, alignment: Int, name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("extern_ptr_syml")]() -> UnsafePointer[type, address_space=address_space, alignment=alignment]`

Gets a pointer to dynamically allocated external memory.

This function returns a pointer to external memory that can be used for dynamic
shared memory allocations in GPU kernels. The memory is allocated in the specified
address space with the given alignment requirements.

Note:

* The memory is not initialized and must be explicitly written before reading.
* The allocation size is determined at kernel launch time.
* The pointer is only valid within the GPU kernel execution context.
* Care must be taken to respect alignment requirements when accessing the memory.

**Parameters:**

* ​type (`AnyTrivialRegType`): The type of elements stored in the memory. Must be a trivial register type.
* ​address\_space (`AddressSpace`): The memory address space to allocate in (e.g. shared, global).
* ​alignment (`Int`): The minimum alignment requirement in bytes for the allocated memory.
* ​name (`StringSlice[StaticConstantOrigin]`): Optional symbolic name for the external memory allocation. Defaults to
  "extern\_ptr\_syml".

**Returns:**

A properly aligned pointer to the allocated external memory in the
specified address space.

---

## fence_mbarrier_init

`fence_mbarrier_init()`

Creates a memory fence after mbarrier initialization.

This function establishes a memory barrier that ensures the proper initialization
of memory barriers (mbarrier) before they are used. It guarantees that the
mbarrier initialization is complete and visible to all threads before subsequent
operations.

Note:

Should be called immediately after mbarrier initialization to ensure proper
synchronization semantics.

---

## fence_proxy_tensormap_generic_sys_acquire

`fence_proxy_tensormap_generic_sys_acquire[type: AnyType](ptr: UnsafePointer[type, alignment=alignment, mut=mut, origin=origin], size: SIMD[int32, 1])`

Acquires a system-wide memory fence for tensor map operations.

This function establishes a memory fence that ensures proper synchronization
between tensor map operations and system memory. It guarantees that all previous
memory operations are completed before subsequent tensor map accesses.

Note:

This is a low-level synchronization primitive typically used in conjunction with
TMA (Tensor Memory Access) operations on NVIDIA GPUs.

**Parameters:**

* ​type (`AnyType`): The data type of the tensor map object being synchronized.

**Args:**

* ​ptr (`UnsafePointer[type, alignment=alignment, mut=mut, origin=origin]`): Pointer to the tensor map object in system memory that needs to be synchronized.
* ​size (`SIMD[int32, 1]`): The size in bytes of the tensor map object being synchronized.

---

## fence_proxy_tensormap_generic_sys_release

`fence_proxy_tensormap_generic_sys_release()`

Releases the system-wide memory fence for tensor map operations.

This function releases the memory fence previously established by the acquire operation.
It ensures that all tensor map operations are completed and visible to the system
before proceeding.

Note:

Should be called after tensor map operations are complete to maintain proper
memory ordering semantics.

---

## memory

This module provides GPU memory operations and utilities.

The module implements low-level memory operations for GPU programming, with a focus on:

* Memory address space abstractions (global, shared, constant)
* Cache control operations and policies
* Memory access patterns and optimizations
* Memory alignment and pointer manipulation

It provides a unified interface for memory operations across different GPU architectures,
with specialized implementations for NVIDIA and AMD GPUs where needed.

The module is designed for performance-critical code and requires careful usage to
achieve optimal memory access patterns and cache utilization.

## Aliases

### `AddressSpace`

`alias AddressSpace = _GPUAddressSpace`

## Structs

* [​`CacheEviction`](/mojo/stdlib/gpu/memory/CacheEviction): Represents cache eviction policies for GPU memory operations.
* [​`CacheOperation`](/mojo/stdlib/gpu/memory/CacheOperation): Represents different GPU cache operation policies.
* [​`Consistency`](/mojo/stdlib/gpu/memory/Consistency): Represents memory consistency models for GPU memory operations.
* [​`Fill`](/mojo/stdlib/gpu/memory/Fill): Represents memory fill patterns for GPU memory operations.
* [​`ReduceOp`](/mojo/stdlib/gpu/memory/ReduceOp): Represents reduction operations for parallel reduction algorithms.

## Functions

* [​`async_copy`](/mojo/stdlib/gpu/memory/async_copy): Asynchronously copies data from global memory to shared memory.
* [​`async_copy_commit_group`](/mojo/stdlib/gpu/memory/async_copy_commit_group): Commits all prior initiated but uncommitted cp.async instructions into a cp.async-group.
* [​`async_copy_wait_all`](/mojo/stdlib/gpu/memory/async_copy_wait_all): Waits for completion of all committed cp.async-groups.
* [​`async_copy_wait_group`](/mojo/stdlib/gpu/memory/async_copy_wait_group): Waits for the completion of `n` most recently committed cp.async-groups.
* [​`cp_async_bulk_tensor_global_shared_cta`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_global_shared_cta): Initiates an asynchronous copy operation to transfer tensor data from shared CTA memory to global memory using NVIDIA's Tensor Memory Access (TMA) mechanism.
* [​`cp_async_bulk_tensor_reduce`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_reduce): Initiates an asynchronous reduction operation between shared CTA memory and global memory using NVIDIA's Tensor Memory Access (TMA) mechanism.
* [​`cp_async_bulk_tensor_shared_cluster_global`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_shared_cluster_global): Initiates an asynchronous bulk copy operation of tensor data from global memory to shared memory.
* [​`cp_async_bulk_tensor_shared_cluster_global_multicast`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_shared_cluster_global_multicast): Initiates an asynchronous multicast load operation using NVIDIA's Tensor Memory Access (TMA) to copy tensor data from global memory to shared memories of multiple CTAs in a cluster.
* [​`external_memory`](/mojo/stdlib/gpu/memory/external_memory): Gets a pointer to dynamically allocated external memory.
* [​`fence_mbarrier_init`](/mojo/stdlib/gpu/memory/fence_mbarrier_init): Creates a memory fence after mbarrier initialization.
* [​`fence_proxy_tensormap_generic_sys_acquire`](/mojo/stdlib/gpu/memory/fence_proxy_tensormap_generic_sys_acquire): Acquires a system-wide memory fence for tensor map operations.
* [​`fence_proxy_tensormap_generic_sys_release`](/mojo/stdlib/gpu/memory/fence_proxy_tensormap_generic_sys_release): Releases the system-wide memory fence for tensor map operations.
* [​`load`](/mojo/stdlib/gpu/memory/load): Loads data from global memory into a SIMD vector.
* [​`multimem_ld_reduce`](/mojo/stdlib/gpu/memory/multimem_ld_reduce): Performs a vectorized load-reduce operation using NVIDIA's multimem feature.
* [​`multimem_st`](/mojo/stdlib/gpu/memory/multimem_st): Stages an inline multimem.st instruction.
* [​`tma_store_fence`](/mojo/stdlib/gpu/memory/tma_store_fence): Establishes a memory fence for shared memory stores in TMA operations.

---

## load

`load[type: DType, //, width: Int = 1, *, read_only: Bool = False, prefetch_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), cache_policy: CacheOperation = CacheOperation(0), eviction_policy: CacheEviction = CacheEviction(0), alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]() if is_nvidia_gpu() else 1](ptr: UnsafePointer[SIMD[type, 1]]) -> SIMD[type, width]`

Loads data from global memory into a SIMD vector.

Provides a high-level interface for vectorized memory loads with configurable
cache behavior and memory access patterns.

**Parameters:**

* ​type (`DType`): The data type to load.
* ​width (`Int`): Vector width (number of elements to load).
* ​read\_only (`Bool`): If True, marks the load as read-only for cache optimization.
* ​prefetch\_size (`OptionalReg[Int]`): Optional L2 cache prefetch size (64, 128, or 256 bytes).
* ​cache\_policy (`CacheOperation`): Cache operation policy for the load.
* ​eviction\_policy (`CacheEviction`): Cache eviction policy.
* ​alignment (`Int`): Memory alignment in bytes.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to global memory to load from.

**Returns:**

SIMD vector containing the loaded data.

`load[OffsetType: Indexer, type: DType, //, width: Int = 1, *, read_only: Bool = False, prefetch_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), cache_policy: CacheOperation = CacheOperation(0), eviction_policy: CacheEviction = CacheEviction(0), alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]() if is_nvidia_gpu() else 1](ptr: UnsafePointer[SIMD[type, 1]], offset: OffsetType) -> SIMD[type, width]`

Loads data from global memory with an offset into a SIMD vector.

Provides a high-level interface for vectorized memory loads with configurable
cache behavior and memory access patterns, supporting offset-based addressing.

**Parameters:**

* ​OffsetType (`Indexer`): Type of the offset value.
* ​type (`DType`): The data type to load.
* ​width (`Int`): Vector width (number of elements to load).
* ​read\_only (`Bool`): If True, marks the load as read-only for cache optimization.
* ​prefetch\_size (`OptionalReg[Int]`): Optional L2 cache prefetch size (64, 128, or 256 bytes).
* ​cache\_policy (`CacheOperation`): Cache operation policy for the load.
* ​eviction\_policy (`CacheEviction`): Cache eviction policy.
* ​alignment (`Int`): Memory alignment in bytes.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1]]`): Base pointer to global memory.
* ​offset (`OffsetType`): Offset from base pointer in elements.

**Returns:**

SIMD vector containing the loaded data.

---

## multimem_ld_reduce

`multimem_ld_reduce[type: DType, *, count: Int, reduction: ReduceOp, scope: Scope, consistency: Consistency, accum_type: DType = get_accum_type[::DType,::DType](), output_width: Int = 1](addr: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]) -> StaticTuple[SIMD[accum_type, output_width], count]`

Performs a vectorized load-reduce operation using NVIDIA's multimem feature.

This function loads multiple values from global memory and performs a reduction
operation across them in a single instruction. It utilizes NVIDIA's multimem
feature available on SM90+ GPUs for improved performance.

**Constraints:**

* Only supported on SM90+ GPUs.
* Count must be 2 or 4.
* Type must be float32, float16, or bfloat16.

**Parameters:**

* ​type (`DType`): Data type for the operation (float32, float16, or bfloat16).
* ​count (`Int`): Number of elements to load and reduce (2 or 4).
* ​reduction (`ReduceOp`): Type of reduction operation to perform.
* ​scope (`Scope`): Memory scope for the operation.
* ​consistency (`Consistency`): Memory consistency model to use.
* ​accum\_type (`DType`): Data type used for accumulation. Defaults to a wider type than input
  (e.g. float32 for float16 inputs) to maintain precision during reduction.
* ​output\_width (`Int`): Width of each output SIMD vector (default 1).

**Args:**

* ​addr (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]`): Pointer to global memory where data will be loaded from.

**Returns:**

A StaticTuple containing 'count' SIMD vectors of width 'output\_width'
holding the results of the load-reduce operation.

---

## multimem_st

`multimem_st[type: DType, *, count: Int, scope: Scope, consistency: Consistency, width: Int = 1](addr: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)], values: StaticTuple[SIMD[type, width], count])`

Stages an inline multimem.st instruction.

This operation performs a store to all memory locations pointed to by the
multimem address using the specified memory consistency model and scope.

Notes:

* Requires SM90+ GPU architecture (PTX ISA 8.1+).
* The address must be a valid multimem address.
* Supported type-width combinations must total 32/64/128 bits.
* Default memory semantics: weak consistency (when not specified).
* Vector stores (.v2/.v4) require matching total size constraints.

Example:

```mojo
from gpu.memory import *

# Store 2 float32 values to multimem address.
multimem_st[DType.float32, count=2, scope=Scope.CTA, consistency=Consistency.RELAXED](
    addr, StaticTuple[DType.float32, 2](val1, val2)
)

# Vector store of 4 float16x2 values.
multimem_st[DType.float16, count=4, scope=Scope.CLUSTER, consistency=Consistency.RELEASE, width=2](
    addr, StaticTuple[DType.float16, 4](vec1, vec2, vec3, vec4)
)
```

See Also:
[PTX ISA Documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-multimem-ld-reduce-multimem-st-multimem-red).

**Parameters:**

* ​type (`DType`): The data type of elements to store (must be float16, bfloat16, or
  float32).
* ​count (`Int`): Number of vector elements per store operation (2 or 4).
* ​scope (`Scope`): Memory scope for visibility of the store operation
  (CTA/Cluster/GPU/System).
* ​consistency (`Consistency`): Memory consistency semantics (weak/relaxed/release).
* ​width (`Int`): Vector width modifier for packed data types (default 1).

**Args:**

* ​addr (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]`): Multimem address in global address space pointing to multiple
  locations.
* ​values (`StaticTuple[SIMD[type, width], count]`): Packed SIMD values to store, with count matching the template
  parameter.

---

## tma_store_fence

`tma_store_fence()`

Establishes a memory fence for shared memory stores in TMA operations.

This function creates a memory barrier that ensures all previous shared memory
stores are completed before subsequent TMA (Tensor Memory Access) store operations
begin. This is crucial for maintaining memory consistency in tensor operations.

Note:

This fence specifically targets the CTA (Cooperative Thread Array) scope
and is used to synchronize async shared memory operations.

---

## WGMMADescriptor

`@register_passable(trivial)`
`struct WGMMADescriptor[dtype: DType]`

Descriptor for shared memory operands used in warp group matrix multiply operations.

This struct represents a descriptor that encodes information about shared memory layout
and access patterns for warp group matrix multiply operations. The descriptor contains
the following bit fields:

* Start address (14 bits): Base address in shared memory.
* Leading byte offset (14 bits): Leading dimension stride in bytes.
* Stride byte offset (14 bits): Stride dimension offset in bytes.
* Base offset (3 bits): Additional offset.
* Swizzle mode (2 bits): Memory access pattern.

The bit layout is:
+----------+----+------------+----+------------+----+-----+----------+-----+
\|   0-13   |14-15|   16-29   |30-31|   32-45   |46-48|49-51|  52-61  |62-63|
+----------+----+------------+----+------------+----+-----+----------+-----+
\|  14bits  |2bits|  14bits   |2bits|  14bits   |2bits|3bits| 10bits  |2bits|
+----------+----+------------+----+------------+----+-----+----------+-----+
\| BaseAddr |  0  |LeadingDim |  0  |  Stride   |  0  |Offst|    0    |Swzle|
+----------+----+------------+----+------------+----+-----+----------+-----+

See: 

## Parameters

* ​dtype (`DType`): The data type of the shared memory operand. This affects memory alignment
  and access patterns for the descriptor.

## Fields

* ​desc (`SIMD[int64, 1]`): The 64-bit descriptor value that encodes shared memory layout information.
  This field stores the complete descriptor with all bit fields packed into a single 64-bit integer:

  * Bits 0-13: Base address in shared memory (14 bits)
  * Bits 16-29: Leading dimension stride in bytes (14 bits)
  * Bits 32-45: Stride dimension offset in bytes (14 bits)
  * Bits 49-51: Base offset (3 bits)
  * Bits 62-63: Swizzle mode for memory access pattern (2 bits)

  The descriptor is used by NVIDIA Hopper architecture's warp group matrix multiply instructions
  to efficiently access shared memory with the appropriate layout and access patterns.

## Implemented traits

`AnyType`,
`Copyable`,
`MMAOperandDescriptor`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(val: SIMD[int64, 1]) -> Self`

Initialize descriptor with raw 64-bit value.

This constructor allows creating a descriptor directly from a 64-bit integer
that already contains the properly formatted bit fields for the descriptor.

The implicit attribute enables automatic conversion from `Int64` to `WGMMADescriptor`.

**Args:**

* ​val (`SIMD[int64, 1]`): A 64-bit integer containing the complete descriptor bit layout.

### `__add__`

`__add__(self, offset: Int) -> Self`

Add offset to descriptor's base address.

**Args:**

* ​offset (`Int`): Byte offset to add to base address.

**Returns:**

New descriptor with updated base address.

### `__iadd__`

`__iadd__(mut self, offset: Int)`

Add offset to descriptor's base address in-place.

**Args:**

* ​offset (`Int`): Byte offset to add to base address.

### `create`

`static create[stride_byte_offset: Int, leading_byte_offset: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))](smem_ptr: UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)]) -> Self`

Create a descriptor for shared memory operand.

**Parameters:**

* ​stride\_byte\_offset (`Int`): Stride dimension offset in bytes.
* ​leading\_byte\_offset (`Int`): Leading dimension stride in bytes.
* ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern mode.

**Args:**

* ​smem\_ptr (`UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)]`): Pointer to shared memory operand.

**Returns:**

Initialized descriptor for the shared memory operand.

---

## mma

This module includes utilities for working with the warp-matrix-matrix-multiplication (wmma) instructions.

## Structs

* [​`WGMMADescriptor`](/mojo/stdlib/gpu/mma/WGMMADescriptor): Descriptor for shared memory operands used in warp group matrix multiply operations.

## Functions

* [​`ld_matrix`](/mojo/stdlib/gpu/mma/ld_matrix): Loads a matrix from shared memory into registers in a format suitable for tensor core operations.
* [​`mma`](/mojo/stdlib/gpu/mma/mma): Performs warp sync Tensor Core based Matrix-multiply and accumulate (MMA) operation.
* [​`st_matrix`](/mojo/stdlib/gpu/mma/st_matrix): Performs warp-synchronized copy from registers to shared memory.
* [​`wgmma_async`](/mojo/stdlib/gpu/mma/wgmma_async): Performs warp group async Matrix-multiply and accumulate (WGMMA) operation.
* [​`wgmma_commit_group_sync`](/mojo/stdlib/gpu/mma/wgmma_commit_group_sync): Commits pending warp group matrix multiply operations.
* [​`wgmma_fence_aligned`](/mojo/stdlib/gpu/mma/wgmma_fence_aligned): Inserts a memory fence for warp group matrix multiply operations.
* [​`wgmma_wait_group_sync`](/mojo/stdlib/gpu/mma/wgmma_wait_group_sync): Waits for all pending warp group matrix multiply operations to complete.

---

## ld_matrix

`ld_matrix[type: DType, //, simd_width: Int, *, transpose: Bool = False](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[type, simd_width]`

Loads a matrix from shared memory into registers in a format suitable for tensor core operations.

This function performs a warp-synchronized load from shared memory to registers, formatting the data
to be directly usable by tensor core Matrix Multiply-Accumulate (MMA) instructions.

Note:

* All threads in a warp must execute this operation together.
* For transposed loads, only half precision (float16) is supported.
* The register width is fixed at 4 bytes (32 bits).
* Supported configurations:
  * x1: One 32-bit register per thread.
  * x2: Two 32-bit registers per thread.
  * x4: Four 32-bit registers per thread.

Example:

```mojo
from gpu.mma import ld_matrix

# Load 8x8 matrix of float16 values
var data = ld_matrix[DType.float16, 8](ptr)

# Load transposed matrix
var transposed = ld_matrix[DType.float16, 8, transpose=True](ptr)
```

.

**Parameters:**

* ​type (`DType`): The data type of the matrix elements (e.g. float16, float32).
* ​simd\_width (`Int`): The width of the SIMD vector to load.
* ​transpose (`Bool`): Whether to transpose the matrix during load (only supported for half precision).

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to shared memory containing the source matrix data.

**Returns:**

SIMD vector containing the loaded matrix data, properly formatted for MMA operations.

---

## mma

`mma[block_size: Int = 1](mut d: SIMD[dtype, size], a: SIMD[dtype, size], b: SIMD[dtype, size], c: SIMD[dtype, size])`

Performs warp sync Tensor Core based Matrix-multiply and accumulate (MMA) operation.

This function executes a matrix multiply-accumulate operation using GPU Tensor Cores,
synchronizing across the warp. It dispatches to architecture-specific implementations
for NVIDIA and AMD GPUs.

The operation performed is: d = (a \* b) + c

Supported configurations depend on the GPU architecture:

* NVIDIA: Various combinations of FP32, FP16, BF16, and FP8 formats
* AMD: Limited subset of FP32 and FP16 operations

Note:

* All threads in a warp must execute this operation together
* Input matrices must be properly loaded and formatted for Tensor Core operations
* Matrix dimensions and data types must match hardware requirements

**Parameters:**

* ​block\_size (`Int`): The size of the block of the MMA operation (e.g., 4x4x4\_16B). Applies to AMD GPUs only.

**Args:**

* ​d (`SIMD[dtype, size]`): Output SIMD vector to store the result.
* ​a (`SIMD[dtype, size]`): First input matrix as SIMD vector.
* ​b (`SIMD[dtype, size]`): Second input matrix as SIMD vector.
* ​c (`SIMD[dtype, size]`): Accumulator matrix as SIMD vector.

---

## st_matrix

`st_matrix[dtype: DType, //, simd_width: Int, *, transpose: Bool = False](ptr: UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)], d: SIMD[float32, simd_width])`

Performs warp-synchronized copy from registers to shared memory.

This function stores data from registers to shared memory in a format that can be
directly used by tensor core Matrix Multiply-Accumulate (MMA) instructions. It uses
the NVIDIA stmatrix instruction to perform an efficient warp-synchronized store.

Note:
The function performs a warp-synchronized operation - all threads in the warp
must execute this instruction to avoid deadlock.

**Constraints:**

* Must be used with shared memory pointers.
* Number of registers must be 1, 2, or 4.
* Data must be properly aligned for matrix operations.
* All threads in warp must participate.
* Only supported on NVIDIA GPUs with tensor core capabilities.

**Parameters:**

* ​dtype (`DType`): Data type of elements to store.
* ​simd\_width (`Int`): Width of the SIMD vector.
* ​transpose (`Bool`): If True, transposes the matrix during store.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)]`): Pointer to shared memory where data will be stored.
* ​d (`SIMD[float32, simd_width]`): SIMD vector containing the data to store.

---

## wgmma_async

`wgmma_async[m: Int, n: Int, k: Int, c_dtype: DType, width: Int, /, *, a_type: DType, b_type: DType, accum_type: DType = c_dtype, layout_a: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("row"), layout_b: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("col"), scale_d: Int = 1, scale_a: Int = 1, scale_b: Int = 1](mat_a_desc: WGMMADescriptor[dtype], mat_b_desc: WGMMADescriptor[dtype], c_reg: StaticTuple[SIMD[c_dtype, 1], width]) -> StaticTuple[SIMD[c_dtype, 1], width]`

Performs warp group async Matrix-multiply and accumulate (WGMMA) operation.

This function executes an asynchronous matrix multiplication using warp group MMA instructions.
It supports various data types including tensor float32, bfloat16, float16, float8, int8, and uint8.

**Constraints:**

* The number of output registers must match the instruction shape:
  `(m * n // 128) * sizeof(accum_type) == width * sizeof(c_dtype)`.
* Data type combinations must be compatible with hardware WGMMA instructions.

**Parameters:**

* ​m (`Int`): Number of rows in matrix A and output matrix.
* ​n (`Int`): Number of columns in matrix B and output matrix.
* ​k (`Int`): Number of columns in matrix A / rows in matrix B.
* ​c\_dtype (`DType`): Data type of the output matrix C.
* ​width (`Int`): Width of the InlineArray register for matrix C.
* ​a\_type (`DType`): Data type of matrix A.
* ​b\_type (`DType`): Data type of matrix B.
* ​accum\_type (`DType`): Accumulation data type (defaults to c\_dtype).
* ​layout\_a (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix A ("row" or "col").
* ​layout\_b (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix B ("row" or "col").
* ​scale\_d (`Int`): Scale factor for matrix C.
* ​scale\_a (`Int`): Scale factor for matrix A.
* ​scale\_b (`Int`): Scale factor for matrix B.

**Args:**

* ​mat\_a\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix A.
* ​mat\_b\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix B.
* ​c\_reg (`StaticTuple[SIMD[c_dtype, 1], width]`): StaticTuple containing matrix C values.

**Returns:**

`StaticTuple` containing the result of the matrix multiplication.

`wgmma_async[m: Int, n: Int, k: Int, c_dtype: DType, width: Int, /, *, a_type: DType, b_type: DType, accum_type: DType = c_dtype, layout_a: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("row"), layout_b: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("col"), scale_d: Int = 1, scale_a: Int = 1, scale_b: Int = 1](mat_a_desc: WGMMADescriptor[dtype], mat_b_desc: WGMMADescriptor[dtype], c_reg: SIMD[c_dtype, width]) -> SIMD[c_dtype, width]`

Performs warp group async Matrix-multiply and accumulate (WGMMA) operation.

This function executes an asynchronous matrix multiplication using warp group MMA instructions.
It supports various data types including tensor float32, bfloat16, float16, float8, int8, and uint8.

**Constraints:**

* The number of output registers must match the instruction shape:
  `(m * n // 128) * sizeof(accum_type) == width * sizeof(c_dtype)`.
* Data type combinations must be compatible with hardware WGMMA instructions.

**Parameters:**

* ​m (`Int`): Number of rows in matrix A and output matrix.
* ​n (`Int`): Number of columns in matrix B and output matrix.
* ​k (`Int`): Number of columns in matrix A / rows in matrix B.
* ​c\_dtype (`DType`): Data type of the output matrix C.
* ​width (`Int`): Width of the SIMD register for matrix C.
* ​a\_type (`DType`): Data type of matrix A.
* ​b\_type (`DType`): Data type of matrix B.
* ​accum\_type (`DType`): Accumulation data type (defaults to c\_dtype).
* ​layout\_a (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix A ("row" or "col").
* ​layout\_b (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix B ("row" or "col").
* ​scale\_d (`Int`): Scale factor for matrix C.
* ​scale\_a (`Int`): Scale factor for matrix A.
* ​scale\_b (`Int`): Scale factor for matrix B.

**Args:**

* ​mat\_a\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix A.
* ​mat\_b\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix B.
* ​c\_reg (`SIMD[c_dtype, width]`): SIMD register containing matrix C values.

**Returns:**

SIMD register containing the result of the matrix multiplication.

`wgmma_async[m: Int, n: Int, k: Int, a_dtype: DType, c_dtype: DType, frag_a_width: Int, frag_c_width: Int, /, *, a_type: DType, b_type: DType, accum_type: DType = c_dtype, layout_a: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("row"), layout_b: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("col"), scale_d: Int = 1, scale_a: Int = 1, scale_b: Int = 1](mat_a_frag: SIMD[a_dtype, frag_a_width], mat_b_desc: WGMMADescriptor[dtype], c: SIMD[c_dtype, frag_c_width]) -> SIMD[c_dtype, frag_c_width]`

Performs warp group async Matrix-multiply and accumulate (WGMMA) operation.

Currently only supports:

* m=64, k=16.
* BF16 input types.
* FP32 accumulation.
* Row major matrix A.
* Column major matrix B (or row major for BF16).

**Parameters:**

* ​m (`Int`): Number of rows in output matrix.
* ​n (`Int`): Number of columns in output matrix.
* ​k (`Int`): Inner dimension for matrix multiplication.
* ​a\_dtype (`DType`): Data type of matrix A fragment.
* ​c\_dtype (`DType`): Data type of output matrix C.
* ​frag\_a\_width (`Int`): Width of matrix A fragment.
* ​frag\_c\_width (`Int`): Width of output matrix C fragment.
* ​a\_type (`DType`): Data type of matrix A.
* ​b\_type (`DType`): Data type of matrix B.
* ​accum\_type (`DType`): Data type used for accumulation (defaults to c\_dtype).
* ​layout\_a (`StringSlice[StaticConstantOrigin]`): Layout of matrix A ("row" or "col", defaults to "row").
* ​layout\_b (`StringSlice[StaticConstantOrigin]`): Layout of matrix B ("row" or "col", defaults to "col").
* ​scale\_d (`Int`): Scale factor for output matrix C (defaults to 1).
* ​scale\_a (`Int`): Scale factor for matrix A (defaults to 1).
* ​scale\_b (`Int`): Scale factor for matrix B (defaults to 1).

**Args:**

* ​mat\_a\_frag (`SIMD[a_dtype, frag_a_width]`): Fragment containing matrix A data.
* ​mat\_b\_desc (`WGMMADescriptor[dtype]`): Descriptor for matrix B data.
* ​c (`SIMD[c_dtype, frag_c_width]`): Fragment containing matrix C data.

**Returns:**

Updated matrix C fragment after WGMMA operation.

---

## wgmma_commit_group_sync

`wgmma_commit_group_sync()`

Commits pending warp group matrix multiply operations.

This synchronizes the warp group and ensures all WGMMA operations have been committed.
Must be called after a sequence of WGMMA operations before accessing results.

---

## wgmma_fence_aligned

`wgmma_fence_aligned()`

Inserts a memory fence for warp group matrix multiply operations.

This ensures all prior shared memory accesses are visible before subsequent WGMMA operations.
Must be called before starting a new sequence of WGMMA operations.

---

## wgmma_wait_group_sync

`wgmma_wait_group_sync[group: Int = 0]()`

Waits for all pending warp group matrix multiply operations to complete.

This synchronizes the warp group and ensures all WGMMA operations have finished executing.
Must be called after commit and before accessing results.

**Parameters:**

* ​group (`Int`): The number of pending wgmma-groups to wait until.

---

## MMAOperandDescriptor

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `__add__`

`__add__(self: _Self, offset: Int) -> _Self`

---

## mma_operand_descriptor

## Traits

* [​`MMAOperandDescriptor`](/mojo/stdlib/gpu/mma_operand_descriptor/MMAOperandDescriptor):

---

## MMASmemDescriptor

`@register_passable(trivial)`
`struct MMASmemDescriptor`

Descriptor for shared memory operands tcgen05 mma instructions.

This struct represents a descriptor that encodes information about shared memory layout
and access patterns for warp group matrix multiply operations. The descriptor contains
the following bit fields:

bits layout:

Bit-field | size | Description
0-13   |  14  | Base address in shared memory
14-15   |   2  | Unused, 0
16-29   |  14  | LBO: leading dim byte offset
30-31   |   2  | Unused, 0
32-45   |  14  | SBO: stride dim byte offset
46-48   |   3  | Unused, 0
49-51   |   3  | Matrix Base offset, 0 for canonical layouts
52      |   1  | LBO mode, only matters for 48B K tile
53-60   |   8  | fixed, 0
61-63   |   3  | Swizzle mode

* Start address, LBO, SBO ignores 4 LSBs.

See 

## Fields

* ​desc (`SIMD[uint64, 1]`): The 64-bit descriptor encodes shared memory operand information.

## Implemented traits

`AnyType`,
`Copyable`,
`MMAOperandDescriptor`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(val: SIMD[uint64, 1]) -> Self`

Initialize descriptor with raw 64-bit value.

This constructor allows creating a descriptor directly from a 64-bit integer
that already contains the properly formatted bit fields for the descriptor.

The implicit attribute enables automatic conversion from `UInt64` to `MMASmemDescriptor`.

**Args:**

* ​val (`SIMD[uint64, 1]`): A 64-bit integer containing the complete descriptor bit layout.

### `__add__`

`__add__(self, offset: Int) -> Self`

Add offset to descriptor's base address.

**Args:**

* ​offset (`Int`): Byte offset to add to base address.

**Returns:**

New descriptor with updated base address.

### `__iadd__`

`__iadd__(mut self, offset: Int)`

Add offset to descriptor's base address in-place.

**Args:**

* ​offset (`Int`): Byte offset to add to base address.

### `create`

`static create[stride_byte_offset: Int, leading_byte_offset: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))](smem_ptr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]) -> Self`

Create a descriptor for shared memory operand.

**Parameters:**

* ​stride\_byte\_offset (`Int`): Stride dimension offset in bytes.
* ​leading\_byte\_offset (`Int`): Leading dimension stride in bytes.
* ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern mode.

**Args:**

* ​smem\_ptr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to shared memory operand.

**Returns:**

Initialized descriptor for the shared memory operand.

---

## UMMAInsDescriptor

`@register_passable(trivial)`
`struct UMMAInsDescriptor[mma_kind: UMMAKind]`

Descriptor for UMMA instructions.

This struct represents a descriptor that encodes information about UMMA instructions.
The descriptor contains the following bit fields:

* Sparsity (2 bits): The sparsity of the input matrices. Currently defaults to dense matrices.
* Saturate for integer types (1 bits): Whether to saturate the result for integer types. Currently not supported.
* Matrix D type (2 bits): Data type of matrix D.
* Matrix A type (3 bits): Data type of matrix A.
* Matrix B type (3 bits): Data type of matrix B.
* Negate A matrix (1 bit): Whether to negate matrix A. Currently defaults to False.
* Negate B matrix (1 bit): Whether to negate matrix B. Currently defaults to False.
* Transpose A (1 bit): Whether to transpose matrix A.
* Transpose B (1 bit): Whether to transpose matrix B.
* N, Dimension of Matrix B (6 bits): Number of columns in matrix B. 3 LSBs are unused.
* M, Dimension of Matrix A (6 bits): Number of rows in matrix A. 3 LSBs are unused.

See: 

## Parameters

* ​mma\_kind (`UMMAKind`): The kind of UMMA instruction.

## Fields

* ​desc (`SIMD[uint32, 1]`): The 32-bit descriptor value that encodes UMMA instruction information.
  This field stores the complete descriptor with all bit fields packed into a single 32-bit integer:
  * Bits 0-1: Sparsity selector(2 bits)
  * Bits 2: Sparsity enable(1 bit)
  * Bits 3: Saturate for integer types (1 bit)
  * Bits 4-5: Matrix D type (2 bits)
  * Bits 6: Reserved (1 bit)
  * Bits 7-9: Matrix A type (3 bits)
  * Bits 10-12: Matrix B type (3 bits)
  * Bits 13: Negate A matrix (1 bit)
  * Bits 14: Negate B matrix (1 bit)
  * Bits 15: Transpose A (1 bit)
  * Bits 16: Transpose B (1 bit)
  * Bits 17-22: N, Dimension of Matrix B (6 bits)
  * Bits 23: Reserved (1 bit)
  * Bits 24-28: M, Dimension of Matrix A (5 bits)
  * Bits 29: Reserved (1 bit)
  * Bits 30-31: Maximum shift while attempting B matrix (2 bits)

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(value: SIMD[uint32, 1]) -> Self`

Initialize descriptor with raw 32-bit value.

This constructor allows creating a descriptor directly from a 32-bit integer
that already contains the properly formatted bit fields for the descriptor.

**Args:**

* ​value (`SIMD[uint32, 1]`): A 32-bit integer containing the complete descriptor bit layout.

### `create`

`static create[d_type: DType, a_type: DType, b_type: DType, output_shape: IndexList[2, element_type=uint32], /, *, transpose_a: Bool = False, transpose_b: Bool = True]() -> Self`

Create a descriptor for UMMA instructions.

This function creates a descriptor for UMMA instructions based on the provided parameters.

**Parameters:**

* ​d\_type (`DType`): The data type of matrix D.
* ​a\_type (`DType`): The data type of matrix A.
* ​b\_type (`DType`): The data type of matrix B.
* ​output\_shape (`IndexList[2, element_type=uint32]`): The shape of the output matrix.
* ​transpose\_a (`Bool`): Whether to transpose matrix A.
* ​transpose\_b (`Bool`): Whether to transpose matrix B.

**Returns:**

A 32-bit integer containing the complete descriptor bit layout.

---

## UMMAKind

`@register_passable(trivial)`
`struct UMMAKind`

Struct for UMMA instruction types.

This struct defines the different types of UMMA instructions that is supported by BlackWell.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `KIND_F16`

`alias KIND_F16 = UMMAKind(__init__[__mlir_type.!pop.int_literal](2))`

f16 type

### `KIND_F8F6F4`

`alias KIND_F8F6F4 = UMMAKind(__init__[__mlir_type.!pop.int_literal](3))`

f8f6f4 type

### `KIND_I8`

`alias KIND_I8 = UMMAKind(__init__[__mlir_type.!pop.int_literal](4))`

i8 type

### `KIND_TF32`

`alias KIND_TF32 = UMMAKind(__init__[__mlir_type.!pop.int_literal](0))`

tf32 type

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Check if two UMMA kinds are equal.

**Args:**

* ​other (`Self`): The other UMMA kind to compare with.

**Returns:**

True if the UMMA kinds are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Check if two UMMA kinds are not equal.

**Args:**

* ​other (`Self`): The other UMMA kind to compare with.

**Returns:**

True if the UMMA kinds are not equal, False otherwise.

### `__int__`

`__int__(self) -> Int`

Convert UMMA kind to an integer value.

**Returns:**

The integer value representing the UMMA instruction type.

### `__str__`

`__str__(self) -> String`

Convert UMMA kind to a string, this can be used as the instruction qualifier.

**Returns:**

The PTX qualifier representation of the UMMA kind.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Write the UMMA kind to a writer.

**Parameters:**

* ​W (`Writer`): The writer type that will receive the formatted output.

**Args:**

* ​writer (`W`): The writer to write the UMMA kind to.

---

## mma_sm100

This module includes utilities for working with the SM100 MMA instructions.

## Structs

* [​`MMASmemDescriptor`](/mojo/stdlib/gpu/mma_sm100/MMASmemDescriptor): Descriptor for shared memory operands tcgen05 mma instructions.
* [​`UMMAInsDescriptor`](/mojo/stdlib/gpu/mma_sm100/UMMAInsDescriptor): Descriptor for UMMA instructions.
* [​`UMMAKind`](/mojo/stdlib/gpu/mma_sm100/UMMAKind): Struct for UMMA instruction types.

## Functions

* [​`mma`](/mojo/stdlib/gpu/mma_sm100/mma): Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction.
* [​`mma_arrive`](/mojo/stdlib/gpu/mma_sm100/mma_arrive): Arrive at the mbar pointer for the MMA instruction.
* [​`mma_arrive_multicast`](/mojo/stdlib/gpu/mma_sm100/mma_arrive_multicast): Arrive at the mbar pointer for the MMA instruction for multiple ctas.

---

## mma

`mma[kind: UMMAKind, //, cta_group: Int = 1, /, *, c_scale: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1)](a_desc: MMASmemDescriptor, b_desc: MMASmemDescriptor, c_tmem: SIMD[uint32, 1], inst_desc: UMMAInsDescriptor[kind])`

Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction.

**Parameters:**

* ​kind (`UMMAKind`): Data type of the matrices.
* ​cta\_group (`Int`): Number of ctas used by MMA.
* ​c\_scale (`SIMD[uint32, 1]`): Scale factor for the C matrix, 0 or 1.

**Args:**

* ​a\_desc (`MMASmemDescriptor`): The descriptor for the A matrix.
* ​b\_desc (`MMASmemDescriptor`): The descriptor for the B matrix.
* ​c\_tmem (`SIMD[uint32, 1]`): The address of the C matrix in the tensor memory.
* ​inst\_desc (`UMMAInsDescriptor[kind]`): The descriptor for the MMA instruction.

`mma[kind: UMMAKind, //, cta_group: Int = 1, /](a_desc: MMASmemDescriptor, b_desc: MMASmemDescriptor, c_tmem: SIMD[uint32, 1], inst_desc: UMMAInsDescriptor[kind], c_scale: SIMD[uint32, 1])`

Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction.

**Parameters:**

* ​kind (`UMMAKind`): Data type of the matrices.
* ​cta\_group (`Int`): Number of ctas used by MMA.

**Args:**

* ​a\_desc (`MMASmemDescriptor`): The descriptor for the A matrix.
* ​b\_desc (`MMASmemDescriptor`): The descriptor for the B matrix.
* ​c\_tmem (`SIMD[uint32, 1]`): The address of the C matrix in the tensor memory.
* ​inst\_desc (`UMMAInsDescriptor[kind]`): The descriptor for the MMA instruction.
* ​c\_scale (`SIMD[uint32, 1]`): Scale factor for the C matrix. Any non-zero value is translated to `1`.

`mma[kind: UMMAKind, //, cta_group: Int = 1, /](a_desc: SIMD[uint32, 1], b_desc: MMASmemDescriptor, c_tmem: SIMD[uint32, 1], inst_desc: UMMAInsDescriptor[kind], c_scale: SIMD[uint32, 1])`

Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction.

**Parameters:**

* ​kind (`UMMAKind`): Data type of the matrices.
* ​cta\_group (`Int`): Number of ctas used by MMA.

**Args:**

* ​a\_desc (`SIMD[uint32, 1]`): The descriptor for the A matrix.
* ​b\_desc (`MMASmemDescriptor`): The descriptor for the B matrix.
* ​c\_tmem (`SIMD[uint32, 1]`): The address of the C matrix in the tensor memory.
* ​inst\_desc (`UMMAInsDescriptor[kind]`): The descriptor for the MMA instruction.
* ​c\_scale (`SIMD[uint32, 1]`): Scale factor for the C matrix. Any non-zero value is interpreted as `1`.

`mma[kind: UMMAKind, //, cta_group: Int = 1, /, *, c_scale: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1)](a_desc: SIMD[uint32, 1], b_desc: MMASmemDescriptor, c_tmem: SIMD[uint32, 1], inst_desc: UMMAInsDescriptor[kind])`

Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction.

**Parameters:**

* ​kind (`UMMAKind`): Data type of the matrices.
* ​cta\_group (`Int`): Number of ctas used by MMA.
* ​c\_scale (`SIMD[uint32, 1]`): Scale factor for the C matrix, 0 or 1.

**Args:**

* ​a\_desc (`SIMD[uint32, 1]`): The descriptor for the A matrix.
* ​b\_desc (`MMASmemDescriptor`): The descriptor for the B matrix.
* ​c\_tmem (`SIMD[uint32, 1]`): The address of the C matrix in the tensor memory.
* ​inst\_desc (`UMMAInsDescriptor[kind]`): The descriptor for the MMA instruction.

---

## mma_arrive

`mma_arrive[cta_group: Int = 1](mbar_ptr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin])`

Arrive at the mbar pointer for the MMA instruction.

**Parameters:**

* ​cta\_group (`Int`): Number of ctas used by MMA.

**Args:**

* ​mbar\_ptr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the mbar.

---

## mma_arrive_multicast

`mma_arrive_multicast[cta_group: Int = 1](mbar_ptr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], cta_mask: SIMD[uint16, 1])`

Arrive at the mbar pointer for the MMA instruction for multiple ctas.

**Parameters:**

* ​cta\_group (`Int`): Number of ctas used by MMA.

**Args:**

* ​mbar\_ptr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the mbar.
* ​cta\_mask (`SIMD[uint16, 1]`): Mask of ctas to signal.

---

## mma_util

Matrix multiply accumulate (MMA) utilities for GPU tensor cores.

This module provides functions for loading matrix tiles from memory into registers and storing
results back to memory when using tensor cores for matrix multiplication. It supports both
NVIDIA and AMD GPUs with functions specialized for different data types (FP32, FP16, BF16).

The key functions are:

* load\_matrix\_a: Loads tiles from the first input matrix A
* load\_matrix\_b: Loads tiles from the second input matrix B
* store\_matrix\_d: Stores result tiles to the output matrix D

Each function handles the specific memory access patterns required by the tensor core
instructions on each GPU architecture. The tile sizes and data layouts match the hardware
requirements documented in:

NVIDIA PTX: 
AMD Matrix Cores: 

## Functions

* [​`load_matrix_a`](/mojo/stdlib/gpu/mma_util/load_matrix_a): Loads a tile of matrix A from memory to registers for TF32 tensor core operations.
* [​`load_matrix_a_amd`](/mojo/stdlib/gpu/mma_util/load_matrix_a_amd): Loads a tile of matrix A from memory to registers for AMD FP32 tensor core operations.
* [​`load_matrix_b`](/mojo/stdlib/gpu/mma_util/load_matrix_b): Loads a tile of matrix B from memory to registers for TF32 tensor core operations.
* [​`load_matrix_b_amd`](/mojo/stdlib/gpu/mma_util/load_matrix_b_amd): Loads a tile of matrix B from memory to registers for AMD FP32 tensor core operations.
* [​`store_matrix_d`](/mojo/stdlib/gpu/mma_util/store_matrix_d): Stores matrix D tile from registers to memory after tensor core operation.

---

## load_matrix_a

`load_matrix_a[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 4]`

Loads a tile of matrix A from memory to registers for TF32 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=8, k=8.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​a\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix A data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix A (stride between rows).

**Returns:**

SIMD vector containing 4 TF32 values loaded from matrix A in the required order.

`load_matrix_a[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float16, 4]`

Loads a tile of matrix A from memory to registers for FP16 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=8, k=8.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​a\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix A data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix A (stride between rows).

**Returns:**

SIMD vector containing 4 FP16 values loaded from matrix A in the required order.

`load_matrix_a[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[bfloat16, (div_s(#lit.struct.extract, 2) + -1) if ((k , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]`

Loads a tile of matrix A from memory to registers for BF16 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=8, k=8 or m=16, n=8, k=16.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​a\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix A data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix A (stride between rows).

**Returns:**

SIMD vector containing k//2 BF16 values loaded from matrix A in the required order.

---

## load_matrix_a_amd

`load_matrix_a_amd[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 1]`

Loads a tile of matrix A from memory to registers for AMD FP32 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=16, k=4.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​a\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix A data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix A (stride between rows).

**Returns:**

SIMD vector containing 1 FP32 value loaded from matrix A.

`load_matrix_a_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](a_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float16, 4]`

Loads a tile of matrix A from memory to registers for AMD FP16 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=16, k=16 and n\_blocks=1 or m=4, n=4, k=4 and n\_blocks=16.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.
* ​n\_blocks (`Int`): Number of blocks.

**Args:**

* ​a\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix A data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix A (stride between rows).

**Returns:**

SIMD vector containing 4 FP16 values loaded from matrix A.

`load_matrix_a_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](a_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[bfloat16, 4]`

Loads a tile of matrix A from memory to registers for AMD BF16 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=16, k=16 and n\_blocks=1 or m=4, n=4, k=4 and n\_blocks=16.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.
* ​n\_blocks (`Int`): Number of blocks.

**Args:**

* ​a\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix A data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix A (stride between rows).

**Returns:**

SIMD vector containing 4 BF16 values loaded from matrix A.

---

## load_matrix_b

`load_matrix_b[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 2]`

Loads a tile of matrix B from memory to registers for TF32 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=8, k=8.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​b\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix B data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix B (stride between rows).

**Returns:**

SIMD vector containing 2 TF32 values loaded from matrix B in the required order.

`load_matrix_b[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float16, 2]`

Loads a tile of matrix B from memory to registers for FP16 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=8, k=8.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​b\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix B data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix B (stride between rows).

**Returns:**

SIMD vector containing 2 FP16 values loaded from matrix B in the required order.

`load_matrix_b[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[bfloat16, (div_s(#lit.struct.extract, 4) + -1) if ((k , 4) == 0) ^ True)) else div_s(#lit.struct.extract, 4)]`

Loads a tile of matrix B from memory to registers for BF16 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=8, k=8 or m=16, n=8, k=16.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​b\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix B data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix B (stride between rows).

**Returns:**

SIMD vector containing k//4 BF16 values loaded from matrix B in the required order.

---

## load_matrix_b_amd

`load_matrix_b_amd[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 1]`

Loads a tile of matrix B from memory to registers for AMD FP32 tensor core operations.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​b\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix B data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix B (stride between rows).

**Returns:**

SIMD vector containing 1 FP32 value loaded from matrix B.

`load_matrix_b_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](b_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int, tile_loops: Int = 1) -> SIMD[float16, 4]`

Loads a tile of matrix B from memory to registers for AMD FP16 tensor core operations.

This function loads 4 consecutive FP16 values per thread from matrix B in a pattern
optimized for AMD GPU tensor core operations. Each thread loads values based on its
position within the warp.

Performance:

* Optimized for AMD GPU memory access patterns.
* Uses thread ID to determine which elements to load.
* Loads 4 consecutive elements per thread for efficient vectorization.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.
* ​n\_blocks (`Int`): Number of blocks.

**Args:**

* ​b\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix B data in memory (FP16 format).
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix B (stride between rows).
* ​tile\_loops (`Int`): Number of tile loops across matrix B's row dimension.

**Returns:**

SIMD vector containing 4 FP16 values loaded from matrix B.

`load_matrix_b_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](b_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int, tile_loops: Int = 1) -> SIMD[bfloat16, 4]`

Loads a tile of matrix B from memory to registers for AMD BF16 tensor core operations.

This function loads 4 consecutive BF16 values per thread from matrix B in a pattern
optimized for AMD GPU tensor core operations. Each thread loads values based on its
position within the warp.

Performance:

* Optimized for AMD GPU memory access patterns.
* Uses thread ID to determine which elements to load.
* Loads 4 consecutive elements per thread for efficient vectorization.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.
* ​n\_blocks (`Int`): Number of blocks.

**Args:**

* ​b\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix B data in memory (BF16 format).
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix B (stride between rows).
* ​tile\_loops (`Int`): Number of tile loops across matrix B's row dimension.

**Returns:**

SIMD vector containing 4 BF16 values loaded from matrix B.

---

## store_matrix_d

`store_matrix_d[dtype: DType, //, m: Int, n: Int, k: Int, n_blocks: Int = 1](d_ptr: UnsafePointer[SIMD[dtype, 1]], d: SIMD[dtype, 4], tile_row: Int, tile_col: Int, ldm: Int)`

Stores matrix D tile from registers to memory after tensor core operation.

This function dispatches to architecture-specific implementations for storing the
results of a tensor core matrix multiply-accumulate operation. It handles the
different memory layouts required by NVIDIA and AMD tensor cores.

Note:

* Automatically selects appropriate implementation based on GPU architecture.
* Each thread stores 4 elements in architecture-specific positions.
* Must be called by all threads in a warp.

**Parameters:**

* ​dtype (`DType`): Data type of the matrix elements.
* ​m (`Int`): Number of rows in matrix D.
* ​n (`Int`): Number of columns in matrix D.
* ​k (`Int`): Inner dimension for matrix multiply.
* ​n\_blocks (`Int`): Number of blocks.

**Args:**

* ​d\_ptr (`UnsafePointer[SIMD[dtype, 1]]`): Pointer to destination memory for matrix D.
* ​d (`SIMD[dtype, 4]`): SIMD vector containing 4 elements to store.
* ​tile\_row (`Int`): Starting row index of the tile in matrix D.
* ​tile\_col (`Int`): Starting column index of the tile in matrix D.
* ​ldm (`Int`): Leading dimension (stride) of matrix D.

---

## ProfileBlock

`struct ProfileBlock[enabled: Bool = False]`

A struct for profiling code blocks.

This struct provides context manager functionality to profile code blocks.
When enabled, it records the start and end time of the block and prints
the timing information.

## Parameters

* ​enabled (`Bool`): Whether profiling is enabled for this block.

## Fields

* ​name (`StringSlice[StaticConstantOrigin]`): Name of the profiling block used for identification in timing output.
* ​loc (`_SourceLocation`): Source code location information for the profiling block, including file, line, and column.
* ​start\_time (`UInt`): Start time of the profiling block in nanoseconds, captured using perf\_counter\_ns().

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(out self, name: StringSlice[StaticConstantOrigin])`

Initialize a new ProfileBlock.

**Args:**

* ​name (`StringSlice[StaticConstantOrigin]`): Name to identify this profiling block.

### `__enter__`

`__enter__(mut self)`

Enter the profiling block and record start time if enabled.

### `__exit__`

`__exit__(mut self)`

Exit the profiling block, record end time and print timing if enabled.

---

## profiler

This module provides GPU profiling functionality.

The profiler module enables performance profiling of GPU code blocks through a simple
context manager interface. It includes:

* ProfileBlock: A context manager for timing code blocks
* Configurable profiling that can be enabled/disabled at compile time
* Nanosecond precision timing using perf\_counter\_ns()
* Source location tracking for profiled blocks
* Formatted timing output

Example:

```mojo
from gpu import profiler
    with profiler.ProfileBlock("my_kernel"):
        # Code to profile
        run_gpu_kernel()
```

## Structs

* [​`ProfileBlock`](/mojo/stdlib/gpu/profiler/ProfileBlock): A struct for profiling code blocks.

---

## Random

`struct Random[rounds: Int = 6]`

A high-performance random number generator using the Philox algorithm.

The Philox algorithm is a counter-based random number generator designed for parallel
and GPU computing. It provides high-quality random numbers with excellent statistical properties.

## Parameters

* ​rounds (`Int`): Number of mixing rounds to perform. Higher values provide better statistical
  quality at the cost of performance. Default is 6.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, *, seed: SIMD[uint64, 1] = __init__[__mlir_type.!pop.int_literal](0), subsequence: SIMD[uint64, 1] = __init__[__mlir_type.!pop.int_literal](0), offset: SIMD[uint64, 1] = __init__[__mlir_type.!pop.int_literal](0))`

Initialize the random number generator.

**Args:**

* ​seed (`SIMD[uint64, 1]`): Initial seed value for reproducible sequences. Default is 0.
* ​subsequence (`SIMD[uint64, 1]`): Subsequence number for generating independent streams. Default is 0.
* ​offset (`SIMD[uint64, 1]`): Starting offset in the sequence. Default is 0.

### `step`

`step(mut self) -> SIMD[uint32, 4]`

Generate 4 random 32-bit unsigned integers.

**Returns:**

SIMD vector containing 4 random 32-bit unsigned integers.

### `step_uniform`

`step_uniform(mut self) -> SIMD[float32, 4]`

Generate 4 random floating point numbers uniformly distributed in \[0,1).

**Returns:**

SIMD vector containing 4 random float32 values in range \[0,1).

---

## random

Random number generation for GPU kernels.

This module implements a high-performance random number generator using the Philox algorithm,
which is designed for parallel and GPU computing. The Philox algorithm is a counter-based
random number generator that provides high-quality random numbers with excellent statistical
properties.

The main class is Random which generates both uniform random numbers and raw 32-bit integers.
It supports:

* Seeding for reproducible sequences
* Multiple independent subsequences
* Configurable number of rounds for quality vs performance tradeoff
* Vectorized operations for efficiency

Example:

```mojo
from gpu.random import Random
    rng = Random(seed=42)
    uniform_values = rng.step_uniform()  # Returns 4 random floats in [0,1)
    raw_values = rng.step()  # Returns 4 raw 32-bit integers
```

## Structs

* [​`Random`](/mojo/stdlib/gpu/random/Random): A high-performance random number generator using the Philox algorithm.

---

## Semaphore

`@register_passable`
`struct Semaphore`

A device-wide semaphore implementation for GPUs.

This struct provides atomic operations and memory barriers for inter-CTA synchronization.
It uses a single thread per CTA to perform atomic operations on a shared lock variable.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(lock: UnsafePointer[SIMD[int32, 1]], thread_id: Int) -> Self`

Initialize a new Semaphore instance.

**Args:**

* ​lock (`UnsafePointer[SIMD[int32, 1]]`): Pointer to shared lock variable in global memory.
* ​thread\_id (`Int`): Thread ID within the CTA, used to determine if this thread
  should perform atomic operations.

### `fetch`

`fetch(mut self)`

Fetch the current state of the semaphore from global memory.

Only the designated wait thread (thread 0) performs the actual load,
using an acquire memory ordering to ensure proper synchronization.

### `state`

`state(self) -> SIMD[int32, 1]`

Get the current state of the semaphore.

**Returns:**

The current state value of the semaphore.

### `wait`

`wait(mut self, status: Int = 0)`

Wait until the semaphore reaches the specified state.

Uses a barrier-based spin loop where all threads participate in checking
the state. Only proceeds when the state matches the expected status.

**Args:**

* ​status (`Int`): The state value to wait for (defaults to 0).

### `release`

`release(mut self, status: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](0))`

Release the semaphore by setting it to the specified state.

Ensures all threads have reached this point via a barrier before
the designated thread updates the semaphore state.

**Args:**

* ​status (`SIMD[int32, 1]`): The new state value to set (defaults to 0).

---

## semaphore

This module provides a device-wide semaphore implementation for NVIDIA GPUs.

The Semaphore struct enables inter-CTA (Cooperative Thread Array) synchronization
by providing atomic operations and memory barriers. It uses NVIDIA-specific intrinsics
to implement efficient thread synchronization.

Example:

````
```mojo
from gpu import Semaphore

var lock = UnsafePointer[Int32](...)
var sem = Semaphore(lock, thread_id)

# Wait for a specific state
sem.wait(0)

# Release the semaphore
sem.release(1)
```
````

## Structs

* [​`Semaphore`](/mojo/stdlib/gpu/semaphore/Semaphore): A device-wide semaphore implementation for GPUs.

---

## AMDScheduleBarrierMask

`@register_passable(trivial)`
`struct AMDScheduleBarrierMask`

Represents different instruction scheduling masks for AMDGPU scheduling instructions.

These masks control which types of instructions can be reordered across a barrier for
performance optimization. When used with schedule\_barrier(), the mask determines which
instructions the compiler is allowed to move across the barrier point.

## Implemented traits

`AnyType`,
`Copyable`,
`Intable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ALL_ALU`

`alias ALL_ALU = AMDScheduleBarrierMask(1)`

Allows reordering of all arithmetic and logic instructions that don't involve memory operations.

### `ALL_DS`

`alias ALL_DS = AMDScheduleBarrierMask(128)`

Permits reordering of all Local Data Share (LDS) operations.

### `ALL_VMEM`

`alias ALL_VMEM = AMDScheduleBarrierMask(16)`

Enables reordering of all vector memory operations (reads and writes).

### `DS_READ`

`alias DS_READ = AMDScheduleBarrierMask(256)`

Enables reordering of LDS read operations only.

### `DS_WRITE`

`alias DS_WRITE = AMDScheduleBarrierMask(512)`

Enables reordering of LDS write operations only.

### `MFMA`

`alias MFMA = AMDScheduleBarrierMask(8)`

Allows reordering of matrix multiplication and WMMA instructions.

### `NONE`

`alias NONE = AMDScheduleBarrierMask(0)`

No instructions can cross the barrier. Most restrictive option.

### `SALU`

`alias SALU = AMDScheduleBarrierMask(4)`

Permits reordering of scalar arithmetic/logic unit instructions only.

### `TRANS`

`alias TRANS = AMDScheduleBarrierMask(1024)`

Allows reordering of transcendental instructions (sin, cos, exp, etc).

### `VALU`

`alias VALU = AMDScheduleBarrierMask(2)`

Permits reordering of vector arithmetic/logic unit instructions only.

### `VMEM_READ`

`alias VMEM_READ = AMDScheduleBarrierMask(32)`

Allows reordering of vector memory read operations only.

### `VMEM_WRITE`

`alias VMEM_WRITE = AMDScheduleBarrierMask(64)`

Allows reordering of vector memory write operations only.

## Methods

### `__init__`

`@implicit`
`__init__(value: Int) -> Self`

Initializes an `AMDScheduleBarrierMask` from an integer value.

This implicit constructor allows creating a barrier mask directly from an integer,
which is useful for combining multiple mask flags using bitwise operations.

**Args:**

* ​value (`Int`): The integer value to use for the barrier mask.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compares two `AMDScheduleBarrierMask` instances for equality.

**Args:**

* ​other (`Self`): The other `AMDScheduleBarrierMask` to compare with.

**Returns:**

True if the masks have the same value, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Compares two `AMDScheduleBarrierMask` instances for inequality.

**Args:**

* ​other (`Self`): The other `AMDScheduleBarrierMask` to compare with.

**Returns:**

True if the masks have different values, False otherwise.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the `AMDScheduleBarrierMask`.

Converts the mask to a human-readable string based on its value.

**Returns:**

A string representation of the mask, or aborts if the value is invalid.

### `__int__`

`__int__(self) -> Int`

Converts the `AMDScheduleBarrierMask` to an integer.

**Returns:**

The integer value of the mask, which can be used with low-level APIs.

---

## async_copy_arrive

`async_copy_arrive[type: AnyType, address_space: AddressSpace](address: UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin])`

Makes a memory barrier track all prior async copy operations from this thread.

This function ensures that all previously initiated asynchronous copy operations
from the executing thread are tracked by the memory barrier at the specified location.
Only supported on NVIDIA GPUs.

**Parameters:**

* ​type (`AnyType`): The data type stored at the barrier location.
* ​address\_space (`AddressSpace`): The memory address space where the barrier is located.

**Args:**

* ​address (`UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory barrier object location.

---

## barrier

`barrier()`

Performs a synchronization barrier at the block level.

This is equivalent to \_\_syncthreads() in CUDA. All threads in a thread block must
execute this function before any thread can proceed past the barrier. This ensures
memory operations before the barrier are visible to all threads after the barrier.

---

## cp_async_bulk_commit_group

`cp_async_bulk_commit_group()`

Commits all prior initiated but uncommitted cp.async.bulk instructions into a cp.async.bulk-group.

This function commits all previously initiated but uncommitted cp.async.bulk instructions into a
cp.async.bulk-group. The cp.async.bulk instructions are used for asynchronous bulk memory transfers
on NVIDIA GPUs.

The function creates a synchronization point for bulk memory transfers, allowing better control over
memory movement and synchronization between different stages of computation.

Note:
This functionality is only available on NVIDIA GPUs. Attempting to use this function on
non-NVIDIA GPUs will result in a compile time error.

---

## cp_async_bulk_wait_group

`cp_async_bulk_wait_group[n: SIMD[int32, 1], read: Bool = True]()`

Waits for completion of asynchronous bulk memory transfer groups.

This function causes the executing thread to wait until a specified number of the most recent
bulk async-groups are pending. It provides synchronization control for bulk memory transfers
on NVIDIA GPUs.

Note:
This functionality is only available on NVIDIA GPUs. Attempting to use this function on
non-NVIDIA GPUs will result in a compile time error.

Example:

```mojo
from gpu.sync import cp_async_bulk_wait_group

# Wait until at most 2 async groups are pending
cp_async_bulk_wait_group[2]()

# Wait for all async groups to complete
cp_async_bulk_wait_group[0]()
```

**Parameters:**

* ​n (`SIMD[int32, 1]`): The number of most recent bulk async-groups allowed to remain pending. When n=0,
  waits for all prior bulk async-groups to complete.
* ​read (`Bool`): If True, indicates that subsequent reads to the transferred memory are expected,
  enabling optimizations for read access patterns. Defaults to True.

---

## sync

This module provides GPU synchronization primitives and barriers.

The module includes:

* Block-level synchronization barriers (barrier())
* Warp-level synchronization (syncwarp())
* Memory barriers (mbarrier) for NVIDIA GPUs
* Instruction scheduling controls for AMD GPUs
* Asynchronous copy and bulk transfer synchronization

The synchronization primitives help coordinate execution between threads within
thread blocks and warps, and manage memory consistency across different memory spaces.

## Structs

* [​`AMDScheduleBarrierMask`](/mojo/stdlib/gpu/sync/AMDScheduleBarrierMask): Represents different instruction scheduling masks for AMDGPU scheduling instructions.

## Functions

* [​`async_copy_arrive`](/mojo/stdlib/gpu/sync/async_copy_arrive): Makes a memory barrier track all prior async copy operations from this thread.
* [​`barrier`](/mojo/stdlib/gpu/sync/barrier): Performs a synchronization barrier at the block level.
* [​`cp_async_bulk_commit_group`](/mojo/stdlib/gpu/sync/cp_async_bulk_commit_group): Commits all prior initiated but uncommitted cp.async.bulk instructions into a cp.async.bulk-group.
* [​`cp_async_bulk_wait_group`](/mojo/stdlib/gpu/sync/cp_async_bulk_wait_group): Waits for completion of asynchronous bulk memory transfer groups.
* [​`mbarrier_arrive`](/mojo/stdlib/gpu/sync/mbarrier_arrive): Signal thread arrival at a shared memory barrier.
* [​`mbarrier_arrive_expect_tx_relaxed`](/mojo/stdlib/gpu/sync/mbarrier_arrive_expect_tx_relaxed): Configure a shared memory barrier to expect additional async transactions.
* [​`mbarrier_arrive_expect_tx_shared`](/mojo/stdlib/gpu/sync/mbarrier_arrive_expect_tx_shared): Configure a shared memory barrier to expect additional async transactions.
* [​`mbarrier_init`](/mojo/stdlib/gpu/sync/mbarrier_init): Initialize a shared memory barrier for synchronizing multiple threads.
* [​`mbarrier_test_wait`](/mojo/stdlib/gpu/sync/mbarrier_test_wait): Test if all threads have arrived at the memory barrier.
* [​`mbarrier_try_wait_parity_shared`](/mojo/stdlib/gpu/sync/mbarrier_try_wait_parity_shared): Wait for completion of a barrier phase with timeout.
* [​`named_barrier`](/mojo/stdlib/gpu/sync/named_barrier): Performs a named synchronization barrier at the block level.
* [​`schedule_barrier`](/mojo/stdlib/gpu/sync/schedule_barrier): Controls instruction scheduling across a barrier point in AMD GPU code.
* [​`schedule_group_barrier`](/mojo/stdlib/gpu/sync/schedule_group_barrier): Controls instruction scheduling across a barrier point in AMD GPU code by creating schedule groups.
* [​`syncwarp`](/mojo/stdlib/gpu/sync/syncwarp): Synchronizes threads within a warp using a barrier.

---

## mbarrier_arrive

`mbarrier_arrive[type: AnyType](shared_mem: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]) -> Int`

Signal thread arrival at a shared memory barrier.

Records that the calling thread has reached the barrier synchronization point.
Only supported on NVIDIA GPUs.

**Parameters:**

* ​type (`AnyType`): The data type stored at the barrier location.

**Args:**

* ​shared\_mem (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier.

**Returns:**

An integer representing the current state of the memory barrier.

---

## mbarrier_arrive_expect_tx_relaxed

`mbarrier_arrive_expect_tx_relaxed[type: AnyType, scope: Scope = Scope(3), space: Scope = Scope(3)](addr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], tx_count: SIMD[int32, 1]) -> SIMD[uint64, 1]`

Configure a shared memory barrier to expect additional async transactions.

Updates the current phase of the memory barrier to track completion of
additional asynchronous transactions. Only supported on NVIDIA GPUs.

**Parameters:**

* ​type (`AnyType`): The type of the memory barrier.
* ​scope (`Scope`): The scope of the memory barrier.
* ​space (`Scope`): The space of the memory barrier.

**Args:**

* ​addr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier.
* ​tx\_count (`SIMD[int32, 1]`): Number of expected transactions to track.

**Returns:**

The state.

---

## mbarrier_arrive_expect_tx_shared

`mbarrier_arrive_expect_tx_shared[type: AnyType](addr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], tx_count: SIMD[int32, 1])`

Configure a shared memory barrier to expect additional async transactions.

Updates the current phase of the memory barrier to track completion of
additional asynchronous transactions. Only supported on NVIDIA GPUs.

**Parameters:**

* ​type (`AnyType`): The type of the memory barrier.

**Args:**

* ​addr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier.
* ​tx\_count (`SIMD[int32, 1]`): Number of expected transactions to track.

---

## mbarrier_init

`mbarrier_init[type: AnyType](shared_mem: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], num_threads: SIMD[int32, 1])`

Initialize a shared memory barrier for synchronizing multiple threads.

Sets up a memory barrier in shared memory that will be used to synchronize
the specified number of threads. Only supported on NVIDIA GPUs.

**Parameters:**

* ​type (`AnyType`): The data type stored at the barrier location.

**Args:**

* ​shared\_mem (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to shared memory location for the barrier.
* ​num\_threads (`SIMD[int32, 1]`): Number of threads that will synchronize on this barrier.

---

## mbarrier_test_wait

`mbarrier_test_wait[type: AnyType](shared_mem: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], state: Int) -> Bool`

Test if all threads have arrived at the memory barrier.

Non-blocking check to see if all participating threads have reached the barrier.
Only supported on NVIDIA GPUs.

**Parameters:**

* ​type (`AnyType`): The data type stored at the barrier location.

**Args:**

* ​shared\_mem (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier.
* ​state (`Int`): Expected state of the memory barrier.

**Returns:**

True if all threads have arrived, False otherwise.

---

## mbarrier_try_wait_parity_shared

`mbarrier_try_wait_parity_shared[type: AnyType](addr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], phase: SIMD[int32, 1], ticks: SIMD[int32, 1])`

Wait for completion of a barrier phase with timeout.

Waits for the shared memory barrier to complete the specified phase,
or until the timeout period expires. Only supported on NVIDIA GPUs.

**Parameters:**

* ​type (`AnyType`): The type of the memory barrier.

**Args:**

* ​addr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier.
* ​phase (`SIMD[int32, 1]`): Phase number to wait for.
* ​ticks (`SIMD[int32, 1]`): Timeout period in nanoseconds.

---

## named_barrier

`named_barrier[num_threads: SIMD[int32, 1], id: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](0)]()`

Performs a named synchronization barrier at the block level.

This function creates a synchronization point using a specific barrier ID, allowing
for multiple independent barriers within a thread block. All threads in the block
must execute this function with the same barrier ID and thread count before any
thread can proceed past the barrier.

Notes:

* Only supported on NVIDIA GPUs.
* Maps directly to the `nvvm.barrier` instruction.
* Useful for fine-grained synchronization when different subsets of threads
  need to synchronize independently.
* The barrier ID must not exceed 16.
* All threads participating in the barrier must specify the same num\_threads value.

**Parameters:**

* ​num\_threads (`SIMD[int32, 1]`): The number of threads that must reach the barrier before any can proceed.
* ​id (`SIMD[int32, 1]`): The barrier identifier (0-16). Default is 0.

---

## schedule_barrier

`schedule_barrier(mask: AMDScheduleBarrierMask = AMDScheduleBarrierMask(0))`

Controls instruction scheduling across a barrier point in AMD GPU code.

This function creates a scheduling barrier that controls which types of instructions
can be reordered across it by the compiler. The mask parameter specifies which
instruction categories (ALU, memory, etc) are allowed to cross the barrier during
scheduling optimization.

Note:
This function only has an effect on AMD GPUs. On other platforms it will
raise a compile time error.

**Args:**

* ​mask (`AMDScheduleBarrierMask`): A bit mask of AMDScheduleBarrierMask flags indicating which instruction
  types can be scheduled across this barrier. Default is NONE, meaning no
  instructions can cross.

---

## schedule_group_barrier

`schedule_group_barrier(mask: AMDScheduleBarrierMask, size: SIMD[int32, 1], sync_id: SIMD[int32, 1])`

Controls instruction scheduling across a barrier point in AMD GPU code by creating schedule groups.

This function creates a scheduling barrier that groups instructions into sequences with custom ordering.
It affects the code that precedes the barrier. The barrier ensures instructions are scheduled according
to the specified group parameters.

Note:
This function only has an effect on AMD GPUs. On other platforms it will raise a compile time error.
The sync\_id parameter allows creating multiple schedule groups that can be ordered relative to each other.

**Args:**

* ​mask (`AMDScheduleBarrierMask`): A bit mask of AMDScheduleBarrierMask flags indicating which instruction types can be
  scheduled across this barrier. Similar to schedule\_barrier masks.
* ​size (`SIMD[int32, 1]`): The number of times to repeat the instruction sequence in the schedule group.
* ​sync\_id (`SIMD[int32, 1]`): A unique identifier for the group that determines the ordering of instructions
  within the same schedule group.

---

## syncwarp

`syncwarp(mask: Int = -1)`

Synchronizes threads within a warp using a barrier.

This function creates a synchronization point where threads in a warp must wait until all
threads specified by the mask reach this point. On NVIDIA GPUs, it uses warp-level
synchronization primitives. On AMD GPUs, this is a no-op since threads execute in lock-step.

Note:

* On NVIDIA GPUs, this maps to the nvvm.bar.warp.sync intrinsic.
* On AMD GPUs, this is a no-op since threads execute in lock-step.
* Threads not participating in the sync must still execute the instruction.

**Args:**

* ​mask (`Int`): An integer bitmask specifying which lanes (threads) in the warp should be
  synchronized. Each bit corresponds to a lane, with bit i controlling lane i.
  A value of 1 means the lane participates in the sync, 0 means it does not.
  Default value of -1 (all bits set) synchronizes all lanes.

---

## TensorMemory

`@register_passable(trivial)`
`struct TensorMemory`

A wrapper around tensor memory allocated for tcgen05 instructions.

## Fields

* ​ptr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3), alignment=16]`): Pointer to the tensor memory address.
* ​num\_cols (`SIMD[uint32, 1]`): The number of columns in the tensor memory.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(num_cols: SIMD[uint32, 1]) -> Self`

Initialize the TensorMemory struct.

**Args:**

* ​num\_cols (`SIMD[uint32, 1]`): The number of columns to allocate.

---

## tcgen05

This module includes utilities for working with the tensorcore 5th generation (tcgen05) instructions.

## Aliases

### `check_blackwell_constraint`

`alias check_blackwell_constraint = constrained[::Bool,::StringSlice[::Bool[_has_blackwell_tcgen05(), __init__[__mlir_type.!kgen.string]("The tcgen05 instructions are only applicable on nVidia Blackwell (sm_100a, sm_101a) hardware."), ?]`

## Structs

* [​`TensorMemory`](/mojo/stdlib/gpu/tcgen05/TensorMemory): A wrapper around tensor memory allocated for tcgen05 instructions.

## Functions

* [​`tcgen05_alloc`](/mojo/stdlib/gpu/tcgen05/tcgen05_alloc): Allocates tensor memory for use with tcgen05 instructions.
* [​`tcgen05_cp`](/mojo/stdlib/gpu/tcgen05/tcgen05_cp): Copies data from shared memory described by the matrix descriptor `s_desc` to tensor memory `tmem_addr`.
* [​`tcgen05_dealloc`](/mojo/stdlib/gpu/tcgen05/tcgen05_dealloc): Deallocates tensor memory allocated by tcgen05\_alloc().
* [​`tcgen05_fence_after`](/mojo/stdlib/gpu/tcgen05/tcgen05_fence_after): Orders all the subsequent asynchronous `tcgen05` operations.
* [​`tcgen05_fence_before`](/mojo/stdlib/gpu/tcgen05/tcgen05_fence_before): Orders all the prior asynchronous `tcgen05` operations.
* [​`tcgen05_ld`](/mojo/stdlib/gpu/tcgen05/tcgen05_ld): Loads data from tensor memory into registers.
* [​`tcgen05_load_wait`](/mojo/stdlib/gpu/tcgen05/tcgen05_load_wait): Waits for tensor memory loads to complete.
* [​`tcgen05_release_allocation_lock`](/mojo/stdlib/gpu/tcgen05/tcgen05_release_allocation_lock): Releases the allocation lock for the current CTA group.
* [​`tcgen05_st`](/mojo/stdlib/gpu/tcgen05/tcgen05_st): Stores data from registers into tensor memory.
* [​`tcgen05_store_wait`](/mojo/stdlib/gpu/tcgen05/tcgen05_store_wait): Waits for tensor memory stores to complete.

---

## tcgen05_alloc

`tcgen05_alloc[cta_group: SIMD[int32, 1]](ptr_tmem_addr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3), alignment=16], num_cols: SIMD[uint32, 1])`

Allocates tensor memory for use with tcgen05 instructions.

Note:
This function is only available on NVIDIA Blackwell GPUs (SM 100+).

**Parameters:**

* ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID.

**Args:**

* ​ptr\_tmem\_addr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3), alignment=16]`): Shared memory pointer to hold tensor memory address.
* ​num\_cols (`SIMD[uint32, 1]`): The number of columns to allocate.

---

## tcgen05_cp

`tcgen05_cp[*, cta_group: SIMD[int32, 1], datapaths: Int, bits: Int, src_fmt: String = __init__[__mlir_type.!kgen.string](""), dst_fmt: String = __init__[__mlir_type.!kgen.string](""), multicast: String = __init__[__mlir_type.!kgen.string]("")](tmem_addr: SIMD[uint32, 1], s_desc: MMASmemDescriptor)`

Copies data from shared memory described by the matrix descriptor `s_desc` to tensor memory `tmem_addr`.

Note:
This function is only available on NVIDIA Blackwell GPUs (SM 100+).

**Parameters:**

* ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID.
* ​datapaths (`Int`): The first dimension of the shape.
* ​bits (`Int`): The second dimension of the shape.
* ​src\_fmt (`String`): Source format string.
* ​dst\_fmt (`String`): Destination format string.
* ​multicast (`String`): Multicast string.

**Args:**

* ​tmem\_addr (`SIMD[uint32, 1]`): Address of the tensor memory.
* ​s\_desc (`MMASmemDescriptor`): Matrix descriptor for the copy operation.

---

## tcgen05_dealloc

`tcgen05_dealloc[cta_group: SIMD[int32, 1]](tmem_addr: SIMD[uint32, 1], num_cols: SIMD[uint32, 1])`

Deallocates tensor memory allocated by tcgen05\_alloc().

This function deallocates tensor memory that was previously allocated using
tcgen05\_alloc(). The deallocation must be performed by the same CTA group
that performed the allocation.

**Parameters:**

* ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID.

**Args:**

* ​tmem\_addr (`SIMD[uint32, 1]`): Address of the tensor memory to deallocate.
* ​num\_cols (`SIMD[uint32, 1]`): Number of columns in the tensor memory.

---

## tcgen05_fence_after

`tcgen05_fence_after()`

Orders all the subsequent asynchronous `tcgen05` operations.

Note:
This function is only available on NVIDIA Blackwell GPUs (SM 100+).

---

## tcgen05_fence_before

`tcgen05_fence_before()`

Orders all the prior asynchronous `tcgen05` operations.

Note:
This function is only available on NVIDIA Blackwell GPUs (SM 100+).

---

## tcgen05_ld

`tcgen05_ld[*, datapaths: Int, bits: Int, repeat: Int, type: DType, pack: Bool, width: Int = (div_s(mul(#lit.struct.extract, #lit.struct.extract, #lit.struct.extract), 1024) + -1) if (((bits * datapaths * repeat) , #lit.struct.extract, #lit.struct.extract), 1024) == 0) ^ True)) else div_s(mul(#lit.struct.extract, #lit.struct.extract, #lit.struct.extract), 1024)](tmem_addr: SIMD[uint32, 1]) -> SIMD[type, width]`

Loads data from tensor memory into registers.

**Parameters:**

* ​datapaths (`Int`): The first dimension of the shape.
* ​bits (`Int`): The second dimension of the shape.
* ​repeat (`Int`): The repeat factor.
* ​type (`DType`): The data type to load.
* ​pack (`Bool`): Whether to pack two 16-bit chunks of adjacent columns into a single 32-bit register.
* ​width (`Int`): The number elements in the result vector.

**Args:**

* ​tmem\_addr (`SIMD[uint32, 1]`): The address of the tensor memory to load from.

**Returns:**

The SIMD register containing the loaded data.

---

## tcgen05_load_wait

`tcgen05_load_wait()`

Waits for tensor memory loads to complete.

Note:
This function is only available on NVIDIA Blackwell GPUs (SM 100+).

---

## tcgen05_release_allocation_lock

`tcgen05_release_allocation_lock[cta_group: SIMD[int32, 1]]()`

Releases the allocation lock for the current CTA group.

Note:
This function is only available on NVIDIA Blackwell GPUs (SM 100+).

**Parameters:**

* ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID.

---

## tcgen05_st

`tcgen05_st[type: DType, width: Int, //, *, datapaths: Int, bits: Int, repeat: Int, pack: Bool](tmem_addr: SIMD[uint32, 1], data: SIMD[type, width])`

Stores data from registers into tensor memory.

**Parameters:**

* ​type (`DType`): The data type to store.
* ​width (`Int`): The number of elements in the data vector.
* ​datapaths (`Int`): The first dimension of the shape.
* ​bits (`Int`): The second dimension of the shape.
* ​repeat (`Int`): The repeat factor.
* ​pack (`Bool`): Whether to pack two 16-bit chunks of adjacent columns into a single 32-bit register.

**Args:**

* ​tmem\_addr (`SIMD[uint32, 1]`): The address of the tensor memory to store to.
* ​data (`SIMD[type, width]`): The data to store into the tensor memory.

---

## tcgen05_store_wait

`tcgen05_store_wait()`

Waits for tensor memory stores to complete.

Note:
This function is only available on NVIDIA Blackwell GPUs (SM 100+).

---

## tensor_ops

This module provides tensor core operations and utilities for GPU computation.

The module includes functions for:

* Tensor core based reductions (tc\_reduce) supporting various data types and SIMD widths
* GEVM (General Matrix-Vector Multiplication) reductions using tensor cores
* Efficient warp-level reductions leveraging tensor core operations

The tensor core operations are optimized for NVIDIA GPUs and support different data types
including float32, float16, and bfloat16. The module provides both scalar and vector
variants of reduction operations with different SIMD widths for maximum performance.

Key functions:

* tc\_reduce: Main tensor core reduction function supporting various types and widths
* tc\_reduce\_gevm\_8x: 8x GEVM reduction using tensor cores
* tc\_reduce\_gevm\_4x: 4x GEVM reduction using tensor cores

Note:
Most operations require NVIDIA GPUs with tensor core support.
Operations are optimized for warp-level execution.

## Functions

* [​`tc_reduce`](/mojo/stdlib/gpu/tensor_ops/tc_reduce): Performs tensor core based reduction on a SIMD vector.
* [​`tc_reduce_gevm_4x`](/mojo/stdlib/gpu/tensor_ops/tc_reduce_gevm_4x): Performs a 4x GEVM reduction using tensor cores.
* [​`tc_reduce_gevm_8x`](/mojo/stdlib/gpu/tensor_ops/tc_reduce_gevm_8x): Performs an 8x GEVM reduction using tensor cores.

---

## tc_reduce

`tc_reduce[in_type: DType, simd_width: Int, //, out_type: DType](val: SIMD[in_type, simd_width]) -> SIMD[out_type, 1]`

Performs tensor core based reduction on a SIMD vector.

Note:
Dispatches to either scalar or vector reduction implementation based on SIMD width.
Supports various input/output type combinations using tensor core operations.

**Parameters:**

* ​in\_type (`DType`): The input data type of the SIMD vector elements.
* ​simd\_width (`Int`): The width of the SIMD vector.
* ​out\_type (`DType`): The output data type for the reduced result.

**Args:**

* ​val (`SIMD[in_type, simd_width]`): Input SIMD vector to reduce.

**Returns:**

Scalar containing the reduced result.

---

## tc_reduce_gevm_4x

`tc_reduce_gevm_4x[out_type: DType, in_type: DType, simd_width: Int](val1: SIMD[in_type, simd_width]) -> SIMD[out_type, simd_width]`

Performs a 4x GEVM reduction using tensor cores.

Note:
Currently only supports bfloat16 input to float32 output conversion.
Uses tensor core matrix multiply-accumulate (MMA) operations for reduction.

**Parameters:**

* ​out\_type (`DType`): The output data type for the reduction result (must be float32).
* ​in\_type (`DType`): The input data type of the vector to reduce (must be bfloat16).
* ​simd\_width (`Int`): The width of the SIMD vector.

**Args:**

* ​val1 (`SIMD[in_type, simd_width]`): Input SIMD vector to reduce.

**Returns:**

SIMD vector containing the reduced result.

---

## tc_reduce_gevm_8x

`tc_reduce_gevm_8x[out_type: DType, in_type: DType, simd_width: Int](val1: SIMD[in_type, simd_width], val2: SIMD[in_type, simd_width]) -> SIMD[out_type, simd_width]`

Performs an 8x GEVM reduction using tensor cores.

Note:
Currently only supports bfloat16 input to float32 output conversion.
Uses tensor core matrix multiply-accumulate (MMA) operations for reduction.

**Parameters:**

* ​out\_type (`DType`): The output data type for the reduction result (must be float32).
* ​in\_type (`DType`): The input data type of the vectors to reduce (must be bfloat16).
* ​simd\_width (`Int`): The width of the SIMD vectors.

**Args:**

* ​val1 (`SIMD[in_type, simd_width]`): First input SIMD vector to reduce.
* ​val2 (`SIMD[in_type, simd_width]`): Second input SIMD vector to reduce.

**Returns:**

SIMD vector containing the reduced result.

---

## ReductionMethod

`@register_passable(trivial)`
`struct ReductionMethod`

Enumerates the supported reduction methods.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `TENSOR_CORE`

`alias TENSOR_CORE = ReductionMethod(0)`

Use tensor core for reduction.

### `WARP`

`alias WARP = ReductionMethod(1)`

Use warp shuffle for reduction.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two ReductionMethod are equal.

**Args:**

* ​other (`Self`): The other ReductionMethod to compare.

**Returns:**

True if the ReductionMethod are equal, false otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if two ReductionMethod are not equal.

**Args:**

* ​other (`Self`): The other ReductionMethod to compare.

**Returns:**

True if the ReductionMethod are not equal, false otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Checks if two ReductionMethod are identical.

**Args:**

* ​other (`Self`): The other ReductionMethod to compare.

**Returns:**

True if the ReductionMethod are identical, false otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Checks if two ReductionMethod are not identical.

**Args:**

* ​other (`Self`): The other ReductionMethod to compare.

**Returns:**

True if the ReductionMethod are not identical, false otherwise.

---

## broadcast

`broadcast[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Broadcasts a SIMD value from lane 0 to all lanes in the warp.

This function takes a SIMD value from lane 0 and copies it to all other lanes in the warp,
effectively broadcasting the value across the entire warp. This is useful for sharing data
between threads in a warp without using shared memory.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to broadcast from lane 0.

**Returns:**

A SIMD value where all lanes contain a copy of the input value from lane 0.

`broadcast(val: Int) -> Int`

Broadcasts an integer value from lane 0 to all lanes in the warp.

This function takes an integer value from lane 0 and copies it to all other lanes in the warp.
It provides a convenient way to share scalar integer data between threads without using shared memory.

**Args:**

* ​val (`Int`): The integer value to broadcast from lane 0.

**Returns:**

The broadcast integer value, where all lanes receive a copy of the input from lane 0.

`broadcast(val: UInt) -> UInt`

Broadcasts an unsigned integer value from lane 0 to all lanes in the warp.

This function takes an unsigned integer value from lane 0 and copies it to all other lanes in the warp.
It provides a convenient way to share scalar unsigned integer data between threads without using shared memory.

**Args:**

* ​val (`UInt`): The unsigned integer value to broadcast from lane 0.

**Returns:**

The broadcast unsigned integer value, where all lanes receive a copy of the input from lane 0.

---

## warp

GPU warp-level operations and utilities.

This module provides warp-level operations for NVIDIA and AMD GPUs, including:

* Shuffle operations to exchange values between threads in a warp:
  * shuffle\_idx: Copy value from source lane to other lanes
  * shuffle\_up: Copy from lower lane IDs
  * shuffle\_down: Copy from higher lane IDs
  * shuffle\_xor: Exchange values in butterfly pattern

* Warp-wide reductions:
  * sum: Compute sum across warp
  * max: Find maximum value across warp
  * min: Find minimum value across warp
  * broadcast: Broadcast value to all lanes

The module handles both NVIDIA and AMD GPU architectures through architecture-specific
implementations of the core operations. It supports various data types including
integers, floats, and half-precision floats, with SIMD vectorization.

## Structs

* [​`ReductionMethod`](/mojo/stdlib/gpu/warp/ReductionMethod): Enumerates the supported reduction methods.

## Functions

* [​`broadcast`](/mojo/stdlib/gpu/warp/broadcast): Broadcasts a SIMD value from lane 0 to all lanes in the warp.
* [​`lane_group_max`](/mojo/stdlib/gpu/warp/lane_group_max): Reduces a SIMD value to its maximum within a lane group using warp-level operations.
* [​`lane_group_max_and_broadcast`](/mojo/stdlib/gpu/warp/lane_group_max_and_broadcast): Reduces and broadcasts the maximum value within a lane group using warp-level operations.
* [​`lane_group_min`](/mojo/stdlib/gpu/warp/lane_group_min): Reduces a SIMD value to its minimum within a lane group using warp-level operations.
* [​`lane_group_reduce`](/mojo/stdlib/gpu/warp/lane_group_reduce): Performs a generic warp-level reduction operation using shuffle operations.
* [​`lane_group_sum`](/mojo/stdlib/gpu/warp/lane_group_sum): Computes the sum of values across a group of lanes using warp-level operations.
* [​`lane_group_sum_and_broadcast`](/mojo/stdlib/gpu/warp/lane_group_sum_and_broadcast): Computes the sum across a lane group and broadcasts the result to all lanes.
* [​`max`](/mojo/stdlib/gpu/warp/max): Computes the maximum value across all lanes in a warp.
* [​`min`](/mojo/stdlib/gpu/warp/min): Computes the minimum value across all lanes in a warp.
* [​`prefix_sum`](/mojo/stdlib/gpu/warp/prefix_sum): Computes a warp-level prefix sum (scan) operation.
* [​`reduce`](/mojo/stdlib/gpu/warp/reduce): Performs a generic warp-wide reduction operation using shuffle operations.
* [​`shuffle_down`](/mojo/stdlib/gpu/warp/shuffle_down): Copies values from threads with higher lane IDs in the warp.
* [​`shuffle_idx`](/mojo/stdlib/gpu/warp/shuffle_idx): Copies a value from a source lane to other lanes in a warp.
* [​`shuffle_up`](/mojo/stdlib/gpu/warp/shuffle_up): Copies values from threads with lower lane IDs in the warp.
* [​`shuffle_xor`](/mojo/stdlib/gpu/warp/shuffle_xor): Exchanges values between threads in a warp using a butterfly pattern.
* [​`sum`](/mojo/stdlib/gpu/warp/sum): Computes the sum of values across all lanes in a warp.

---

## lane_group_max

`lane_group_max[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Reduces a SIMD value to its maximum within a lane group using warp-level operations.

This function performs a parallel reduction across a group of lanes to find the maximum value.
The reduction is done using warp shuffle operations for efficient communication between lanes.
The result is stored in all participating lanes.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.
* ​num\_lanes (`Int`): The number of threads participating in the reduction.
* ​stride (`Int`): The stride between lanes participating in the reduction.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the maximum.

**Returns:**

A SIMD value where all participating lanes contain the maximum value found across the lane group.
Non-participating lanes (lane\_id >= num\_lanes) retain their original values.

---

## lane_group_max_and_broadcast

`lane_group_max_and_broadcast[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Reduces and broadcasts the maximum value within a lane group using warp-level operations.

This function performs a parallel reduction to find the maximum value and broadcasts it to all lanes.
The reduction and broadcast are done using warp shuffle operations in a butterfly pattern for
efficient all-to-all communication between lanes.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.
* ​num\_lanes (`Int`): The number of threads participating in the reduction.
* ​stride (`Int`): The stride between lanes participating in the reduction.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce and broadcast. Each lane contributes its value.

**Returns:**

A SIMD value where all participating lanes contain the maximum value found across the lane group.
Non-participating lanes (lane\_id >= num\_lanes) retain their original values.

---

## lane_group_min

`lane_group_min[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Reduces a SIMD value to its minimum within a lane group using warp-level operations.

This function performs a parallel reduction across a group of lanes to find the minimum value.
The reduction is done using warp shuffle operations for efficient communication between lanes.
The result is stored in all participating lanes.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.
* ​num\_lanes (`Int`): The number of threads participating in the reduction.
* ​stride (`Int`): The stride between lanes participating in the reduction.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the minimum.

**Returns:**

A SIMD value where all participating lanes contain the minimum value found across the lane group.
Non-participating lanes (lane\_id >= num\_lanes) retain their original values.

---

## lane_group_reduce

`lane_group_reduce[val_type: DType, simd_width: Int, //, shuffle: fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1], func: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1], num_lanes: Int, *, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Performs a generic warp-level reduction operation using shuffle operations.

This function implements a parallel reduction across threads in a warp using a butterfly
pattern. It allows customizing both the shuffle operation and reduction function.

Example:

```mojo
    from gpu.warp import lane_group_reduce, shuffle_down

    # Compute sum across 16 threads using shuffle down
    @parameter
    fn add[type: DType, width: Int](x: SIMD[type, width], y: SIMD[type, width]) -> SIMD[type, width]:
        return x + y
    var val = SIMD[DType.float32, 16](42.0)
    var result = lane_group_reduce[shuffle_down, add, num_lanes=16](val)
```

.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.
* ​shuffle (`fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1]`): A function that performs the warp shuffle operation. Takes a SIMD value and
  offset and returns the shuffled result.
* ​func (`fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]`): A binary function that combines two SIMD values during reduction. This defines
  the reduction operation (e.g. add, max, min).
* ​num\_lanes (`Int`): The number of lanes in a group. The reduction is done within each group. Must be a power of 2.
* ​stride (`Int`): The stride between lanes participating in the reduction.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value.

**Returns:**

A SIMD value containing the reduction result.

---

## lane_group_sum

`lane_group_sum[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Computes the sum of values across a group of lanes using warp-level operations.

This function performs a parallel reduction across a group of lanes to compute their sum.
The reduction is done using warp shuffle operations for efficient communication between lanes.
The result is stored in all participating lanes.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.
* ​num\_lanes (`Int`): The number of threads participating in the reduction.
* ​stride (`Int`): The stride between lanes participating in the reduction.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to the sum.

**Returns:**

A SIMD value where all participating lanes contain the sum found across the lane group.
Non-participating lanes (lane\_id >= num\_lanes) retain their original values.

---

## lane_group_sum_and_broadcast

`lane_group_sum_and_broadcast[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Computes the sum across a lane group and broadcasts the result to all lanes.

This function performs a parallel reduction using a butterfly pattern to compute the sum,
then broadcasts the result to all participating lanes. The butterfly pattern ensures
efficient communication between lanes through warp shuffle operations.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.
* ​num\_lanes (`Int`): The number of threads participating in the reduction.
* ​stride (`Int`): The stride between lanes participating in the reduction.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to the sum.

**Returns:**

A SIMD value where all participating lanes contain the sum found across the lane group.
Non-participating lanes (lane\_id >= num\_lanes) retain their original values.

---

## max

`max[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Computes the maximum value across all lanes in a warp.

This is a convenience wrapper around lane\_group\_max that operates on the entire warp.
It performs a parallel reduction using warp shuffle operations to find the global maximum
value across all lanes in the warp.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the maximum.

**Returns:**

A SIMD value where all lanes contain the maximum value found across the entire warp.

---

## min

`min[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Computes the minimum value across all lanes in a warp.

This is a convenience wrapper around lane\_group\_min that operates on the entire warp.
It performs a parallel reduction using warp shuffle operations to find the global minimum
value across all lanes in the warp.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the minimum.

**Returns:**

A SIMD value where all lanes contain the minimum value found across the entire warp.
The minimum value is broadcast to all lanes.

---

## prefix_sum

`prefix_sum[type: DType, //, intermediate_type: DType = type, *, output_type: DType = type, exclusive: Bool = False](x: SIMD[type, 1]) -> SIMD[output_type, 1]`

Computes a warp-level prefix sum (scan) operation.

Performs an inclusive or exclusive prefix sum across threads in a warp using
a parallel scan algorithm with warp shuffle operations. This implements an
efficient parallel scan with logarithmic complexity.

For example, if we have a warp with the following elements:

$$
[x_0, x_1, x_2, x_3, x_4]
$$

The prefix sum is:

$$
[x_0, x_0 + x_1, x_0 + x_1 + x_2, x_0 + x_1 + x_2 + x_3, x_0 + x_1 + x_2 + x_3 + x_4]
$$

**Parameters:**

* ​type (`DType`): The data type of the input SIMD elements.
* ​intermediate\_type (`DType`): Type used for intermediate calculations (defaults to
  input type).
* ​output\_type (`DType`): The desired output data type (defaults to input type).
* ​exclusive (`Bool`): If True, performs exclusive scan where each thread receives
  the sum of all previous threads. If False (default), performs
  inclusive scan where each thread receives the sum including
  its own value.

**Args:**

* ​x (`SIMD[type, 1]`): The SIMD value to include in the prefix sum.

**Returns:**

A scalar containing the prefix sum at the current thread's position in
the warp, cast to the specified output type.

---

## reduce

`reduce[val_type: DType, simd_width: Int, //, shuffle: fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1], func: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Performs a generic warp-wide reduction operation using shuffle operations.

This is a convenience wrapper around lane\_group\_reduce that operates on the entire warp.
It allows customizing both the shuffle operation and reduction function.

Example:

```mojo
    from gpu.warp import reduce, shuffle_down

    # Compute warp-wide sum using shuffle down
    @parameter
    fn add[type: DType, width: Int](x: SIMD[type, width], y: SIMD[type, width]) capturing -> SIMD[type, width]:
        return x + y

    val = SIMD[DType.float32, 4](2.0, 4.0, 6.0, 8.0)
    result = reduce[shuffle_down, add](val)
```

.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.
* ​shuffle (`fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1]`): A function that performs the warp shuffle operation. Takes a SIMD value and
  offset and returns the shuffled result.
* ​func (`fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]`): A binary function that combines two SIMD values during reduction. This defines
  the reduction operation (e.g. add, max, min).

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value.

**Returns:**

A SIMD value containing the reduction result broadcast to all lanes in the warp.

---

## shuffle_down

`shuffle_down[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Copies values from threads with higher lane IDs in the warp.

Performs a shuffle operation where each thread receives a value from a thread with a
higher lane ID, offset by the specified amount. Uses the full warp mask by default.

For example, with offset=1:

* Thread 0 gets value from thread 1
* Thread 1 gets value from thread 2
* Thread N gets value from thread N+1
* Last N threads get undefined values

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled down the warp.
* ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values down by. Must be positive.

**Returns:**

The SIMD value from the thread offset lanes higher in the warp.
Returns undefined values for threads where lane\_id + offset >= WARP\_SIZE.

`shuffle_down[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Copies values from threads with higher lane IDs in the warp using a custom mask.

Performs a shuffle operation where each thread receives a value from a thread with a
higher lane ID, offset by the specified amount. The mask parameter controls which
threads participate in the shuffle.

For example, with offset=1:

* Thread 0 gets value from thread 1
* Thread 1 gets value from thread 2
* Thread N gets value from thread N+1
* Last N threads get undefined values

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​mask (`UInt`): A bitmask controlling which threads participate in the shuffle.
  Only threads with their corresponding bit set will exchange values.
* ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled down the warp.
* ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values down by. Must be positive.

**Returns:**

The SIMD value from the thread offset lanes higher in the warp.
Returns undefined values for threads where lane\_id + offset >= WARP\_SIZE
or where the corresponding mask bit is not set.

---

## shuffle_idx

`shuffle_idx[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Copies a value from a source lane to other lanes in a warp.

```
Broadcasts a value from a source thread in a warp to all participating threads
without using shared memory. This is a convenience wrapper that uses the full
warp mask by default.
```

Example:

```mojo
    from gpu.warp import shuffle_idx

    val = SIMD[DType.float32, 16](1.0)

    # Broadcast value from lane 0 to all lanes
    result = shuffle_idx(val, 0)

    # Get value from lane 5
    result = shuffle_idx(val, 5)
```

.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32, half).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​val (`SIMD[type, simd_width]`): The SIMD value to be broadcast from the source lane.
* ​offset (`SIMD[uint32, 1]`): The source lane ID to copy the value from.

**Returns:**

A SIMD vector where all lanes contain the value from the source lane specified by offset.

`shuffle_idx[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Copies a value from a source lane to other lanes in a warp with explicit mask control.

```
Broadcasts a value from a source thread in a warp to participating threads specified by
the mask. This provides fine-grained control over which threads participate in the shuffle
operation.
```

Example:

```mojo
    from gpu.warp import shuffle_idx

    # Only broadcast to first 16 lanes
    var mask = 0xFFFF  # 16 ones
    var val = SIMD[DType.float32, 32](1.0)
    var result = shuffle_idx(mask, val, 5)
```

.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32, half).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​mask (`UInt`): A bit mask specifying which lanes participate in the shuffle (1 bit per lane).
* ​val (`SIMD[type, simd_width]`): The SIMD value to be broadcast from the source lane.
* ​offset (`SIMD[uint32, 1]`): The source lane ID to copy the value from.

**Returns:**

A SIMD vector where participating lanes (set in mask) contain the value from the
source lane specified by offset. Non-participating lanes retain their original values.

---

## shuffle_up

`shuffle_up[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Copies values from threads with lower lane IDs in the warp.

Performs a shuffle operation where each thread receives a value from a thread with a
lower lane ID, offset by the specified amount. Uses the full warp mask by default.

For example, with offset=1:

* Thread N gets value from thread N-1
* Thread 1 gets value from thread 0
* Thread 0 gets undefined value

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled up the warp.
* ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values up by.

**Returns:**

The SIMD value from the thread offset lanes lower in the warp.
Returns undefined values for threads where lane\_id - offset 

`shuffle_up[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Copies values from threads with lower lane IDs in the warp.

Performs a shuffle operation where each thread receives a value from a thread with a
lower lane ID, offset by the specified amount. The operation is performed only for
threads specified in the mask.

For example, with offset=1:

* Thread N gets value from thread N-1 if both threads are in the mask
* Thread 1 gets value from thread 0 if both threads are in the mask
* Thread 0 gets undefined value
* Threads not in the mask get undefined values

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​mask (`UInt`): The warp mask specifying which threads participate in the shuffle.
* ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled up the warp.
* ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values up by.

**Returns:**

The SIMD value from the thread offset lanes lower in the warp.
Returns undefined values for threads where lane\_id - offset

---

## shuffle_xor

`shuffle_xor[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Exchanges values between threads in a warp using a butterfly pattern.

Performs a butterfly exchange pattern where each thread swaps values with another thread
whose lane ID differs by a bitwise XOR with the given offset. This creates a butterfly
communication pattern useful for parallel reductions and scans.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​val (`SIMD[type, simd_width]`): The SIMD value to be exchanged with another thread.
* ​offset (`SIMD[uint32, 1]`): The lane offset to XOR with the current thread's lane ID to determine
  the exchange partner. Common values are powers of 2 for butterfly patterns.

**Returns:**

The SIMD value from the thread at lane (current\_lane XOR offset).

`shuffle_xor[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Exchanges values between threads in a warp using a butterfly pattern with masking.

Performs a butterfly exchange pattern where each thread swaps values with another thread
whose lane ID differs by a bitwise XOR with the given offset. The mask parameter allows
controlling which threads participate in the exchange.

Example:

```mojo
    from gpu.warp import shuffle_xor

    # Exchange values between even-numbered threads 4 lanes apart
    mask = 0xAAAAAAAA  # Even threads only
    var val = SIMD[DType.float32, 16](42.0)  # Example value
    result = shuffle_xor(mask, val, 4.0)
```

.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​mask (`UInt`): A bit mask specifying which threads participate in the exchange.
  Only threads with their corresponding bit set in the mask will exchange values.
* ​val (`SIMD[type, simd_width]`): The SIMD value to be exchanged with another thread.
* ​offset (`SIMD[uint32, 1]`): The lane offset to XOR with the current thread's lane ID to determine
  the exchange partner. Common values are powers of 2 for butterfly patterns.

**Returns:**

The SIMD value from the thread at lane (current\_lane XOR offset) if both threads
are enabled by the mask, otherwise the original value is preserved.

---

## sum

`sum[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Computes the sum of values across all lanes in a warp.

This is a convenience wrapper around lane\_group\_sum\_and\_broadcast that
operates on the entire warp.  It performs a parallel reduction using warp
shuffle operations to find the global sum across all lanes in the warp.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to the sum.

**Returns:**

A SIMD value where all lanes contain the sum found across the entire warp.
The sum is broadcast to all lanes.

`sum[intermediate_type: DType, *, reduction_method: ReductionMethod, output_type: DType](x: SIMD[dtype, size]) -> SIMD[output_type, 1]`

Performs a warp-level reduction to compute the sum of values across threads.

This function provides two reduction methods:

1. Warp shuffle: Uses warp shuffle operations to efficiently sum values across threads
2. Tensor core: Leverages tensor cores for high-performance reductions, with type casting

The tensor core method will cast the input to the specified intermediate type before
reduction to ensure compatibility with tensor core operations. The warp shuffle method
requires the output type to match the input type.

**Constraints:**

* For warp shuffle reduction, output\_type must match the input value type.
* For tensor core reduction, input will be cast to intermediate\_type.

**Parameters:**

* ​intermediate\_type (`DType`): The data type to cast to when using tensor core reduction.
* ​reduction\_method (`ReductionMethod`): `WARP` for warp shuffle or `TENSOR_CORE` for tensor core reduction.
* ​output\_type (`DType`): The desired output data type for the reduced value.

**Args:**

* ​x (`SIMD[dtype, size]`): The SIMD value to reduce across the warp.

**Returns:**

A scalar containing the sum of the input values across all threads in the warp,
cast to the specified output type.

---

## Hashable

A trait for types which specify a function to hash their data.

This hash function will be used for applications like hash maps, and
don't need to be cryptographically secure. A good hash function will
hash similar / common types to different values, and in particular
the *low order bits* of the hash, which are used in smaller dictionaries,
should be sensitive to any changes in the data structure. If your type's
hash function doesn't meet this criteria it will get poor performance in
common hash map implementations.

```mojo
@fieldwise_init
struct Foo(Hashable):
    fn __hash__(self) -> UInt:
        return 4  # chosen by fair random dice roll

var foo = Foo()
print(hash(foo))
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__hash__`

`__hash__(self: _Self) -> UInt`

Return a 64-bit hash of the type's data.

**Returns:**

A 64-bit integer hash of this instance's data.

---

## hash

`hash[T: Hashable](hashable: T) -> UInt`

Hash a Hashable type using its underlying hash implementation.

**Parameters:**

* ​T (`Hashable`): Any Hashable type.

**Args:**

* ​hashable (`T`): The input data to hash.

**Returns:**

A 64-bit integer hash based on the underlying implementation.

`hash(bytes: UnsafePointer[SIMD[uint8, 1], alignment=alignment, mut=False, origin=origin], n: Int) -> UInt`

Hash a byte array using a SIMD-modified DJBX33A hash algorithm.

*This hash function is not suitable for cryptographic purposes.* The
algorithm is easy to reverse and produce deliberate hash collisions.
The hash function is designed to have relatively good mixing and statistical
properties for use in hash-based data structures.  We *do* however initialize
a random hash secret which is mixed into the final hash output. This can help
prevent DDOS attacks on applications which make use of this function for
dictionary hashing. As a consequence, hash values are deterministic within an
individual runtime instance ie.  a value will always hash to the same thing,
but in between runs this value will change based on the hash secret.

We take advantage of Mojo's first-class SIMD support to create a
SIMD-vectorized hash function, using some simple hash algorithm as a base.

* Interpret those bytes as a SIMD vector, padded with zeros to align
  to the system SIMD width.
* Apply the simple hash function parallelized across SIMD vectors.
* Hash the final SIMD vector state to reduce to a single value.

Python uses DJBX33A with a hash secret for smaller strings, and
then the SipHash algorithm for longer strings. The arguments and tradeoffs
are well documented in PEP 456. We should consider this and deeper
performance/security tradeoffs as Mojo evolves.

References:

* [Wikipedia: Non-cryptographic hash function](https://en.wikipedia.org/wiki/Non-cryptographic_hash_function)
* [Python PEP 456](https://peps.python.org/pep-0456/)
* [PHP Hash algorithm and collisions](https://www.phpinternalsbook.com/php5/hashtables/hash_algorithm.html)

```mojo
from random import rand
var n = 64
var rand_bytes = UnsafePointer[UInt8].alloc(n)
rand(rand_bytes, n)
hash(rand_bytes, n)
```

**Args:**

* ​bytes (`UnsafePointer[SIMD[uint8, 1], alignment=alignment, mut=False, origin=origin]`): The byte array to hash.
* ​n (`Int`): The length of the byte array.

**Returns:**

A 64-bit integer hash. This hash is *not* suitable for
cryptographic purposes, but will have good low-bit
hash collision statistical properties for common data structures.

---

## hash

Implements the `Hashable` trait and `hash()` built-in function.

There are a few main tools in this module:

* `Hashable` trait for types implementing `__hash__(self) -> UInt`
* `hash[T: Hashable](hashable: T) -> Int` built-in function.
* A `hash()` implementation for arbitrary byte strings,
  `hash(data: UnsafePointer[UInt8], n: Int) -> Int`,
  is the workhorse function, which implements efficient hashing via SIMD
  vectors. See the documentation of this function for more details on the hash
  implementation.
* `hash(SIMD)` and `hash(UInt8)` implementations
  These are useful helpers to specialize for the general bytes implementation.

## Traits

* [​`Hashable`](/mojo/stdlib/hashlib/hash/Hashable): A trait for types which specify a function to hash their data.

## Functions

* [​`hash`](/mojo/stdlib/hashlib/hash/hash): Hash a Hashable type using its underlying hash implementation.

---

## hashlib

Implements the hashlib package that provides various hash algorithms.

## Modules

* [​`hash`](/mojo/stdlib/hashlib/hash/): Implements the `Hashable` trait and `hash()` built-in function.

---

## stdlib

## Packages

* [​`algorithm`](/mojo/stdlib/algorithm/): Implements the algorithm package.
* [​`base64`](/mojo/stdlib/base64/): Implements the base64 package.
* [​`benchmark`](/mojo/stdlib/benchmark/): Implements the benchmark package for runtime benchmarking.
* [​`bit`](/mojo/stdlib/bit/): Implements the bit package.
* [​`buffer`](/mojo/stdlib/buffer/): Implements the buffer package.
* [​`builtin`](/mojo/stdlib/builtin/): Implements the builtin package.
* [​`collections`](/mojo/stdlib/collections/): Implements the collections package.
* [​`compile`](/mojo/stdlib/compile/): Provides utilities for compiling and inspecting Mojo code at runtime.
* [​`complex`](/mojo/stdlib/complex/): Provides types and functions for working with complex numbers.
* [​`documentation`](/mojo/stdlib/documentation/): Implements the documentation package.
* [​`gpu`](/mojo/stdlib/gpu/): Provides low-level programming constructs for working with GPUs.
* [​`hashlib`](/mojo/stdlib/hashlib/): Implements the hashlib package that provides various hash algorithms.
* [​`logger`](/mojo/stdlib/logger/): Provides logging functionality with different severity levels.
* [​`math`](/mojo/stdlib/math/): Implements the math package.
* [​`memory`](/mojo/stdlib/memory/): The memory package provides several pointer types, as well as utility functions for dealing with memory.
* [​`os`](/mojo/stdlib/os/): Provides access to operating-system dependent functionality.
* [​`pathlib`](/mojo/stdlib/pathlib/): Implements the pathlib package.
* [​`prelude`](/mojo/stdlib/prelude/): Implements the prelude package.  This package provide the public entities that are automatically imported into every Mojo program.
* [​`pwd`](/mojo/stdlib/pwd/): Provides access to user and group information from the password database.
* [​`python`](/mojo/stdlib/python/): Implements the python package.
* [​`random`](/mojo/stdlib/random/): Implements the random package.
* [​`runtime`](/mojo/stdlib/runtime/): Implements the runtime package.
* [​`stat`](/mojo/stdlib/stat/): Implements the stat package.
* [​`subprocess`](/mojo/stdlib/subprocess/): Implements the subprocess package.
* [​`sys`](/mojo/stdlib/sys/): Implements the sys package.
* [​`tempfile`](/mojo/stdlib/tempfile/): Implements the tempfile package.
* [​`testing`](/mojo/stdlib/testing/): Implements the testing package.
* [​`time`](/mojo/stdlib/time/): Implements the time package.
* [​`utils`](/mojo/stdlib/utils/): Implements the utils package.

---

## logger

Provides logging functionality with different severity levels.

## Modules

* [​`logger`](/mojo/stdlib/logger/logger/): Provides logging functionality with different severity levels.

---

## Level

`struct Level`

Represents logging severity levels.

Defines the available logging levels in ascending order of severity.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `CRITICAL`

`alias CRITICAL = Level(50)`

A serious error indicating that the program itself may be unable to continue running.

### `DEBUG`

`alias DEBUG = Level(10)`

Detailed information, typically of interest only when diagnosing problems.

### `ERROR`

`alias ERROR = Level(40)`

Due to a more serious problem, the software has not been able to perform some function.

### `INFO`

`alias INFO = Level(20)`

Confirmation that things are working as expected.

### `NOTSET`

`alias NOTSET = Level(0)`

Lowest level, used when no level is set.

### `WARNING`

`alias WARNING = Level(30)`

Indication that something unexpected happened, or may happen in the near future.

## Methods

### `__lt__`

`__lt__(self, other: Self) -> Bool`

Returns True if this level is less than the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if this level is less than the other level, False otherwise.

### `__le__`

`__le__(self, other: Self) -> Bool`

Returns True if this level is less than or equal to the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if this level is less than or equal to the other level, False otherwise.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Returns True if this level equals the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if the levels are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Returns True if this level does not equal the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if the levels are not equal, False otherwise.

### `__gt__`

`__gt__(self, other: Self) -> Bool`

Returns True if this level is greater than the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if this level is greater than the other level, False otherwise.

### `__ge__`

`__ge__(self, other: Self) -> Bool`

Returns True if this level is greater than or equal to the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if this level is greater than or equal to the other level, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Returns True if this level is identical to the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if this level is identical to the other level, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Returns True if this level is not identical to the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if this level is not identical to the other level, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the string representation of this level to a writer.

**Parameters:**

* ​W (`Writer`): The writer type that implements the Writer trait.

**Args:**

* ​writer (`W`): The writer to write to.

### `__str__`

`__str__(self) -> String`

Returns the string representation of this level.

**Returns:**

String: A human-readable string representation of the level (e.g., "DEBUG", "INFO").

### `__repr__`

`__repr__(self) -> String`

Returns the detailed string representation of this level.

**Returns:**

String: A string representation including the type name and level value (e.g., "Level.DEBUG").

---

## Logger

`struct Logger[level: Level = _from_str[::Bool,::Origin[$0]](env_get_string[::StringSlice[::Bool())]`

A logger that outputs messages at or above a specified severity level.

## Parameters

* ​level (`Level`): The minimum severity level for messages to be logged.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(out self, fd: FileDescriptor = FileDescriptor(1))`

Initializes a new Logger.

**Args:**

* ​fd (`FileDescriptor`): The file descriptor to write log messages to (defaults to stdout).

### `debug`

`debug[*Ts: Writable](self, *values: *Ts)`

Logs a debug message.

**Parameters:**

* ​\*Ts (`Writable`): The types of values to log.

**Args:**

* ​\*values (`*Ts`): The values to log.

### `info`

`info[*Ts: Writable](self, *values: *Ts)`

Logs an info message.

**Parameters:**

* ​\*Ts (`Writable`): The types of values to log.

**Args:**

* ​\*values (`*Ts`): The values to log.

### `warning`

`warning[*Ts: Writable](self, *values: *Ts)`

Logs a warning message.

**Parameters:**

* ​\*Ts (`Writable`): The types of values to log.

**Args:**

* ​\*values (`*Ts`): The values to log.

### `error`

`error[*Ts: Writable](self, *values: *Ts)`

Logs an error message.

**Parameters:**

* ​\*Ts (`Writable`): The types of values to log.

**Args:**

* ​\*values (`*Ts`): The values to log.

### `critical`

`critical[*Ts: Writable](self, *values: *Ts)`

Logs a critical message and aborts execution.

**Parameters:**

* ​\*Ts (`Writable`): The types of values to log.

**Args:**

* ​\*values (`*Ts`): The values to log.

---

## logger

Provides logging functionality with different severity levels.

This module implements a simple logging system with configurable severity
levels: `NOTSET`, `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`. The
logging level can be set via the LOGGING\_LEVEL environment variable.

The main components are:

* `Level`: An enum-like struct defining the available logging levels
* `Logger`: A struct that handles logging messages with different severity levels

Example:

```mojo
from logger import Logger

var logger = Logger()  # Uses default level from LOGGING_LEVEL env var
logger.info("Starting process")
logger.debug("Debug information")
logger.error("An error occurred")
```

The logger can be configured to write to different file descriptors (default
stdout). Messages below the configured level will be silently ignored.

## Aliases

### `DEFAULT_LEVEL`

`alias DEFAULT_LEVEL = _from_str[::Bool,::Origin[$0]](env_get_string[::StringSlice[::Bool())`

## Structs

* [​`Level`](/mojo/stdlib/logger/logger/Level): Represents logging severity levels.
* [​`Logger`](/mojo/stdlib/logger/logger/Logger): A logger that outputs messages at or above a specified severity level.

---

## constants

Defines math utilities.

You can import these APIs from the `math` package. For example:

```mojo
from math import pi
```

## Aliases

### `e`

`alias e = 2.7182818284590451`

The euler constant e = 2.718281...

### `log2e`

`alias log2e = 1.4426950408889634`

log2e = log2(e), where e is Euler's constant.

### `pi`

`alias pi = 3.1415926535897931`

The mathematical constant π = 3.141592...

### `tau`

`alias tau = 6.2831853071795862`

The mathematical constant τ = 6.283185.... Tau is a circumference of a circle (2π).

---

## math

Implements the math package.

## Modules

* [​`constants`](/mojo/stdlib/math/constants/): Defines math utilities.
* [​`math`](/mojo/stdlib/math/math/): Defines math utilities.
* [​`polynomial`](/mojo/stdlib/math/polynomial/): Provides two implementations for evaluating polynomials.

---

## CeilDivable

The `CeilDivable` trait describes a type that defines a ceil division operation.

Types that conform to `CeilDivable` will work with the `math.ceildiv`
function.

For example:

```mojo
from math import CeilDivable

@fieldwise_init
struct Foo(CeilDivable, Copyable):
    var x: Float64

    fn __ceildiv__(self, denominator: Self) -> Self:
        return Self(self.x // denominator.x)
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__ceildiv__`

`__ceildiv__(self: _Self, denominator: _Self) -> _Self`

Return the rounded-up result of dividing self by denominator.

**Args:**

* ​denominator (`_Self`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

---

## CeilDivableRaising

The `CeilDivable` trait describes a type that define a floor division and negation operation that can raise.

Types that conform to `CeilDivableRaising` will work with the `//` operator
as well as the `math.ceildiv` function.

For example:

```mojo
from math import CeilDivableRaising

@fieldwise_init
struct Foo(CeilDivableRaising, Copyable):
    var x: Float64

    fn __ceildiv__(self, denominator: Self) raises -> Self:
        return Self(self.x // denominator.x)
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__ceildiv__`

`__ceildiv__(self: _Self, denominator: _Self) -> _Self`

Return the rounded-up result of dividing self by denominator.

**Args:**

* ​denominator (`_Self`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

---

## Ceilable

The `Ceilable` trait describes a type that defines a ceiling operation.

Types that conform to `Ceilable` will work with the builtin `ceil`
function. The ceiling operation always returns the same type as the input.

For example:

```mojo
from math import Ceilable, ceil

@fieldwise_init
struct Complex(Ceilable, Copyable):
    var re: Float64
    var im: Float64

    fn __ceil__(self) -> Self:
        return Self(ceil(self.re), ceil(self.im))
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__ceil__`

`__ceil__(self: _Self) -> _Self`

Return the ceiling of the Int value, which is itself.

**Returns:**

The Int value itself.

---

## Floorable

The `Floorable` trait describes a type that defines a floor operation.

Types that conform to `Floorable` will work with the builtin `floor`
function. The floor operation always returns the same type as the input.

For example:

```mojo
from math import Floorable, floor

@fieldwise_init
struct Complex(Floorable, Copyable):
    var re: Float64
    var im: Float64

    fn __floor__(self) -> Self:
        return Self(floor(self.re), floor(self.im))
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__floor__`

`__floor__(self: _Self) -> _Self`

Return the floor of the Int value, which is itself.

**Returns:**

The Int value itself.

---

## Truncable

The `Truncable` trait describes a type that defines a truncation operation.

Types that conform to `Truncable` will work with the builtin `trunc`
function. The truncation operation always returns the same type as the
input.

For example:

```mojo
from math import Truncable, trunc

@fieldwise_init
struct Complex(Truncable, Copyable):
    var re: Float64
    var im: Float64

    fn __trunc__(self) -> Self:
        return Self(trunc(self.re), trunc(self.im))
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__trunc__`

`__trunc__(self: _Self) -> _Self`

Return the truncated Int value, which is itself.

**Returns:**

The Int value itself.

---

## acos

`acos[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `acos` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `acos` of the input.

---

## acosh

`acosh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `acosh` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `acosh` of the input.

---

## align_down

`align_down(value: Int, alignment: Int) -> Int`

Returns the closest multiple of alignment that is less than or equal to value.

**Args:**

* ​value (`Int`): The value to align.
* ​alignment (`Int`): Value to align to.

**Returns:**

Closest multiple of the alignment that is less than or equal to the
input value. In other words, floor(value / alignment) \* alignment.

`align_down(value: UInt, alignment: UInt) -> UInt`

Returns the closest multiple of alignment that is less than or equal to value.

**Args:**

* ​value (`UInt`): The value to align.
* ​alignment (`UInt`): Value to align to.

**Returns:**

Closest multiple of the alignment that is less than or equal to the
input value. In other words, floor(value / alignment) \* alignment.

---

## align_up

`align_up(value: Int, alignment: Int) -> Int`

Returns the closest multiple of alignment that is greater than or equal to value.

**Args:**

* ​value (`Int`): The value to align.
* ​alignment (`Int`): Value to align to.

**Returns:**

Closest multiple of the alignment that is greater than or equal to the
input value. In other words, ceiling(value / alignment) \* alignment.

`align_up(value: UInt, alignment: UInt) -> UInt`

Returns the closest multiple of alignment that is greater than or equal to value.

**Args:**

* ​value (`UInt`): The value to align.
* ​alignment (`UInt`): Value to align to.

**Returns:**

Closest multiple of the alignment that is greater than or equal to the
input value. In other words, ceiling(value / alignment) \* alignment.

---

## asin

`asin[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `asin` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `asin` of the input.

---

## asinh

`asinh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `asinh` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `asinh` of the input.

---

## atan

`atan[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `atan` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `atan` of the input.

---

## atan2

`atan2[dtype: DType, width: Int, //](y: SIMD[dtype, width], x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `atan2` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​y (`SIMD[dtype, width]`): The first input argument.
* ​x (`SIMD[dtype, width]`): The second input argument.

**Returns:**

The `atan2` of the inputs.

---

## atanh

`atanh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `atanh` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `atanh` of the input.

---

## cbrt

`cbrt[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `cbrt` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `cbrt` of the input.

---

## ceil

`ceil[T: Ceilable, //](value: T) -> T`

Get the ceiling value of the given object.

**Parameters:**

* ​T (`Ceilable`): The type conforming to `Ceilable`.

**Args:**

* ​value (`T`): The object to get the ceiling value of.

**Returns:**

The ceiling value of the object.

---

## ceildiv

`ceildiv[T: CeilDivable, //](numerator: T, denominator: T) -> T`

Return the rounded-up result of dividing numerator by denominator.

**Parameters:**

* ​T (`CeilDivable`): A type that support floor division.

**Args:**

* ​numerator (`T`): The numerator.
* ​denominator (`T`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

`ceildiv[T: CeilDivableRaising, //](numerator: T, denominator: T) -> T`

Return the rounded-up result of dividing numerator by denominator, potentially raising.

**Parameters:**

* ​T (`CeilDivableRaising`): A type that support floor division.

**Args:**

* ​numerator (`T`): The numerator.
* ​denominator (`T`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

`ceildiv(numerator: IntLiteral[value], denominator: IntLiteral[value]) -> IntLiteral[(0 - (value // (0 - value)))]`

Return the rounded-up result of dividing numerator by denominator.

**Args:**

* ​numerator (`IntLiteral[value]`): The numerator.
* ​denominator (`IntLiteral[value]`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

---

## clamp

`clamp(val: Int, lower_bound: Int, upper_bound: Int) -> Int`

Clamps the integer value vector to be in a certain range.

**Args:**

* ​val (`Int`): The value to clamp.
* ​lower\_bound (`Int`): Minimum of the range to clamp to.
* ​upper\_bound (`Int`): Maximum of the range to clamp to.

**Returns:**

An integer clamped to be within lower\_bound and upper\_bound.

`clamp(val: UInt, lower_bound: UInt, upper_bound: UInt) -> UInt`

Clamps the integer value vector to be in a certain range.

**Args:**

* ​val (`UInt`): The value to clamp.
* ​lower\_bound (`UInt`): Minimum of the range to clamp to.
* ​upper\_bound (`UInt`): Maximum of the range to clamp to.

**Returns:**

An integer clamped to be within lower\_bound and upper\_bound.

`clamp[dtype: DType, width: Int, //](val: SIMD[dtype, width], lower_bound: SIMD[dtype, width], upper_bound: SIMD[dtype, width]) -> SIMD[dtype, width]`

Clamps the values in a SIMD vector to be in a certain range.

Clamp cuts values in the input SIMD vector off at the upper bound and
lower bound values. For example,  SIMD vector `[0, 1, 2, 3]` clamped to
a lower bound of 1 and an upper bound of 2 would return `[1, 1, 2, 2]`.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​val (`SIMD[dtype, width]`): The value to clamp.
* ​lower\_bound (`SIMD[dtype, width]`): Minimum of the range to clamp to.
* ​upper\_bound (`SIMD[dtype, width]`): Maximum of the range to clamp to.

**Returns:**

A SIMD vector containing x clamped to be within lower\_bound and
upper\_bound.

---

## copysign

`copysign[dtype: DType, width: Int, //](magnitude: SIMD[dtype, width], sign: SIMD[dtype, width]) -> SIMD[dtype, width]`

Returns a value with the magnitude of the first operand and the sign of the second operand.

**Constraints:**

The type of the input must be numeric.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​magnitude (`SIMD[dtype, width]`): The magnitude to use.
* ​sign (`SIMD[dtype, width]`): The sign to copy.

**Returns:**

Copies the sign from sign to magnitude.

---

## cos

`cos[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `cos` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `cos` of the input.

---

## cosh

`cosh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `cosh` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `cosh` of the input.

---

## erf

`erf[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs the elementwise Erf on a SIMD vector.

**Constraints:**

The type must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector to perform elementwise Erf on.

**Returns:**

The result of the elementwise Erf operation.

---

## erfc

`erfc[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `erfc` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `erfc` of the input.

---

## exp

`exp[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Calculates elementwise exponential of the input vector.

Given an input vector $X$ and an output vector $Y$, sets $Y_i = e^{X_i}$ for
each position $i$ in the input vector (where $e$ is the mathematical constant
$e$).

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input SIMD vector.

**Returns:**

A SIMD vector containing $e$ raised to the power $X_i$ where $X_i$ is an
element in the input SIMD vector.

`exp[T: _Expable](x: T) -> T`

Computes the exponential of the input value.

**Parameters:**

* ​T (`_Expable`): The type of the input value.

**Args:**

* ​x (`T`): The input value.

**Returns:**

The exponential of the input value.

---

## exp2

`exp2[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes elementwise 2 raised to the power of n, where n is an element of the input SIMD vector.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector to perform exp2 on.

**Returns:**

Vector containing $2^n$ computed elementwise, where n is an element in
the input SIMD vector.

---

## expm1

`expm1[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `expm1` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `expm1` of the input.

---

## factorial

`factorial(n: Int) -> Int`

Computes the factorial of the integer.

**Args:**

* ​n (`Int`): The input value. Must be non-negative.

**Returns:**

The factorial of the input. Results are undefined for negative inputs.

---

## floor

`floor[T: Floorable, //](value: T) -> T`

Get the floor value of the given object.

**Parameters:**

* ​T (`Floorable`): The type conforming to `Floorable`.

**Args:**

* ​value (`T`): The object to get the floor value of.

**Returns:**

The floor value of the object.

---

## fma

`fma(a: Int, b: Int, c: Int) -> Int`

Performs `fma` (fused multiply-add) on the inputs.

The result is `(a * b) + c`.

**Args:**

* ​a (`Int`): The first input.
* ​b (`Int`): The second input.
* ​c (`Int`): The third input.

**Returns:**

`(a * b) + c`.

`fma(a: UInt, b: UInt, c: UInt) -> UInt`

Performs `fma` (fused multiply-add) on the inputs.

The result is `(a * b) + c`.

**Args:**

* ​a (`UInt`): The first input.
* ​b (`UInt`): The second input.
* ​c (`UInt`): The third input.

**Returns:**

`(a * b) + c`.

`fma[dtype: DType, width: Int, //](a: SIMD[dtype, width], b: SIMD[dtype, width], c: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs elementwise `fma` (fused multiply-add) on the inputs.

Each element in the result SIMD vector is $(A_i * B_i) + C_i$, where $A_i$,
$B_i$ and $C_i$ are elements at index $i$ in a, b, and c respectively.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​a (`SIMD[dtype, width]`): The first vector of inputs.
* ​b (`SIMD[dtype, width]`): The second vector of inputs.
* ​c (`SIMD[dtype, width]`): The third vector of inputs.

**Returns:**

Elementwise `fma` of a, b and c.

---

## frexp

`frexp[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> StaticTuple[SIMD[dtype, width], 2]`

Breaks floating point values into a fractional part and an exponent part. This follows C and Python in increasing the exponent by 1 and normalizing the fraction from 0.5 to 1.0 instead of 1.0 to 2.0.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input values.

**Returns:**

A tuple of two SIMD vectors containing the fractional and exponent parts
of the input floating point values.

---

## gamma

`gamma[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the Gamma of the input.

For details, see .

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The Gamma function evaluated at the input.

---

## gcd

`gcd(m: Int, n: Int, /) -> Int`

Compute the greatest common divisor of two integers.

**Args:**

* ​m (`Int`): The first integer.
* ​n (`Int`): The second integrer.

**Returns:**

The greatest common divisor of the two integers.

`gcd(s: Span[Int, origin], /) -> Int`

Computes the greatest common divisor of a span of integers.

**Args:**

* ​s (`Span[Int, origin]`): A span containing a collection of integers.

**Returns:**

The greatest common divisor of all the integers in the span.

`gcd(l: List[Int, hint_trivial_type], /) -> Int`

Computes the greatest common divisor of a list of integers.

**Args:**

* ​l (`List[Int, hint_trivial_type]`): A list containing a collection of integers.

**Returns:**

The greatest common divisor of all the integers in the list.

`gcd(*values: Int) -> Int`

Computes the greatest common divisor of a variadic number of integers.

**Args:**

* ​\*values (`Int`): A variadic list of integers.

**Returns:**

The greatest common divisor of the given integers.

---

## hypot

`hypot[dtype: DType, width: Int, //](arg0: SIMD[dtype, width], arg1: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `hypot` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​arg0 (`SIMD[dtype, width]`): The first input argument.
* ​arg1 (`SIMD[dtype, width]`): The second input argument.

**Returns:**

The `hypot` of the inputs.

---

## math

Defines math utilities.

You can import these APIs from the `math` package. For example:

```mojo
from math import floor
```

## Traits

* [​`Ceilable`](/mojo/stdlib/math/math/Ceilable): The `Ceilable` trait describes a type that defines a ceiling operation.
* [​`CeilDivable`](/mojo/stdlib/math/math/CeilDivable): The `CeilDivable` trait describes a type that defines a ceil division operation.
* [​`CeilDivableRaising`](/mojo/stdlib/math/math/CeilDivableRaising): The `CeilDivable` trait describes a type that define a floor division and negation operation that can raise.
* [​`Floorable`](/mojo/stdlib/math/math/Floorable): The `Floorable` trait describes a type that defines a floor operation.
* [​`Truncable`](/mojo/stdlib/math/math/Truncable): The `Truncable` trait describes a type that defines a truncation operation.

## Functions

* [​`acos`](/mojo/stdlib/math/math/acos): Computes the `acos` of the inputs.
* [​`acosh`](/mojo/stdlib/math/math/acosh): Computes the `acosh` of the inputs.
* [​`align_down`](/mojo/stdlib/math/math/align_down): Returns the closest multiple of alignment that is less than or equal to value.
* [​`align_up`](/mojo/stdlib/math/math/align_up): Returns the closest multiple of alignment that is greater than or equal to value.
* [​`asin`](/mojo/stdlib/math/math/asin): Computes the `asin` of the inputs.
* [​`asinh`](/mojo/stdlib/math/math/asinh): Computes the `asinh` of the inputs.
* [​`atan`](/mojo/stdlib/math/math/atan): Computes the `atan` of the inputs.
* [​`atan2`](/mojo/stdlib/math/math/atan2): Computes the `atan2` of the inputs.
* [​`atanh`](/mojo/stdlib/math/math/atanh): Computes the `atanh` of the inputs.
* [​`cbrt`](/mojo/stdlib/math/math/cbrt): Computes the `cbrt` of the inputs.
* [​`ceil`](/mojo/stdlib/math/math/ceil): Get the ceiling value of the given object.
* [​`ceildiv`](/mojo/stdlib/math/math/ceildiv): Return the rounded-up result of dividing numerator by denominator.
* [​`clamp`](/mojo/stdlib/math/math/clamp): Clamps the integer value vector to be in a certain range.
* [​`copysign`](/mojo/stdlib/math/math/copysign): Returns a value with the magnitude of the first operand and the sign of the second operand.
* [​`cos`](/mojo/stdlib/math/math/cos): Computes the `cos` of the inputs.
* [​`cosh`](/mojo/stdlib/math/math/cosh): Computes the `cosh` of the inputs.
* [​`erf`](/mojo/stdlib/math/math/erf): Performs the elementwise Erf on a SIMD vector.
* [​`erfc`](/mojo/stdlib/math/math/erfc): Computes the `erfc` of the inputs.
* [​`exp`](/mojo/stdlib/math/math/exp): Calculates elementwise exponential of the input vector.
* [​`exp2`](/mojo/stdlib/math/math/exp2): Computes elementwise 2 raised to the power of n, where n is an element of the input SIMD vector.
* [​`expm1`](/mojo/stdlib/math/math/expm1): Computes the `expm1` of the inputs.
* [​`factorial`](/mojo/stdlib/math/math/factorial): Computes the factorial of the integer.
* [​`floor`](/mojo/stdlib/math/math/floor): Get the floor value of the given object.
* [​`fma`](/mojo/stdlib/math/math/fma): Performs `fma` (fused multiply-add) on the inputs.
* [​`frexp`](/mojo/stdlib/math/math/frexp): Breaks floating point values into a fractional part and an exponent part. This follows C and Python in increasing the exponent by 1 and normalizing the fraction from 0.5 to 1.0 instead of 1.0 to 2.0.
* [​`gamma`](/mojo/stdlib/math/math/gamma): Computes the Gamma of the input.
* [​`gcd`](/mojo/stdlib/math/math/gcd): Compute the greatest common divisor of two integers.
* [​`hypot`](/mojo/stdlib/math/math/hypot): Computes the `hypot` of the inputs.
* [​`iota`](/mojo/stdlib/math/math/iota): Creates a SIMD vector containing an increasing sequence, starting from offset.
* [​`isclose`](/mojo/stdlib/math/math/isclose): Returns a boolean SIMD vector indicating which element pairs of `a` and `b` are equal within a given tolerance.
* [​`isqrt`](/mojo/stdlib/math/math/isqrt): Performs elementwise reciprocal square root on a SIMD vector.
* [​`j0`](/mojo/stdlib/math/math/j0): Computes the Bessel function of the first kind of order 0 for each input value.
* [​`j1`](/mojo/stdlib/math/math/j1): Computes the Bessel function of the first kind of order 1 for each input value.
* [​`lcm`](/mojo/stdlib/math/math/lcm): Computes the least common multiple of two integers.
* [​`ldexp`](/mojo/stdlib/math/math/ldexp): Computes elementwise ldexp function.
* [​`lgamma`](/mojo/stdlib/math/math/lgamma): Computes the `lgamma` of the inputs.
* [​`log`](/mojo/stdlib/math/math/log): Performs elementwise natural log (base E) of a SIMD vector.
* [​`log10`](/mojo/stdlib/math/math/log10): Computes the `log10` of the inputs.
* [​`log1p`](/mojo/stdlib/math/math/log1p): Computes the `log1p` of the inputs.
* [​`log2`](/mojo/stdlib/math/math/log2): Performs elementwise log (base 2) of a SIMD vector.
* [​`logb`](/mojo/stdlib/math/math/logb): Computes the `logb` of the inputs.
* [​`modf`](/mojo/stdlib/math/math/modf): Computes the integral and fractional part of the value.
* [​`recip`](/mojo/stdlib/math/math/recip): Performs elementwise reciprocal on a SIMD vector.
* [​`remainder`](/mojo/stdlib/math/math/remainder): Computes the `remainder` of the inputs.
* [​`scalb`](/mojo/stdlib/math/math/scalb): Computes the `scalb` of the inputs.
* [​`sin`](/mojo/stdlib/math/math/sin): Computes the `sin` of the inputs.
* [​`sinh`](/mojo/stdlib/math/math/sinh): Computes the `sinh` of the inputs.
* [​`sqrt`](/mojo/stdlib/math/math/sqrt): Performs square root on an integer.
* [​`tan`](/mojo/stdlib/math/math/tan): Computes the `tan` of the inputs.
* [​`tanh`](/mojo/stdlib/math/math/tanh): Performs elementwise evaluation of the tanh function.
* [​`trunc`](/mojo/stdlib/math/math/trunc): Get the truncated value of the given object.
* [​`ulp`](/mojo/stdlib/math/math/ulp): Computes the ULP (units of last place) or (units of least precision) of the number.
* [​`y0`](/mojo/stdlib/math/math/y0): Computes the Bessel function of the second kind of order 0 for each input value.
* [​`y1`](/mojo/stdlib/math/math/y1): Computes the Bessel function of the second kind of order 1 for each input value.

---

## iota

`iota[dtype: DType, width: Int](offset: SIMD[dtype, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[dtype, width]`

Creates a SIMD vector containing an increasing sequence, starting from offset.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​offset (`SIMD[dtype, 1]`): The value to start the sequence at. Default is zero.

**Returns:**

An increasing sequence of values, starting from offset.

`iota[dtype: DType, //](buff: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], len: Int, offset: Int = 0)`

Fill the buffer with numbers ranging from offset to offset + len - 1, spaced by 1.

The function doesn't return anything, the buffer is updated inplace.

**Parameters:**

* ​dtype (`DType`): DType of the underlying data.

**Args:**

* ​buff (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The buffer to fill.
* ​len (`Int`): The length of the buffer to fill.
* ​offset (`Int`): The value to fill at index 0.

`iota[dtype: DType, //](mut v: List[SIMD[dtype, 1], hint_trivial_type], offset: Int = 0)`

Fill a list with consecutive numbers starting from the specified offset.

**Parameters:**

* ​dtype (`DType`): DType of the underlying data.

**Args:**

* ​v (`List[SIMD[dtype, 1], hint_trivial_type]`): The list to fill with numbers.
* ​offset (`Int`): The starting value to fill at index 0.

`iota(mut v: List[Int, hint_trivial_type], offset: Int = 0)`

Fill a list with consecutive numbers starting from the specified offset.

**Args:**

* ​v (`List[Int, hint_trivial_type]`): The list to fill with numbers.
* ​offset (`Int`): The starting value to fill at index 0.

---

## isclose

`isclose[dtype: DType, width: Int, *, symmetrical: Bool = True](a: SIMD[dtype, width], b: SIMD[dtype, width], *, atol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0E-8), rtol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0000000000000001E-5), equal_nan: Bool = False) -> SIMD[bool, width]`

Returns a boolean SIMD vector indicating which element pairs of `a` and `b` are equal within a given tolerance.

For floating-point dtypes, the following criteria apply:

* Symmetric (Python `math.isclose` style), when `symmetrical` is true:
  ```
  |a - b| ≤ max(atol, rtol * max(|a|, |b|))
  ```
* Asymmetric (NumPy style), when `symmetrical` is false:
  ```
  |a - b| ≤ atol + rtol * |b|
  ```

NaN values are considered equal only if `equal_nan` is true.

**Parameters:**

* ​dtype (`DType`): Element type of the input and output vectors.
* ​width (`Int`): Number of lanes in each SIMD vector.
* ​symmetrical (`Bool`): If true, use the symmetric comparison formula (default: true).

**Args:**

* ​a (`SIMD[dtype, width]`): First input vector.
* ​b (`SIMD[dtype, width]`): Second input vector.
* ​atol (`SIMD[float64, 1]`): Absolute tolerance.
* ​rtol (`SIMD[float64, 1]`): Relative tolerance.
* ​equal\_nan (`Bool`): If true, treat NaNs as equal (default: false).

**Returns:**

A boolean vector where `a` and `b` are equal within the given tolerance.

---

## isqrt

`isqrt[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs elementwise reciprocal square root on a SIMD vector.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector to perform reciprocal square root on.

**Returns:**

The elementwise reciprocal square root of x.

---

## j0

`j0[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the Bessel function of the first kind of order 0 for each input value.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input vector.

**Returns:**

A vector containing the computed value for each value in the input.

---

## j1

`j1[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the Bessel function of the first kind of order 1 for each input value.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input vector.

**Returns:**

A vector containing the computed value for each value in the input.

---

## lcm

`lcm(m: Int, n: Int, /) -> Int`

Computes the least common multiple of two integers.

**Args:**

* ​m (`Int`): The first integer.
* ​n (`Int`): The second integer.

**Returns:**

The least common multiple of the two integers.

`lcm(s: Span[Int, origin], /) -> Int`

Computes the least common multiple of a span of integers.

**Args:**

* ​s (`Span[Int, origin]`): A span of integers.

**Returns:**

The least common multiple of the span.

`lcm(l: List[Int, hint_trivial_type], /) -> Int`

Computes the least common multiple of a list of integers.

**Args:**

* ​l (`List[Int, hint_trivial_type]`): A list of integers.

**Returns:**

The least common multiple of the list.

`lcm(*values: Int) -> Int`

Computes the least common multiple of a variadic list of integers.

**Args:**

* ​\*values (`Int`): A variadic list of integers.

**Returns:**

The least common multiple of the list.

---

## ldexp

`ldexp[dtype: DType, width: Int, //](x: SIMD[dtype, width], exp: SIMD[int32, width]) -> SIMD[dtype, width]`

Computes elementwise ldexp function.

The ldexp function multiplies a floating point value x by the number 2
raised to the exp power. I.e. $ldexp(x,exp)$ calculate the value of $x *
2^{exp}$ and is used within the $erf$ function.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector of floating point values.
* ​exp (`SIMD[int32, width]`): SIMD vector containing the exponents.

**Returns:**

Vector containing elementwise result of ldexp on x and exp.

---

## lgamma

`lgamma[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `lgamma` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `lgamma` of the input.

---

## log

`log[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs elementwise natural log (base E) of a SIMD vector.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): Vector to perform logarithm operation on.

**Returns:**

Vector containing result of performing natural log base E on x.

---

## log10

`log10[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `log10` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `log10` of the input.

---

## log1p

`log1p[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `log1p` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `log1p` of the input.

---

## log2

`log2[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs elementwise log (base 2) of a SIMD vector.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): Vector to perform logarithm operation on.

**Returns:**

Vector containing result of performing log base 2 on x.

---

## logb

`logb[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `logb` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `logb` of the input.

---

## modf

`modf[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> Tuple[SIMD[dtype, width], SIMD[dtype, width]]`

Computes the integral and fractional part of the value.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input value.

**Returns:**

A tuple containing the integral and fractional part of the value.

---

## recip

`recip[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs elementwise reciprocal on a SIMD vector.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector to perform reciprocal on.

**Returns:**

The elementwise reciprocal of x.

---

## remainder

`remainder[dtype: DType, width: Int, //](x: SIMD[dtype, width], y: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `remainder` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The first input argument.
* ​y (`SIMD[dtype, width]`): The second input argument.

**Returns:**

The `remainder` of the inputs.

---

## scalb

`scalb[dtype: DType, width: Int, //](arg0: SIMD[dtype, width], arg1: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `scalb` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​arg0 (`SIMD[dtype, width]`): The first input argument.
* ​arg1 (`SIMD[dtype, width]`): The second input argument.

**Returns:**

The `scalb` of the inputs.

---

## sin

`sin[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `sin` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `sin` of the input.

---

## sinh

`sinh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `sinh` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `sinh` of the input.

---

## sqrt

`sqrt(x: Int) -> Int`

Performs square root on an integer.

**Args:**

* ​x (`Int`): The integer value to perform square root on.

**Returns:**

The square root of x.

`sqrt[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs elementwise square root on the elements of a SIMD vector.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector to perform square root on.

**Returns:**

The elementwise square root of x.

---

## tan

`tan[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `tan` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `tan` of the input.

---

## tanh

`tanh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs elementwise evaluation of the tanh function.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The vector to perform the elementwise tanh on.

**Returns:**

The result of the elementwise tanh operation.

---

## trunc

`trunc[T: Truncable, //](value: T) -> T`

Get the truncated value of the given object.

**Parameters:**

* ​T (`Truncable`): The type conforming to Truncable.

**Args:**

* ​value (`T`): The object to get the truncated value of.

**Returns:**

The truncated value of the object.

---

## ulp

`ulp[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the ULP (units of last place) or (units of least precision) of the number.

**Constraints:**

The element type of the inpiut must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector input.

**Returns:**

The ULP of x.

---

## y0

`y0[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the Bessel function of the second kind of order 0 for each input value.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input vector.

**Returns:**

A vector containing the computed value for each value in the input.

---

## y1

`y1[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the Bessel function of the second kind of order 1 for each input value.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input vector.

**Returns:**

A vector containing the computed value for each value in the input.

---

## polynomial

Provides two implementations for evaluating polynomials.

You can import these APIs from the `math` package. For example:

```mojo
from math.polynomial import polynomial_evaluate
```

## Functions

* [​`polynomial_evaluate`](/mojo/stdlib/math/polynomial/polynomial_evaluate): Evaluates the polynomial.

---

## polynomial_evaluate

`polynomial_evaluate[: Bool, dtype: DType, width: Int, //, coefficients: List[SIMD[dtype, 1], $0]](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Evaluates the polynomial.

**Parameters:**

* ​dtype (`DType`): The dtype of the value.
* ​width (`Int`): The width of the computed value.
* ​coefficients (`List[SIMD[dtype, 1], $0]`): The coefficients.

**Args:**

* ​x (`SIMD[dtype, width]`): The value to compute the polynomial with.

**Returns:**

The polynomial evaluation results using the specified value and the
constant coefficients.

---

## ArcPointer

`@register_passable`
`struct ArcPointer[T: Movable]`

Atomic reference-counted pointer.

This smart pointer owns an instance of `T` indirectly managed on the heap.
This pointer is copyable, including across threads, maintaining a reference
count to the underlying data.

When you initialize an `ArcPointer` with a value, it allocates memory and
moves the value into the allocated memory. Copying an instance of an
`ArcPointer` increments the reference count. Destroying an instance
decrements the reference count. When the reference count reaches zero,
`ArcPointer` destroys the value and frees its memory.

This pointer itself is thread-safe using atomic accesses to reference count
the underlying data, but references returned to the underlying data are not
thread-safe.

Subscripting an `ArcPointer` (`ptr[]`) returns a mutable reference to the
stored value. This is the only safe way to access the stored value. Other
methods, such as using the `unsafe_ptr()` method to retrieve an unsafe
pointer to the stored value, or accessing the private fields of an
`ArcPointer`, are unsafe and may result in memory errors.

For a comparison with other pointer types, see [Intro to
pointers](/mojo/manual/pointers/) in the Mojo Manual.

Examples:

```mojo
from memory import ArcPointer
var p = ArcPointer(4)
var p2 = p
p2[]=3
print(3 == p[])
```

## Parameters

* ​T (`Movable`): The type of the stored value.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Identifiable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(owned value: T) -> Self`

Construct a new thread-safe, reference-counted smart pointer, and move the value into heap memory managed by the new pointer.

**Args:**

* ​value (`T`): The value to manage.

### `__copyinit__`

`__copyinit__(existing: Self) -> Self`

Copy an existing reference. Increment the refcount to the object.

**Args:**

* ​existing (`Self`): The existing reference.

### `__del__`

`__del__(owned self)`

Delete the smart pointer.

Decrement the reference count for the stored value. If there are no more
references, delete the object and free its memory.

### `__getitem__`

`__getitem__[self_life: ImmutableOrigin](ref [self_life] self) -> ref [self_life] T`

Returns a mutable reference to the managed value.

**Parameters:**

* ​self\_life (`ImmutableOrigin`): The origin of self.

**Returns:**

A reference to the managed value.

### `__is__`

`__is__(self, rhs: Self) -> Bool`

Returns True if the two `ArcPointer` instances point at the same object.

**Args:**

* ​rhs (`Self`): The other `ArcPointer`.

**Returns:**

True if the two `ArcPointers` instances point at the same object and
False otherwise.

### `__isnot__`

`__isnot__(self, rhs: Self) -> Bool`

Returns True if the two `ArcPointer` instances point at different objects.

**Args:**

* ​rhs (`Self`): The other `ArcPointer`.

**Returns:**

True if the two `ArcPointer` instances point at different objects
and False otherwise.

### `copy`

`copy(self) -> Self`

Copy the object.

**Returns:**

A copy of the value.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[T]`

Retrieves a pointer to the underlying memory.

**Returns:**

The `UnsafePointer` to the pointee.

### `count`

`count(self) -> SIMD[uint64, 1]`

Count the amount of current references.

**Returns:**

The current amount of references to the pointee.

---

## arc

Reference-counted smart pointers.

You can import these APIs from the `memory` package. For example:

```mojo
from memory import ArcPointer
```

## Structs

* [​`ArcPointer`](/mojo/stdlib/memory/arc/ArcPointer): Atomic reference-counted pointer.

---

## memory

The memory package provides several pointer types, as well as utility functions for dealing with memory.

## Modules

* [​`arc`](/mojo/stdlib/memory/arc/): Reference-counted smart pointers.
* [​`maybe_uninitialized`](/mojo/stdlib/memory/maybe_uninitialized/):
* [​`memory`](/mojo/stdlib/memory/memory/): Defines functions for memory manipulations.
* [​`owned_pointer`](/mojo/stdlib/memory/owned_pointer/): Implements `OwnedPointer`, a safe, single-ownership smart pointer.
* [​`pointer`](/mojo/stdlib/memory/pointer/): Implements the Pointer type.
* [​`span`](/mojo/stdlib/memory/span/): Implements the `Span` type.
* [​`unsafe`](/mojo/stdlib/memory/unsafe/): Provides utility functions for unsafe manipulation of SIMD values.
* [​`unsafe_pointer`](/mojo/stdlib/memory/unsafe_pointer/): Implement a generic unsafe pointer type.

---

## UnsafeMaybeUninitialized

`struct UnsafeMaybeUninitialized[ElementType: AnyType]`

A memory location that may or may not be initialized.

Note that the destructor is a no-op. If the memory was initialized, the caller
is responsible for calling `assume_initialized_destroy` before the memory is
deallocated.

Every method in this struct is unsafe and the caller must know at all
times if the memory is initialized or not. Calling a method
that assumes the memory is initialized when it is not will result in
undefined behavior.

## Parameters

* ​ElementType (`AnyType`): The type of the element to store.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `type`

`alias type = array ElementType>`

## Methods

### `__init__`

`__init__(out self)`

The memory is now considered uninitialized.

`__init__[MovableType: Movable](out self: UnsafeMaybeUninitialized[MovableType], owned value: MovableType)`

The memory is now considered initialized.

**Parameters:**

* ​MovableType (`Movable`): The type of the element to store.

**Args:**

* ​value (`MovableType`): The value to initialize the memory with.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Copy another object.

This method should never be called as implicit copy should not
be done on memory that may be uninitialized.

Trying to call this method will abort.

If you wish to perform a copy, you should manually call the method
`copy_from` instead.

**Args:**

* ​other (`Self`): The object to copy.

### `__moveinit__`

`__moveinit__(out self, owned other: Self)`

Move another object.

This method should never be called as implicit moves should not
be done on memory that may be uninitialized.

Trying to call this method will abort.

If you wish to perform a move, you should manually call the method
`move_from` instead.

**Args:**

* ​other (`Self`): The object to move.

### `__del__`

`__del__(owned self)`

This is a no-op.

Calling this method assumes that the memory is uninitialized.
If the memory was initialized, the caller should
use `assume_initialized_destroy` before.

### `copy_from`

`copy_from[CopyableType: ExplicitlyCopyable](mut self: UnsafeMaybeUninitialized[CopyableType], other: UnsafeMaybeUninitialized[CopyableType])`

Copy another object.

This function assumes that the current memory is uninitialized
and the other object is initialized memory.

**Parameters:**

* ​CopyableType (`ExplicitlyCopyable`): The type object to copy.

**Args:**

* ​other (`UnsafeMaybeUninitialized[CopyableType]`): The object to copy.

`copy_from[CopyableType: ExplicitlyCopyable](mut self: UnsafeMaybeUninitialized[CopyableType], other: CopyableType)`

Copy another object.

This function assumes that the current memory is uninitialized.

**Parameters:**

* ​CopyableType (`ExplicitlyCopyable`): The type object to copy.

**Args:**

* ​other (`CopyableType`): The object to copy.

### `move_from`

`move_from[MovableType: Movable](mut self: UnsafeMaybeUninitialized[MovableType], mut other: UnsafeMaybeUninitialized[MovableType])`

Move another object.

This function assumes that the current memory is uninitialized
and the other object is initialized memory.

After the function is called, the other object is considered uninitialized.

**Parameters:**

* ​MovableType (`Movable`): The type object to move.

**Args:**

* ​other (`UnsafeMaybeUninitialized[MovableType]`): The object to move.

`move_from[MovableType: Movable](mut self: UnsafeMaybeUninitialized[MovableType], other: UnsafePointer[MovableType])`

Move another object.

This function assumes that the current memory is uninitialized
and the other object is initialized memory.

After the function is called, the `other` object is considered uninitialized.

**Parameters:**

* ​MovableType (`Movable`): The type object to move.

**Args:**

* ​other (`UnsafePointer[MovableType]`): The pointer to the object to move.

### `write`

`write[MovableType: Movable](mut self: UnsafeMaybeUninitialized[MovableType], owned value: MovableType)`

Write a value into an uninitialized memory location.

Calling this method assumes that the memory is uninitialized.

**Parameters:**

* ​MovableType (`Movable`): The type of the element to store.

**Args:**

* ​value (`MovableType`): The value to write.

### `assume_initialized`

`assume_initialized(ref self) -> ref [self] ElementType`

Returns a reference to the internal value.

Calling this method assumes that the memory is initialized.

**Returns:**

A reference to the internal value.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[ElementType]`

Get a pointer to the underlying element.

Note that this method does not assumes that the memory is initialized
or not. It can always be called.

**Returns:**

A pointer to the underlying element.

### `assume_initialized_destroy`

`assume_initialized_destroy(mut self)`

Runs the destructor of the internal value.

Calling this method assumes that the memory is initialized.

---

## maybe_uninitialized

## Structs

* [​`UnsafeMaybeUninitialized`](/mojo/stdlib/memory/maybe_uninitialized/UnsafeMaybeUninitialized): A memory location that may or may not be initialized.

---

## memory

Defines functions for memory manipulations.

You can import these APIs from the `memory` package. For example:

```mojo
from memory import memcmp
```

## Functions

* [​`memcmp`](/mojo/stdlib/memory/memory/memcmp): Compares two buffers. Both strings are assumed to be of the same length.
* [​`memcpy`](/mojo/stdlib/memory/memory/memcpy): Copies a memory area.
* [​`memset`](/mojo/stdlib/memory/memory/memset): Fills memory with the given value.
* [​`memset_zero`](/mojo/stdlib/memory/memory/memset_zero): Fills memory with zeros.
* [​`stack_allocation`](/mojo/stdlib/memory/memory/stack_allocation): Allocates data buffer space on the stack given a data type and number of elements.

---

## memcmp

`memcmp[type: AnyType, address_space: AddressSpace](s1: UnsafePointer[type, address_space=address_space], s2: UnsafePointer[type, address_space=address_space], count: Int) -> Int`

Compares two buffers. Both strings are assumed to be of the same length.

**Parameters:**

* ​type (`AnyType`): The element type.
* ​address\_space (`AddressSpace`): The address space of the pointer.

**Args:**

* ​s1 (`UnsafePointer[type, address_space=address_space]`): The first buffer address.
* ​s2 (`UnsafePointer[type, address_space=address_space]`): The second buffer address.
* ​count (`Int`): The number of elements in the buffers.

**Returns:**

Returns 0 if the bytes strings are identical, 1 if s1 > s2, and -1 if
s1

---

## memcpy

`memcpy[T: AnyType](dest: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], src: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], count: Int)`

Copies a memory area.

**Parameters:**

* ​T (`AnyType`): The element type.

**Args:**

* ​dest (`UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]`): The destination pointer.
* ​src (`UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]`): The source pointer.
* ​count (`Int`): The number of elements to copy.

---

## memset

`memset[type: AnyType, address_space: AddressSpace](ptr: UnsafePointer[type, address_space=address_space], value: SIMD[uint8, 1], count: Int)`

Fills memory with the given value.

**Parameters:**

* ​type (`AnyType`): The element dtype.
* ​address\_space (`AddressSpace`): The address space of the pointer.

**Args:**

* ​ptr (`UnsafePointer[type, address_space=address_space]`): UnsafePointer to the beginning of the memory block to fill.
* ​value (`SIMD[uint8, 1]`): The value to fill with.
* ​count (`Int`): Number of elements to fill (in elements, not bytes).

---

## memset_zero

`memset_zero[type: AnyType, address_space: AddressSpace, //](ptr: UnsafePointer[type, address_space=address_space], count: Int)`

Fills memory with zeros.

**Parameters:**

* ​type (`AnyType`): The element type.
* ​address\_space (`AddressSpace`): The address space of the pointer.

**Args:**

* ​ptr (`UnsafePointer[type, address_space=address_space]`): UnsafePointer to the beginning of the memory block to fill.
* ​count (`Int`): Number of elements to fill (in elements, not bytes).

`memset_zero[dtype: DType, address_space: AddressSpace, //, *, count: Int](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space])`

Fills memory with zeros.

**Parameters:**

* ​dtype (`DType`): The element type.
* ​address\_space (`AddressSpace`): The address space of the pointer.
* ​count (`Int`): Number of elements to fill (in elements, not bytes).

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space]`): UnsafePointer to the beginning of the memory block to fill.

---

## stack_allocation

`stack_allocation[count: Int, dtype: DType, /, alignment: Int = alignof[::DType,__mlir_type.!kgen.target]() if is_gpu() else 1, address_space: AddressSpace = AddressSpace(0)]() -> UnsafePointer[SIMD[dtype, 1], address_space=address_space]`

Allocates data buffer space on the stack given a data type and number of elements.

**Parameters:**

* ​count (`Int`): Number of elements to allocate memory for.
* ​dtype (`DType`): The data type of each element.
* ​alignment (`Int`): Address alignment of the allocated data.
* ​address\_space (`AddressSpace`): The address space of the pointer.

**Returns:**

A data pointer of the given type pointing to the allocated space.

`stack_allocation[count: Int, type: AnyType, /, name: Optional[StringSlice[StaticConstantOrigin]] = Optional(None), alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]() if is_gpu() else 1, address_space: AddressSpace = AddressSpace(0)]() -> UnsafePointer[type, address_space=address_space]`

Allocates data buffer space on the stack given a data type and number of elements.

**Parameters:**

* ​count (`Int`): Number of elements to allocate memory for.
* ​type (`AnyType`): The data type of each element.
* ​name (`Optional[StringSlice[StaticConstantOrigin]]`): The name of the global variable (only honored in certain cases).
* ​alignment (`Int`): Address alignment of the allocated data.
* ​address\_space (`AddressSpace`): The address space of the pointer.

**Returns:**

A data pointer of the given type pointing to the allocated space.

---

## OwnedPointer

`@register_passable`
`struct OwnedPointer[T: AnyType]`

A safe, owning, smart pointer.

This smart pointer is designed for cases where there is clear ownership
of the underlying data, and restricts access to it through the origin
system such that no more than one mutable alias for the underlying data
may exist.

For a comparison with other pointer types, see [Intro to
pointers](/mojo/manual/pointers/) in the Mojo Manual.

## Parameters

* ​T (`AnyType`): The type to be stored in the `OwnedPointer`.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__[T: Movable](owned value: T) -> OwnedPointer[T]`

Construct a new `OwnedPointer` by moving the passed value into a new backing allocation.

**Parameters:**

* ​T (`Movable`): The type of the data to store. It is restricted to `Movable` here to allow efficient move construction.

**Args:**

* ​value (`T`): The value to move into the `OwnedPointer`.

`__init__[T: ExplicitlyCopyable](*, copy_value: T) -> OwnedPointer[T]`

Construct a new `OwnedPointer` by explicitly copying the passed value into a new backing allocation.

**Parameters:**

* ​T (`ExplicitlyCopyable`): The type of the data to store, which must be
  `ExplicitlyCopyable`.

**Args:**

* ​copy\_value (`T`): The value to explicitly copy into the `OwnedPointer`.

`__init__[T: Copyable, U: NoneType = NoneType(None)](value: T) -> OwnedPointer[T]`

Construct a new `OwnedPointer` by copying the passed value into a new backing allocation.

**Parameters:**

* ​T (`Copyable`): The type of the data to store.
* ​U (`NoneType`): A dummy type parameter, to lower the selection priority of this ctor.

**Args:**

* ​value (`T`): The value to copy into the `OwnedPointer`.

`__init__[T: ExplicitlyCopyable](*, other: OwnedPointer[T]) -> OwnedPointer[T]`

Construct a new `OwnedPointer` by explicitly copying the value from another `OwnedPointer`.

**Parameters:**

* ​T (`ExplicitlyCopyable`): The type of the data to store.

**Args:**

* ​other (`OwnedPointer[T]`): The `OwnedPointer` to copy.

### `__del__`

`__del__(owned self)`

Destroy the OwnedPointer\[].

### `__getitem__`

`__getitem__(ref self) -> ref [self] T`

Returns a reference to the pointers's underlying data with parametric mutability.

**Returns:**

A reference to the data underlying the `OwnedPointer`.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[T]`

UNSAFE: returns the backing pointer for this `OwnedPointer`.

**Returns:**

An UnsafePointer to the backing allocation for this `OwnedPointer`.

### `take`

`take[T: Movable](owned self: OwnedPointer[T]) -> T`

Move the value within the `OwnedPointer` out of it, consuming the `OwnedPointer` in the process.

**Parameters:**

* ​T (`Movable`): The type of the data backing this `OwnedPointer`. `take()` only exists for `T: Movable`
  since this consuming operation only makes sense for types that you want to avoid copying.
  For types that are `Copyable` or `ExplicitlyCopyable` but are not `Movable`, you can copy them through
  `__getitem__` as in `var v = some_ptr_var[]`.

**Returns:**

The data that is (was) backing the `OwnedPointer`.

### `steal_data`

`steal_data(owned self) -> UnsafePointer[T]`

Take ownership over the heap allocated pointer backing this `OwnedPointer`.

**Safety:**
This function is not unsafe to call, as a memory leak is not
considered unsafe.

However, to avoid a memory leak, callers should ensure that the
returned pointer is eventually deinitialized and deallocated.
Failure to do so will leak memory.

**Returns:**

The pointer owned by this instance.

---

## owned_pointer

Implements `OwnedPointer`, a safe, single-ownership smart pointer.

You can import these APIs from the `memory` package. For example:

```mojo
from memory import OwnedPointer
```

## Structs

* [​`OwnedPointer`](/mojo/stdlib/memory/owned_pointer/OwnedPointer): A safe, owning, smart pointer.

---

## AddressSpace

`@register_passable(trivial)`
`struct AddressSpace`

Address space of the pointer.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Intable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `GENERIC`

`alias GENERIC = AddressSpace(0)`

Generic address space.

## Methods

### `__init__`

`__init__(value: Int) -> Self`

Initializes the address space from the underlying integral value.

**Args:**

* ​value (`Int`): The address space value.

`__init__(value: _GPUAddressSpace) -> Self`

Initializes the address space from the underlying integral value.

**Args:**

* ​value (`_GPUAddressSpace`): The address space value.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

True if the two address spaces are equal and False otherwise.

**Args:**

* ​other (`Self`): The other address space value.

**Returns:**

True if the two address spaces are equal and False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

True if the two address spaces are inequal and False otherwise.

**Args:**

* ​other (`Self`): The other address space value.

**Returns:**

True if the two address spaces are inequal and False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

True if the two address spaces are equal and False otherwise.

**Args:**

* ​other (`Self`): The other address space value.

**Returns:**

True if the two address spaces are equal and False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

True if the two address spaces are equal and False otherwise.

**Args:**

* ​other (`Self`): The other address space value.

**Returns:**

True if the two address spaces are equal and False otherwise.

### `value`

`value(self) -> Int`

The integral value of the address space.

**Returns:**

The integral value of the address space.

### `__int__`

`__int__(self) -> Int`

The integral value of the address space.

**Returns:**

The integral value of the address space.

### `__index__`

`__index__(self) -> index`

Convert to index.

**Returns:**

The corresponding \_\_mlir\_type.index value.

### `__str__`

`__str__(self) -> String`

Gets a string representation of the AddressSpace.

**Returns:**

The string representation of the AddressSpace.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats the address space to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

---

## Pointer

`@register_passable(trivial)`
`struct Pointer[mut: Bool, //, type: AnyType, origin: Origin[mut], address_space: AddressSpace = AddressSpace(0)]`

Defines a non-nullable safe pointer.

For a comparison with other pointer types, see [Intro to
pointers](/mojo/manual/pointers/) in the Mojo Manual.

## Parameters

* ​mut (`Bool`): Whether the pointee data may be mutated through this.
* ​type (`AnyType`): Type of the underlying data.
* ​origin (`Origin[mut]`): The origin of the pointer.
* ​address\_space (`AddressSpace`): The address space of the pointee data.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`

## Aliases

### `Immutable`

`alias Immutable = Pointer[type, (muttoimm origin._mlir_origin), address_space]`

The immutable version of the `Pointer`.

### `Mutable`

`alias Mutable = Pointer[type, (mutcast origin._mlir_origin), address_space]`

The mutable version of the `Pointer`.

## Methods

### `__init__`

`__init__(*, ref [origin, address_space] to: type) -> Self`

Constructs a Pointer from a reference to a value.

**Args:**

* ​to (`type`): The value to construct a pointer to.

### `__getitem__`

`__getitem__(self) -> ref [origin, address_space] type`

Enable subscript syntax `ptr[]` to access the element.

**Returns:**

A reference to the underlying value in memory.

### `__eq__`

`__eq__(self, rhs: Pointer[type, origin, address_space]) -> Bool`

Returns True if the two pointers are equal.

**Args:**

* ​rhs (`Pointer[type, origin, address_space]`): The value of the other pointer.

**Returns:**

True if the two pointers are equal and False otherwise.

### `__ne__`

`__ne__(self, rhs: Pointer[type, origin, address_space]) -> Bool`

Returns True if the two pointers are not equal.

**Args:**

* ​rhs (`Pointer[type, origin, address_space]`): The value of the other pointer.

**Returns:**

True if the two pointers are not equal and False otherwise.

### `address_of`

`static address_of(ref [origin, address_space] value: type) -> Self`

Constructs a Pointer from a reference to a value.

**Args:**

* ​value (`type`): The value to get the address of.

**Returns:**

The result Pointer.

### `copy`

`copy(self) -> Self`

Constructs a copy from another Pointer.

Note that this does **not** copy the underlying data.

**Returns:**

A copy of the value.

### `get_immutable`

`get_immutable(self) -> Pointer[type, (muttoimm origin._mlir_origin), address_space]`

Constructs a new Pointer with the same underlying target and an ImmutableOrigin.

Notes:
This does **not** copy the underlying data.

**Returns:**

A new Pointer with the same target as self and an ImmutableOrigin.

### `__str__`

`__str__(self) -> String`

Gets a string representation of the Pointer.

**Returns:**

The string representation of the Pointer.

### `__merge_with__`

`__merge_with__[: Bool, : Origin[$0], //, other_type: AnyStruct[Pointer[type, $1, address_space]]](self) -> Pointer[type, origin, address_space]`

Returns a pointer merged with the specified `other_type`.

**Parameters:**

* ​other\_type (`AnyStruct[Pointer[type, $1, address_space]]`): The type of the pointer to merge with.

**Returns:**

A pointer merged with the specified `other_type`.

---

## pointer

Implements the Pointer type.

You can import these APIs from the `memory` package. For example:

```mojo
from memory import Pointer
```

## Structs

* [​`AddressSpace`](/mojo/stdlib/memory/pointer/AddressSpace): Address space of the pointer.
* [​`Pointer`](/mojo/stdlib/memory/pointer/Pointer): Defines a non-nullable safe pointer.

---

## Span

`@register_passable(trivial)`
`struct Span[mut: Bool, //, T: Copyable & Movable, origin: Origin[mut], *, address_space: AddressSpace = AddressSpace(0), alignment: Int = _default_alignment[::AnyType]()]`

A non-owning view of contiguous data.

## Parameters

* ​mut (`Bool`): Whether the span is mutable.
* ​T (`Copyable & Movable`): The type of the elements in the span.
* ​origin (`Origin[mut]`): The origin of the Span.
* ​address\_space (`AddressSpace`): The address space associated with the allocated memory.
* ​alignment (`Int`): The minimum alignment of the underlying pointer known statically.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Defaultable`,
`ExplicitlyCopyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `Immutable`

`alias Immutable = Span[T, (muttoimm origin._mlir_origin)]`

The immutable version of the `Span`.

### `Mutable`

`alias Mutable = Span[T, (mutcast origin._mlir_origin)]`

The mutable version of the `Span`.

## Methods

### `__init__`

`__init__() -> Self`

Create an empty / zero-length span.

`__init__(*, ptr: UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin], length: UInt) -> Self`

Unsafe construction from a pointer and length.

**Args:**

* ​ptr (`UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The underlying pointer of the span.
* ​length (`UInt`): The length of the view.

`@implicit`
`__init__(ref [origin, address_space] list: List[T, hint_trivial_type]) -> Self`

Construct a `Span` from a `List`.

**Args:**

* ​list (`List[T, hint_trivial_type]`): The list to which the span refers.

`@implicit`
`__init__[size: Int, //](ref [origin] array: InlineArray[T, size]) -> Self`

Construct a `Span` from an `InlineArray`.

**Parameters:**

* ​size (`Int`): The size of the `InlineArray`.

**Args:**

* ​array (`InlineArray[T, size]`): The array to which the span refers.

### `__bool__`

`__bool__(self) -> Bool`

Check if a span is non-empty.

**Returns:**

True if a span is non-empty, False otherwise.

### `__getitem__`

`__getitem__[I: Indexer](self, idx: I) -> ref [origin, address_space] T`

Get a reference to an element in the span.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The index of the value to return.

**Returns:**

An element reference.

`__getitem__(self, slc: Slice) -> Self`

Get a new span from a slice of the current span.

Allocation:
This function allocates when the step is negative, to avoid a memory
leak, take ownership of the value.

**Args:**

* ​slc (`Slice`): The slice specifying the range of the new subslice.

**Returns:**

A new span that points to the same data as the current span.

### `__eq__`

`__eq__[T: EqualityComparable & Copyable & Movable, rhs_alignment: Int, //](self: Span[T, origin, alignment=alignment], rhs: Span[T, origin, alignment=rhs_alignment]) -> Bool`

Verify if span is equal to another span.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the span. Must implement the
  traits `EqualityComparable`, `Copyable` and `Movable`.
* ​rhs\_alignment (`Int`): The inferred alignment of the rhs span.

**Args:**

* ​rhs (`Span[T, origin, alignment=rhs_alignment]`): The span to compare against.

**Returns:**

True if the spans are equal in length and contain the same elements, False otherwise.

### `__ne__`

`__ne__[T: EqualityComparable & Copyable & Movable, //](self: Span[T, origin, alignment=alignment], rhs: Span[T, origin]) -> Bool`

Verify if span is not equal to another span.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the span. Must implement the
  traits `EqualityComparable`, `Copyable` and `Movable`.

**Args:**

* ​rhs (`Span[T, origin]`): The span to compare against.

**Returns:**

True if the spans are not equal in length or contents, False otherwise.

### `__contains__`

`__contains__[dtype: DType, //](self: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment], value: SIMD[dtype, 1]) -> Bool`

Verify if a given value is present in the Span.

**Parameters:**

* ​dtype (`DType`): The DType of the scalars stored in the Span.

**Args:**

* ​value (`SIMD[dtype, 1]`): The value to find.

**Returns:**

True if the value is contained in the list, False otherwise.

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of the provided `Span`.

**Returns:**

A copy of the `Span`.

### `__iter__`

`__iter__(self) -> _SpanIter[T, origin, address_space=address_space, alignment=alignment]`

Get an iterator over the elements of the `Span`.

**Returns:**

An iterator over the elements of the `Span`.

### `__reversed__`

`__reversed__(self) -> _SpanIter[T, origin, False, address_space, alignment]`

Iterate backwards over the `Span`.

**Returns:**

A reversed iterator of the `Span` elements.

### `__len__`

`__len__(self) -> Int`

Returns the length of the span. This is a known constant value.

**Returns:**

The size of the span.

### `get_immutable`

`get_immutable(self) -> Span[T, (muttoimm origin._mlir_origin)]`

Return an immutable version of this `Span`.

**Returns:**

An immutable version of the same `Span`.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`

Retrieves a pointer to the underlying memory.

**Returns:**

The pointer to the underlying memory.

### `as_ref`

`as_ref(self) -> Pointer[T, origin, address_space]`

Gets a `Pointer` to the first element of this span.

**Returns:**

A `Pointer` pointing at the first element of this span.

### `copy_from`

`copy_from[origin: MutableOrigin, other_alignment: Int, //](self: Span[T, origin, alignment=alignment], other: Span[T, origin, alignment=other_alignment])`

Performs an element wise copy from all elements of `other` into all elements of `self`.

**Parameters:**

* ​origin (`MutableOrigin`): The inferred mutable origin of the data within the Span.
* ​other\_alignment (`Int`): The inferred alignment of the data within the Span.

**Args:**

* ​other (`Span[T, origin, alignment=other_alignment]`): The `Span` to copy all elements from.

### `fill`

`fill[origin: MutableOrigin, //](self: Span[T, origin, alignment=alignment], value: T)`

Fill the memory that a span references with a given value.

**Parameters:**

* ​origin (`MutableOrigin`): The inferred mutable origin of the data within the Span.

**Args:**

* ​value (`T`): The value to assign to each element.

### `swap_elements`

`swap_elements(self: Span[T, origin, alignment=alignment], a: UInt, b: UInt)`

Swap the values at indices `a` and `b`.

**Args:**

* ​a (`UInt`): The first argument index.
* ​b (`UInt`): The second argument index.

**Raises:**

If a or b are larger than the length of the span.

### `__merge_with__`

`__merge_with__[: Bool, : Origin[$0], //, other_type: AnyStruct[Span[T, $1, address_space=address_space, alignment=alignment]]](self) -> Span[T, origin, address_space=address_space, alignment=alignment]`

Returns a pointer merged with the specified `other_type`.

**Parameters:**

* ​other\_type (`AnyStruct[Span[T, $1, address_space=address_space, alignment=alignment]]`): The type of the pointer to merge with.

**Returns:**

A pointer merged with the specified `other_type`.

---

## span

Implements the `Span` type.

You can import these APIs from the `memory` module. For example:

```mojo
from memory import Span
```

## Structs

* [​`Span`](/mojo/stdlib/memory/span/Span): A non-owning view of contiguous data.

---

## bitcast

`bitcast[src_dtype: DType, src_width: Int, //, dtype: DType, width: Int = src_width](val: SIMD[src_dtype, src_width]) -> SIMD[dtype, width]`

Bitcasts a SIMD value to another SIMD value.

For a discussion of byte order, see
[Converting data: bitcasting and byte order](/mojo/manual/pointers/unsafe-pointers#converting-data-bitcasting-and-byte-order)
in the Mojo Manual.

Examples:

The following example uses `bitcast` to break a 32-bit integer into a vector
of four 8-bit integers:

```mojo
from memory import bitcast

u32 = SIMD[DType.uint32, 1](4631)
u8x4 = bitcast[DType.uint8, 4](u32)
print(u32, u8x4) # 4631 [23, 18, 0, 0]
```

**Constraints:**

The bitwidth of the two types must be the same.

**Parameters:**

* ​src\_dtype (`DType`): The source type.
* ​src\_width (`Int`): The source width.
* ​dtype (`DType`): The target type.
* ​width (`Int`): The target width.

**Args:**

* ​val (`SIMD[src_dtype, src_width]`): The source value.

**Returns:**

A new SIMD value with the specified type and width with a bitcopy of the
source SIMD value.

---

## unsafe

Provides utility functions for unsafe manipulation of SIMD values.

You can import these APIs from the `memory` package. For example:

```mojo
from memory import bitcast
```

## Functions

* [​`bitcast`](/mojo/stdlib/memory/unsafe/bitcast): Bitcasts a SIMD value to another SIMD value.
* [​`pack_bits`](/mojo/stdlib/memory/unsafe/pack_bits): Packs a SIMD vector of `bool` values into an integer.

---

## pack_bits

`pack_bits[src_width: Int, //, dtype: DType = ui1 if (src_width == 1) else ui2 if (src_width == 2) else ui4 if (src_width == 4) else uint8 if (src_width == 8) else uint16 if (src_width == 16) else uint32 if (src_width == 32) else uint64 if (src_width == 64) else ui128 if (src_width == 128) else ui256 if (src_width == 256) else invalid, width: Int = 1](val: SIMD[bool, src_width]) -> SIMD[dtype, width]`

Packs a SIMD vector of `bool` values into an integer.

Examples:

This example packs a vector of 8 `bool` values into a single 8-bit integer.

```mojo
from memory import pack_bits

bits = SIMD[DType.bool, 8](1, 1, 0, 1, 0, 0, 0, 0)
u8 = pack_bits[DType.uint8](bits)
print(bits, u8) # [True, True, False, True, False, False, False, False] 11
```

**Constraints:**

The logical bitwidth of the bool vector must be the same as the bitwidth of the
target type. The target type must be a unsigned type.

**Parameters:**

* ​src\_width (`Int`): The source width.
* ​dtype (`DType`): The target type.
* ​width (`Int`): The target width.

**Args:**

* ​val (`SIMD[bool, src_width]`): The source value.

**Returns:**

A new integer scalar which has the same bitwidth as the bool vector.

---

## UnsafePointer

`@register_passable(trivial)`
`struct UnsafePointer[type: AnyType, *, address_space: AddressSpace = AddressSpace(0), alignment: Int = _default_alignment[::AnyType](), mut: Bool = True, origin: Origin[mut] = SomeAnyOrigin]`

UnsafePointer\[T] represents an indirect reference to one or more values of type T consecutively in memory, and can refer to uninitialized memory.

Because it supports referring to uninitialized memory, it provides unsafe
methods for initializing and destroying instances of T, as well as methods
for accessing the values once they are initialized.

For more information see [Unsafe
pointers](/mojo/manual/pointers/unsafe-pointers) in the Mojo Manual. For a
comparison with other pointer types, see [Intro to
pointers](/mojo/manual/pointers/).

## Parameters

* ​type (`AnyType`): The type the pointer points to.
* ​address\_space (`AddressSpace`): The address space associated with the UnsafePointer allocated memory.
* ​alignment (`Int`): The minimum alignment of this pointer known statically.
* ​mut (`Bool`): Whether the origin is mutable.
* ​origin (`Origin[mut]`): The origin of the memory being addressed.

## Fields

* ​address (`pointer *"type", #lit.struct.extract, "value">>`): The underlying pointer.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Defaultable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`ImplicitlyBoolable`,
`Intable`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Create a null pointer.

`__init__(*, ref [origin, address_space] to: type) -> Self`

Constructs a Pointer from a reference to a value.

**Args:**

* ​to (`type`): The value to construct a pointer to.

`@implicit`
`__init__(other: UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self`

Exclusivity parameter cast a pointer.

**Args:**

* ​other (`UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to cast.

`__init__(*, ref [origin] unchecked_downcast_value: PythonObject) -> UnsafePointer[type, mut=mut, origin=origin]`

Downcast a `PythonObject` known to contain a Mojo object to a pointer.

This operation is only valid if the provided Python object contains
an initialized Mojo object of matching type.

**Args:**

* ​unchecked\_downcast\_value (`PythonObject`): The Python object to downcast from.

### `__bool__`

`__bool__(self) -> Bool`

Return true if the pointer is non-null.

**Returns:**

Whether the pointer is null.

### `__getitem__`

`__getitem__(self) -> ref [origin, address_space] type`

Return a reference to the underlying data.

**Returns:**

A reference to the value.

`__getitem__[I: Indexer, //](self, offset: I) -> ref [origin, address_space] type`

Return a reference to the underlying data, offset by the given index.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​offset (`I`): The offset index.

**Returns:**

An offset reference.

### `__lt__`

`__lt__(self, rhs: Self) -> Bool`

Returns True if this pointer represents a lower address than rhs.

**Args:**

* ​rhs (`Self`): The value of the other pointer.

**Returns:**

True if this pointer represents a lower address and False otherwise.

### `__le__`

`__le__(self, rhs: Self) -> Bool`

Returns True if this pointer represents a lower than or equal    address than rhs.

**Args:**

* ​rhs (`Self`): The value of the other pointer.

**Returns:**

True if this pointer represents a lower address and False otherwise.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Returns True if the two pointers are equal.

**Args:**

* ​rhs (`Self`): The value of the other pointer.

**Returns:**

True if the two pointers are equal and False otherwise.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Returns True if the two pointers are not equal.

**Args:**

* ​rhs (`Self`): The value of the other pointer.

**Returns:**

True if the two pointers are not equal and False otherwise.

### `__gt__`

`__gt__(self, rhs: Self) -> Bool`

Returns True if this pointer represents a higher address than rhs.

**Args:**

* ​rhs (`Self`): The value of the other pointer.

**Returns:**

True if this pointer represents a higher than or equal address and
False otherwise.

### `__ge__`

`__ge__(self, rhs: Self) -> Bool`

Returns True if this pointer represents a higher than or equal    address than rhs.

**Args:**

* ​rhs (`Self`): The value of the other pointer.

**Returns:**

True if this pointer represents a higher than or equal address and
False otherwise.

### `__add__`

`__add__[I: Indexer, //](self, offset: I) -> Self`

Return a pointer at an offset from the current one.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​offset (`I`): The offset index.

**Returns:**

An offset pointer.

### `__sub__`

`__sub__[I: Indexer, //](self, offset: I) -> Self`

Return a pointer at an offset from the current one.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​offset (`I`): The offset index.

**Returns:**

An offset pointer.

### `__iadd__`

`__iadd__[I: Indexer, //](mut self, offset: I)`

Add an offset to this pointer.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​offset (`I`): The offset index.

### `__isub__`

`__isub__[I: Indexer, //](mut self, offset: I)`

Subtract an offset from this pointer.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​offset (`I`): The offset index.

### `copy`

`copy(self) -> Self`

Copy an existing pointer.

**Returns:**

A copy of the value.

### `address_of`

`static address_of(ref [address_space] arg: type) -> UnsafePointer[type, address_space=address_space, alignment=1, mut=arg_is_mut, origin=arg_is_origin]`

Gets the address of the argument.

**Args:**

* ​arg (`type`): The value to get the address of.

**Returns:**

An UnsafePointer which contains the address of the argument.

### `alloc`

`static alloc(count: Int) -> UnsafePointer[type, alignment=alignment, origin={}]`

Allocate an array with specified or default alignment.

**Args:**

* ​count (`Int`): The number of elements in the array.

**Returns:**

The pointer to the newly allocated array.

### `offset`

`offset[I: Indexer, //](self, idx: I) -> Self`

Returns a new pointer shifted by the specified offset.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The offset of the new pointer.

**Returns:**

The new constructed UnsafePointer.

### `__merge_with__`

`__merge_with__[: Int, : Bool, : Origin[$1], //, other_type: AnyStruct[UnsafePointer[type, address_space=address_space, alignment=$0, mut=$1, origin=$2]]](self) -> UnsafePointer[type, address_space=address_space, alignment=min(alignment, alignment), mut=mut, origin=origin]`

Returns a pointer merged with the specified `other_type`.

**Parameters:**

* ​other\_type (`AnyStruct[UnsafePointer[type, address_space=address_space, alignment=$0, mut=$1, origin=$2]]`): The type of the pointer to merge with.

**Returns:**

A pointer merged with the specified `other_type`.

### `__as_bool__`

`__as_bool__(self) -> Bool`

Return true if the pointer is non-null.

**Returns:**

Whether the pointer is null.

### `__int__`

`__int__(self) -> Int`

Returns the pointer address as an integer.

**Returns:**

The address of the pointer as an Int.

### `__str__`

`__str__(self) -> String`

Gets a string representation of the pointer.

**Returns:**

The string representation of the pointer.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this pointer address to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `as_noalias_ptr`

`as_noalias_ptr(self) -> Self`

Cast the pointer to a new pointer that is known not to locally alias any other pointer. In other words, the pointer transitively does not alias any other memory value declared in the local function context.

This information is relayed to the optimizer. If the pointer does
locally alias another memory value, the behaviour is undefined.

**Returns:**

A noalias pointer.

### `load`

`load[dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False, invariant: Bool = _default_invariant[::Bool]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[dtype, width]`

Loads the value the pointer points to.

**Constraints:**

The width and alignment must be positive integer values.

**Parameters:**

* ​dtype (`DType`): The data type of SIMD vector.
* ​width (`Int`): The size of the SIMD vector.
* ​alignment (`Int`): The minimal alignment of the address.
* ​volatile (`Bool`): Whether the operation is volatile or not.
* ​invariant (`Bool`): Whether the memory is load invariant.

**Returns:**

The loaded value.

`load[dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False, invariant: Bool = _default_invariant[::Bool]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[dtype, 1]) -> SIMD[dtype, width]`

Loads the value the pointer points to with the given offset.

**Constraints:**

The width and alignment must be positive integer values.
The offset must be integer.

**Parameters:**

* ​dtype (`DType`): The data type of SIMD vector elements.
* ​width (`Int`): The size of the SIMD vector.
* ​alignment (`Int`): The minimal alignment of the address.
* ​volatile (`Bool`): Whether the operation is volatile or not.
* ​invariant (`Bool`): Whether the memory is load invariant.

**Args:**

* ​offset (`SIMD[dtype, 1]`): The offset to load from.

**Returns:**

The loaded value.

`load[I: Indexer, dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False, invariant: Bool = _default_invariant[::Bool]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: I) -> SIMD[dtype, width]`

Loads the value the pointer points to with the given offset.

**Constraints:**

The width and alignment must be positive integer values.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.
* ​dtype (`DType`): The data type of SIMD vector elements.
* ​width (`Int`): The size of the SIMD vector.
* ​alignment (`Int`): The minimal alignment of the address.
* ​volatile (`Bool`): Whether the operation is volatile or not.
* ​invariant (`Bool`): Whether the memory is load invariant.

**Args:**

* ​offset (`I`): The offset to load from.

**Returns:**

The loaded value.

### `store`

`store[I: Indexer, dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: I, val: SIMD[dtype, width])`

Stores a single element value at the given offset.

**Constraints:**

The width and alignment must be positive integer values.
The offset must be integer.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.
* ​dtype (`DType`): The data type of SIMD vector elements.
* ​width (`Int`): The size of the SIMD vector.
* ​alignment (`Int`): The minimal alignment of the address.
* ​volatile (`Bool`): Whether the operation is volatile or not.

**Args:**

* ​offset (`I`): The offset to store to.
* ​val (`SIMD[dtype, width]`): The value to store.

`store[dtype: DType, offset_type: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[offset_type, 1], val: SIMD[dtype, width])`

Stores a single element value at the given offset.

**Constraints:**

The width and alignment must be positive integer values.

**Parameters:**

* ​dtype (`DType`): The data type of SIMD vector elements.
* ​offset\_type (`DType`): The data type of the offset value.
* ​width (`Int`): The size of the SIMD vector.
* ​alignment (`Int`): The minimal alignment of the address.
* ​volatile (`Bool`): Whether the operation is volatile or not.

**Args:**

* ​offset (`SIMD[offset_type, 1]`): The offset to store to.
* ​val (`SIMD[dtype, width]`): The value to store.

`store[dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], val: SIMD[dtype, width])`

Stores a single element value.

**Constraints:**

The width and alignment must be positive integer values.

**Parameters:**

* ​dtype (`DType`): The data type of SIMD vector elements.
* ​width (`Int`): The size of the SIMD vector.
* ​alignment (`Int`): The minimal alignment of the address.
* ​volatile (`Bool`): Whether the operation is volatile or not.

**Args:**

* ​val (`SIMD[dtype, width]`): The value to store.

### `strided_load`

`strided_load[dtype: DType, T: Intable, //, width: Int](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], stride: T) -> SIMD[dtype, width]`

Performs a strided load of the SIMD vector.

**Parameters:**

* ​dtype (`DType`): DType of returned SIMD value.
* ​T (`Intable`): The Intable type of the stride.
* ​width (`Int`): The SIMD width.

**Args:**

* ​stride (`T`): The stride between loads.

**Returns:**

A vector which is stride loaded.

### `strided_store`

`strided_store[dtype: DType, T: Intable, //, width: Int = 1](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], val: SIMD[dtype, width], stride: T)`

Performs a strided store of the SIMD vector.

**Parameters:**

* ​dtype (`DType`): DType of `val`, the SIMD value to store.
* ​T (`Intable`): The Intable type of the stride.
* ​width (`Int`): The SIMD width.

**Args:**

* ​val (`SIMD[dtype, width]`): The SIMD value to store.
* ​stride (`T`): The stride between stores.

### `gather`

`gather[dtype: DType, //, *, width: Int = 1, alignment: Int = _default_alignment[::DType,::Int]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[dtype, width], mask: SIMD[bool, width] = SIMD(True), default: SIMD[dtype, width] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[dtype, width]`

Gathers a SIMD vector from offsets of the current pointer.

This method loads from memory addresses calculated by appropriately
shifting the current pointer according to the `offset` SIMD vector,
or takes from the `default` SIMD vector, depending on the values of
the `mask` SIMD vector.

If a mask element is `True`, the respective result element is given
by the current pointer and the `offset` SIMD vector; otherwise, the
result element is taken from the `default` SIMD vector.

**Constraints:**

The offset type must be an integral type.
The alignment must be a power of two integer value.

**Parameters:**

* ​dtype (`DType`): DType of the return SIMD.
* ​width (`Int`): The SIMD width.
* ​alignment (`Int`): The minimal alignment of the address.

**Args:**

* ​offset (`SIMD[dtype, width]`): The SIMD vector of offsets to gather from.
* ​mask (`SIMD[bool, width]`): The SIMD vector of boolean values, indicating for each
  element whether to load from memory or to take from the
  `default` SIMD vector.
* ​default (`SIMD[dtype, width]`): The SIMD vector providing default values to be taken
  where the `mask` SIMD vector is `False`.

**Returns:**

The SIMD vector containing the gathered values.

### `scatter`

`scatter[dtype: DType, //, *, width: Int = 1, alignment: Int = _default_alignment[::DType,::Int]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[dtype, width], val: SIMD[dtype, width], mask: SIMD[bool, width] = SIMD(True))`

Scatters a SIMD vector into offsets of the current pointer.

This method stores at memory addresses calculated by appropriately
shifting the current pointer according to the `offset` SIMD vector,
depending on the values of the `mask` SIMD vector.

If a mask element is `True`, the respective element in the `val` SIMD
vector is stored at the memory address defined by the current pointer
and the `offset` SIMD vector; otherwise, no action is taken for that
element in `val`.

If the same offset is targeted multiple times, the values are stored
in the order they appear in the `val` SIMD vector, from the first to
the last element.

**Constraints:**

The offset type must be an integral type.
The alignment must be a power of two integer value.

**Parameters:**

* ​dtype (`DType`): DType of `value`, the result SIMD buffer.
* ​width (`Int`): The SIMD width.
* ​alignment (`Int`): The minimal alignment of the address.

**Args:**

* ​offset (`SIMD[dtype, width]`): The SIMD vector of offsets to scatter into.
* ​val (`SIMD[dtype, width]`): The SIMD vector containing the values to be scattered.
* ​mask (`SIMD[bool, width]`): The SIMD vector of boolean values, indicating for each
  element whether to store at memory or not.

### `free`

`free(self: UnsafePointer[type, alignment=alignment, mut=mut, origin=origin])`

Free the memory referenced by the pointer.

### `bitcast`

`bitcast[T: AnyType = type](self) -> UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`

Bitcasts a UnsafePointer to a different type.

**Parameters:**

* ​T (`AnyType`): The target type.

**Returns:**

A new UnsafePointer object with the specified type and the same address,
as the original UnsafePointer.

### `static_alignment_cast`

`static_alignment_cast[alignment: Int = alignment](self) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`

Changes the `alignment` of an `UnsafePointer`.

The static alignment of an UnsafePointer must be greater
or equal to the actual alignment of the runtime pointer
value. Casting an UnsafePointer to a static alignment greater
than its runtime alignment may cause undefined behavior".

This only changes the compile-time alignment encoded in the type of
this pointer. This does not change the alignment of the pointer address
at runtime.

**Parameters:**

* ​alignment (`Int`): Alignment of the destination pointer.

**Returns:**

A new UnsafePointer object with the same type, address\_space, and address,
as the original UnsafePointer, and the new specified alignment.

### `origin_cast`

`origin_cast[mut: Bool = mut, origin: Origin[mut] = origin](self) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`

Changes the origin or mutability of a pointer.

**Parameters:**

* ​mut (`Bool`): Whether the origin is mutable.
* ​origin (`Origin[mut]`): Origin of the destination pointer.

**Returns:**

A new UnsafePointer object with the same type and the same address,
as the original UnsafePointer and the new specified mutability and origin.

### `address_space_cast`

`address_space_cast[address_space: AddressSpace = address_space](self) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`

Casts an UnsafePointer to a different address space.

**Parameters:**

* ​address\_space (`AddressSpace`): The address space of the result.

**Returns:**

A new UnsafePointer object with the same type and the same address,
as the original UnsafePointer and the new address space.

### `destroy_pointee`

`destroy_pointee(self: UnsafePointer[type, alignment=alignment, mut=mut, origin=origin])`

Destroy the pointed-to value.

The pointer must not be null, and the pointer memory location is assumed
to contain a valid initialized instance of `type`.  This is equivalent to
`_ = self.take_pointee()` but doesn't require `Movable` and is
more efficient because it doesn't invoke `__moveinit__`.

### `take_pointee`

`take_pointee[T: Movable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]) -> T`

Move the value at the pointer out, leaving it uninitialized.

The pointer must not be null, and the pointer memory location is assumed
to contain a valid initialized instance of `T`.

This performs a *consuming* move, ending the origin of the value stored
in this pointer memory location. Subsequent reads of this pointer are
not valid. If a new valid value is stored using `init_pointee_move()`, then
reading from this pointer becomes valid again.

**Parameters:**

* ​T (`Movable`): The type the pointer points to, which must be `Movable`.

**Returns:**

The value at the pointer.

### `init_pointee_move`

`init_pointee_move[T: Movable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], owned value: T)`

Emplace a new value into the pointer location, moving from `value`.

The pointer memory location is assumed to contain uninitialized data,
and consequently the current contents of this pointer are not destructed
before writing `value`. Similarly, ownership of `value` is logically
transferred into the pointer location.

When compared to `init_pointee_copy`, this avoids an extra copy on
the caller side when the value is an `owned` rvalue.

**Parameters:**

* ​T (`Movable`): The type the pointer points to, which must be `Movable`.

**Args:**

* ​value (`T`): The value to emplace.

### `init_pointee_copy`

`init_pointee_copy[T: Copyable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], value: T)`

Emplace a copy of `value` into the pointer location.

The pointer memory location is assumed to contain uninitialized data,
and consequently the current contents of this pointer are not destructed
before writing `value`. Similarly, ownership of `value` is logically
transferred into the pointer location.

When compared to `init_pointee_move`, this avoids an extra move on
the callee side when the value must be copied.

**Parameters:**

* ​T (`Copyable`): The type the pointer points to, which must be `Copyable`.

**Args:**

* ​value (`T`): The value to emplace.

### `init_pointee_explicit_copy`

`init_pointee_explicit_copy[T: ExplicitlyCopyable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], value: T)`

Emplace a copy of `value` into this pointer location.

The pointer memory location is assumed to contain uninitialized data,
and consequently the current contents of this pointer are not destructed
before writing `value`. Similarly, ownership of `value` is logically
transferred into the pointer location.

When compared to `init_pointee_move`, this avoids an extra move on
the callee side when the value must be copied.

**Parameters:**

* ​T (`ExplicitlyCopyable`): The type the pointer points to, which must be
  `ExplicitlyCopyable`.

**Args:**

* ​value (`T`): The value to emplace.

### `move_pointee_into`

`move_pointee_into[T: Movable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], dst: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin])`

Moves the value `self` points to into the memory location pointed to by `dst`.

This performs a consuming move (using `__moveinit__()`) out of the
memory location pointed to by `self`. Subsequent reads of this
pointer are not valid unless and until a new, valid value has been
moved into this pointer's memory location using `init_pointee_move()`.

This transfers the value out of `self` and into `dest` using at most one
`__moveinit__()` call.

**Safety:**

* `self` must be non-null
* `self` must contain a valid, initialized instance of `T`
* `dst` must not be null
* The contents of `dst` should be uninitialized. If `dst` was
  previously written with a valid value, that value will be be
  overwritten and its destructor will NOT be run.

**Parameters:**

* ​T (`Movable`): The type the pointer points to, which must be `Movable`.

**Args:**

* ​dst (`UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]`): Destination pointer that the value will be moved into.

---

## unsafe_pointer

Implement a generic unsafe pointer type.

You can import these APIs from the `memory` package. For example:

```mojo
from memory import UnsafePointer
```

## Structs

* [​`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer): UnsafePointer\[T] represents an indirect reference to one or more values of type T consecutively in memory, and can refer to uninitialized memory.

---

## Atomic

`struct Atomic[dtype: DType, *, scope: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")]`

Represents a value with atomic operations.

The class provides atomic `add` and `sub` methods for mutating the value.

## Parameters

* ​dtype (`DType`): DType of the value.
* ​scope (`StringSlice[StaticConstantOrigin]`): The memory synchronization scope.

## Fields

* ​value (`SIMD[dtype, 1]`): The atomic value.
  This is the underlying value of the atomic. Access to the value can only
  occur through atomic primitive operations.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(out self, value: SIMD[dtype, 1])`

Constructs a new atomic value.

**Args:**

* ​value (`SIMD[dtype, 1]`): Initial value represented as `Scalar[dtype]` type.

### `__iadd__`

`__iadd__(mut self, rhs: SIMD[dtype, 1])`

Performs atomic in-place add.

Atomically replaces the current value with the result of arithmetic
addition of the value and arg. That is, it performs atomic
post-increment. The operation is a read-modify-write operation. Memory
is affected according to the value of order which is sequentially
consistent.

**Args:**

* ​rhs (`SIMD[dtype, 1]`): Value to add.

### `__isub__`

`__isub__(mut self, rhs: SIMD[dtype, 1])`

Performs atomic in-place sub.

Atomically replaces the current value with the result of arithmetic
subtraction of the value and arg. That is, it performs atomic
post-decrement. The operation is a read-modify-write operation. Memory
is affected according to the value of order which is sequentially
consistent.

**Args:**

* ​rhs (`SIMD[dtype, 1]`): Value to subtract.

### `load`

`load(mut self) -> SIMD[dtype, 1]`

Loads the current value from the atomic.

**Returns:**

The current value of the atomic.

### `fetch_add`

`static fetch_add[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], rhs: SIMD[dtype, 1]) -> SIMD[dtype, 1]`

Performs atomic in-place add.

Atomically replaces the current value with the result of arithmetic
addition of the value and arg. That is, it performs atomic
post-increment. The operation is a read-modify-write operation. Memory
is affected according to the value of order which is sequentially
consistent.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The source pointer.
* ​rhs (`SIMD[dtype, 1]`): Value to add.

**Returns:**

The original value before addition.

`fetch_add[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](mut self, rhs: SIMD[dtype, 1]) -> SIMD[dtype, 1]`

Performs atomic in-place add.

Atomically replaces the current value with the result of arithmetic
addition of the value and arg. That is, it performs atomic
post-increment. The operation is a read-modify-write operation. Memory
is affected according to the value of order which is sequentially
consistent.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​rhs (`SIMD[dtype, 1]`): Value to add.

**Returns:**

The original value before addition.

### `store`

`static store[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], value: SIMD[dtype, 1])`

Performs atomic store. The operation is a read-modify-write operation. Memory is affected according to the value of order which is sequentially consistent.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The source pointer.
* ​value (`SIMD[dtype, 1]`): The value to store.

### `fetch_sub`

`fetch_sub[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](mut self, rhs: SIMD[dtype, 1]) -> SIMD[dtype, 1]`

Performs atomic in-place sub.

Atomically replaces the current value with the result of arithmetic
subtraction of the value and arg. That is, it performs atomic
post-decrement. The operation is a read-modify-write operation. Memory
is affected according to the value of order which is sequentially
consistent.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​rhs (`SIMD[dtype, 1]`): Value to subtract.

**Returns:**

The original value before subtraction.

### `compare_exchange_weak`

`compare_exchange_weak[*, failure_ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6)), success_ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](self, mut expected: SIMD[dtype, 1], desired: SIMD[dtype, 1]) -> Bool`

Atomically compares the self value with that of the expected value. If the values are equal, then the self value is replaced with the desired value and True is returned. Otherwise, False is returned the the expected value is rewritten with the self value.

**Parameters:**

* ​failure\_ordering (`Consistency`): The memory ordering for the failure case.
* ​success\_ordering (`Consistency`): The memory ordering for the success case.

**Args:**

* ​expected (`SIMD[dtype, 1]`): The expected value.
* ​desired (`SIMD[dtype, 1]`): The desired value.

**Returns:**

True if self == expected and False otherwise.

### `max`

`static max[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], rhs: SIMD[dtype, 1])`

Performs atomic in-place max on the pointer.

Atomically replaces the current value pointer to by `ptr` by the result
of max of the value and arg. The operation is a read-modify-write
operation. The operation is a read-modify-write operation perform
according to sequential consistency semantics.

**Constraints:**

The input type must be either integral or floating-point type.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The source pointer.
* ​rhs (`SIMD[dtype, 1]`): Value to max.

`max[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](self, rhs: SIMD[dtype, 1])`

Performs atomic in-place max.

Atomically replaces the current value with the result of max of the
value and arg. The operation is a read-modify-write operation perform
according to sequential consistency semantics.

**Constraints:**

The input type must be either integral or floating-point type.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​rhs (`SIMD[dtype, 1]`): Value to max.

### `min`

`static min[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], rhs: SIMD[dtype, 1])`

Performs atomic in-place min on the pointer.

Atomically replaces the current value pointer to by `ptr` by the result
of min of the value and arg. The operation is a read-modify-write
operation. The operation is a read-modify-write operation perform
according to sequential consistency semantics.

**Constraints:**

The input type must be either integral or floating-point type.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The source pointer.
* ​rhs (`SIMD[dtype, 1]`): Value to min.

`min[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](self, rhs: SIMD[dtype, 1])`

Performs atomic in-place min.

Atomically replaces the current value with the result of min of the
value and arg. The operation is a read-modify-write operation. The
operation is a read-modify-write operation perform according to
sequential consistency semantics.

**Constraints:**

The input type must be either integral or floating-point type.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​rhs (`SIMD[dtype, 1]`): Value to min.

---

## Consistency

`@register_passable(trivial)`
`struct Consistency`

Represents the consistency model for atomic operations.

The class provides a set of constants that represent different consistency
models for atomic operations.

Attributes:
NOT\_ATOMIC: Not atomic.
UNORDERED: Unordered.
MONOTONIC: Monotonic.
ACQUIRE: Acquire.
RELEASE: Release.
ACQUIRE\_RELEASE: Acquire-release.
SEQUENTIAL: Sequentially consistent.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ACQUIRE`

`alias ACQUIRE = Consistency(__init__[__mlir_type.!pop.int_literal](3))`

Acquire.

### `ACQUIRE_RELEASE`

`alias ACQUIRE_RELEASE = Consistency(__init__[__mlir_type.!pop.int_literal](5))`

Acquire-release.

### `MONOTONIC`

`alias MONOTONIC = Consistency(__init__[__mlir_type.!pop.int_literal](2))`

Monotonic.

### `NOT_ATOMIC`

`alias NOT_ATOMIC = Consistency(__init__[__mlir_type.!pop.int_literal](0))`

Not atomic.

### `RELEASE`

`alias RELEASE = Consistency(__init__[__mlir_type.!pop.int_literal](4))`

Release.

### `SEQUENTIAL`

`alias SEQUENTIAL = Consistency(__init__[__mlir_type.!pop.int_literal](6))`

Sequentially consistent.

### `UNORDERED`

`alias UNORDERED = Consistency(__init__[__mlir_type.!pop.int_literal](1))`

Unordered.

## Methods

### `__init__`

`__init__(value: SIMD[uint8, 1]) -> Self`

Constructs a new Consistency object.

**Args:**

* ​value (`SIMD[uint8, 1]`): The value of the consistency model.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compares two Consistency objects for equality.

**Args:**

* ​other (`Self`): The other Consistency object to compare with.

**Returns:**

True if the objects are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Compares two Consistency objects for inequality.

**Args:**

* ​other (`Self`): The other Consistency object to compare with.

**Returns:**

True if the objects are not equal, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Checks if the Consistency object is the same as another.

**Args:**

* ​other (`Self`): The other Consistency object to compare with.

**Returns:**

True if the objects are the same, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Checks if the Consistency object is not the same as another.

**Args:**

* ​other (`Self`): The other Consistency object to compare with.

**Returns:**

True if the objects are not the same, False otherwise.

### `__mlir_attr`

`__mlir_attr(self) -> !kgen.deferred`

Returns the MLIR attribute representation of the Consistency object.

**Returns:**

The MLIR attribute representation of the Consistency object.

---

## atomic

Implements the `Atomic` struct.

You can import these APIs from the `os` package. For example:

```mojo
from os import Atomic
```

## Structs

* [​`Atomic`](/mojo/stdlib/os/atomic/Atomic): Represents a value with atomic operations.
* [​`Consistency`](/mojo/stdlib/os/atomic/Consistency): Represents the consistency model for atomic operations.

---

## getenv

`getenv(owned name: String, default: String = __init__[__mlir_type.!kgen.string]("")) -> String`

Returns the value of the given environment variable.

**Constraints:**

The function only works on macOS or Linux and returns an empty string
otherwise.

**Args:**

* ​name (`String`): The name of the environment variable.
* ​default (`String`): The default value to return if the environment variable
  doesn't exist.

**Returns:**

The value of the environment variable.

---

## env

Provides functions for working with environment variables.

You can import these APIs from the `os` package. For example:

```mojo
from os import setenv
```

## Functions

* [​`getenv`](/mojo/stdlib/os/env/getenv): Returns the value of the given environment variable.
* [​`setenv`](/mojo/stdlib/os/env/setenv): Changes or adds an environment variable.
* [​`unsetenv`](/mojo/stdlib/os/env/unsetenv): Unsets an environment variable.

---

## setenv

`setenv(owned name: String, owned value: String, overwrite: Bool = True) -> Bool`

Changes or adds an environment variable.

**Constraints:**

The function only works on macOS or Linux and returns False otherwise.

**Args:**

* ​name (`String`): The name of the environment variable.
* ​value (`String`): The value of the environment variable.
* ​overwrite (`Bool`): If an environment variable with the given name already exists,
  its value is not changed unless `overwrite` is True.

**Returns:**

False if the name is empty or contains an `=` character. In any other
case, True is returned.

---

## unsetenv

`unsetenv(owned name: String) -> Bool`

Unsets an environment variable.

**Args:**

* ​name (`String`): The name of the environment variable.

**Returns:**

True if unsetting the variable succeeded. Otherwise, False is returned.

---

## fstat

Implements file system status operations.

You can import these APIs from the `os` package. For example:

```mojo
from os import stat
```

## Structs

* [​`stat_result`](/mojo/stdlib/os/fstat/stat_result): Object whose fields correspond  to the members of the stat structure.

## Functions

* [​`lstat`](/mojo/stdlib/os/fstat/lstat): Get the status of a file or a file descriptor (similar to stat, but does not follow symlinks).
* [​`stat`](/mojo/stdlib/os/fstat/stat): Get the status of a file or a file descriptor.

---

## lstat

`lstat[PathLike: PathLike](path: PathLike) -> stat_result`

Get the status of a file or a file descriptor (similar to stat, but does not follow symlinks).

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

Returns the stat\_result on the path.

---

## stat

`stat[PathLike: PathLike](path: PathLike) -> stat_result`

Get the status of a file or a file descriptor.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

Returns the stat\_result on the path.

---

## stat_result

`struct stat_result`

Object whose fields correspond  to the members of the stat structure.

## Fields

* ​st\_mode (`Int`): File mode: file type and file mode bits (permissions).
* ​st\_ino (`Int`): Platform dependent, but if non-zero, uniquely identifies the file for a given value of st\_dev.
* ​st\_dev (`Int`): Identifier of the device on which this file resides.
* ​st\_nlink (`Int`): Number of hard links.
* ​st\_uid (`Int`): User identifier of the file owner.
* ​st\_gid (`Int`): Group identifier of the file owner.
* ​st\_size (`Int`): Size of the file in bytes, if it is a regular file or a symbolic link.
* ​st\_atimespec (`_CTimeSpec`): Time of file most recent access.
* ​st\_mtimespec (`_CTimeSpec`): Time of file most recent modification.
* ​st\_ctimespec (`_CTimeSpec`): Time of file most recent change.
* ​st\_birthtimespec (`_CTimeSpec`): Time of file creation.
* ​st\_blocks (`Int`): Number of 512-byte blocks allocated for file.
* ​st\_blksize (`Int`): Preferred blocksize for efficient file system I/O.
* ​st\_rdev (`Int`): Type of device if an inode device.
* ​st\_flags (`Int`): User defined flags for file.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(out self, *, st_mode: Int, st_ino: Int, st_dev: Int, st_nlink: Int, st_uid: Int, st_gid: Int, st_size: Int, st_atimespec: _CTimeSpec, st_mtimespec: _CTimeSpec, st_ctimespec: _CTimeSpec, st_birthtimespec: _CTimeSpec, st_blocks: Int, st_blksize: Int, st_rdev: Int, st_flags: Int)`

Initialize the stat\_result structure.

**Args:**

* ​st\_mode (`Int`): File mode: file type and file mode bits (permissions).
* ​st\_ino (`Int`): Uniquely identifier for a file.
* ​st\_dev (`Int`): Identifier of the device on which this file resides.
* ​st\_nlink (`Int`): Number of hard links.
* ​st\_uid (`Int`): User identifier of the file owner.
* ​st\_gid (`Int`): Group identifier of the file owner.
* ​st\_size (`Int`): Size of the file (bytes), if it is a file or a symlink.
* ​st\_atimespec (`_CTimeSpec`): Time of file most recent access.
* ​st\_mtimespec (`_CTimeSpec`): Time of file most recent modification.
* ​st\_ctimespec (`_CTimeSpec`): Time of file most recent change.
* ​st\_birthtimespec (`_CTimeSpec`): Time of file creation.
* ​st\_blocks (`Int`): Number of 512-byte blocks allocated for file.
* ​st\_blksize (`Int`): Preferred blocksize for efficient file system I/O.
* ​st\_rdev (`Int`): Type of device if an inode device.
* ​st\_flags (`Int`): User defined flags for file.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this path to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__str__`

`__str__(self) -> String`

Constructs a string representation of stat\_result.

**Returns:**

A string representation of stat\_result.

### `__repr__`

`__repr__(self) -> String`

Constructs a representation of stat\_result.

**Returns:**

A representation of stat\_result.

---

## os

Provides access to operating-system dependent functionality.

The types and functions in this package primarily provide operating-system
independent access to operating-system dependent features, such as file systems
and environment variables.

For accessing files, see built-in [`open()`](/mojo/stdlib/builtin/file/open)
function and the [`file`](/mojo/stdlib/builtin/file/) module. For manipulating
file system paths, see the [`os.path`](/mojo/stdlib/os/path/) package for
OS-independent path manipulation functions and the `pathlib` package for
the [`Path`](/mojo/stdlib/pathlib/path/Path) struct, an abstraction for handling
paths.

## Packages

* [​`path`](/mojo/stdlib/os/path/): Provides a set of operating-system independent functions for manipulating file system paths.

## Modules

* [​`atomic`](/mojo/stdlib/os/atomic/): Implements the `Atomic` struct.
* [​`env`](/mojo/stdlib/os/env/): Provides functions for working with environment variables.
* [​`fstat`](/mojo/stdlib/os/fstat/): Implements file system status operations.
* [​`os`](/mojo/stdlib/os/os/): Provides functions to access operating-system dependent functionality, including file system operations.
* [​`pathlike`](/mojo/stdlib/os/pathlike/): Implements the `PathLike` trait.

---

## abort

`abort[result: AnyType = None]() -> result`

Calls a target dependent trap instruction if available.

**Parameters:**

* ​result (`AnyType`): The result type.

**Returns:**

A null result type.

`abort[result: AnyType = None](message: String) -> result`

Calls a target dependent trap instruction if available.

**Parameters:**

* ​result (`AnyType`): The result type.

**Args:**

* ​message (`String`): The message to include when aborting.

**Returns:**

A null result type.

---

## getuid

`getuid() -> Int`

Retrieve the user ID of the calling process.

**Constraints:**

This function is constrained to run on Linux or macOS operating systems only.

**Returns:**

The user ID of the calling process.

---

## os

Provides functions to access operating-system dependent functionality, including file system operations.

You can import a method from the `os` package. For example:

```mojo
from os import listdir
```

## Aliases

### `SEEK_CUR`

`alias SEEK_CUR = __init__[__mlir_type.!pop.int_literal](1)`

Seek from the current position.

### `SEEK_END`

`alias SEEK_END = __init__[__mlir_type.!pop.int_literal](2)`

Seek from the end of the file.

### `SEEK_SET`

`alias SEEK_SET = __init__[__mlir_type.!pop.int_literal](0)`

Seek from the beginning of the file.

### `sep`

`alias sep = "\\".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]() if os_is_windows() else "/".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]()`

## Functions

* [​`abort`](/mojo/stdlib/os/os/abort): Calls a target dependent trap instruction if available.
* [​`getuid`](/mojo/stdlib/os/os/getuid): Retrieve the user ID of the calling process.
* [​`listdir`](/mojo/stdlib/os/os/listdir): Gets the list of entries contained in the path provided.
* [​`makedirs`](/mojo/stdlib/os/os/makedirs): Creates a specified leaf directory along with any necessary intermediate directories that don't already exist.
* [​`mkdir`](/mojo/stdlib/os/os/mkdir): Creates a directory at the specified path.
* [​`remove`](/mojo/stdlib/os/os/remove): Removes the specified file.
* [​`removedirs`](/mojo/stdlib/os/os/removedirs): Removes a leaf directory and all empty intermediate ones.
* [​`rmdir`](/mojo/stdlib/os/os/rmdir): Removes the specified directory.
* [​`unlink`](/mojo/stdlib/os/os/unlink): Removes the specified file.

---

## listdir

`listdir[PathLike: PathLike](path: PathLike) -> List[String]`

Gets the list of entries contained in the path provided.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

Returns the list of entries in the path provided.

---

## makedirs

`makedirs[PathLike: PathLike](path: PathLike, mode: Int = 511, exist_ok: Bool = False)`

Creates a specified leaf directory along with any necessary intermediate directories that don't already exist.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.
* ​mode (`Int`): The mode to create the directory with.
* ​exist\_ok (`Bool`): Ignore error if `True` and path exists (default `False`).

---

## mkdir

`mkdir[PathLike: PathLike](path: PathLike, mode: Int = 511)`

Creates a directory at the specified path.

If the directory can not be created an error is raised.
Absolute and relative paths are allowed, relative paths are resolved from cwd.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.
* ​mode (`Int`): The mode to create the directory with.

---

## remove

`remove[PathLike: PathLike](path: PathLike)`

Removes the specified file.

If the path is a directory or it can not be deleted, an error is raised.
Absolute and relative paths are allowed, relative paths are resolved from cwd.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the file.

---

## removedirs

`removedirs[PathLike: PathLike](path: PathLike)`

Removes a leaf directory and all empty intermediate ones.

Directories corresponding to rightmost path segments will be pruned away
until either the whole path is consumed or an error occurs. Errors during
this latter phase are ignored, which occur when a directory was not empty.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

---

## rmdir

`rmdir[PathLike: PathLike](path: PathLike)`

Removes the specified directory.

If the path is not a directory or it can not be deleted, an error is raised.
Absolute and relative paths are allowed, relative paths are resolved from cwd.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

---

## unlink

`unlink[PathLike: PathLike](path: PathLike)`

Removes the specified file.

If the path is a directory or it can not be deleted, an error is raised.
Absolute and relative paths are allowed, relative paths are resolved from cwd.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the file.

---

## path

Provides a set of operating-system independent functions for manipulating file system paths.

## Modules

* [​`path`](/mojo/stdlib/os/path/path/): Provides a set of operating-system independent functions for manipulating file system paths.

---

## basename

`basename[PathLike: PathLike, //](path: PathLike) -> String`

Returns the tail section of a path.

```mojo
from os.path import basename

basename("a/path/foo.txt")  # returns "foo.txt"
```

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to retrieve the basename from.

**Returns:**

The basename from the path.

---

## dirname

`dirname[PathLike: PathLike, //](path: PathLike) -> String`

Returns the directory component of a pathname.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to a file.

**Returns:**

The directory component of a pathname.

---

## exists

`exists[PathLike: PathLike, //](path: PathLike) -> Bool`

Return True if path exists.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

Returns True if the path exists and is not a broken symbolic link.

---

## expanduser

`expanduser[PathLike: PathLike, //](path: PathLike) -> String`

Expands a tilde "\~" prefix in `path` to the user's home directory.

For example, `~/folder` becomes `/home/current_user/folder`. On macOS and
Linux a path starting with `~user/` will expand to the specified user's home
directory, so `~user/folder` becomes `/home/user/folder`.

If the home directory cannot be determined, or the `path` is not prefixed
with "\~", the original path is returned unchanged.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path that is being expanded.

**Returns:**

The expanded path.

---

## expandvars

`expandvars[PathLike: PathLike, //](path: PathLike) -> String`

Replaces `${var}` or `$var` in the path with values from the current environment variables. Malformed variable names and references to non-existing variables are left unchanged.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path that is being expanded.

**Returns:**

The expanded path.

---

## getsize

`getsize[PathLike: PathLike, //](path: PathLike) -> Int`

Return the size, in bytes, of the specified path.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the file.

**Returns:**

The size of the path in bytes.

---

## path

Provides a set of operating-system independent functions for manipulating file system paths.

You can import these APIs from the `os.path` package. For example:

```mojo
from os.path import isdir
```

## Functions

* [​`basename`](/mojo/stdlib/os/path/path/basename): Returns the tail section of a path.
* [​`dirname`](/mojo/stdlib/os/path/path/dirname): Returns the directory component of a pathname.
* [​`exists`](/mojo/stdlib/os/path/path/exists): Return True if path exists.
* [​`expanduser`](/mojo/stdlib/os/path/path/expanduser): Expands a tilde "\~" prefix in `path` to the user's home directory.
* [​`expandvars`](/mojo/stdlib/os/path/path/expandvars): Replaces `${var}` or `$var` in the path with values from the current environment variables. Malformed variable names and references to non-existing variables are left unchanged.
* [​`getsize`](/mojo/stdlib/os/path/path/getsize): Return the size, in bytes, of the specified path.
* [​`is_absolute`](/mojo/stdlib/os/path/path/is_absolute): Return True if `path` is an absolute path name. On Unix, that means it begins with a slash.
* [​`isdir`](/mojo/stdlib/os/path/path/isdir): Return True if path is an existing directory. This follows symbolic links, so both islink() and isdir() can be true for the same path.
* [​`isfile`](/mojo/stdlib/os/path/path/isfile): Test whether a path is a regular file.
* [​`islink`](/mojo/stdlib/os/path/path/islink): Return True if path refers to an existing directory entry that is a symbolic link.
* [​`join`](/mojo/stdlib/os/path/path/join): Join two or more pathname components, inserting '/' as needed. If any component is an absolute path, all previous path components will be discarded.  An empty last part will result in a path that ends with a separator.
* [​`lexists`](/mojo/stdlib/os/path/path/lexists): Return True if path exists or is a broken symlink.
* [​`split`](/mojo/stdlib/os/path/path/split): Split a given pathname into two components: head and tail. This is useful for separating the directory path from the filename. If the input path ends with a separator, the tail component will be empty. If there is no separator in the path, the head component will be empty, and the entire path will be considered the tail. Trailing separators in the head are stripped unless the head is the root directory.
* [​`split_extension`](/mojo/stdlib/os/path/path/split_extension): Splits `path` into the root and extension.
* [​`splitroot`](/mojo/stdlib/os/path/path/splitroot): Splits `path` into drive, root and tail. The tail contains anything after the root.

---

## is_absolute

`is_absolute[PathLike: PathLike, //](path: PathLike) -> Bool`

Return True if `path` is an absolute path name. On Unix, that means it begins with a slash.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to check.

**Returns:**

Return `True` if path is an absolute path name.

---

## isdir

`isdir[PathLike: PathLike, //](path: PathLike) -> Bool`

Return True if path is an existing directory. This follows symbolic links, so both islink() and isdir() can be true for the same path.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

True if the path is a directory or a link to a directory and
False otherwise.

---

## isfile

`isfile[PathLike: PathLike, //](path: PathLike) -> Bool`

Test whether a path is a regular file.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

Returns True if the path is a regular file.

---

## islink

`islink[PathLike: PathLike, //](path: PathLike) -> Bool`

Return True if path refers to an existing directory entry that is a symbolic link.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

True if the path is a link to a directory and False otherwise.

---

## join

`join(owned path: String, *paths: String) -> String`

Join two or more pathname components, inserting '/' as needed. If any component is an absolute path, all previous path components will be discarded.  An empty last part will result in a path that ends with a separator.

**Args:**

* ​path (`String`): The path to join.
* ​\*paths (`String`): The paths to join.

**Returns:**

The joined path.

---

## lexists

`lexists[PathLike: PathLike, //](path: PathLike) -> Bool`

Return True if path exists or is a broken symlink.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

Returns True if the path exists or is a broken symbolic link.

---

## split

`split[PathLike: PathLike, //](path: PathLike) -> Tuple[String, String]`

Split a given pathname into two components: head and tail. This is useful for separating the directory path from the filename. If the input path ends with a separator, the tail component will be empty. If there is no separator in the path, the head component will be empty, and the entire path will be considered the tail. Trailing separators in the head are stripped unless the head is the root directory.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to be split.

**Returns:**

A tuple containing two strings: (head, tail).

---

## split_extension

`split_extension[PathLike: PathLike, //](path: PathLike) -> Tuple[String, String]`

Splits `path` into the root and extension.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to be split.

**Returns:**

A tuple containing two strings: (root, extension).

---

## splitroot

`splitroot[PathLike: PathLike, //](path: PathLike) -> Tuple[String, String, String]`

Splits `path` into drive, root and tail. The tail contains anything after the root.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to be split.

**Returns:**

A tuple containing three strings: (drive, root, tail).

---

## PathLike

A trait representing file system paths.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__fspath__`

`__fspath__(self: _Self) -> String`

Return the file system path representation of the object.

**Returns:**

The file system path representation as a string.

---

## pathlike

Implements the `PathLike` trait.

You can import the trait from the `os` package. For example:

```mojo
from os import PathLike
```

## Traits

* [​`PathLike`](/mojo/stdlib/os/pathlike/PathLike): A trait representing file system paths.

---

## pathlib

Implements the pathlib package.

## Modules

* [​`path`](/mojo/stdlib/pathlib/path/): Implements `Path` and related functions.

---

## Path

`struct Path`

The Path object.

## Fields

* ​path (`String`): The underlying path string representation.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Hashable`,
`Movable`,
`PathLike`,
`Stringable`,
`UnknownDestructibility`,
`Writable`,
`_HashableWithHasher`

## Methods

### `__init__`

`__init__(out self)`

Initializes a path with the current directory.

`__init__(out self, path: StringSlice[origin])`

Initializes a path with the provided path.

**Args:**

* ​path (`StringSlice[origin]`): The file system path.

`@implicit`
`__init__(out self, owned path: String)`

Initializes a path with the provided path.

**Args:**

* ​path (`String`): The file system path.

`@implicit`
`__init__(out self, path: StringLiteral[value])`

Initializes a path with the provided path.

**Args:**

* ​path (`StringLiteral[value]`): The file system path.

### `__bool__`

`__bool__(self) -> Bool`

Checks if the path is not empty.

**Returns:**

True if the path length is greater than zero, and False otherwise.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Returns True if the two paths are equal.

**Args:**

* ​other (`Self`): The other path to compare against.

**Returns:**

True if the paths are equal and False otherwise.

`__eq__(self, other: StringSlice[origin]) -> Bool`

Returns True if the two paths are equal.

**Args:**

* ​other (`StringSlice[origin]`): The other path to compare against.

**Returns:**

True if the String and Path are equal, and False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Returns True if the two paths are not equal.

**Args:**

* ​other (`Self`): The other path to compare against.

**Returns:**

True if the paths are not equal and False otherwise.

### `__truediv__`

`__truediv__(self, suffix: Self) -> Self`

Joins two paths using the system-defined path separator.

**Args:**

* ​suffix (`Self`): The suffix to append to the path.

**Returns:**

A new path with the suffix appended to the current path.

`__truediv__(self, suffix: StringSlice[origin]) -> Self`

Joins two paths using the system-defined path separator.

**Args:**

* ​suffix (`StringSlice[origin]`): The suffix to append to the path.

**Returns:**

A new path with the suffix appended to the current path.

### `__itruediv__`

`__itruediv__(mut self, suffix: StringSlice[origin])`

Joins two paths using the system-defined path separator.

**Args:**

* ​suffix (`StringSlice[origin]`): The suffix to append to the path.

### `copy`

`copy(self) -> Self`

Copy the object.

**Returns:**

A copy of the value.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the path.

**Returns:**

A string representation of the path.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this path to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__fspath__`

`__fspath__(self) -> String`

Returns a string representation of the path.

**Returns:**

A string representation of the path.

### `__repr__`

`__repr__(self) -> String`

Returns a printable representation of the path.

**Returns:**

A printable representation of the path.

### `__hash__`

`__hash__(self) -> UInt`

Hash the underlying path string using builtin hash.

**Returns:**

An integer value containing the hash of the path string.

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with the path string value.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

### `stat`

`stat(self) -> stat_result`

Returns the stat information on the path.

**Returns:**

A stat\_result object containing information about the path.

### `lstat`

`lstat(self) -> stat_result`

Returns the lstat information on the path. This is similar to stat, but if the file is a symlink then it gives you information about the symlink rather than the target.

**Returns:**

A stat\_result object containing information about the path.

### `exists`

`exists(self) -> Bool`

Returns True if the path exists and False otherwise.

**Returns:**

True if the path exists on disk and False otherwise.

### `expanduser`

`expanduser(self) -> Self`

Expands a prefixed `~` with `$HOME` on posix or `$USERPROFILE` on windows. If environment variables are not set or the `path` is not prefixed with `~`, returns the `path` unmodified.

**Returns:**

The expanded path.

### `home`

`static home() -> Self`

Returns `$HOME` on posix or `$USERPROFILE` on windows. If environment variables are not set it returns `~`.

**Returns:**

Path to user home directory.

### `is_dir`

`is_dir(self) -> Bool`

Returns True if the path is a directory and False otherwise.

**Returns:**

Return True if the path points to a directory (or a link pointing to
a directory).

### `is_file`

`is_file(self) -> Bool`

Returns True if the path is a file and False otherwise.

**Returns:**

Return True if the path points to a file (or a link pointing to
a file).

### `read_text`

`read_text(self) -> String`

Returns content of the file.

**Returns:**

Contents of file as string.

### `read_bytes`

`read_bytes(self) -> List[SIMD[uint8, 1]]`

Returns content of the file as bytes.

**Returns:**

Contents of file as list of bytes.

### `write_text`

`write_text[T: Writable](self, value: T)`

Writes the value to the file as text.

**Parameters:**

* ​T (`Writable`): The type of an object conforming to the `Writable` trait.

**Args:**

* ​value (`T`): The value to write.

### `write_bytes`

`write_bytes(self, bytes: Span[SIMD[uint8, 1], origin])`

Writes bytes to the file.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The bytes to write to this file.

### `suffix`

`suffix(self) -> String`

The path's extension, if any. This includes the leading period. For example: '.txt'. If no extension is found, returns the empty string.

**Returns:**

The path's extension.

### `joinpath`

`joinpath(self, *pathsegments: String) -> Self`

Joins the Path using the pathsegments.

**Args:**

* ​\*pathsegments (`String`): The path segments.

**Returns:**

The path concatenation with the pathsegments using the
directory separator.

### `listdir`

`listdir(self) -> List[Path]`

Gets the list of entries contained in the path provided.

**Returns:**

The list of entries in the path provided.

---

## cwd

`cwd() -> Path`

Gets the current directory.

**Returns:**

The current directory.

---

## path

Implements `Path` and related functions.

## Aliases

### `DIR_SEPARATOR`

`alias DIR_SEPARATOR = "\\".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]() if os_is_windows() else "/".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]()`

## Structs

* [​`Path`](/mojo/stdlib/pathlib/path/Path): The Path object.

## Functions

* [​`cwd`](/mojo/stdlib/pathlib/path/cwd): Gets the current directory.

---

## prelude

Implements the prelude package.  This package provide the public entities that are automatically imported into every Mojo program.

---

## pwd

Provides access to user and group information from the password database.

Use the [`Passwd`](/mojo/stdlib/pwd/pwd/Passwd) type to access user account
information such as user name, ID, group, and home directory.

## Modules

* [​`pwd`](/mojo/stdlib/pwd/pwd/):

---

## Passwd

`struct Passwd`

Represents user account information retrieved from the user password database related to a user ID.

## Fields

* ​pw\_name (`String`): User name.
* ​pw\_passwd (`String`): User password.
* ​pw\_uid (`Int`): User ID.
* ​pw\_gid (`Int`): Group ID.
* ​pw\_gecos (`String`): Real name or comment field.
* ​pw\_dir (`String`): Home directory.
* ​pw\_shell (`String`): Shell program.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this string to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__str__`

`__str__(self) -> String`

Gets the Passwd struct as a string.

**Returns:**

A compact string of the Passwd struct.

### `__repr__`

`__repr__(self) -> String`

Gets the Passwd struct as a string.

**Returns:**

A compact string representation of Passwd struct.

---

## getpwnam

`getpwnam(owned name: String) -> Passwd`

Retrieves the user ID in the password database for the given user name.

**Constraints:**

This function is constrained to run on Linux or macOS operating systems
only.

**Args:**

* ​name (`String`): The name of the user to retrieve the password entry for.

**Returns:**

An object containing the user's account information, including login
name, encrypted password, user ID, group ID, real name, home directory,
and shell program.

**Raises:**

If the user name does not exist or there is an error retrieving the
information.

---

## getpwuid

`getpwuid(uid: Int) -> Passwd`

Retrieve the password database entry for a given user ID.

**Constraints:**

This function is constrained to run on Linux or macOS operating systems
only.

**Args:**

* ​uid (`Int`): The user ID for which to retrieve the password database entry.

**Returns:**

An object containing the user's account information, including login
name, encrypted password, user ID, group ID, real name, home directory,
and shell program.

**Raises:**

If the user ID does not exist or there is an error retrieving the
information.

---

## pwd

## Structs

* [​`Passwd`](/mojo/stdlib/pwd/pwd/Passwd): Represents user account information retrieved from the user password database related to a user ID.

## Functions

* [​`getpwnam`](/mojo/stdlib/pwd/pwd/getpwnam): Retrieves the user ID in the password database for the given user name.
* [​`getpwuid`](/mojo/stdlib/pwd/pwd/getpwuid): Retrieve the password database entry for a given user ID.

---

## PyMojoObject

`struct PyMojoObject[T: AnyType]`

Storage backing a PyObject\* wrapping a Mojo value.

This struct represents the C-level layout of a Python object that contains
a wrapped Mojo value. It must be ABI-compatible with CPython's PyObject
structure to enable seamless interoperability between Mojo and Python.

The struct follows Python's object model where all Python objects begin
with a PyObject header (ob\_base), followed by type-specific data. In this
case, the type-specific data is a Mojo value of type T.

See  for more details.

## Parameters

* ​T (`AnyType`): The Mojo type being wrapped. Can be any type that satisfies `AnyType`.

## Fields

* ​ob\_base (`PyObject`): The standard Python object header containing reference count and type information.
  This must be the first field to maintain ABI compatibility with Python's object layout.
  All Python objects begin with this header structure.
* ​mojo\_value (`T`): The actual Mojo value being wrapped and exposed to Python.
  This field stores the Mojo data that Python code can interact with through
  the registered type methods and bindings.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

---

## PythonModuleBuilder

`struct PythonModuleBuilder`

A builder for creating Python modules with Mojo function and type bindings.

This builder provides a high-level API for declaring Python bindings for Mojo
functions and types within a Python module. It manages the registration of
functions, types, and their associated metadata, then finalizes everything
into a complete Python module object.

The builder follows a declarative pattern where you:

1. Create a builder instance with a module name
2. Add function bindings using `def_function()`, `def_py_function()`, `def_py_c_function()`
3. Add type bindings using `add_type[T]()` and configure them
4. Call `finalize()` to finish building the Python module.

Example:

```mojo
from python.bindings import PythonModuleBuilder

var builder = PythonModuleBuilder("my_module")
builder.def_function[my_func]("my_func", "Documentation for my_func")

_ = builder.add_type[MyType]("MyType").def_method[my_method]("my_method")

var module = builder.finalize()
```

Note:
After calling `finalize()`, the builder's internal state is cleared and
it should not be reused for creating additional modules.

TODO: This should be enforced programmatically in the future.

## Fields

* ​module (`PythonObject`): The Python module being built.
* ​functions (`List[PyMethodDef]`): List of function definitions that will be exposed in the module.
* ​type\_builders (`List[PythonTypeBuilder]`): List of type builders for types that will be exposed in the module.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, name: StringSlice[StaticConstantOrigin])`

Construct a Python module builder with the given module name.

**Args:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the module.

**Raises:**

If the module creation fails.

`__init__(out self, module: PythonObject)`

Construct a Python module builder with the given module.

**Args:**

* ​module (`PythonObject`): The module to build.

### `add_type`

`add_type[T: Movable & Defaultable & Representable](mut self, type_name: StringSlice[StaticConstantOrigin]) -> ref [*[0,0].type_builders] PythonTypeBuilder`

Add a type to the module and return a builder for it.

**Parameters:**

* ​T (`Movable & Defaultable & Representable`): The mojo type to bind in the module.

**Args:**

* ​type\_name (`StringSlice[StaticConstantOrigin]`): The name of the type to expose in the module.

**Returns:**

A reference to a type builder registered in the module builder.

### `def_py_c_function`

`def_py_c_function(mut self, func: fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PyCFunction signature in the module.

**Args:**

* ​func (`fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr`): The function to declare a binding for.
* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

### `def_py_function`

`def_py_function[func: fn(mut PythonObject, mut PythonObject) -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PyFunction signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject, mut PythonObject) -> PythonObject`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_py_function[func: fn(mut PythonObject, mut PythonObject) raises -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PyFunctionRaising signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject, mut PythonObject) raises -> PythonObject`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

### `def_function`

`def_function[func_type: AnyTrivialRegType, //, func: PyObjectFunction[func_type, False]](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

These signatures can have any number of positional PythonObject
arguments up to 3, can optionally return a PythonObject, and can raise.

Example signature types:

```mojo
alias F1 = fn (mut PythonObject) raises -> PythonObject
alias F2 = fn (mut PythonObject, PythonObject) -> PythonObject
alias F3 = fn (mut PythonObject, PythonObject, mut PythonObject)
```

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): The type of the function to declare a binding for.
* ​func (`PyObjectFunction[func_type, False]`): The function to declare a binding for. Users can pass their
  function directly, and it will be implicitly converted to a
  PyObjectFunction if and only if its signature is supported.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

### `finalize`

`finalize(mut self) -> PythonObject`

Finalize the module builder, creating the module object.

All types and functions added to the builder will be built and exposed
in the module. After calling this method, the builder's internal state
is cleared and it should not be reused for creating additional modules.

**Returns:**

The finalized Python module containing all registered functions and types.

**Raises:**

If the module creation fails or if we fail to add any of the
declared functions or types to the module.

---

## PythonTypeBuilder

`struct PythonTypeBuilder`

A builder for a Python 'type' binding.

This is typically used to build a type description of a `PyMojoObject[T]`.

This builder is used to declare method bindings for a Python type, and then
create the type binding.

Finalizing builder created with `PythonTypeObject.bind[T]()` will globally
register the resulting Python 'type' object as the single canonical type
object for the Mojo type `T`. Subsequent attempts to register a Python type
for `T` will raise an exception.

Registering a Python type object for `T` is necessary to be able to
construct a `PythonObject` from an instance of `T`, or to downcast an
existing `PythonObject` to a pointer to the inner `T` value.

## Fields

* ​type\_name (`StringSlice[StaticConstantOrigin]`): The name the type will be exposed as in the Python module.
* ​basicsize (`Int`): The required allocation size to hold an instance of this type as a Python object.
* ​methods (`List[PyMethodDef]`): List of method definitions that will be exposed on the Python type.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, type_name: StringSlice[StaticConstantOrigin], *, basicsize: Int)`

Construct a new builder for a Python type binding.

**Args:**

* ​type\_name (`StringSlice[StaticConstantOrigin]`): The name the type will be exposed as in the Python module.
* ​basicsize (`Int`): The required allocation size to hold an instance of this
  type as a Python object.

### `bind`

`static bind[T: Movable & Defaultable & Representable](type_name: StringSlice[StaticConstantOrigin]) -> Self`

Construct a new builder for a Python type that binds a Mojo type.

**Parameters:**

* ​T (`Movable & Defaultable & Representable`): The mojo type to bind.

**Args:**

* ​type\_name (`StringSlice[StaticConstantOrigin]`): The name the type will be exposed as in the Python module.

**Returns:**

A new type builder instance.

### `finalize`

`finalize(mut self, module: PythonObject)`

Finalize the builder and add the created type to a Python module.

This method completes the type building process by calling the
parameterless `finalize()` method to create the Python type object, then
automatically adds the resulting type to the specified Python module
using the builder's configured type name. After successful completion,
the builder's method list is cleared to prevent accidental reuse.

This is a convenience method that combines type finalization and module
registration in a single operation, which is the most common use case
when creating Python-accessible Mojo types.

Note:
After calling this method, the builder's internal state is modified
(methods list is cleared), so the builder should not be reused for
creating additional type objects. If you need the type object for
further operations, use the parameterless `finalize()` method
instead and manually add it to the module.

**Args:**

* ​module (`PythonObject`): The Python module to which the finalized type will be added.
  The type will be accessible from Python code that imports this
  module using the name specified during builder construction.

**Raises:**

If the type object creation fails (see `finalize()` for details) or
if adding the type to the module fails, typically due to name
conflicts or module state issues.

### `def_py_c_method`

`def_py_c_method(mut self, method: fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PyObjectPtr signature for the type.

**Args:**

* ​method (`fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr`): The method to declare a binding for.
* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

### `def_py_method`

`def_py_method[method: fn(mut PythonObject, mut PythonObject) -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PyObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject, mut PythonObject) -> PythonObject`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

`def_py_method[method: fn(mut PythonObject, mut PythonObject) raises -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PyObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject, mut PythonObject) raises -> PythonObject`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

### `def_method`

`def_method[method_type: AnyTrivialRegType, //, method: PyObjectFunction[method_type, True]](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PythonObject signature for the type.

These signatures can have any number of positional PythonObject
arguments up to 3 (including self), can optionally return a
PythonObject, and can raise.

Example signature types:

```mojo
alias F1 = fn (mut PythonObject) raises -> PythonObject
alias F2 = fn (mut PythonObject, PythonObject) -> PythonObject
alias F3 = fn (mut PythonObject, PythonObject, mut PythonObject)
```

**Parameters:**

* ​method\_type (`AnyTrivialRegType`): The type of the method to declare a binding for.
* ​method (`PyObjectFunction[method_type, True]`): The method to declare a binding for. Users can pass their
  function directly, and it will be implicitly converted to a
  PyObjectFunction if and only if its signature is supported.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

---

## check_arguments_arity

`check_arguments_arity(arity: Int, args: PythonObject)`

Validate that the provided arguments match the expected function arity.

This function checks if the number of arguments in the provided tuple matches
the expected arity for a function call. If the counts don't match, it raises
a descriptive error message similar to Python's built-in TypeError messages.

**Args:**

* ​arity (`Int`): The expected number of arguments for the function.
* ​args (`PythonObject`): A tuple containing the actual arguments passed to the function.

**Raises:**

Error: If the argument count doesn't match the expected arity. The error
message follows Python's convention for TypeError messages, indicating
whether too few or too many arguments were provided.

`check_arguments_arity(arity: Int, args: PythonObject, func_name: StringSlice[origin])`

Validate that the provided arguments match the expected function arity.

This function checks if the number of arguments in the provided tuple matches
the expected arity for a function call. If the counts don't match, it raises
a descriptive error message similar to Python's built-in TypeError messages.

**Args:**

* ​arity (`Int`): The expected number of arguments for the function.
* ​args (`PythonObject`): A tuple containing the actual arguments passed to the function.
* ​func\_name (`StringSlice[origin]`): The name of the function being called, used in error messages
  to provide better debugging information.

**Raises:**

Error: If the argument count doesn't match the expected arity. The error
message follows Python's convention for TypeError messages, indicating
whether too few or too many arguments were provided, along with the
specific function name.

---

## bindings

## Aliases

### `MOJO_PYTHON_TYPE_OBJECTS`

`alias MOJO_PYTHON_TYPE_OBJECTS = _Global[__init__[__mlir_type.!kgen.string]("MOJO_PYTHON_TYPE_OBJECTS"), Dict[StringSlice[StaticConstantOrigin], PythonObject], _init_python_type_objects]`

Mapping of Mojo type identifiers to unique `PyTypeObject*` binding that Mojo type to this CPython interpreter instance.

### `Typed_initproc`

`alias Typed_initproc = fn(PyObjectPtr, PythonObject, PyObjectPtr) -> SIMD[int32, 1]`

### `Typed_newfunc`

`alias Typed_newfunc = fn(UnsafePointer[PyTypeObject], PythonObject, PyObjectPtr) -> PyObjectPtr`

## Structs

* [​`PyMojoObject`](/mojo/stdlib/python/bindings/PyMojoObject): Storage backing a PyObject\* wrapping a Mojo value.
* [​`PythonModuleBuilder`](/mojo/stdlib/python/bindings/PythonModuleBuilder): A builder for creating Python modules with Mojo function and type bindings.
* [​`PythonTypeBuilder`](/mojo/stdlib/python/bindings/PythonTypeBuilder): A builder for a Python 'type' binding.

## Functions

* [​`check_arguments_arity`](/mojo/stdlib/python/bindings/check_arguments_arity): Validate that the provided arguments match the expected function arity.
* [​`lookup_py_type_object`](/mojo/stdlib/python/bindings/lookup_py_type_object): Retrieve a reference to the unique Python type describing Python objects containing Mojo values of type `T`.

---

## lookup_py_type_object

`lookup_py_type_object[T: AnyType]() -> PythonObject`

Retrieve a reference to the unique Python type describing Python objects containing Mojo values of type `T`.

This function looks up the Python type object that was previously registered
for the Mojo type `T` using a `PythonTypeBuilder`. The returned type object
can be used to create Python objects that wrap Mojo values of type `T`.

**Parameters:**

* ​T (`AnyType`): The Mojo type to look up.

**Returns:**

A `PythonObject` representing the Python type object that binds the Mojo
type `T` to the current CPython interpreter instance.

**Raises:**

If no `PythonTypeBuilder` was ever finalized for type `T`, or if no
Python type object has been registered for the provided type identifier.

---

## python

Implements the python package.

## Modules

* [​`bindings`](/mojo/stdlib/python/bindings/):
* [​`python`](/mojo/stdlib/python/python/): Implements Python interoperability.
* [​`python_object`](/mojo/stdlib/python/python_object/): Implements PythonObject.

---

## Python

`struct Python`

Provides methods that help you use Python code in Mojo.

## Implemented traits

`AnyType`,
`Defaultable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Default constructor.

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Copy constructor.

**Args:**

* ​existing (`Self`): The existing instance to copy from.

### `cpython`

`cpython(self) -> ref [StaticConstantOrigin] CPython`

Handle to the low-level C API of the CPython interpreter present in the current process.

**Returns:**

Handle to the CPython interpreter instance in the current process.

### `eval`

`eval(self, owned code: String) -> Bool`

Executes the given Python code.

**Args:**

* ​code (`String`): The python code to execute.

**Returns:**

`True` if the code executed successfully or `False` if the code
raised an exception.

### `evaluate`

`static evaluate(owned expr: String, file: Bool = False, name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("__main__")) -> PythonObject`

Executes the given Python code.

**Args:**

* ​expr (`String`): The Python expression to evaluate.
* ​file (`Bool`): Evaluate as a file and return the module.
* ​name (`StringSlice[StaticConstantOrigin]`): The name of the module (most relevant if `file` is True).

**Returns:**

`PythonObject` containing the result of the evaluation.

### `add_to_path`

`static add_to_path(dir_path: StringSlice[origin])`

Adds a directory to the Python path.

This might be necessary to import a Python module via `import_module()`.
For example:

```mojo
from python import Python

# Specify path to `mypython.py` module
Python.add_to_path("path/to/module")
var mypython = Python.import_module("mypython")

var c = mypython.my_algorithm(2, 3)
```

**Args:**

* ​dir\_path (`StringSlice[origin]`): The path to a Python module you want to import.

### `import_module`

`static import_module(owned module: String) -> PythonObject`

Imports a Python module.

This provides you with a module object you can use just like you would
in Python. For example:

```mojo
from python import Python

# This is equivalent to Python's `import numpy as np`
np = Python.import_module("numpy")
a = np.array([1, 2, 3])
```

**Args:**

* ​module (`String`): The Python module name. This module must be visible from the
  list of available Python paths (you might need to add the
  module's path with `add_to_path()`).

**Returns:**

The Python module.

### `create_module`

`static create_module(name: StringSlice[StaticConstantOrigin]) -> PythonObject`

Creates a Python module using the provided name.

Inspired by 

TODO: allow specifying a doc-string to attach to the module upon creation or lazily added?

**Args:**

* ​name (`StringSlice[StaticConstantOrigin]`): The Python module name.

**Returns:**

The Python module.

### `add_functions`

`static add_functions(module: PythonObject, owned functions: List[PyMethodDef])`

Adds functions to a Python module object.

**Args:**

* ​module (`PythonObject`): The Python module object.
* ​functions (`List[PyMethodDef]`): List of function data.

**Raises:**

If we fail to add the functions to the module.

### `add_object`

`static add_object(module: PythonObject, owned name: String, value: PythonObject)`

Add a new object to `module` with the given name and value.

The provided object can be any type of Python object: an instance,
a type object, a function, etc.

The added value will be inserted into the `__dict__` of the provided
module.

**Args:**

* ​module (`PythonObject`): The Python module to modify.
* ​name (`String`): The name of the new object.
* ​value (`PythonObject`): The python object value.

### `dict`

`static dict[V: PythonConvertible & Copyable & Movable = PythonObject](*, owned **kwargs: V) -> PythonObject`

Construct an Python dictionary from keyword arguments.

**Parameters:**

* ​V (`PythonConvertible & Copyable & Movable`): The type of the values in the dictionary. Must implement the
  `PythonConvertible`, `Copyable`, and `Movable` traits.

**Args:**

* ​\*\*kwargs (`V`): The keyword arguments to construct the dictionary with.

**Returns:**

The constructed Python dictionary.

**Raises:**

On failure to construct the dictionary or convert the values to
Python objects.

`static dict[K: PythonConvertible & Copyable & Movable = PythonObject, V: PythonConvertible & Copyable & Movable = PythonObject](tuples: Span[Tuple[K, V], origin]) -> PythonObject`

Construct an Python dictionary from a list of key-value tuples.

**Parameters:**

* ​K (`PythonConvertible & Copyable & Movable`): The type of the keys in the dictionary. Must implement the
  `PythonConvertible`, `Copyable`, and `Movable` traits.
* ​V (`PythonConvertible & Copyable & Movable`): The type of the values in the dictionary. Must implement the
  `PythonConvertible`, `Copyable`, and `Movable` traits.

**Args:**

* ​tuples (`Span[Tuple[K, V], origin]`): The list of key-value tuples to construct the dictionary
  with.

**Returns:**

The constructed Python dictionary.

**Raises:**

On failure to construct the dictionary or convert the keys or values
to Python objects.

### `list`

`static list[T: PythonConvertible & Copyable & Movable](values: Span[T, origin]) -> PythonObject`

Initialize the object from a list of values.

**Parameters:**

* ​T (`PythonConvertible & Copyable & Movable`): The span element type.

**Args:**

* ​values (`Span[T, origin]`): The values to initialize the list with.

**Returns:**

A PythonObject representing the list.

`static list[*Ts: PythonConvertible & Copyable](owned *values: *Ts) -> PythonObject`

Construct an Python list of objects.

**Parameters:**

* ​\*Ts (`PythonConvertible & Copyable`): The list element types.

**Args:**

* ​\*values (`*Ts`): The values to initialize the list with.

**Returns:**

The constructed Python list.

### `tuple`

`static tuple[*Ts: PythonConvertible & Copyable](owned *values: *Ts) -> PythonObject`

Construct an Python tuple of objects.

**Parameters:**

* ​\*Ts (`PythonConvertible & Copyable`): The list element types.

**Args:**

* ​\*values (`*Ts`): The values to initialize the tuple with.

**Returns:**

The constructed Python tuple.

### `as_string_slice`

`as_string_slice(self, str_obj: PythonObject) -> StringSlice[MutableAnyOrigin]`

Return a string representing the given Python object.

**Args:**

* ​str\_obj (`PythonObject`): The Python object.

**Returns:**

Mojo string representing the given Python object.

### `type`

`static type(obj: PythonObject) -> PythonObject`

Return Type of this PythonObject.

**Args:**

* ​obj (`PythonObject`): PythonObject we want the type of.

**Returns:**

A PythonObject that holds the type object.

### `none`

`static none() -> PythonObject`

Get a `PythonObject` representing `None`.

**Returns:**

`PythonObject` representing `None`.

### `str`

`static str(obj: PythonObject) -> PythonObject`

Convert a PythonObject to a Python `str`.

**Args:**

* ​obj (`PythonObject`): The PythonObject to convert.

**Returns:**

A Python `str` object.

**Raises:**

An error if the conversion failed.

### `int`

`static int(obj: PythonObject) -> PythonObject`

Convert a PythonObject to a Python `int` (i.e. arbitrary precision integer).

**Args:**

* ​obj (`PythonObject`): The PythonObject to convert.

**Returns:**

A PythonObject representing the result of the conversion to `int`.

**Raises:**

If the conversion to `int` fails.

### `float`

`static float(obj: PythonObject) -> PythonObject`

Convert a PythonObject to a Python `float` object.

**Args:**

* ​obj (`PythonObject`): The PythonObject to convert.

**Returns:**

A Python `float` object.

**Raises:**

If the conversion fails.

### `py_long_as_ssize_t`

`static py_long_as_ssize_t(obj: PythonObject) -> Int`

Get the value of a Python `long` object.

**Args:**

* ​obj (`PythonObject`): The Python `long` object.

**Returns:**

The value of the `long` object as a `Py_ssize_t`.

**Raises:**

If `obj` is not a Python `long` object, or if the `long` object
value overflows `Py_ssize_t`.

### `is_true`

`static is_true(obj: PythonObject) -> Bool`

Check if the PythonObject is truthy.

**Args:**

* ​obj (`PythonObject`): The PythonObject to check.

**Returns:**

True if the PythonObject is truthy and False otherwise.

**Raises:**

If the boolean value of the PythonObject cannot be determined.

---

## python

Implements Python interoperability.

You can import these APIs from the `python` package. For example:

```mojo
from python import Python
```

## Structs

* [​`Python`](/mojo/stdlib/python/python/Python): Provides methods that help you use Python code in Mojo.

---

## ConvertibleFromPython

Denotes a type that can attempt construction from a read-only Python object.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self: _Self, obj: PythonObject)`

Attempt to construct an instance of this object from a read-only Python value.

**Args:**

* ​obj (`PythonObject`): The Python object to convert from.

**Raises:**

If conversion was not successful.

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

---

## PythonConvertible

A trait that indicates a type can be converted to a PythonObject, and that specifies the behavior with a `to_python_object` method.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `to_python_object`

`to_python_object(owned self: _Self) -> PythonObject`

Convert a value to a PythonObject.

**Returns:**

A PythonObject representing the value.

**Raises:**

If the conversion to a PythonObject failed.

---

## PythonObject

`@register_passable`
`struct PythonObject`

A Python object.

## Fields

* ​py\_object (`PyObjectPtr`): A pointer to the underlying Python object.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Defaultable`,
`Movable`,
`PythonConvertible`,
`SizedRaising`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Initialize the object with a `None` value.

`__init__(*, from_owned_ptr: PyObjectPtr) -> Self`

Initialize this object from an owned reference-counted Python object pointer.

Ownership of the reference will be assumed by `PythonObject`.

**Args:**

* ​from\_owned\_ptr (`PyObjectPtr`): The `PyObjectPtr` to take ownership of.

`__init__(*, from_borrowed_ptr: PyObjectPtr) -> Self`

Initialize this object from a read-only reference-counted Python object pointer.

The reference count of the pointee object will be incremented, and
ownership of the additional reference count will be assumed by the
initialized `PythonObject`.

The CPython API documentation indicates the ownership semantics of the
returned object on any function that returns a `PyObject*` value. The
two possible annotations are:

* "Return value: New reference."
* "Return value: Borrowed reference.

This function should be used to construct a `PythonObject` from the
pointer returned by 'Borrowed reference'-type objects.

**Args:**

* ​from\_borrowed\_ptr (`PyObjectPtr`): A read-only reference counted pointer to a Python
  object.

**Returns:**

An owned PythonObject pointer.

`__init__[T: Movable](out self, *, owned alloc: T)`

Allocate a new `PythonObject` and store a Mojo value in it.

The newly allocated Python object will contain the provided Mojo `T`
instance directly, without attempting conversion to an equivalent Python
builtin type.

Only Mojo types that have a registered Python 'type' object can be stored
as a Python object. Mojo types are registered using a
`PythonTypeBuilder`.

**Parameters:**

* ​T (`Movable`): The Mojo type of the value that the resulting Python object
  holds.

**Args:**

* ​alloc (`T`): The Mojo value to store in the new Python object.

**Raises:**

If no Python type object has been registered for `T` by a
`PythonTypeBuilder`.

`@implicit`
`__init__(none: NoneType) -> Self`

Initialize a none value object from a `None` literal.

**Args:**

* ​none (`NoneType`): None.

`@implicit`
`__init__(value: Bool) -> Self`

Initialize the object from a bool.

**Args:**

* ​value (`Bool`): The boolean value.

`@implicit`
`__init__(integer: Int) -> Self`

Initialize the object with an integer value.

**Args:**

* ​integer (`Int`): The integer value.

`@implicit`
`__init__[dtype: DType](value: SIMD[dtype, 1]) -> Self`

Initialize the object with a generic scalar value. If the scalar value type is bool, it is converted to a boolean. Otherwise, it is converted to the appropriate integer or floating point type.

**Parameters:**

* ​dtype (`DType`): The scalar value type.

**Args:**

* ​value (`SIMD[dtype, 1]`): The scalar value.

`@implicit`
`__init__(out self, value: StringLiteral[value])`

Initialize the object from a string literal.

**Args:**

* ​value (`StringLiteral[value]`): The string value.

`@implicit`
`__init__(out self, value: String)`

Initialize the object from a string.

**Args:**

* ​value (`String`): The string value.

`@implicit`
`__init__(out self, string: StringSlice[origin])`

Initialize the object from a string.

**Args:**

* ​string (`StringSlice[origin]`): The string value.

**Raises:**

If the string is not valid UTF-8.

`@implicit`
`__init__(slice: Slice) -> Self`

Initialize the object from a Mojo Slice.

**Args:**

* ​slice (`Slice`): The dictionary value.

`__init__[*Ts: PythonConvertible & Copyable](out self, owned *values: *Ts, *, __list_literal__: Tuple[])`

Construct an Python list of objects.

**Parameters:**

* ​\*Ts (`PythonConvertible & Copyable`): The types of the input values.

**Args:**

* ​\*values (`*Ts`): The values to initialize the list with.
* ​**list\_literal** (`Tuple[]`): Tell Mojo to use this method for list literals.

**Returns:**

The constructed Python list.

`__init__[*Ts: PythonConvertible & Copyable](out self, owned *values: *Ts, *, __set_literal__: Tuple[])`

Construct an Python set of objects.

**Parameters:**

* ​\*Ts (`PythonConvertible & Copyable`): The types of the input values.

**Args:**

* ​\*values (`*Ts`): The values to initialize the set with.
* ​**set\_literal** (`Tuple[]`): Tell Mojo to use this method for set literals.

**Returns:**

The constructed Python set.

`__init__(out self, owned keys: List[PythonObject], owned values: List[PythonObject], __dict_literal__: Tuple[])`

Construct a Python dictionary from a list of keys and a list of values.

**Args:**

* ​keys (`List[PythonObject]`): The keys of the dictionary.
* ​values (`List[PythonObject]`): The values of the dictionary.
* ​**dict\_literal** (`Tuple[]`): Tell Mojo to use this method for dict literals.

### `__copyinit__`

`__copyinit__(existing: Self) -> Self`

Copy the object.

This increments the underlying refcount of the existing object.

**Args:**

* ​existing (`Self`): The value to copy.

### `__del__`

`__del__(owned self)`

Destroy the object.

This decrements the underlying refcount of the pointed-to object.

### `__bool__`

`__bool__(self) -> Bool`

Evaluate the boolean value of the object.

**Returns:**

Whether the object evaluates as true.

### `__getitem__`

`__getitem__(self, *args: Self) -> Self`

Return the value for the given key or keys.

**Args:**

* ​\*args (`Self`): The key or keys to access on this object.

**Returns:**

The value corresponding to the given key for this object.

`__getitem__(self, *args: Slice) -> Self`

Return the sliced value for the given Slice or Slices.

**Args:**

* ​\*args (`Slice`): The Slice or Slices to apply to this object.

**Returns:**

The sliced value corresponding to the given Slice(s) for this object.

### `__setitem__`

`__setitem__(self, *args: Self, *, value: Self)`

Set the value with the given key or keys.

**Args:**

* ​\*args (`Self`): The key or keys to set on this object.
* ​value (`Self`): The value to set.

### `__neg__`

`__neg__(self) -> Self`

Negative.

Calls the underlying object's `__neg__` method.

**Returns:**

The result of prefixing this object with a `-` operator. For most
numerical objects, this returns the negative.

### `__pos__`

`__pos__(self) -> Self`

Positive.

Calls the underlying object's `__pos__` method.

**Returns:**

The result of prefixing this object with a `+` operator. For most
numerical objects, this does nothing.

### `__invert__`

`__invert__(self) -> Self`

Inversion.

Calls the underlying object's `__invert__` method.

**Returns:**

The logical inverse of this object: a bitwise representation where
all bits are flipped, from zero to one, and from one to zero.

### `__lt__`

`__lt__(self, rhs: Self) -> Self`

Less than (rich) comparison operator.

**Args:**

* ​rhs (`Self`): The value of the right hand side of the comparison.

**Returns:**

The result of the comparison, not necessarily a boolean.

**Raises:**

If the object doesn't implement the `__lt__` method, or if it fails.

### `__le__`

`__le__(self, rhs: Self) -> Self`

Less than or equal (rich) comparison operator.

**Args:**

* ​rhs (`Self`): The value of the right hand side of the comparison.

**Returns:**

The result of the comparison, not necessarily a boolean.

**Raises:**

If the object doesn't implement the `__le__` method, or if it fails.

### `__eq__`

`__eq__(self, rhs: Self) -> Self`

Equality (rich) comparison operator.

**Args:**

* ​rhs (`Self`): The value of the right hand side of the comparison.

**Returns:**

The result of the comparison, not necessarily a boolean.

**Raises:**

If the object doesn't implement the `__eq__` method, or if it fails.

### `__ne__`

`__ne__(self, rhs: Self) -> Self`

Inequality (rich) comparison operator.

**Args:**

* ​rhs (`Self`): The value of the right hand side of the comparison.

**Returns:**

The result of the comparison, not necessarily a boolean.

**Raises:**

If the object doesn't implement the `__ne__` method, or if it fails.

### `__gt__`

`__gt__(self, rhs: Self) -> Self`

Greater than (rich) comparison operator.

**Args:**

* ​rhs (`Self`): The value of the right hand side of the comparison.

**Returns:**

The result of the comparison, not necessarily a boolean.

**Raises:**

If the object doesn't implement the `__gt__` method, or if it fails.

### `__ge__`

`__ge__(self, rhs: Self) -> Self`

Greater than or equal (rich) comparison operator.

**Args:**

* ​rhs (`Self`): The value of the right hand side of the comparison.

**Returns:**

The result of the comparison, not necessarily a boolean.

**Raises:**

If the object doesn't implement the `__ge__` method, or if it fails.

### `__is__`

`__is__(self, other: Self) -> Bool`

Test if the PythonObject is the `other` PythonObject, the same as `x is y` in Python.

**Args:**

* ​other (`Self`): The right-hand-side value in the comparison.

**Returns:**

True if they are the same object and False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Test if the PythonObject is not the `other` PythonObject, the same as `x is not y` in Python.

**Args:**

* ​other (`Self`): The right-hand-side value in the comparison.

**Returns:**

True if they are not the same object and False otherwise.

### `__contains__`

`__contains__(self, rhs: Self) -> Bool`

Contains dunder.

Calls the underlying object's `__contains__` method.

**Args:**

* ​rhs (`Self`): Right hand value.

**Returns:**

True if rhs is in self.

### `__add__`

`__add__(self, rhs: Self) -> Self`

Addition and concatenation.

Calls the underlying object's `__add__` method.

**Args:**

* ​rhs (`Self`): Right hand value.

**Returns:**

The sum or concatenated values.

### `__sub__`

`__sub__(self, rhs: Self) -> Self`

Subtraction.

Calls the underlying object's `__sub__` method.

**Args:**

* ​rhs (`Self`): Right hand value.

**Returns:**

The difference.

### `__mul__`

`__mul__(self, rhs: Self) -> Self`

Multiplication.

Calls the underlying object's `__mul__` method.

**Args:**

* ​rhs (`Self`): Right hand value.

**Returns:**

The product.

### `__truediv__`

`__truediv__(self, rhs: Self) -> Self`

Division.

Calls the underlying object's `__truediv__` method.

**Args:**

* ​rhs (`Self`): The right-hand-side value by which this object is divided.

**Returns:**

The result of dividing the right-hand-side value by this.

### `__floordiv__`

`__floordiv__(self, rhs: Self) -> Self`

Return the division of self and rhs rounded down to the nearest integer.

Calls the underlying object's `__floordiv__` method.

**Args:**

* ​rhs (`Self`): The right-hand-side value by which this object is divided.

**Returns:**

The result of dividing this by the right-hand-side value, modulo any
remainder.

### `__mod__`

`__mod__(self, rhs: Self) -> Self`

Return the remainder of self divided by rhs.

Calls the underlying object's `__mod__` method.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

The remainder of dividing self by rhs.

### `__pow__`

`__pow__(self, exp: Self) -> Self`

Raises this object to the power of the given value.

**Args:**

* ​exp (`Self`): The exponent.

**Returns:**

The result of raising this by the given exponent.

### `__lshift__`

`__lshift__(self, rhs: Self) -> Self`

Bitwise left shift.

**Args:**

* ​rhs (`Self`): The right-hand-side value by which this object is bitwise
  shifted to the left.

**Returns:**

This value, shifted left by the given value.

### `__rshift__`

`__rshift__(self, rhs: Self) -> Self`

Bitwise right shift.

**Args:**

* ​rhs (`Self`): The right-hand-side value by which this object is bitwise
  shifted to the right.

**Returns:**

This value, shifted right by the given value.

### `__and__`

`__and__(self, rhs: Self) -> Self`

Bitwise AND.

**Args:**

* ​rhs (`Self`): The right-hand-side value with which this object is bitwise
  AND'ed.

**Returns:**

The bitwise AND result of this and the given value.

### `__or__`

`__or__(self, rhs: Self) -> Self`

Bitwise OR.

**Args:**

* ​rhs (`Self`): The right-hand-side value with which this object is bitwise
  OR'ed.

**Returns:**

The bitwise OR result of this and the given value.

### `__xor__`

`__xor__(self, rhs: Self) -> Self`

Exclusive OR.

**Args:**

* ​rhs (`Self`): The right-hand-side value with which this object is exclusive
  OR'ed.

**Returns:**

The exclusive OR result of this and the given value.

### `__radd__`

`__radd__(self, lhs: Self) -> Self`

Reverse addition and concatenation.

Calls the underlying object's `__radd__` method.

**Args:**

* ​lhs (`Self`): The left-hand-side value to which this object is added or
  concatenated.

**Returns:**

The sum.

### `__rsub__`

`__rsub__(self, lhs: Self) -> Self`

Reverse subtraction.

Calls the underlying object's `__rsub__` method.

**Args:**

* ​lhs (`Self`): The left-hand-side value from which this object is subtracted.

**Returns:**

The result of subtracting this from the given value.

### `__rmul__`

`__rmul__(self, lhs: Self) -> Self`

Reverse multiplication.

Calls the underlying object's `__rmul__` method.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is multiplied by this object.

**Returns:**

The product of the multiplication.

### `__rtruediv__`

`__rtruediv__(self, lhs: Self) -> Self`

Reverse division.

Calls the underlying object's `__rtruediv__` method.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is divided by this object.

**Returns:**

The result of dividing the given value by this.

### `__rfloordiv__`

`__rfloordiv__(self, lhs: Self) -> Self`

Reverse floor division.

Calls the underlying object's `__rfloordiv__` method.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is divided by this object.

**Returns:**

The result of dividing the given value by this, modulo any
remainder.

### `__rmod__`

`__rmod__(self, lhs: Self) -> Self`

Reverse modulo.

Calls the underlying object's `__rmod__` method.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is divided by this object.

**Returns:**

The remainder from dividing the given value by this.

### `__rpow__`

`__rpow__(self, lhs: Self) -> Self`

Reverse power of.

**Args:**

* ​lhs (`Self`): The number that is raised to the power of this object.

**Returns:**

The result of raising the given value by this exponent.

### `__rlshift__`

`__rlshift__(self, lhs: Self) -> Self`

Reverse bitwise left shift.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is bitwise shifted to the left
  by this object.

**Returns:**

The given value, shifted left by this.

### `__rrshift__`

`__rrshift__(self, lhs: Self) -> Self`

Reverse bitwise right shift.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is bitwise shifted to the right
  by this object.

**Returns:**

The given value, shifted right by this.

### `__rand__`

`__rand__(self, lhs: Self) -> Self`

Reverse bitwise and.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is bitwise AND'ed with this
  object.

**Returns:**

The bitwise AND result of the given value and this.

### `__ror__`

`__ror__(self, lhs: Self) -> Self`

Reverse bitwise OR.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is bitwise OR'ed with this
  object.

**Returns:**

The bitwise OR result of the given value and this.

### `__rxor__`

`__rxor__(self, lhs: Self) -> Self`

Reverse exclusive OR.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is exclusive OR'ed with this
  object.

**Returns:**

The exclusive OR result of the given value and this.

### `__iadd__`

`__iadd__(mut self, rhs: Self)`

Immediate addition and concatenation.

**Args:**

* ​rhs (`Self`): The right-hand-side value that is added to this object.

### `__isub__`

`__isub__(mut self, rhs: Self)`

Immediate subtraction.

**Args:**

* ​rhs (`Self`): The right-hand-side value that is subtracted from this object.

### `__imul__`

`__imul__(mut self, rhs: Self)`

In-place multiplication.

Calls the underlying object's `__imul__` method.

**Args:**

* ​rhs (`Self`): The right-hand-side value by which this object is multiplied.

### `__itruediv__`

`__itruediv__(mut self, rhs: Self)`

Immediate division.

**Args:**

* ​rhs (`Self`): The value by which this object is divided.

### `__ifloordiv__`

`__ifloordiv__(mut self, rhs: Self)`

Immediate floor division.

**Args:**

* ​rhs (`Self`): The value by which this object is divided.

### `__imod__`

`__imod__(mut self, rhs: Self)`

Immediate modulo.

**Args:**

* ​rhs (`Self`): The right-hand-side value that is used to divide this object.

### `__ipow__`

`__ipow__(mut self, rhs: Self)`

Immediate power of.

**Args:**

* ​rhs (`Self`): The exponent.

### `__ilshift__`

`__ilshift__(mut self, rhs: Self)`

Immediate bitwise left shift.

**Args:**

* ​rhs (`Self`): The right-hand-side value by which this object is bitwise
  shifted to the left.

### `__irshift__`

`__irshift__(mut self, rhs: Self)`

Immediate bitwise right shift.

**Args:**

* ​rhs (`Self`): The right-hand-side value by which this object is bitwise
  shifted to the right.

### `__iand__`

`__iand__(mut self, rhs: Self)`

Immediate bitwise AND.

**Args:**

* ​rhs (`Self`): The right-hand-side value with which this object is bitwise
  AND'ed.

### `__ixor__`

`__ixor__(mut self, rhs: Self)`

Immediate exclusive OR.

**Args:**

* ​rhs (`Self`): The right-hand-side value with which this object is
  exclusive OR'ed.

### `__ior__`

`__ior__(mut self, rhs: Self)`

Immediate bitwise OR.

**Args:**

* ​rhs (`Self`): The right-hand-side value with which this object is bitwise
  OR'ed.

### `copy`

`copy(self) -> Self`

Copy the object.

**Returns:**

A copy of the value.

### `__iter__`

`__iter__(self) -> _PyIter`

Iterate over the object.

**Returns:**

An iterator object.

**Raises:**

If the object is not iterable.

### `__getattr__`

`__getattr__(self, owned name: String) -> Self`

Return the value of the object attribute with the given name.

**Args:**

* ​name (`String`): The name of the object attribute to return.

**Returns:**

The value of the object attribute with the given name.

### `__setattr__`

`__setattr__(self, owned name: String, new_value: Self)`

Set the given value for the object attribute with the given name.

**Args:**

* ​name (`String`): The name of the object attribute to set.
* ​new\_value (`Self`): The new value to be set for that attribute.

### `__call__`

`__call__(self, *args: Self, *, owned **kwargs: Self) -> Self`

Call the underlying object as if it were a function.

**Args:**

* ​\*args (`Self`): Positional arguments to the function.
* ​\*\*kwargs (`Self`): Keyword arguments to the function.

**Returns:**

The return value from the called object.

**Raises:**

If the function cannot be called for any reason.

### `__len__`

`__len__(self) -> Int`

Returns the length of the object.

**Returns:**

The length of the object.

### `__hash__`

`__hash__(self) -> Int`

Returns the hash value of the object.

**Returns:**

The hash value of the object.

### `__int__`

`__int__(self) -> Self`

Convert the PythonObject to a Python `int` (i.e. arbitrary precision integer).

**Returns:**

A Python `int` object.

**Raises:**

An error if the conversion failed.

### `__float__`

`__float__(self) -> Self`

Convert the PythonObject to a Python `float` object.

**Returns:**

A Python `float` object.

**Raises:**

If the conversion fails.

### `__str__`

`__str__(self) -> Self`

Convert the PythonObject to a Python `str`.

**Returns:**

A Python `str` object.

**Raises:**

An error if the conversion failed.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this Python object to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `to_python_object`

`to_python_object(owned self) -> Self`

Convert this value to a PythonObject.

**Returns:**

A PythonObject representing the value.

### `unsafe_as_py_object_ptr`

`unsafe_as_py_object_ptr(self) -> PyObjectPtr`

Get the underlying PyObject pointer.

Safety:
Use-after-free: The caller must take care that `self` outlives the
usage of the pointer returned by this function.

**Returns:**

The underlying PyObject pointer.

### `steal_data`

`steal_data(owned self) -> PyObjectPtr`

Take ownership of the underlying pointer from the Python object.

**Returns:**

The underlying data.

### `unsafe_get_as_pointer`

`unsafe_get_as_pointer[dtype: DType](self) -> UnsafePointer[SIMD[dtype, 1]]`

Reinterpret a Python integer as a Mojo pointer.

Warning: converting from an integer to a pointer is unsafe! The
compiler assumes the resulting pointer DOES NOT alias any Mojo-derived
pointer. This is OK if the pointer originates from and is owned by
Python, e.g. the data underpinning a torch tensor.

**Parameters:**

* ​dtype (`DType`): The desired DType of the pointer.

**Returns:**

An `UnsafePointer` for the underlying Python data.

### `downcast_value_ptr`

`downcast_value_ptr[T: AnyType](self, *, func: Optional[StringSlice[StaticConstantOrigin]] = Optional(None)) -> UnsafePointer[T]`

Get a pointer to the expected contained Mojo value of type `T`.

This method validates that this object actually contains an instance of
`T`, and will raise an error if it does not.

Mojo values are stored as Python objects backed by the `PyMojoObject[T]`
struct.

**Parameters:**

* ​T (`AnyType`): The type of the Mojo value that this Python object is expected
  to contain.

**Args:**

* ​func (`Optional[StringSlice[StaticConstantOrigin]]`): Optional name of bound Mojo function that the raised
  TypeError should reference if downcasting fails.

**Returns:**

A pointer to the inner Mojo value.

**Raises:**

If the Python object does not contain an instance of the Mojo `T`
type.

### `unchecked_downcast_value_ptr`

`unchecked_downcast_value_ptr[T: AnyType](self) -> UnsafePointer[T]`

Get a pointer to the expected Mojo value of type `T`.

This function assumes that this Python object was allocated as an
instance of `PyMojoObject[T]`.

# Safety

The user must be certain that this Python object type matches the bound
Python type object for `T`.

**Parameters:**

* ​T (`AnyType`): The type of the Mojo value stored in this object.

**Returns:**

A pointer to the inner Mojo value.

---

## python_object

Implements PythonObject.

You can import these APIs from the `python` package. For example:

```mojo
from python import PythonObject
```

## Aliases

### `PyFunction`

`alias PyFunction = fn(mut PythonObject, mut PythonObject) -> PythonObject`

### `PyFunctionRaising`

`alias PyFunctionRaising = fn(mut PythonObject, mut PythonObject) raises -> PythonObject`

## Structs

* [​`PythonObject`](/mojo/stdlib/python/python_object/PythonObject): A Python object.

## Traits

* [​`ConvertibleFromPython`](/mojo/stdlib/python/python_object/ConvertibleFromPython): Denotes a type that can attempt construction from a read-only Python object.
* [​`PythonConvertible`](/mojo/stdlib/python/python_object/PythonConvertible): A trait that indicates a type can be converted to a PythonObject, and that specifies the behavior with a `to_python_object` method.

---

## random

Implements the random package.

## Modules

* [​`random`](/mojo/stdlib/random/random/): Provides functions for random numbers.

---

## random

Provides functions for random numbers.

You can import these APIs from the `random` package. For example:

```mojo
from random import seed
```

## Functions

* [​`rand`](/mojo/stdlib/random/random/rand): Fills memory with random values from a uniform distribution.
* [​`randint`](/mojo/stdlib/random/random/randint): Fills memory with uniform random in range \[low, high].
* [​`randn`](/mojo/stdlib/random/random/randn): Fills memory with random values from a Normal(mean, standard\_deviation) distribution.
* [​`randn_float64`](/mojo/stdlib/random/random/randn_float64): Returns a random double sampled from a Normal(mean, standard\_deviation) distribution.
* [​`random_float64`](/mojo/stdlib/random/random/random_float64): Returns a random `Float64` number from the given range.
* [​`random_si64`](/mojo/stdlib/random/random/random_si64): Returns a random `Int64` number from the given range.
* [​`random_ui64`](/mojo/stdlib/random/random/random_ui64): Returns a random `UInt64` number from the given range.
* [​`seed`](/mojo/stdlib/random/random/seed): Seeds the random number generator using the current time.
* [​`shuffle`](/mojo/stdlib/random/random/shuffle): Shuffles the elements of the list randomly.

---

## rand

`rand[dtype: DType](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], size: Int, /, *, min: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](0), max: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1), int_scale: Optional[Int] = Optional(None))`

Fills memory with random values from a uniform distribution.

**Parameters:**

* ​dtype (`DType`): The dtype of the pointer.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The pointer to the memory area to fill.
* ​size (`Int`): The number of elements to fill.
* ​min (`SIMD[float64, 1]`): The minimum value for random.
* ​max (`SIMD[float64, 1]`): The maximum value for random.
* ​int\_scale (`Optional[Int]`): The scale for error checking (float type only).

---

## randint

`randint[dtype: DType](ptr: UnsafePointer[SIMD[dtype, 1]], size: Int, low: Int, high: Int)`

Fills memory with uniform random in range \[low, high].

**Constraints:**

The type should be integral.

**Parameters:**

* ​dtype (`DType`): The dtype of the pointer.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1]]`): The pointer to the memory area to fill.
* ​size (`Int`): The number of elements to fill.
* ​low (`Int`): The minimal value for random.
* ​high (`Int`): The maximal value for random.

---

## randn

`randn[dtype: DType](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], size: Int, mean: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](0), standard_deviation: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1))`

Fills memory with random values from a Normal(mean, standard\_deviation) distribution.

**Constraints:**

The type should be floating point.

**Parameters:**

* ​dtype (`DType`): The dtype of the pointer.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The pointer to the memory area to fill.
* ​size (`Int`): The number of elements to fill.
* ​mean (`SIMD[float64, 1]`): Normal distribution mean.
* ​standard\_deviation (`SIMD[float64, 1]`): Normal distribution standard deviation.

---

## randn_float64

`randn_float64(mean: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](0), standard_deviation: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1)) -> SIMD[float64, 1]`

Returns a random double sampled from a Normal(mean, standard\_deviation) distribution.

**Args:**

* ​mean (`SIMD[float64, 1]`): Normal distribution mean.
* ​standard\_deviation (`SIMD[float64, 1]`): Normal distribution standard deviation.

**Returns:**

A random float64 sampled from Normal(mean, standard\_deviation).

---

## random_float64

`random_float64(min: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](0), max: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](1)) -> SIMD[float64, 1]`

Returns a random `Float64` number from the given range.

**Args:**

* ​min (`SIMD[float64, 1]`): The minimum number in the range (default is 0.0).
* ​max (`SIMD[float64, 1]`): The maximum number in the range (default is 1.0).

**Returns:**

A random number from the specified range.

---

## random_si64

`random_si64(min: SIMD[int64, 1], max: SIMD[int64, 1]) -> SIMD[int64, 1]`

Returns a random `Int64` number from the given range.

**Args:**

* ​min (`SIMD[int64, 1]`): The minimum number in the range.
* ​max (`SIMD[int64, 1]`): The maximum number in the range.

**Returns:**

A random number from the specified range.

---

## random_ui64

`random_ui64(min: SIMD[uint64, 1], max: SIMD[uint64, 1]) -> SIMD[uint64, 1]`

Returns a random `UInt64` number from the given range.

**Args:**

* ​min (`SIMD[uint64, 1]`): The minimum number in the range.
* ​max (`SIMD[uint64, 1]`): The maximum number in the range.

**Returns:**

A random number from the specified range.

---

## seed

`seed()`

Seeds the random number generator using the current time.

`seed(a: Int)`

Seeds the random number generator using the value provided.

**Args:**

* ​a (`Int`): The seed value.

---

## shuffle

`shuffle[T: Copyable & Movable, //](mut list: List[T])`

Shuffles the elements of the list randomly.

Performs an in-place Fisher-Yates shuffle on the provided list.

**Parameters:**

* ​T (`Copyable & Movable`): The type of element in the List.

**Args:**

* ​list (`List[T]`): The list to modify.

---

## DeviceContextPtr

`@register_passable(trivial)`
`struct DeviceContextPtr`

Exposes a pointer to a C++ DeviceContext to Mojo.

Note: When initializing a `DeviceContext` from a pointer, the refcount is not
incremented. This is considered safe because `get_device_context()`
is only used within kernels and the `DeviceContext` lifetime is managed
by the graph compiler.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Initialize an empty `DeviceContextPtr` with a null pointer.

This creates a `DeviceContextPtr` that doesn't point to any device context.

`@implicit`
`__init__(handle: UnsafePointer[NoneType]) -> Self`

Initialize a `DeviceContextPtr` from a raw pointer.

**Args:**

* ​handle (`UnsafePointer[NoneType]`): A raw pointer to a C++ `DeviceContext`.

`@implicit`
`__init__(device: DeviceContext) -> Self`

Initialize a DeviceContextPtr from a `DeviceContext`.

This constructor allows implicit conversion from `DeviceContext` to `DeviceContextPtr`.

**Args:**

* ​device (`DeviceContext`): The `DeviceContext` to wrap in this pointer.

### `__getitem__`

`__getitem__(self) -> DeviceContext`

Dereference the pointer to get the `DeviceContext`.

**Returns:**

The `DeviceContext` that this pointer points to.

### `get_device_context`

`get_device_context(self) -> DeviceContext`

Get the `DeviceContext` that this pointer points to.

This is an alias for the dereference operator.

**Returns:**

The `DeviceContext` that this pointer points to.

---

## DeviceContextPtrList

`@register_passable(trivial)`
`struct DeviceContextPtrList[size: Int]`

A fixed-size collection of `DeviceContextPtr` objects.

This struct provides a lightweight, register-passable container for a fixed number
of `DeviceContextPtr` objects, with array-like access semantics.

## Parameters

* ​size (`Int`): The fixed number of `DeviceContextPtr` objects in the collection.

## Fields

* ​ptrs (`StaticTuple[DeviceContextPtr, size]`): The underlying storage for the device context pointers.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(ptrs: StaticTuple[DeviceContextPtr, size]) -> Self`

Initialize with a StaticTuple of `DeviceContextPtr` objects.

**Args:**

* ​ptrs (`StaticTuple[DeviceContextPtr, size]`): A StaticTuple containing the `DeviceContextPtr` objects to store.

### `__getitem__`

`__getitem__[index: Int](self) -> DeviceContext`

Access a `DeviceContext` at a compile-time known index.

**Parameters:**

* ​index (`Int`): A compile-time integer index.

**Returns:**

The `DeviceContext` at the specified index.

`__getitem__[I: Indexer, //](self, idx: I) -> DeviceContext`

Access a `DeviceContext` using a runtime index value.

**Parameters:**

* ​I (`Indexer`): A type that conforms to the `Indexer` trait.

**Args:**

* ​idx (`I`): A runtime index value that conforms to the Indexer trait.

**Returns:**

The `DeviceContext` at the specified index.

### `__len__`

`__len__(self) -> Int`

Get the number of `DeviceContextPtr` objects in the collection.

**Returns:**

The size of the collection as specified by the size parameter.

---

## Task

`struct Task[type: AnyType, origins: origin.set]`

Represents an asynchronous task that will produce a value of the specified type.

A Task encapsulates a coroutine that is executing asynchronously and will eventually
produce a result. Tasks can be awaited in async functions or waited on in synchronous code.

## Parameters

* ​type (`AnyType`): The type of value that this task will produce when completed.
* ​origins (`origin.set`): The set of origins for the coroutine wrapped by this task.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(out self, owned handle: Coroutine[type, origins])`

Initialize a task with a coroutine.

Takes ownership of the provided coroutine and sets up the task to receive
its result when completed.

**Args:**

* ​handle (`Coroutine[type, origins]`): The coroutine to execute as a task. Ownership is transferred.

### `__del__`

`__del__(owned self)`

Destroy the memory associated with a task. This must be manually called when a task goes out of scope.

### `__await__`

`__await__(self) -> ref [*[0,0]._result] type`

Suspend the current async function until the task completes and its result becomes available. This function must be force inlined into the calling async function.

This method enables the use of the 'await' keyword with Task objects in
async functions.

**Returns:**

A reference to the result value produced by the task.

### `get`

`get(self) -> ref [*[0,0]._result] type`

Get the task's result value. Calling this on an incomplete task is undefined behavior.

**Returns:**

A reference to the result value produced by the task.

### `wait`

`wait(self) -> ref [*[0,0]._result] type`

Block the current thread until the future value becomes available.

This method is used in synchronous code to wait for an asynchronous task
to complete. Unlike `__await__`, this method does not suspend the current
coroutine but instead blocks the entire thread.

**Returns:**

A reference to the result value produced by the task.

---

## TaskGroup

`struct TaskGroup`

A group of tasks that can be executed concurrently.

TaskGroup manages a collection of coroutines that can be executed in parallel.
It provides mechanisms to create, track, and wait for the completion of tasks.

## Fields

* ​counter (`Atomic[index]`): Atomic counter tracking the number of active tasks in the group.
* ​chain (`_Chain`): Chain used for asynchronous completion notification.
* ​tasks (`List[_TaskGroupBox]`): Collection of tasks managed by this TaskGroup.

## Implemented traits

`AnyType`,
`Defaultable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Initialize a new TaskGroup with an empty task list and initialized chain.

### `__del__`

`__del__(owned self)`

Clean up resources associated with the TaskGroup.

### `__await__`

`__await__(mut self)`

Make TaskGroup awaitable in async contexts.

This allows using 'await task\_group' syntax in async functions.

### `create_task`

`create_task(mut self, owned task: Coroutine[None, origins])`

Add a new task to the TaskGroup for execution.

**Args:**

* ​task (`Coroutine[None, origins]`): The coroutine to be executed as a task.

### `await_body_impl`

`static await_body_impl(hdl: !co.routine, mut task_group: Self)`

Implementation of the await functionality for TaskGroup.

**Args:**

* ​hdl (`!co.routine`): The coroutine handle to be awaited.
* ​task\_group (`Self`): The TaskGroup to be awaited.

### `wait`

`wait[origins: origin.set = {}](mut self)`

Wait for all tasks in the `TaskGroup` to complete.

This is a blocking call that returns only when all tasks have finished.

**Parameters:**

* ​origins (`origin.set`): The origin set for the wait operation.

---

## TaskGroupContext

`@register_passable(trivial)`
`struct TaskGroupContext`

Context structure for task group operations.

This structure holds a callback function and a pointer to a TaskGroup,
allowing asynchronous operations to interact with their parent TaskGroup
when they complete.

## Fields

* ​callback (`fn(mut TaskGroup) -> None`): Callback function to be invoked on the TaskGroup when an operation completes.
* ​task\_group (`UnsafePointer[TaskGroup]`): Pointer to the TaskGroup that owns or is associated with this context.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `tg_callback_fn_type`

`alias tg_callback_fn_type = fn(mut TaskGroup) -> None`

Type definition for callback functions that operate on TaskGroups.

---

## create_task

`create_task(owned handle: Coroutine[type, origins], out task: Task[type, origins])`

Run the coroutine as a task on the AsyncRT Runtime.

This function creates a task from a coroutine and schedules it for execution
on the async runtime. The task will execute asynchronously without blocking
the current execution context.

**Args:**

* ​handle (`Coroutine[type, origins]`): The coroutine to execute as a task. Ownership is transferred.

**Returns:**

The `task` output parameter is initialized with the created task.

---

## asyncrt

This module implements the low level concurrency library.

## Structs

* [​`DeviceContextPtr`](/mojo/stdlib/runtime/asyncrt/DeviceContextPtr): Exposes a pointer to a C++ DeviceContext to Mojo.
* [​`DeviceContextPtrList`](/mojo/stdlib/runtime/asyncrt/DeviceContextPtrList): A fixed-size collection of `DeviceContextPtr` objects.
* [​`Task`](/mojo/stdlib/runtime/asyncrt/Task): Represents an asynchronous task that will produce a value of the specified type.
* [​`TaskGroup`](/mojo/stdlib/runtime/asyncrt/TaskGroup): A group of tasks that can be executed concurrently.
* [​`TaskGroupContext`](/mojo/stdlib/runtime/asyncrt/TaskGroupContext): Context structure for task group operations.

## Functions

* [​`create_task`](/mojo/stdlib/runtime/asyncrt/create_task): Run the coroutine as a task on the AsyncRT Runtime.
* [​`parallelism_level`](/mojo/stdlib/runtime/asyncrt/parallelism_level): Gets the parallelism level of the Runtime.

---

## parallelism_level

`parallelism_level() -> Int`

Gets the parallelism level of the Runtime.

**Returns:**

The number of worker threads available in the async runtime.

---

## runtime

Implements the runtime package.

## Modules

* [​`asyncrt`](/mojo/stdlib/runtime/asyncrt/): This module implements the low level concurrency library.
* [​`tracing`](/mojo/stdlib/runtime/tracing/): Provides tracing utilities.

---

## Trace

`struct Trace[level: TraceLevel, *, category: TraceCategory = TraceCategory(4), target: Optional[StringSlice[StaticConstantOrigin]] = Optional(None)]`

An object representing a specific trace.

This struct provides functionality for creating and managing trace events
for profiling and debugging purposes.

## Parameters

* ​level (`TraceLevel`): The trace level to use.
* ​category (`TraceCategory`): The trace category to use (defaults to TraceCategory.MAX).
* ​target (`Optional[StringSlice[StaticConstantOrigin]]`): Optional target information to include in the trace.

## Fields

* ​int\_payload (`OptionalReg[Int]`): Optional integer payload, typically used for task IDs that are appended to trace names.
* ​detail (`String`): Additional details about the trace event, included when detailed tracing is enabled.
* ​event\_id (`Int`): Unique identifier for the trace event, assigned when the trace begins.
* ​parent\_id (`Int`): Identifier of the parent trace event, used for creating hierarchical trace relationships.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, *, owned _name_value: Variant[String, StringSlice[StaticConstantOrigin]], detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

Creates a Mojo trace with the given name.

**Args:**

* ​\_name\_value (`Variant[String, StringSlice[StaticConstantOrigin]]`): The name that is used to identify this Mojo trace.
* ​detail (`String`): Details of the trace entry.
* ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be
  appended to parent name. 0 (default) indicates no parent.
* ​task\_id (`OptionalReg[Int]`): Int that is appended to name.

`__init__(out self, owned name: String, detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, *, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

Creates a Mojo trace with the given string name.

**Args:**

* ​name (`String`): The name that is used to identify this Mojo trace.
* ​detail (`String`): Details of the trace entry.
* ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be
  appended to parent name. 0 (default) indicates no parent.
* ​task\_id (`OptionalReg[Int]`): Int that is appended to name.

`__init__(out self, name: StringSlice[StaticConstantOrigin], detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, *, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

Creates a Mojo trace with the given static string name.

**Args:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name that is used to identify this Mojo trace.
* ​detail (`String`): Details of the trace entry.
* ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be
  appended to parent name. 0 (default) indicates no parent.
* ​task\_id (`OptionalReg[Int]`): Int that is appended to name.

`__init__(out self, name: StringLiteral[value], detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, *, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

Creates a Mojo trace with the given string literal name.

**Args:**

* ​name (`StringLiteral[value]`): The name that is used to identify this Mojo trace.
* ​detail (`String`): Details of the trace entry.
* ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be
  appended to parent name. 0 (default) indicates no parent.
* ​task\_id (`OptionalReg[Int]`): Int that is appended to name.

### `__enter__`

`__enter__(mut self)`

Enters the trace context.

This begins recording of the trace event.

### `__exit__`

`__exit__(self)`

Exits the trace context.

This finishes recording of the trace event.

### `mark`

`mark(self)`

Marks the tracer with the info at the specific point of time.

This creates a point event in the trace timeline rather than a range.

### `name`

`name(self) -> String`

Returns the name of the trace.

**Returns:**

The name of the trace as a String.

### `start`

`start(mut self)`

Start recording trace event.

This begins recording of the trace event, similar to **enter**.

### `end`

`end(mut self)`

End recording trace event.

This finishes recording of the trace event, similar to **exit**.

---

## TraceCategory

`@register_passable(trivial)`
`struct TraceCategory`

An enum-like struct specifying the type of tracing to perform.

## Fields

* ​value (`Int`): The integer value representing the trace category. Used for bitwise operations when determining if profiling is enabled for a specific category.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Intable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ASYNCRT`

`alias ASYNCRT = TraceCategory(1)`

### `Kernel`

`alias Kernel = TraceCategory(3)`

### `MAX`

`alias MAX = TraceCategory(4)`

### `MEM`

`alias MEM = TraceCategory(2)`

### `OTHER`

`alias OTHER = TraceCategory(0)`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compares for equality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are equal.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Compares for inequality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are not equal.

### `__is__`

`__is__(self, rhs: Self) -> Bool`

Compares for equality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are equal.

### `__isnot__`

`__isnot__(self, rhs: Self) -> Bool`

Compares for inequality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are not equal.

### `__int__`

`__int__(self) -> Int`

Converts the trace category to an integer.

**Returns:**

The integer value of the trace category.

---

## TraceLevel

`@register_passable(trivial)`
`struct TraceLevel`

An enum-like struct specifying the level of tracing to perform.

## Fields

* ​value (`Int`): The integer value representing the trace level.
  Lower values indicate higher priority trace levels:
  * 0 (ALWAYS): Always traced
  * 1 (OP): Operation-level tracing
  * 2 (THREAD): Thread-level tracing

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ALWAYS`

`alias ALWAYS = TraceLevel(0)`

### `OP`

`alias OP = TraceLevel(1)`

### `THREAD`

`alias THREAD = TraceLevel(2)`

## Methods

### `__init__`

`@implicit`
`__init__(value: Int) -> Self`

Initializes a TraceLevel with the given integer value.

**Args:**

* ​value (`Int`): The integer value for the trace level.

### `__le__`

`__le__(self, rhs: Self) -> Bool`

Performs less than or equal to comparison.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if this value is less than or equal to `rhs`.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compares for equality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are equal.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Compares for inequality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are not equal.

### `__is__`

`__is__(self, rhs: Self) -> Bool`

Compares for equality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are equal.

### `__isnot__`

`__isnot__(self, rhs: Self) -> Bool`

Compares for inequality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are not equal.

### `__int__`

`__int__(self) -> Int`

Converts the trace level to an integer.

**Returns:**

The integer value of the trace level.

---

## get_current_trace_id

`get_current_trace_id[level: TraceLevel]() -> Int`

Returns the id of last created trace entry on the current thread.

**Parameters:**

* ​level (`TraceLevel`): The trace level to check.

**Returns:**

The ID of the current trace if profiling is enabled, otherwise 0.

---

## tracing

Provides tracing utilities.

## Structs

* [​`Trace`](/mojo/stdlib/runtime/tracing/Trace): An object representing a specific trace.
* [​`TraceCategory`](/mojo/stdlib/runtime/tracing/TraceCategory): An enum-like struct specifying the type of tracing to perform.
* [​`TraceLevel`](/mojo/stdlib/runtime/tracing/TraceLevel): An enum-like struct specifying the level of tracing to perform.

## Functions

* [​`get_current_trace_id`](/mojo/stdlib/runtime/tracing/get_current_trace_id): Returns the id of last created trace entry on the current thread.
* [​`is_profiling_disabled`](/mojo/stdlib/runtime/tracing/is_profiling_disabled): Returns False if the profiling is enabled for that specific type and level and True otherwise.
* [​`is_profiling_enabled`](/mojo/stdlib/runtime/tracing/is_profiling_enabled): Returns True if the profiling is enabled for that specific type and level and False otherwise.
* [​`trace_arg`](/mojo/stdlib/runtime/tracing/trace_arg): Helper to stringify the type and shape of a kernel argument for tracing.

---

## is_profiling_disabled

`is_profiling_disabled[type: TraceCategory, level: TraceLevel]() -> Bool`

Returns False if the profiling is enabled for that specific type and level and True otherwise.

**Parameters:**

* ​type (`TraceCategory`): The trace category to check.
* ​level (`TraceLevel`): The trace level to check.

**Returns:**

True if profiling is disabled for the specified type and level.

---

## is_profiling_enabled

`is_profiling_enabled[type: TraceCategory, level: TraceLevel]() -> Bool`

Returns True if the profiling is enabled for that specific type and level and False otherwise.

**Parameters:**

* ​type (`TraceCategory`): The trace category to check.
* ​level (`TraceLevel`): The trace level to check.

**Returns:**

True if profiling is enabled for the specified type and level.

---

## trace_arg

`trace_arg(name: String, shape: IndexList[size, element_type=element_type]) -> String`

Helper to stringify the type and shape of a kernel argument for tracing.

**Args:**

* ​name (`String`): The name of the argument.
* ​shape (`IndexList[size, element_type=element_type]`): The shape of the argument.

**Returns:**

A string representation of the argument with its shape.

`trace_arg(name: String, shape: IndexList[size, element_type=element_type], dtype: DType) -> String`

Helper to stringify the type and shape of a kernel argument for tracing.

**Args:**

* ​name (`String`): The name of the argument.
* ​shape (`IndexList[size, element_type=element_type]`): The shape of the argument.
* ​dtype (`DType`): The data type of the argument.

**Returns:**

A string representation of the argument with its shape and data type.

`trace_arg(name: String, buf: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> String`

Helper to stringify the type and shape of a kernel argument for tracing.

**Args:**

* ​name (`String`): The name of the argument.
* ​buf (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The NDBuffer to trace.

**Returns:**

A string representation of the buffer with its shape and data type.

---

## stat

Implements the stat package.

## Modules

* [​`stat`](/mojo/stdlib/stat/stat/): Implements the stat module.

---

## S_ISBLK

`S_ISBLK[intable: Intable](mode: intable) -> Bool`

Returns True if the mode is a block device.

**Parameters:**

* ​intable (`Intable`): A type conforming to Intable.

**Args:**

* ​mode (`intable`): The file mode.

**Returns:**

True if the mode is a block device and False otherwise.

---

## S_ISCHR

`S_ISCHR[intable: Intable](mode: intable) -> Bool`

Returns True if the mode is a character device.

**Parameters:**

* ​intable (`Intable`): A type conforming to Intable.

**Args:**

* ​mode (`intable`): The file mode.

**Returns:**

True if the mode is a character device and False otherwise.

---

## S_ISDIR

`S_ISDIR[intable: Intable](mode: intable) -> Bool`

Returns True if the mode is a directory.

**Parameters:**

* ​intable (`Intable`): A type conforming to Intable.

**Args:**

* ​mode (`intable`): The file mode.

**Returns:**

True if the mode is a directory and False otherwise.

---

## S_ISFIFO

`S_ISFIFO[intable: Intable](mode: intable) -> Bool`

Returns True if the mode is a fifo.

**Parameters:**

* ​intable (`Intable`): A type conforming to Intable.

**Args:**

* ​mode (`intable`): The file mode.

**Returns:**

True if the mode is a fifo and False otherwise.

---

## S_ISLNK

`S_ISLNK[intable: Intable](mode: intable) -> Bool`

Returns True if the mode is a symlink.

**Parameters:**

* ​intable (`Intable`): A type conforming to Intable.

**Args:**

* ​mode (`intable`): The file mode.

**Returns:**

True if the mode is a symlink and False otherwise.

---

## S_ISREG

`S_ISREG[intable: Intable](mode: intable) -> Bool`

Returns True if the mode is a regular file.

**Parameters:**

* ​intable (`Intable`): A type conforming to Intable.

**Args:**

* ​mode (`intable`): The file mode.

**Returns:**

True if the mode is a regular file and False otherwise.

---

## S_ISSOCK

`S_ISSOCK[intable: Intable](mode: intable) -> Bool`

Returns True if the mode is a socket.

**Parameters:**

* ​intable (`Intable`): A type conforming to Intable.

**Args:**

* ​mode (`intable`): The file mode.

**Returns:**

True if the mode is a socket and False otherwise.

---

## stat

Implements the stat module.

## Aliases

### `S_IFBLK`

`alias S_IFBLK = 24576`

Bits that determine the block device.

### `S_IFCHR`

`alias S_IFCHR = 8192`

Bits that determine the char device.

### `S_IFDIR`

`alias S_IFDIR = 16384`

Bits that determine the directory.

### `S_IFIFO`

`alias S_IFIFO = 4096`

Bits that determine the fifo.

### `S_IFLNK`

`alias S_IFLNK = 40960`

Bits that determine the symlink.

### `S_IFMT`

`alias S_IFMT = 61440`

Bits that determine the file type.

### `S_IFREG`

`alias S_IFREG = 32768`

Bits that determine the regular file.

### `S_IFSOCK`

`alias S_IFSOCK = 49152`

Bits that determine the socket.

## Functions

* [​`S_ISBLK`](/mojo/stdlib/stat/stat/S_ISBLK): Returns True if the mode is a block device.
* [​`S_ISCHR`](/mojo/stdlib/stat/stat/S_ISCHR): Returns True if the mode is a character device.
* [​`S_ISDIR`](/mojo/stdlib/stat/stat/S_ISDIR): Returns True if the mode is a directory.
* [​`S_ISFIFO`](/mojo/stdlib/stat/stat/S_ISFIFO): Returns True if the mode is a fifo.
* [​`S_ISLNK`](/mojo/stdlib/stat/stat/S_ISLNK): Returns True if the mode is a symlink.
* [​`S_ISREG`](/mojo/stdlib/stat/stat/S_ISREG): Returns True if the mode is a regular file.
* [​`S_ISSOCK`](/mojo/stdlib/stat/stat/S_ISSOCK): Returns True if the mode is a socket.

---

## subprocess

Implements the subprocess package.

## Modules

* [​`subprocess`](/mojo/stdlib/subprocess/subprocess/): Implements the subprocess package.

---

## subprocess

Implements the subprocess package.

## Functions

* [​`run`](/mojo/stdlib/subprocess/subprocess/run): Runs the specified command and returns the output as a string.

---

## run

`run(cmd: String) -> String`

Runs the specified command and returns the output as a string.

This function executes the given command in a subprocess, captures its
standard output, and returns it as a string. It automatically handles
opening and closing the subprocess.

**Args:**

* ​cmd (`String`): The command to execute as a string.

**Returns:**

The standard output of the command as a string, with trailing
whitespace removed.

**Raises:**

This function raises if:

* The command cannot be executed.
* There is an IO error reading from the subprocess.
* The data written by the subprocess is not valid UTF-8.

---

## argv

`argv() -> VariadicList[StringSlice[StaticConstantOrigin]]`

Gets the list of command line arguments given to the `mojo` CLI.

For example:

```mojo title="app.mojo"
from sys import argv

def main():
    args = argv()
    for arg in args:
        print(arg)
```

```sh
mojo app.mojo "Hello world"
```

```output
app.mojo
Hello world
```

**Returns:**

The list of command line arguments provided when mojo was invoked.

---

## arg

Implements functions and variables for interacting with execution and system environment.

## Functions

* [​`argv`](/mojo/stdlib/sys/arg/argv): Gets the list of command line arguments given to the `mojo` CLI.

---

## compile

Implements functions that return compile-time information.

## Aliases

### `DebugLevel`

`alias DebugLevel = _DebugLevel()`

Represents the debug level used during compilation.

### `OptimizationLevel`

`alias OptimizationLevel = _OptimizationLevel()`

Represents the optimization level used during compilation.

## Functions

* [​`is_compile_time`](/mojo/stdlib/sys/compile/is_compile_time): Returns true if the current code is executed at compile time, false otherwise.

---

## is_compile_time

`is_compile_time() -> Bool`

Returns true if the current code is executed at compile time, false otherwise.

**Returns:**

A boolean value indicating whether the code is being compiled.

---

## breakpointhook

`breakpointhook()`

Cause an execution trap with the intention of requesting the attention of a debugger.

---

## debug

This module includes the debug hook functions.

## Functions

* [​`breakpointhook`](/mojo/stdlib/sys/debug/breakpointhook): Cause an execution trap with the intention of requesting the attention of a debugger.

---

## DLHandle

`@register_passable(trivial)`
`struct DLHandle`

Represents a dynamically linked library that can be loaded and unloaded.

The library is loaded on initialization and unloaded by `close`.

## Fields

* ​handle (`UnsafePointer[NoneType]`): The handle to the dynamic library.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, flags: Int = (256 if os_is_linux() else 8 | 2))`

Initialize a dynamic library handle to all global symbols in the current process.

Notes:
On POSIX-compatible operating systems, this performs
`dlopen(nullptr, flags)`.

**Args:**

* ​flags (`Int`): The flags to load the dynamic library.

`__init__[PathLike: PathLike, //](out self, path: PathLike, flags: Int = (256 if os_is_linux() else 8 | 2))`

Initialize a DLHandle object by loading the dynamic library at the given path.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the `os.PathLike` trait.

**Args:**

* ​path (`PathLike`): The path to the dynamic library file.
* ​flags (`Int`): The flags to load the dynamic library.

### `__bool__`

`__bool__(self) -> Bool`

Checks if the handle is valid.

**Returns:**

True if the DLHandle is not null and False otherwise.

### `copy`

`copy(self) -> Self`

Copy the object.

**Returns:**

A copy of the value.

### `check_symbol`

`check_symbol(self, owned name: String) -> Bool`

Check that the symbol exists in the dynamic library.

**Args:**

* ​name (`String`): The symbol to check.

**Returns:**

`True` if the symbol exists.

### `close`

`close(mut self)`

Delete the DLHandle object unloading the associated dynamic library.

### `get_function`

`get_function[result_type: AnyTrivialRegType](self, owned name: String) -> result_type`

Returns a handle to the function with the given name in the dynamic library.

**Parameters:**

* ​result\_type (`AnyTrivialRegType`): The type of the function pointer to return.

**Args:**

* ​name (`String`): The name of the function to get the handle for.

**Returns:**

A handle to the function.

### `get_symbol`

`get_symbol[result_type: AnyType](self, name: StringSlice[origin]) -> UnsafePointer[result_type]`

Returns a pointer to the symbol with the given name in the dynamic library.

**Parameters:**

* ​result\_type (`AnyType`): The type of the symbol to return.

**Args:**

* ​name (`StringSlice[origin]`): The name of the symbol to get the handle for.

**Returns:**

A pointer to the symbol.

`get_symbol[result_type: AnyType](self, *, cstr_name: UnsafePointer[SIMD[int8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> UnsafePointer[result_type]`

Returns a pointer to the symbol with the given name in the dynamic library.

**Parameters:**

* ​result\_type (`AnyType`): The type of the symbol to return.

**Args:**

* ​cstr\_name (`UnsafePointer[SIMD[int8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The name of the symbol to get the handle for.

**Returns:**

A pointer to the symbol.

### `call`

`call[name: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType = NoneType, *T: AnyType = *?](self, *args: *T) -> return_type`

Call a function with any amount of arguments.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the function.
* ​return\_type (`AnyTrivialRegType`): The return type of the function.
* ​\*T (`AnyType`): The types of `args`.

**Args:**

* ​\*args (`*T`): The arguments.

**Returns:**

The result.

`call[name: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType = NoneType](self, args: VariadicPack[is_owned, origin, AnyType, element_types]) -> return_type`

Call a function with any amount of arguments.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the function.
* ​return\_type (`AnyTrivialRegType`): The return type of the function.

**Args:**

* ​args (`VariadicPack[is_owned, origin, AnyType, element_types]`): The arguments.

**Returns:**

The result.

---

## RTLD

`struct RTLD`

Enumeration of the RTLD flags used during dynamic library loading.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `GLOBAL`

`alias GLOBAL = 256 if os_is_linux() else 8`

Make symbols available for symbol resolution of subsequently loaded libraries.

### `LAZY`

`alias LAZY = 1`

Load library lazily (defer function resolution until needed).

### `LOCAL`

`alias LOCAL = 4`

Make symbols not available for symbol resolution of subsequently loaded libraries.

### `NOW`

`alias NOW = 2`

Load library immediately (resolve all symbols on load).

---

## external_call

`external_call[callee: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType, *types: AnyType](*args: *types) -> return_type`

Calls an external function.

**Parameters:**

* ​callee (`StringSlice[StaticConstantOrigin]`): The name of the external function.
* ​return\_type (`AnyTrivialRegType`): The return type.
* ​\*types (`AnyType`): The argument types.

**Args:**

* ​\*args (`*types`): The arguments to pass to the external function.

**Returns:**

The external call result.

`external_call[callee: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType](args: VariadicPack[is_owned, origin, AnyType, element_types]) -> return_type`

Calls an external function.

**Parameters:**

* ​callee (`StringSlice[StaticConstantOrigin]`): The name of the external function.
* ​return\_type (`AnyTrivialRegType`): The return type.

**Args:**

* ​args (`VariadicPack[is_owned, origin, AnyType, element_types]`): The arguments to pass to the external function.

**Returns:**

The external call result.

---

## ffi

Implements a foreign functions interface (FFI).

## Aliases

### `c_char`

`alias c_char = SIMD[int8, 1]`

C `char` type.

### `c_double`

`alias c_double = SIMD[float64, 1]`

C `double` type.

### `c_float`

`alias c_float = SIMD[float32, 1]`

C `float` type.

### `c_int`

`alias c_int = SIMD[int32, 1]`

C `int` type.

The C `int` type is typically a signed 32-bit integer on commonly used targets
today.

### `c_long`

`alias c_long = SIMD[_c_long_dtype(), 1]`

C `long` type.

The C `long` type is typically a signed 64-bit integer on macOS and Linux, and a
32-bit integer on Windows.

### `c_long_long`

`alias c_long_long = SIMD[_c_long_long_dtype(), 1]`

C `long long` type.

The C `long long` type is typically a signed 64-bit integer on commonly used
targets today.

### `c_short`

`alias c_short = SIMD[int16, 1]`

C `short` type.

### `c_size_t`

`alias c_size_t = UInt`

C `size_t` type.

### `c_ssize_t`

`alias c_ssize_t = Int`

C `ssize_t` type.

### `c_uchar`

`alias c_uchar = SIMD[uint8, 1]`

C `unsigned char` type.

### `c_uint`

`alias c_uint = SIMD[uint32, 1]`

C `unsigned int` type.

### `c_ushort`

`alias c_ushort = SIMD[uint16, 1]`

C `unsigned short` type.

### `DEFAULT_RTLD`

`alias DEFAULT_RTLD = (256 if os_is_linux() else 8 | 2)`

### `OpaquePointer`

`alias OpaquePointer = UnsafePointer[NoneType]`

An opaque pointer, equivalent to the C `void*` type.

## Structs

* [​`DLHandle`](/mojo/stdlib/sys/ffi/DLHandle): Represents a dynamically linked library that can be loaded and unloaded.
* [​`RTLD`](/mojo/stdlib/sys/ffi/RTLD): Enumeration of the RTLD flags used during dynamic library loading.

## Functions

* [​`external_call`](/mojo/stdlib/sys/ffi/external_call): Calls an external function.

---

## sys

Implements the sys package.

## Modules

* [​`arg`](/mojo/stdlib/sys/arg/): Implements functions and variables for interacting with execution and system environment.
* [​`compile`](/mojo/stdlib/sys/compile/): Implements functions that return compile-time information.
* [​`debug`](/mojo/stdlib/sys/debug/): This module includes the debug hook functions.
* [​`ffi`](/mojo/stdlib/sys/ffi/): Implements a foreign functions interface (FFI).
* [​`info`](/mojo/stdlib/sys/info/): Implements methods for querying the host target info.
* [​`intrinsics`](/mojo/stdlib/sys/intrinsics/): Defines intrinsics.
* [​`param_env`](/mojo/stdlib/sys/param_env/): Implements functions for retrieving compile-time defines.
* [​`terminate`](/mojo/stdlib/sys/terminate/): This module includes the exit functions.

---

## CompilationTarget

`@register_passable(trivial)`
`struct CompilationTarget[value: target = _current_target()]`

A struct that provides information about a target architecture.

This struct encapsulates various methods to query target-specific information
such as architecture features, OS details, endianness, and memory characteristics.

## Parameters

* ​value (`target`): The target architecture to query. Defaults to the current target.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `has_sse4`

`static has_sse4() -> Bool`

Checks if the target supports SSE4 instructions.

**Returns:**

True if the target supports SSE4, False otherwise.

### `is_x86`

`static is_x86() -> Bool`

Checks if the target is an x86 architecture.

**Returns:**

True if the target is x86, False otherwise.

---

## alignof

`alignof[type: AnyType, target: target = _current_target()]() -> Int`

Returns the align of (in bytes) of the type.

**Parameters:**

* ​type (`AnyType`): The type in question.
* ​target (`target`): The target architecture.

**Returns:**

The alignment of the type in bytes.

`alignof[dtype: DType, target: target = _current_target()]() -> Int`

Returns the align of (in bytes) of the dtype.

**Parameters:**

* ​dtype (`DType`): The DType in question.
* ​target (`target`): The target architecture.

**Returns:**

The alignment of the dtype in bytes.

---

## bitwidthof

`bitwidthof[type: AnyTrivialRegType, target: target = _current_target()]() -> Int`

Returns the size of (in bits) of the type.

**Parameters:**

* ​type (`AnyTrivialRegType`): The type in question.
* ​target (`target`): The target architecture.

**Returns:**

The size of the type in bits.

`bitwidthof[dtype: DType, target: target = _current_target()]() -> Int`

Returns the size of (in bits) of the dtype.

**Parameters:**

* ​dtype (`DType`): The type in question.
* ​target (`target`): The target architecture.

**Returns:**

The size of the dtype in bits.

---

## has_accelerator

`has_accelerator() -> Bool`

Returns True if the host system has an accelerator and False otherwise.

**Returns:**

True if the host system has an accelerator.

---

## has_amd_gpu_accelerator

`has_amd_gpu_accelerator() -> Bool`

Returns True if the host system has an AMD GPU and False otherwise.

**Returns:**

True if the host system has an AMD GPU.

---

## has_avx

`has_avx() -> Bool`

Returns True if the host system has AVX, otherwise returns False.

**Returns:**

True if the host system has AVX, otherwise returns False.

---

## has_avx2

`has_avx2() -> Bool`

Returns True if the host system has AVX2, otherwise returns False.

**Returns:**

True if the host system has AVX2, otherwise returns False.

---

## has_avx512f

`has_avx512f() -> Bool`

Returns True if the host system has AVX512, otherwise returns False.

**Returns:**

True if the host system has AVX512, otherwise returns False.

---

## has_fma

`has_fma() -> Bool`

Returns True if the host system has FMA (Fused Multiply-Add) support, otherwise returns False.

**Returns:**

True if the host system has FMA support, otherwise returns False.

---

## has_intel_amx

`has_intel_amx() -> Bool`

Returns True if the host system has Intel AMX support, otherwise returns False.

**Returns:**

True if the host system has Intel AMX and False otherwise.

---

## has_neon

`has_neon() -> Bool`

Returns True if the host system has Neon support, otherwise returns False.

**Returns:**

True if the host system support the Neon instruction set.

---

## has_neon_int8_dotprod

`has_neon_int8_dotprod() -> Bool`

Returns True if the host system has the Neon int8 dot product extension, otherwise returns False.

**Returns:**

True if the host system support the Neon int8 dot product extension and
False otherwise.

---

## has_neon_int8_matmul

`has_neon_int8_matmul() -> Bool`

Returns True if the host system has the Neon int8 matrix multiplication extension (I8MM), otherwise returns False.

**Returns:**

True if the host system support the Neon int8 matrix multiplication
extension (I8MM) and False otherwise.

---

## has_nvidia_gpu_accelerator

`has_nvidia_gpu_accelerator() -> Bool`

Returns True if the host system has an NVIDIA GPU and False otherwise.

**Returns:**

True if the host system has an NVIDIA GPU.

---

## has_sse4

`has_sse4() -> Bool`

Returns True if the host system has sse4, otherwise returns False.

**Deprecated:**

Use `CompilationTarget.has_sse4()` instead.

**Returns:**

True if the host system has sse4, otherwise returns False.

---

## has_vnni

`has_vnni() -> Bool`

Returns True if the host system has avx512\_vnni, otherwise returns False.

**Returns:**

True if the host system has avx512\_vnni, otherwise returns False.

---

## info

Implements methods for querying the host target info.

You can import these APIs from the `sys` package. For example:

```mojo
from sys import CompilationTarget

print(CompilationTarget.is_x86())
```

## Structs

* [​`CompilationTarget`](/mojo/stdlib/sys/info/CompilationTarget): A struct that provides information about a target architecture.

## Functions

* [​`alignof`](/mojo/stdlib/sys/info/alignof): Returns the align of (in bytes) of the type.
* [​`bitwidthof`](/mojo/stdlib/sys/info/bitwidthof): Returns the size of (in bits) of the type.
* [​`has_accelerator`](/mojo/stdlib/sys/info/has_accelerator): Returns True if the host system has an accelerator and False otherwise.
* [​`has_amd_gpu_accelerator`](/mojo/stdlib/sys/info/has_amd_gpu_accelerator): Returns True if the host system has an AMD GPU and False otherwise.
* [​`has_avx`](/mojo/stdlib/sys/info/has_avx): Returns True if the host system has AVX, otherwise returns False.
* [​`has_avx2`](/mojo/stdlib/sys/info/has_avx2): Returns True if the host system has AVX2, otherwise returns False.
* [​`has_avx512f`](/mojo/stdlib/sys/info/has_avx512f): Returns True if the host system has AVX512, otherwise returns False.
* [​`has_fma`](/mojo/stdlib/sys/info/has_fma): Returns True if the host system has FMA (Fused Multiply-Add) support, otherwise returns False.
* [​`has_intel_amx`](/mojo/stdlib/sys/info/has_intel_amx): Returns True if the host system has Intel AMX support, otherwise returns False.
* [​`has_neon`](/mojo/stdlib/sys/info/has_neon): Returns True if the host system has Neon support, otherwise returns False.
* [​`has_neon_int8_dotprod`](/mojo/stdlib/sys/info/has_neon_int8_dotprod): Returns True if the host system has the Neon int8 dot product extension, otherwise returns False.
* [​`has_neon_int8_matmul`](/mojo/stdlib/sys/info/has_neon_int8_matmul): Returns True if the host system has the Neon int8 matrix multiplication extension (I8MM), otherwise returns False.
* [​`has_nvidia_gpu_accelerator`](/mojo/stdlib/sys/info/has_nvidia_gpu_accelerator): Returns True if the host system has an NVIDIA GPU and False otherwise.
* [​`has_sse4`](/mojo/stdlib/sys/info/has_sse4): Returns True if the host system has sse4, otherwise returns False.
* [​`has_vnni`](/mojo/stdlib/sys/info/has_vnni): Returns True if the host system has avx512\_vnni, otherwise returns False.
* [​`is_32bit`](/mojo/stdlib/sys/info/is_32bit): Returns True if the maximum integral value is 32 bit.
* [​`is_64bit`](/mojo/stdlib/sys/info/is_64bit): Returns True if the maximum integral value is 64 bit.
* [​`is_amd_gpu`](/mojo/stdlib/sys/info/is_amd_gpu): Returns True if the target triple of the compiler is `amdgcn-amd-amdhsa` False otherwise.
* [​`is_apple_m1`](/mojo/stdlib/sys/info/is_apple_m1): Returns True if the host system is an Apple M1 with AMX support, otherwise returns False.
* [​`is_apple_m2`](/mojo/stdlib/sys/info/is_apple_m2): Returns True if the host system is an Apple M2 with AMX support, otherwise returns False.
* [​`is_apple_m3`](/mojo/stdlib/sys/info/is_apple_m3): Returns True if the host system is an Apple M3 with AMX support, otherwise returns False.
* [​`is_apple_m4`](/mojo/stdlib/sys/info/is_apple_m4): Returns True if the host system is an Apple M4 with AMX support, otherwise returns False.
* [​`is_apple_silicon`](/mojo/stdlib/sys/info/is_apple_silicon): Returns True if the host system is an Apple Silicon with AMX support, otherwise returns False.
* [​`is_big_endian`](/mojo/stdlib/sys/info/is_big_endian): Returns True if the host endianness is big and False otherwise.
* [​`is_gpu`](/mojo/stdlib/sys/info/is_gpu): Returns True if the target triple is GPU and  False otherwise.
* [​`is_little_endian`](/mojo/stdlib/sys/info/is_little_endian): Returns True if the host endianness is little and False otherwise.
* [​`is_neoverse_n1`](/mojo/stdlib/sys/info/is_neoverse_n1): Returns True if the host system is a Neoverse N1 system, otherwise returns False.
* [​`is_nvidia_gpu`](/mojo/stdlib/sys/info/is_nvidia_gpu): Returns True if the target triple of the compiler is `nvptx64-nvidia-cuda` False otherwise.
* [​`is_triple`](/mojo/stdlib/sys/info/is_triple): Returns True if the target triple of the compiler matches the input and False otherwise.
* [​`is_x86`](/mojo/stdlib/sys/info/is_x86): Returns True if the host system architecture is X86 and False otherwise.
* [​`num_logical_cores`](/mojo/stdlib/sys/info/num_logical_cores): Returns the number of hardware threads, including hyperthreads across all CPU sockets.
* [​`num_performance_cores`](/mojo/stdlib/sys/info/num_performance_cores): Returns the number of physical performance cores across all CPU sockets. If not known, returns the total number of physical cores.
* [​`num_physical_cores`](/mojo/stdlib/sys/info/num_physical_cores): Returns the number of physical cores across all CPU sockets.
* [​`os_is_linux`](/mojo/stdlib/sys/info/os_is_linux): Returns True if the host operating system is Linux.
* [​`os_is_macos`](/mojo/stdlib/sys/info/os_is_macos): Returns True if the host operating system is macOS.
* [​`os_is_windows`](/mojo/stdlib/sys/info/os_is_windows): Returns True if the host operating system is Windows.
* [​`simdbitwidth`](/mojo/stdlib/sys/info/simdbitwidth): Returns the vector size (in bits) of the specified target.
* [​`simdbytewidth`](/mojo/stdlib/sys/info/simdbytewidth): Returns the vector size (in bytes) of the specified target.
* [​`simdwidthof`](/mojo/stdlib/sys/info/simdwidthof): Returns the vector size of the type on the host system.
* [​`sizeof`](/mojo/stdlib/sys/info/sizeof): Returns the size of (in bytes) of the type.

---

## is_32bit

`is_32bit[target: target = _current_target()]() -> Bool`

Returns True if the maximum integral value is 32 bit.

**Parameters:**

* ​target (`target`): The target architecture.

**Returns:**

True if the maximum integral value is 32 bit, False otherwise.

---

## is_64bit

`is_64bit[target: target = _current_target()]() -> Bool`

Returns True if the maximum integral value is 64 bit.

**Parameters:**

* ​target (`target`): The target architecture.

**Returns:**

True if the maximum integral value is 64 bit, False otherwise.

---

## is_amd_gpu

`is_amd_gpu() -> Bool`

Returns True if the target triple of the compiler is `amdgcn-amd-amdhsa` False otherwise.

**Returns:**

True if the triple target is amdgpu and False otherwise.

---

## is_apple_m1

`is_apple_m1() -> Bool`

Returns True if the host system is an Apple M1 with AMX support, otherwise returns False.

**Returns:**

True if the host system is an Apple M1 with AMX support and False
otherwise.

---

## is_apple_m2

`is_apple_m2() -> Bool`

Returns True if the host system is an Apple M2 with AMX support, otherwise returns False.

**Returns:**

True if the host system is an Apple M2 with AMX support and False
otherwise.

---

## is_apple_m3

`is_apple_m3() -> Bool`

Returns True if the host system is an Apple M3 with AMX support, otherwise returns False.

**Returns:**

True if the host system is an Apple M3 with AMX support and False
otherwise.

---

## is_apple_m4

`is_apple_m4() -> Bool`

Returns True if the host system is an Apple M4 with AMX support, otherwise returns False.

**Returns:**

True if the host system is an Apple M4 with AMX support and False
otherwise.

---

## is_apple_silicon

`is_apple_silicon() -> Bool`

Returns True if the host system is an Apple Silicon with AMX support, otherwise returns False.

**Returns:**

True if the host system is an Apple Silicon with AMX support and False
otherwise.

---

## is_big_endian

`is_big_endian[target: target = _current_target()]() -> Bool`

Returns True if the host endianness is big and False otherwise.

**Parameters:**

* ​target (`target`): The target architecture.

**Returns:**

True if the host target is big endian and False otherwise.

---

## is_gpu

`is_gpu() -> Bool`

Returns True if the target triple is GPU and  False otherwise.

**Returns:**

True if the triple target is GPU and False otherwise.

---

## is_little_endian

`is_little_endian[target: target = _current_target()]() -> Bool`

Returns True if the host endianness is little and False otherwise.

**Parameters:**

* ​target (`target`): The target architecture.

**Returns:**

True if the host target is little endian and False otherwise.

---

## is_neoverse_n1

`is_neoverse_n1() -> Bool`

Returns True if the host system is a Neoverse N1 system, otherwise returns False.

**Returns:**

True if the host system is a Neoverse N1 system and False otherwise.

---

## is_nvidia_gpu

`is_nvidia_gpu() -> Bool`

Returns True if the target triple of the compiler is `nvptx64-nvidia-cuda` False otherwise.

**Returns:**

True if the triple target is cuda and False otherwise.

`is_nvidia_gpu[subarch: StringSlice[StaticConstantOrigin]]() -> Bool`

Returns True if the target triple of the compiler is `nvptx64-nvidia-cuda` and we are compiling for the specified sub-architecture and False otherwise.

**Parameters:**

* ​subarch (`StringSlice[StaticConstantOrigin]`): The subarchitecture (e.g. sm\_80).

**Returns:**

True if the triple target is cuda and False otherwise.

---

## is_triple

`is_triple[: string, //, name: StringLiteral[$0], target: target = _current_target()]() -> Bool`

Returns True if the target triple of the compiler matches the input and False otherwise.

**Parameters:**

* ​name (`StringLiteral[$0]`): The name of the triple value.
* ​target (`target`): The triple value to be checked against.

**Returns:**

True if the triple matches and False otherwise.

---

## is_x86

`is_x86() -> Bool`

Returns True if the host system architecture is X86 and False otherwise.

**Deprecated:**

Use `CompilationTarget.is_x86()` instead.

**Returns:**

True if the host system architecture is X86 and False otherwise.

---

## num_logical_cores

`num_logical_cores() -> Int`

Returns the number of hardware threads, including hyperthreads across all CPU sockets.

**Returns:**

Int: The number of threads on the system.

---

## num_performance_cores

`num_performance_cores() -> Int`

Returns the number of physical performance cores across all CPU sockets. If not known, returns the total number of physical cores.

**Returns:**

Int: The number of physical performance cores on the system.

---

## num_physical_cores

`num_physical_cores() -> Int`

Returns the number of physical cores across all CPU sockets.

**Returns:**

Int: The number of physical cores on the system.

---

## os_is_linux

`os_is_linux() -> Bool`

Returns True if the host operating system is Linux.

**Returns:**

True if the host operating system is Linux and False otherwise.

---

## os_is_macos

`os_is_macos() -> Bool`

Returns True if the host operating system is macOS.

**Returns:**

True if the host operating system is macOS and False otherwise.

---

## os_is_windows

`os_is_windows() -> Bool`

Returns True if the host operating system is Windows.

**Returns:**

True if the host operating system is Windows and False otherwise.

---

## simdbitwidth

`simdbitwidth[target: target = _current_target()]() -> Int`

Returns the vector size (in bits) of the specified target.

**Parameters:**

* ​target (`target`): The target architecture.

**Returns:**

The vector size (in bits) of the specified target.

---

## simdbytewidth

`simdbytewidth[target: target = _current_target()]() -> Int`

Returns the vector size (in bytes) of the specified target.

**Parameters:**

* ​target (`target`): The target architecture.

**Returns:**

The vector size (in bytes) of the host system.

---

## simdwidthof

`simdwidthof[type: AnyTrivialRegType, target: target = _current_target()]() -> Int`

Returns the vector size of the type on the host system.

**Parameters:**

* ​type (`AnyTrivialRegType`): The type in question.
* ​target (`target`): The target architecture.

**Returns:**

The vector size of the type on the host system.

`simdwidthof[dtype: DType, target: target = _current_target()]() -> Int`

Returns the vector size of the type on the host system.

**Parameters:**

* ​dtype (`DType`): The DType in question.
* ​target (`target`): The target architecture.

**Returns:**

The vector size of the dtype on the host system.

---

## sizeof

`sizeof[type: AnyType, target: target = _current_target()]() -> Int`

Returns the size of (in bytes) of the type.

Example:

```mojo
from sys.info import sizeof
def main():
    print(
        sizeof[UInt8]() == 1,
        sizeof[UInt16]() == 2,
        sizeof[Int32]() == 4,
        sizeof[Float64]() == 8,
        sizeof[
            SIMD[DType.uint8, 4]
        ]() == 4,
    )
```

Note: `align_of` is in same module.

**Parameters:**

* ​type (`AnyType`): The type in question.
* ​target (`target`): The target architecture.

**Returns:**

The size of the type in bytes.

`sizeof[dtype: DType, target: target = _current_target()]() -> Int`

Returns the size of (in bytes) of the dtype.

**Parameters:**

* ​dtype (`DType`): The DType in question.
* ​target (`target`): The target architecture.

**Returns:**

The size of the dtype in bytes.

---

## PrefetchCache

`@register_passable(trivial)`
`struct PrefetchCache`

Prefetch cache type.

## Fields

* ​value (`SIMD[int32, 1]`): The cache prefetch. It should be in \[0, 1].

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `DATA`

`alias DATA = PrefetchCache(1)`

The data prefetching option.

### `INSTRUCTION`

`alias INSTRUCTION = PrefetchCache(0)`

The instruction prefetching option.

## Methods

### `__init__`

`__init__(value: Int) -> Self`

Constructs a prefetch option.

**Args:**

* ​value (`Int`): An integer value representing the prefetch cache option to be
  used. Should be a value in the range `[0, 1]`.

---

## PrefetchLocality

`@register_passable(trivial)`
`struct PrefetchLocality`

The prefetch locality.

The locality, rw, and cache type correspond to LLVM prefetch intrinsic's
inputs (see
[LLVM prefetch locality](https://llvm.org/docs/LangRef.html#llvm-prefetch-intrinsic))

## Fields

* ​value (`SIMD[int32, 1]`): The prefetch locality to use. It should be a value in \[0, 3].

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `HIGH`

`alias HIGH = PrefetchLocality(3)`

Extremely local locality (keep in cache).

### `LOW`

`alias LOW = PrefetchLocality(1)`

Low locality.

### `MEDIUM`

`alias MEDIUM = PrefetchLocality(2)`

Medium locality.

### `NONE`

`alias NONE = PrefetchLocality(0)`

No locality.

## Methods

### `__init__`

`__init__(value: Int) -> Self`

Constructs a prefetch locality option.

**Args:**

* ​value (`Int`): An integer value representing the locality. Should be a value
  in the range `[0, 3]`.

---

## PrefetchOptions

`@register_passable(trivial)`
`struct PrefetchOptions`

Collection of configuration parameters for a prefetch intrinsic call.

The op configuration follows similar interface as LLVM intrinsic prefetch
op, with a "locality" attribute that specifies the level of temporal locality
in the application, that is, how soon would the same data be visited again.
Possible locality values are: `NONE`, `LOW`, `MEDIUM`, and `HIGH`.

The op also takes a "cache tag" attribute giving hints on how the
prefetched data will be used. Possible tags are: `ReadICache`, `ReadDCache`
and `WriteDCache`.

Note: the actual behavior of the prefetch op and concrete interpretation of
these attributes are target-dependent.

## Fields

* ​rw (`PrefetchRW`): Indicates prefetching for read or write.
* ​locality (`PrefetchLocality`): Indicates locality level.
* ​cache (`PrefetchCache`): Indicates i-cache or d-cache prefetching.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Constructs an instance of PrefetchOptions with default params.

### `for_read`

`for_read(self) -> Self`

Sets the prefetch purpose to read.

**Returns:**

The updated prefetch parameter.

### `for_write`

`for_write(self) -> Self`

Sets the prefetch purpose to write.

**Returns:**

The updated prefetch parameter.

### `no_locality`

`no_locality(self) -> Self`

Sets the prefetch locality to none.

**Returns:**

The updated prefetch parameter.

### `low_locality`

`low_locality(self) -> Self`

Sets the prefetch locality to low.

**Returns:**

The updated prefetch parameter.

### `medium_locality`

`medium_locality(self) -> Self`

Sets the prefetch locality to medium.

**Returns:**

The updated prefetch parameter.

### `high_locality`

`high_locality(self) -> Self`

Sets the prefetch locality to high.

**Returns:**

The updated prefetch parameter.

### `to_data_cache`

`to_data_cache(self) -> Self`

Sets the prefetch target to data cache.

**Returns:**

The updated prefetch parameter.

### `to_instruction_cache`

`to_instruction_cache(self) -> Self`

Sets the prefetch target to instruction cache.

**Returns:**

The updated prefetch parameter.

---

## PrefetchRW

`@register_passable(trivial)`
`struct PrefetchRW`

Prefetch read or write.

## Fields

* ​value (`SIMD[int32, 1]`): The read-write prefetch. It should be in \[0, 1].

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `READ`

`alias READ = PrefetchRW(0)`

Read prefetch.

### `WRITE`

`alias WRITE = PrefetchRW(1)`

Write prefetch.

## Methods

### `__init__`

`__init__(value: Int) -> Self`

Constructs a prefetch read-write option.

**Args:**

* ​value (`Int`): An integer value representing the prefetch read-write option
  to be used. Should be a value in the range `[0, 1]`.

---

## assume

`assume(val: Bool)`

Signals to the optimizer that the condition is always true. This allows the optimizer to optimize the code.

**Args:**

* ​val (`Bool`): The input value which is assumed to be `True`.

---

## ballot

`ballot[dtype: DType](value: Bool) -> SIMD[dtype, 1]`

Returns a bitfield(Int32 or Int64) containing the result of its Bool argument in all active lanes, and zero in all inactive lanes. For example, ballot(True) returns EXEC mask.

**Parameters:**

* ​dtype (`DType`): The DType of the return type.

**Args:**

* ​value (`Bool`): The value to place across the mask.

**Returns:**

A bitfield(Int32 or Int64) containing the result of its Bool argument in all active lanes.

---

## compressed_store

`compressed_store[dtype: DType, size: Int](value: SIMD[dtype, size], addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], mask: SIMD[bool, size])`

Compresses the lanes of `value`, skipping `mask` lanes, and stores at `addr`.

**Parameters:**

* ​dtype (`DType`): DType of `value`, the value to store.
* ​size (`Int`): Size of `value`, the value to store.

**Args:**

* ​value (`SIMD[dtype, size]`): The vector containing data to store.
* ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The memory location to store the compressed data.
* ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of
  `value`.

---

## expect

`expect[T: AnyTrivialRegType, //, expected_val: T](val: T) -> T`

Provides information about expected (the most probable) value of `val`, which can be used by optimizers.

Notes:
Only works with integer/boolean types.

**Parameters:**

* ​T (`AnyTrivialRegType`): The type of the input value.
* ​expected\_val (`T`): The expected value of `val`.

**Args:**

* ​val (`T`): The input value.

**Returns:**

The input value.

---

## gather

`gather[dtype: DType, size: Int, //, *, invariant: Bool = False](owned base: SIMD[index, size], mask: SIMD[bool, size], passthrough: SIMD[dtype, size], alignment: Int = 0) -> SIMD[dtype, size]`

Reads scalar values from a SIMD vector, and gathers them into one vector.

The gather function reads scalar values from a SIMD vector of memory
locations and gathers them into one vector. The memory locations are
provided in the vector of pointers `base` as addresses. The memory is
accessed according to the provided mask. The mask holds a bit for each
vector lane, and is used to prevent memory accesses to the masked-off
lanes. The masked-off lanes in the result vector are taken from the
corresponding lanes of the `passthrough` operand.

In general, for some vector of pointers `base`, mask `mask`, and passthrough
`passthrough` a call of the form:

```mojo
result = gather(base, mask, passthrough)
```

is equivalent to the following sequence of scalar loads in C++:

```cpp
for (int i = 0; i dtype (`DType`): DType of the return SIMD buffer.
* ​size (`Int`): Size of the return SIMD buffer.
* ​invariant (`Bool`): Whether the memory is load invariant.

**Args:**

* ​base (`SIMD[index, size]`): The vector containing memory addresses that gather will access.
* ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of
  the base vector.
* ​passthrough (`SIMD[dtype, size]`): In the result vector, the masked-off lanes are replaced
  with the passthrough vector.
* ​alignment (`Int`): The alignment of the source addresses. Must be 0 or a power
  of two constant integer value.

**Returns:**

A SIMD\[dtype, size] containing the result of the gather operation.

---

## implicitarg_ptr

`implicitarg_ptr() -> UnsafePointer[SIMD[uint8, 1], address_space=AddressSpace(4)]`

Get a pointer to AMD's implicit arguments table.

**Returns:**

A pointer to LLVM's implicit arguments table.

---

## intrinsics

Defines intrinsics.

You can import these APIs from the `sys` package. For example:

```mojo
from sys import PrefetchLocality
```

## Aliases

### `block_dim`

`alias block_dim = _BlockDim()`

### `block_id_in_cluster`

`alias block_id_in_cluster = _Cluster_BlockIdx()`

### `block_idx`

`alias block_idx = _BlockIdx()`

### `cluster_dim`

`alias cluster_dim = _ClusterDim()`

### `cluster_idx`

`alias cluster_idx = _ClusterIdx()`

### `global_idx`

`alias global_idx = _GridIdx()`

### `grid_dim`

`alias grid_dim = _GridDim()`

### `thread_idx`

`alias thread_idx = _ThreadIdx()`

## Structs

* [​`PrefetchCache`](/mojo/stdlib/sys/intrinsics/PrefetchCache): Prefetch cache type.
* [​`PrefetchLocality`](/mojo/stdlib/sys/intrinsics/PrefetchLocality): The prefetch locality.
* [​`PrefetchOptions`](/mojo/stdlib/sys/intrinsics/PrefetchOptions): Collection of configuration parameters for a prefetch intrinsic call.
* [​`PrefetchRW`](/mojo/stdlib/sys/intrinsics/PrefetchRW): Prefetch read or write.

## Functions

* [​`assume`](/mojo/stdlib/sys/intrinsics/assume): Signals to the optimizer that the condition is always true. This allows the optimizer to optimize the code.
* [​`ballot`](/mojo/stdlib/sys/intrinsics/ballot): Returns a bitfield(Int32 or Int64) containing the result of its Bool argument in all active lanes, and zero in all inactive lanes. For example, ballot(True) returns EXEC mask.
* [​`compressed_store`](/mojo/stdlib/sys/intrinsics/compressed_store): Compresses the lanes of `value`, skipping `mask` lanes, and stores at `addr`.
* [​`expect`](/mojo/stdlib/sys/intrinsics/expect): Provides information about expected (the most probable) value of `val`, which can be used by optimizers.
* [​`gather`](/mojo/stdlib/sys/intrinsics/gather): Reads scalar values from a SIMD vector, and gathers them into one vector.
* [​`implicitarg_ptr`](/mojo/stdlib/sys/intrinsics/implicitarg_ptr): Get a pointer to AMD's implicit arguments table.
* [​`lane_id`](/mojo/stdlib/sys/intrinsics/lane_id): Returns the lane ID of the current thread.
* [​`likely`](/mojo/stdlib/sys/intrinsics/likely): Provides information that the most probable value of `val` is going to be `True`. This information can be used by optimizers.
* [​`llvm_intrinsic`](/mojo/stdlib/sys/intrinsics/llvm_intrinsic): Calls an LLVM intrinsic with the name `intrin` and return type `type`.
* [​`masked_load`](/mojo/stdlib/sys/intrinsics/masked_load): Loads data from memory and return it, replacing masked lanes with values from the passthrough vector.
* [​`masked_store`](/mojo/stdlib/sys/intrinsics/masked_store): Stores a value at a memory location, skipping masked lanes.
* [​`prefetch`](/mojo/stdlib/sys/intrinsics/prefetch): Prefetches an instruction or data into cache before it is used.
* [​`readfirstlane`](/mojo/stdlib/sys/intrinsics/readfirstlane): Get the value in the lowest active lane of the input operand.
* [​`scatter`](/mojo/stdlib/sys/intrinsics/scatter): Takes scalar values from a SIMD vector and `scatters` them into a vector of pointers.
* [​`sendmsg`](/mojo/stdlib/sys/intrinsics/sendmsg): Send a message to fixed function hardware. Refer to the specific ISA manual for the ops and messages.
* [​`strided_load`](/mojo/stdlib/sys/intrinsics/strided_load): Loads values from addr according to a specific stride.
* [​`strided_store`](/mojo/stdlib/sys/intrinsics/strided_store): Loads values from addr according to a specific stride.
* [​`unlikely`](/mojo/stdlib/sys/intrinsics/unlikely): Provides information that the most probable value of `val` is going to be `False`. This information can be used by optimizers.

---

## lane_id

`lane_id() -> UInt`

Returns the lane ID of the current thread.

**Returns:**

The lane ID of the current thread.

---

## likely

`likely(val: Bool) -> Bool`

Provides information that the most probable value of `val` is going to be `True`. This information can be used by optimizers.

**Args:**

* ​val (`Bool`): The input value which is likely to be `True` most of the time.

**Returns:**

The input value.

---

## llvm_intrinsic

`llvm_intrinsic[intrin: StringSlice[StaticConstantOrigin], type: AnyTrivialRegType, *types: AnyType, *, has_side_effect: Bool = True](*args: *types) -> type`

Calls an LLVM intrinsic with the name `intrin` and return type `type`.

**Parameters:**

* ​intrin (`StringSlice[StaticConstantOrigin]`): The name of the llvm intrinsic.
* ​type (`AnyTrivialRegType`): The return type of the intrinsic.
* ​\*types (`AnyType`): The argument types for the function.
* ​has\_side\_effect (`Bool`): If `True` the intrinsic will have side effects,
  otherwise its pure.

**Args:**

* ​\*args (`*types`): The arguments to the function.

**Returns:**

The result of calling the llvm intrinsic with no arguments.

---

## masked_load

`masked_load[dtype: DType, //, size: Int](addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=False, origin=origin], mask: SIMD[bool, size], passthrough: SIMD[dtype, size], alignment: Int = 1) -> SIMD[dtype, size]`

Loads data from memory and return it, replacing masked lanes with values from the passthrough vector.

**Parameters:**

* ​dtype (`DType`): DType of the return SIMD buffer.
* ​size (`Int`): Size of the return SIMD buffer.

**Args:**

* ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=False, origin=origin]`): The base pointer for the load.
* ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of
  the memory stored at addr.
* ​passthrough (`SIMD[dtype, size]`): In the result vector, the masked-off lanes are replaced
  with the passthrough vector.
* ​alignment (`Int`): The alignment of the source addresses. Must be 0 or a power
  of two constant integer value. Default is 1.

**Returns:**

The loaded memory stored in a vector of type SIMD\[dtype, size].

---

## masked_store

`masked_store[size: Int](value: SIMD[dtype, size], addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], mask: SIMD[bool, size], alignment: Int = 1)`

Stores a value at a memory location, skipping masked lanes.

**Parameters:**

* ​size (`Int`): Size of `value`, the data to store.

**Args:**

* ​value (`SIMD[dtype, size]`): The vector containing data to store.
* ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): A vector of memory location to store data at.
* ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of
  `value`.
* ​alignment (`Int`): The alignment of the destination locations. Must be 0 or a
  power of two constant integer value.

---

## prefetch

`prefetch[dtype: DType, //, params: PrefetchOptions = PrefetchOptions()](addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin])`

Prefetches an instruction or data into cache before it is used.

The prefetch function provides prefetching hints for the target
to prefetch instruction or data into cache before they are used.

**Parameters:**

* ​dtype (`DType`): The DType of value stored in addr.
* ​params (`PrefetchOptions`): Configuration options for the prefect intrinsic.

**Args:**

* ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The data pointer to prefetch.

---

## readfirstlane

`readfirstlane(value: SIMD[int32, 1]) -> SIMD[int32, 1]`

Get the value in the lowest active lane of the input operand.

**Args:**

* ​value (`SIMD[int32, 1]`): The input value.

**Returns:**

The value in the lowest active lane of the input operand.

`readfirstlane(value: UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`

Get the value in the lowest active lane of the input operand.

**Args:**

* ​value (`UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The input pointer.

**Returns:**

The value in the lowest active lane of the input operand.

`readfirstlane(value: Int) -> Int`

Get the value in the lowest active lane of the input operand.

**Args:**

* ​value (`Int`): The input pointer.

**Returns:**

The value in the lowest active lane of the input operand.

---

## scatter

`scatter[dtype: DType, size: Int, //](value: SIMD[dtype, size], owned base: SIMD[index, size], mask: SIMD[bool, size], alignment: Int = 0)`

Takes scalar values from a SIMD vector and `scatters` them into a vector of pointers.

The scatter operation stores scalar values from a SIMD vector of memory
locations and scatters them into a vector of pointers. The memory locations
are provided in the vector of pointers `base` as addresses. The memory is
stored according to the provided mask. The mask holds a bit for each vector
lane, and is used to prevent memory accesses to the masked-off lanes.

The `value` operand is a vector value to be written to memory. The `base`
operand is a vector of pointers, pointing to where the value elements
should be stored. It has the same underlying type as the value operand. The
`mask` operand, mask, is a vector of boolean values. The types of the
`mask` and the `value` operand must have the same number of vector
elements.

Scatter with overlapping addresses is guaranteed to be ordered from
least-significant to most-significant element.

In general, for some vector `value`, vector of pointers `base`, and mask
`mask` a call of the form:

```mojo
scatter(value, base, mask)
```

is equivalent to the following sequence of scalar stores in C++:

```cpp
for (int i = 0; i dtype (`DType`): DType of `value`, the result SIMD buffer.
* ​size (`Int`): Size of `value`, the result SIMD buffer.

**Args:**

* ​value (`SIMD[dtype, size]`): The vector that will contain the result of the scatter operation.
* ​base (`SIMD[index, size]`): The vector containing memory addresses that scatter will access.
* ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of
  the base vector.
* ​alignment (`Int`): The alignment of the source addresses. Must be 0 or a power
  of two constant integer value.

---

## sendmsg

`sendmsg(opcode: SIMD[int32, 1], msg: SIMD[int32, 1])`

Send a message to fixed function hardware. Refer to the specific ISA manual for the ops and messages.

**Args:**

* ​opcode (`SIMD[int32, 1]`): The operation to perform.
* ​msg (`SIMD[int32, 1]`): The message to send.

---

## strided_load

`strided_load[dtype: DType, //, simd_width: Int, *, invariant: Bool = False](addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=False, origin=origin], stride: Int, mask: SIMD[bool, simd_width] = SIMD(True)) -> SIMD[dtype, simd_width]`

Loads values from addr according to a specific stride.

**Parameters:**

* ​dtype (`DType`): DType of `value`, the value to store.
* ​simd\_width (`Int`): The width of the SIMD vectors.
* ​invariant (`Bool`): Whether the memory is load invariant.

**Args:**

* ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=False, origin=origin]`): The memory location to load data from.
* ​stride (`Int`): How many lanes to skip before loading again.
* ​mask (`SIMD[bool, simd_width]`): A binary vector which prevents memory access to certain lanes of
  `value`.

**Returns:**

A vector containing the loaded data.

---

## strided_store

`strided_store[dtype: DType, //, simd_width: Int](value: SIMD[dtype, simd_width], addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], stride: Int, mask: SIMD[bool, simd_width] = SIMD(True))`

Loads values from addr according to a specific stride.

**Parameters:**

* ​dtype (`DType`): DType of `value`, the value to store.
* ​simd\_width (`Int`): The width of the SIMD vectors.

**Args:**

* ​value (`SIMD[dtype, simd_width]`): The values to store.
* ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The location to store values at.
* ​stride (`Int`): How many lanes to skip before storing again.
* ​mask (`SIMD[bool, simd_width]`): A binary vector which prevents memory access to certain lanes of
  `value`.

---

## unlikely

`unlikely(val: Bool) -> Bool`

Provides information that the most probable value of `val` is going to be `False`. This information can be used by optimizers.

**Args:**

* ​val (`Bool`): The input value which is likely to be `False` most of the time.

**Returns:**

The input value.

---

## env_get_bool

`env_get_bool[name: StringSlice[StaticConstantOrigin]]() -> Bool`

Try to get an boolean-valued define. Compilation fails if the name is not defined or the value is neither `True` or `False`.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the define.

**Returns:**

An boolean parameter value.

`env_get_bool[name: StringSlice[StaticConstantOrigin], default: Bool]() -> Bool`

Try to get an bool-valued define. If the name is not defined, return a default value instead. The boolean must be either `True` or `False`.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the define.
* ​default (`Bool`): The default value to use.

**Returns:**

An bool parameter value.

---

## env_get_dtype

`env_get_dtype[name: StringSlice[StaticConstantOrigin], default: DType]() -> DType`

Try to get an DType-valued define. If the name is not defined, return a default value instead.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the define.
* ​default (`DType`): The default value to use.

**Returns:**

An DType parameter value.

---

## env_get_int

`env_get_int[name: StringSlice[StaticConstantOrigin]]() -> Int`

Try to get an integer-valued define. Compilation fails if the name is not defined.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the define.

**Returns:**

An integer parameter value.

`env_get_int[name: StringSlice[StaticConstantOrigin], default: Int]() -> Int`

Try to get an integer-valued define. If the name is not defined, return a default value instead.

Example:

```mojo
from sys.param_env import env_get_int

def main():
    alias number = env_get_int[
        "favorite_number",
        1 # Default value
    ]()
    parametrized[number]()

fn parametrized[num: Int]():
    print(num)
```

If the program is `app.mojo`:

* `mojo run -D favorite_number=2 app.mojo`
* `mojo run -D app.mojo`

Note: useful for parameterizing SIMD vector sizes.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the define.
* ​default (`Int`): The default value to use.

**Returns:**

An integer parameter value.

---

## env_get_string

`env_get_string[name: StringSlice[StaticConstantOrigin]]() -> StringSlice[StaticConstantOrigin]`

Try to get a string-valued define. Compilation fails if the name is not defined.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the define.

**Returns:**

A string parameter value.

`env_get_string[name: StringSlice[StaticConstantOrigin], default: StringSlice[StaticConstantOrigin]]() -> StringSlice[StaticConstantOrigin]`

Try to get a string-valued define. If the name is not defined, return a default value instead.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the define.
* ​default (`StringSlice[StaticConstantOrigin]`): The default value to use.

**Returns:**

A string parameter value.

---

## param_env

Implements functions for retrieving compile-time defines.

You can use these functions to set parameter values or runtime constants based on
name-value pairs defined on the command line. For example:

```mojo
  from sys import is_defined

  alias float_type = DType.float32 if is_defined["FLOAT32"]() else DType.float64

  # Use `float_type` as a constant.
```

And on the command line:

```
  mojo -D FLOAT_32 main.mojo
```

For more information, see the [Mojo build docs](/mojo/cli/build.html#d-keyvalue).
The `mojo run` command also supports the `-D` option.

You can import these APIs from the `sys` package. For example:

```mojo
from sys import is_defined
```

## Functions

* [​`env_get_bool`](/mojo/stdlib/sys/param_env/env_get_bool): Try to get an boolean-valued define. Compilation fails if the name is not defined or the value is neither `True` or `False`.
* [​`env_get_dtype`](/mojo/stdlib/sys/param_env/env_get_dtype): Try to get an DType-valued define. If the name is not defined, return a default value instead.
* [​`env_get_int`](/mojo/stdlib/sys/param_env/env_get_int): Try to get an integer-valued define. Compilation fails if the name is not defined.
* [​`env_get_string`](/mojo/stdlib/sys/param_env/env_get_string): Try to get a string-valued define. Compilation fails if the name is not defined.
* [​`is_defined`](/mojo/stdlib/sys/param_env/is_defined): Return true if the named value is defined.

---

## is_defined

`is_defined[name: StringSlice[StaticConstantOrigin]]() -> Bool`

Return true if the named value is defined.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name to test.

**Returns:**

True if the name is defined.

---

## exit

`exit()`

Exits from Mojo. Unlike the Python implementation this does not raise an exception to exit.

`exit[intable: Intable](code: intable)`

Exits from Mojo. Unlike the Python implementation this does not raise an exception to exit.

**Parameters:**

* ​intable (`Intable`): The type of the exit code.

**Args:**

* ​code (`intable`): The exit code.

---

## terminate

This module includes the exit functions.

## Functions

* [​`exit`](/mojo/stdlib/sys/terminate/exit): Exits from Mojo. Unlike the Python implementation this does not raise an exception to exit.

---

## tempfile

Implements the tempfile package.

## Modules

* [​`tempfile`](/mojo/stdlib/tempfile/tempfile/): Implements tempfile methods.

---

## NamedTemporaryFile

`struct NamedTemporaryFile`

A handle to a temporary file.

Example:

```mojo
from tempfile import NamedTemporaryFile
from pathlib import Path
def main():
    var p: Path
    with NamedTemporaryFile(mode="rw") as f:
        p = f.name
        f.write("Hello world!")
        f.seek(0)
        print(
            f.read() == "Hello world!"
        )
    print(String(p), p.exists()) #Removed by default
```

Note: `NamedTemporaryFile.__init__` document the arguments.

## Fields

* ​name (`String`): Name of the file.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, mode: String = __init__[__mlir_type.!kgen.string]("w"), name: Optional[String] = Optional(None), suffix: String = __init__[__mlir_type.!kgen.string](""), prefix: String = __init__[__mlir_type.!kgen.string]("tmp"), dir: Optional[String] = Optional(None), delete: Bool = True)`

Create a named temporary file.

This is a wrapper around a `FileHandle`,
`os.remove()` is called in the `close()` method if `delete` is True.

Can be used as a context manager. When used as a context manager, the
`close()` is called when the context manager exits.

**Args:**

* ​mode (`String`): The mode to open the file in (the mode can be "r" or "w").
* ​name (`Optional[String]`): The name of the temp file. If it is unspecified, then a random name will be provided.
* ​suffix (`String`): Suffix to use for the file name if name is not provided.
* ​prefix (`String`): Prefix to use for the file name if name is not provided.
* ​dir (`Optional[String]`): Directory in which the file will be created.
* ​delete (`Bool`): Whether the file is deleted on close.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Moves constructor for the file handle.

**Args:**

* ​existing (`Self`): The existing file handle.

### `__del__`

`__del__(owned self)`

Closes the file handle.

### `close`

`close(mut self)`

Closes the file handle.

### `read`

`read(self, size: Int = -1) -> String`

Reads the data from the file.

**Args:**

* ​size (`Int`): Requested number of bytes to read.

**Returns:**

The contents of the file.

### `read_bytes`

`read_bytes(self, size: Int = -1) -> List[SIMD[uint8, 1]]`

Read from file buffer until we have `size` characters or we hit EOF. If `size` is negative or omitted, read until EOF.

**Args:**

* ​size (`Int`): Requested number of bytes to read.

**Returns:**

The contents of the file.

### `seek`

`seek(self, offset: SIMD[uint64, 1], whence: SIMD[uint8, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[uint64, 1]`

Seeks to the given offset in the file.

**Args:**

* ​offset (`SIMD[uint64, 1]`): The byte offset to seek to from the start of the file.
* ​whence (`SIMD[uint8, 1]`): The reference point for the offset:
  os.SEEK\_SET = 0: start of file (Default).
  os.SEEK\_CUR = 1: current position.
  os.SEEK\_END = 2: end of file.

**Returns:**

The resulting byte offset from the start of the file.

**Raises:**

An error if this file handle is invalid, or if file seek returned a
failure.

### `write`

`write[*Ts: Writable](mut self, *args: *Ts)`

Write a sequence of Writable arguments to the provided Writer.

**Parameters:**

* ​\*Ts (`Writable`): Types of the provided argument sequence.

**Args:**

* ​\*args (`*Ts`): Sequence of arguments to write to this Writer.

### `write_bytes`

`write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])`

Write a span of bytes to the file.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file.

### `__enter__`

`__enter__(owned self) -> Self`

The function to call when entering the context.

**Returns:**

The file handle.

---

## TemporaryDirectory

`struct TemporaryDirectory`

A temporary directory.

## Fields

* ​name (`String`): The name of the temporary directory.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, suffix: String = __init__[__mlir_type.!kgen.string](""), prefix: String = __init__[__mlir_type.!kgen.string]("tmp"), dir: Optional[String] = Optional(None), ignore_cleanup_errors: Bool = False)`

Create a temporary directory.

Can be used as a context manager. When used as a context manager,
the directory is removed when the context manager exits.

**Args:**

* ​suffix (`String`): Suffix to use for the directory name.
* ​prefix (`String`): Prefix to use for the directory name.
* ​dir (`Optional[String]`): Directory in which the directory will be created.
* ​ignore\_cleanup\_errors (`Bool`): Whether to ignore cleanup errors.

### `__enter__`

`__enter__(self) -> String`

The function to call when entering the context.

**Returns:**

The temporary directory name.

### `__exit__`

`__exit__(self)`

Called when exiting the context with no error.

`__exit__(self, err: Error) -> Bool`

Called when exiting the context with an error.

**Args:**

* ​err (`Error`): The error raised inside the context.

**Returns:**

True if the temporary directory was removed successfully.

---

## gettempdir

`gettempdir() -> Optional[String]`

Return the default directory to use for temporary files.

**Returns:**

The name of the default temporary directory.

---

## tempfile

Implements tempfile methods.

You can import a method from the `tempfile` package. For example:

```mojo
from tempfile import gettempdir
```

## Aliases

### `TMP_MAX`

`alias TMP_MAX = 10000`

## Structs

* [​`NamedTemporaryFile`](/mojo/stdlib/tempfile/tempfile/NamedTemporaryFile): A handle to a temporary file.
* [​`TemporaryDirectory`](/mojo/stdlib/tempfile/tempfile/TemporaryDirectory): A temporary directory.

## Functions

* [​`gettempdir`](/mojo/stdlib/tempfile/tempfile/gettempdir): Return the default directory to use for temporary files.
* [​`mkdtemp`](/mojo/stdlib/tempfile/tempfile/mkdtemp): Create a temporary directory. Caller is responsible for deleting the directory when done with it.

---

## mkdtemp

`mkdtemp(suffix: String = __init__[__mlir_type.!kgen.string](""), prefix: String = __init__[__mlir_type.!kgen.string]("tmp"), dir: Optional[String] = Optional(None)) -> String`

Create a temporary directory. Caller is responsible for deleting the directory when done with it.

**Args:**

* ​suffix (`String`): Suffix to use for the directory name.
* ​prefix (`String`): Prefix to use for the directory name.
* ​dir (`Optional[String]`): Directory in which the directory will be created.

**Returns:**

The name of the created directory.

**Raises:**

If the directory can not be created.

---

## testing

Implements the testing package.

## Modules

* [​`testing`](/mojo/stdlib/testing/testing/): Implements various testing utils.

---

## assert_almost_equal

`assert_almost_equal[dtype: DType, size: Int](lhs: SIMD[dtype, size], rhs: SIMD[dtype, size], msg: String = __init__[__mlir_type.!kgen.string](""), *, atol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0E-8), rtol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0000000000000001E-5), equal_nan: Bool = False, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are equal up to a tolerance. If it is not then an Error is raised.

When the type is boolean or integral, then equality is checked. When the
type is floating-point, then this checks if the two input values are
numerically the close using the $abs(lhs - rhs) dtype (`DType`): The dtype of the left- and right-hand-side SIMD vectors.
* ​size (`Int`): The width of the left- and right-hand-side SIMD vectors.

**Args:**

* ​lhs (`SIMD[dtype, size]`): The lhs of the equality.
* ​rhs (`SIMD[dtype, size]`): The rhs of the equality.
* ​msg (`String`): The message to print.
* ​atol (`SIMD[float64, 1]`): The absolute tolerance.
* ​rtol (`SIMD[float64, 1]`): The relative tolerance.
* ​equal\_nan (`Bool`): Whether to treat nans as equal.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

---

## assert_equal

`assert_equal[T: EqualityComparable & Stringable, //](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are equal. If it is not then an Error is raised.

**Parameters:**

* ​T (`EqualityComparable & Stringable`): The type of the input values.

**Args:**

* ​lhs (`T`): The lhs of the equality.
* ​rhs (`T`): The rhs of the equality.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_equal(lhs: String, rhs: String, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are equal. If it is not then an Error is raised.

**Args:**

* ​lhs (`String`): The lhs of the equality.
* ​rhs (`String`): The rhs of the equality.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_equal[dtype: DType, size: Int](lhs: SIMD[dtype, size], rhs: SIMD[dtype, size], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are equal. If it is not then an Error is raised.

**Parameters:**

* ​dtype (`DType`): The dtype of the left- and right-hand-side SIMD vectors.
* ​size (`Int`): The width of the left- and right-hand-side SIMD vectors.

**Args:**

* ​lhs (`SIMD[dtype, size]`): The lhs of the equality.
* ​rhs (`SIMD[dtype, size]`): The rhs of the equality.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_equal[T: Copyable & Movable & EqualityComparable & Representable, //](lhs: List[T], rhs: List[T], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that two lists are equal.

**Parameters:**

* ​T (`Copyable & Movable & EqualityComparable & Representable`): The type of the elements in the lists.

**Args:**

* ​lhs (`List[T]`): The left-hand side list.
* ​rhs (`List[T]`): The right-hand side list.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_equal[O1: ImmutableOrigin, O2: ImmutableOrigin](lhs: List[StringSlice[O1]], rhs: List[StringSlice[O2]], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that two lists are equal.

**Parameters:**

* ​O1 (`ImmutableOrigin`): The origin of lhs.
* ​O2 (`ImmutableOrigin`): The origin of rhs.

**Args:**

* ​lhs (`List[StringSlice[O1]]`): The left-hand side list.
* ​rhs (`List[StringSlice[O2]]`): The right-hand side list.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_equal[D: DType](lhs: List[SIMD[D, 1]], rhs: List[SIMD[D, 1]], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that two lists are equal.

**Parameters:**

* ​D (`DType`): A DType.

**Args:**

* ​lhs (`List[SIMD[D, 1]]`): The left-hand side list.
* ​rhs (`List[SIMD[D, 1]]`): The right-hand side list.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_equal(lhs: PythonObject, rhs: PythonObject, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are equal. If it is not then an Error is raised.

**Args:**

* ​lhs (`PythonObject`): The lhs of the equality.
* ​rhs (`PythonObject`): The rhs of the equality.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (default to the `__call_location`).

**Raises:**

An Error with the provided message if assert fails.

---

## assert_false

`assert_false[T: Boolable, //](val: T, msg: String = __init__[__mlir_type.!kgen.string]("condition was unexpectedly True"), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input value is False and raises an Error if it's not.

**Parameters:**

* ​T (`Boolable`): The type of the value argument.

**Args:**

* ​val (`T`): The value to assert to be False.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

---

## assert_is

`assert_is[T: Stringable & Identifiable](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values have the same identity. If they do not then an Error is raised.

**Parameters:**

* ​T (`Stringable & Identifiable`): A Stringable and Identifiable type.

**Args:**

* ​lhs (`T`): The lhs of the `is` statement.
* ​rhs (`T`): The rhs of the `is` statement.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

---

## assert_is_not

`assert_is_not[T: Stringable & Identifiable](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values have different identities. If they do not then an Error is raised.

**Parameters:**

* ​T (`Stringable & Identifiable`): A Stringable and Identifiable type.

**Args:**

* ​lhs (`T`): The lhs of the `is not` statement.
* ​rhs (`T`): The rhs of the `is not` statement.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

---

## assert_not_equal

`assert_not_equal[T: EqualityComparable & Stringable, //](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are not equal. If it is not then an Error is raised.

**Parameters:**

* ​T (`EqualityComparable & Stringable`): The type of the input values.

**Args:**

* ​lhs (`T`): The lhs of the inequality.
* ​rhs (`T`): The rhs of the inequality.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_not_equal(lhs: String, rhs: String, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are not equal. If it is not then an an Error is raised.

**Args:**

* ​lhs (`String`): The lhs of the inequality.
* ​rhs (`String`): The rhs of the inequality.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_not_equal[dtype: DType, size: Int](lhs: SIMD[dtype, size], rhs: SIMD[dtype, size], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are not equal. If it is not then an Error is raised.

**Parameters:**

* ​dtype (`DType`): The dtype of the left- and right-hand-side SIMD vectors.
* ​size (`Int`): The width of the left- and right-hand-side SIMD vectors.

**Args:**

* ​lhs (`SIMD[dtype, size]`): The lhs of the inequality.
* ​rhs (`SIMD[dtype, size]`): The rhs of the inequality.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_not_equal[T: Copyable & Movable & EqualityComparable & Representable, //](lhs: List[T], rhs: List[T], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that two lists are not equal.

**Parameters:**

* ​T (`Copyable & Movable & EqualityComparable & Representable`): The type of the elements in the lists.

**Args:**

* ​lhs (`List[T]`): The left-hand side list.
* ​rhs (`List[T]`): The right-hand side list.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

---

## assert_raises

`struct assert_raises`

Context manager that asserts that the block raises an exception.

You can use this to test expected error cases, and to test that the correct
errors are raised. For instance:

```mojo
from testing import assert_raises

# Good! Caught the raised error, test passes
with assert_raises():
    raise "SomeError"

# Also good!
with assert_raises(contains="Some"):
    raise "SomeError"

# This will assert, we didn't raise
with assert_raises():
    pass

# This will let the underlying error propagate, failing the test
with assert_raises(contains="Some"):
    raise "OtherError"
```

## Fields

* ​message\_contains (`Optional[String]`): If present, check that the error message contains this literal string.
* ​call\_location (`_SourceLocation`): Assigned the value returned by \_\_call\_locations() at Self.**init**.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, *, location: Optional[_SourceLocation] = Optional(None))`

Construct a context manager with no message pattern.

**Args:**

* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

`__init__(out self, *, contains: String, location: Optional[_SourceLocation] = Optional(None))`

Construct a context manager matching specific errors.

**Args:**

* ​contains (`String`): The test will only pass if the error message
  includes the literal text passed.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

### `__enter__`

`__enter__(self)`

Enter the context manager.

### `__exit__`

`__exit__(self)`

Exit the context manager with no error.

**Raises:**

AssertionError: Always. The block must raise to pass the test.

`__exit__(self, error: Error) -> Bool`

Exit the context manager with an error.

**Args:**

* ​error (`Error`): The error raised.

**Returns:**

True if the error message contained the expected string.

**Raises:**

Error: If the error raised doesn't include the expected string.

---

## assert_true

`assert_true[T: Boolable, //](val: T, msg: String = __init__[__mlir_type.!kgen.string]("condition was unexpectedly False"), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input value is True and raises an Error if it's not.

**Parameters:**

* ​T (`Boolable`): The type of the value argument.

**Args:**

* ​val (`T`): The value to assert to be True.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

---

## testing

Implements various testing utils.

You can import these APIs from the `testing` package. For example:

```mojo
from testing import assert_true

def main():
    x = 1
    y = 2
    try:
        assert_true(x==1)
        assert_true(y==2)
        assert_true((x+y)==3)
        print("All assertions succeeded")
    except e:
        print("At least one assertion failed:")
        print(e)
```

## Structs

* [​`assert_raises`](/mojo/stdlib/testing/testing/assert_raises): Context manager that asserts that the block raises an exception.

## Functions

* [​`assert_almost_equal`](/mojo/stdlib/testing/testing/assert_almost_equal): Asserts that the input values are equal up to a tolerance. If it is not then an Error is raised.
* [​`assert_equal`](/mojo/stdlib/testing/testing/assert_equal): Asserts that the input values are equal. If it is not then an Error is raised.
* [​`assert_false`](/mojo/stdlib/testing/testing/assert_false): Asserts that the input value is False and raises an Error if it's not.
* [​`assert_is`](/mojo/stdlib/testing/testing/assert_is): Asserts that the input values have the same identity. If they do not then an Error is raised.
* [​`assert_is_not`](/mojo/stdlib/testing/testing/assert_is_not): Asserts that the input values have different identities. If they do not then an Error is raised.
* [​`assert_not_equal`](/mojo/stdlib/testing/testing/assert_not_equal): Asserts that the input values are not equal. If it is not then an Error is raised.
* [​`assert_true`](/mojo/stdlib/testing/testing/assert_true): Asserts that the input value is True and raises an Error if it's not.

---

## time

Implements the time package.

## Modules

* [​`time`](/mojo/stdlib/time/time/): Implements basic utils for working with time.

---

## time

Implements basic utils for working with time.

You can import these APIs from the `time` package. For example:

```mojo
from time import perf_counter_ns
```

## Functions

* [​`monotonic`](/mojo/stdlib/time/time/monotonic): Returns the current monotonic time time in nanoseconds. This function queries the current platform's monotonic clock, making it useful for measuring time differences, but the significance of the returned value varies depending on the underlying implementation.
* [​`perf_counter`](/mojo/stdlib/time/time/perf_counter): Return the value (in fractional seconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid.
* [​`perf_counter_ns`](/mojo/stdlib/time/time/perf_counter_ns): Return the value (in nanoseconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid.
* [​`sleep`](/mojo/stdlib/time/time/sleep): Suspends the current thread for the seconds specified.
* [​`time_function`](/mojo/stdlib/time/time/time_function): Measures the time spent in the function.

---

## monotonic

`monotonic() -> UInt`

Returns the current monotonic time time in nanoseconds. This function queries the current platform's monotonic clock, making it useful for measuring time differences, but the significance of the returned value varies depending on the underlying implementation.

**Returns:**

The current time in ns.

---

## perf_counter

`perf_counter() -> SIMD[float64, 1]`

Return the value (in fractional seconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid.

**Returns:**

The current time in ns.

---

## perf_counter_ns

`perf_counter_ns() -> UInt`

Return the value (in nanoseconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid.

**Returns:**

The current time in ns.

---

## sleep

`sleep(sec: SIMD[float64, 1])`

Suspends the current thread for the seconds specified.

**Args:**

* ​sec (`SIMD[float64, 1]`): The number of seconds to sleep for.

`sleep(sec: UInt)`

Suspends the current thread for the seconds specified.

**Args:**

* ​sec (`UInt`): The number of seconds to sleep for.

---

## time_function

`time_function[: origin.set, //, func: fn() raises capturing -> None]() -> UInt`

Measures the time spent in the function.

**Parameters:**

* ​func (`fn() raises capturing -> None`): The function to time.

**Returns:**

The time elapsed in the function in ns.

`time_function[: origin.set, //, func: fn() capturing -> None]() -> UInt`

Measures the time spent in the function.

**Parameters:**

* ​func (`fn() capturing -> None`): The function to time.

**Returns:**

The time elapsed in the function in ns.

---

## utils

Implements the utils package.

## Modules

* [​`index`](/mojo/stdlib/utils/index_/): Implements `IndexList` which is commonly used to represent N-D indices.
* [​`lock`](/mojo/stdlib/utils/lock/):
* [​`numerics`](/mojo/stdlib/utils/numerics/): Defines utilities to work with numeric types.
* [​`static_tuple`](/mojo/stdlib/utils/static_tuple/): Implements StaticTuple, a statically-sized uniform container.
* [​`variant`](/mojo/stdlib/utils/variant/): Defines a Variant type.
* [​`write`](/mojo/stdlib/utils/write/): Establishes the contract between `Writer` and `Writable` types.

---

## Index

`Index[T0: Intable, //, *, dtype: DType = int64](x: T0) -> IndexList[1, element_type=dtype]`

Constructs a 1-D Index from the given value.

**Parameters:**

* ​T0 (`Intable`): The type of the 1st argument.
* ​dtype (`DType`): The integer type of the underlying element.

**Args:**

* ​x (`T0`): The initial value.

**Returns:**

The constructed IndexList.

`Index[*, dtype: DType = int64](x: UInt) -> IndexList[1, element_type=dtype]`

Constructs a 1-D Index from the given value.

**Parameters:**

* ​dtype (`DType`): The integer type of the underlying element.

**Args:**

* ​x (`UInt`): The initial value.

**Returns:**

The constructed IndexList.

`Index[T0: Intable, T1: Intable, //, *, dtype: DType = int64](x: T0, y: T1) -> IndexList[2, element_type=dtype]`

Constructs a 2-D Index from the given values.

**Parameters:**

* ​T0 (`Intable`): The type of the 1st argument.
* ​T1 (`Intable`): The type of the 2nd argument.
* ​dtype (`DType`): The integer type of the underlying element.

**Args:**

* ​x (`T0`): The 1st initial value.
* ​y (`T1`): The 2nd initial value.

**Returns:**

The constructed IndexList.

`Index[*, dtype: DType = int64](x: UInt, y: UInt) -> IndexList[2, element_type=dtype]`

Constructs a 2-D Index from the given values.

**Parameters:**

* ​dtype (`DType`): The integer type of the underlying element.

**Args:**

* ​x (`UInt`): The 1st initial value.
* ​y (`UInt`): The 2nd initial value.

**Returns:**

The constructed IndexList.

`Index[T0: Intable, T1: Intable, T2: Intable, //, *, dtype: DType = int64](x: T0, y: T1, z: T2) -> IndexList[3, element_type=dtype]`

Constructs a 3-D Index from the given values.

**Parameters:**

* ​T0 (`Intable`): The type of the 1st argument.
* ​T1 (`Intable`): The type of the 2nd argument.
* ​T2 (`Intable`): The type of the 3rd argument.
* ​dtype (`DType`): The integer type of the underlying element.

**Args:**

* ​x (`T0`): The 1st initial value.
* ​y (`T1`): The 2nd initial value.
* ​z (`T2`): The 3rd initial value.

**Returns:**

The constructed IndexList.

`Index[T0: Intable, T1: Intable, T2: Intable, T3: Intable, //, *, dtype: DType = int64](x: T0, y: T1, z: T2, w: T3) -> IndexList[4, element_type=dtype]`

Constructs a 4-D Index from the given values.

**Parameters:**

* ​T0 (`Intable`): The type of the 1st argument.
* ​T1 (`Intable`): The type of the 2nd argument.
* ​T2 (`Intable`): The type of the 3rd argument.
* ​T3 (`Intable`): The type of the 4th argument.
* ​dtype (`DType`): The integer type of the underlying element.

**Args:**

* ​x (`T0`): The 1st initial value.
* ​y (`T1`): The 2nd initial value.
* ​z (`T2`): The 3rd initial value.
* ​w (`T3`): The 4th initial value.

**Returns:**

The constructed IndexList.

`Index[T0: Intable, T1: Intable, T2: Intable, T3: Intable, T4: Intable, //, *, dtype: DType = int64](x: T0, y: T1, z: T2, w: T3, v: T4) -> IndexList[5, element_type=dtype]`

Constructs a 5-D Index from the given values.

**Parameters:**

* ​T0 (`Intable`): The type of the 1st argument.
* ​T1 (`Intable`): The type of the 2nd argument.
* ​T2 (`Intable`): The type of the 3rd argument.
* ​T3 (`Intable`): The type of the 4th argument.
* ​T4 (`Intable`): The type of the 5th argument.
* ​dtype (`DType`): The integer type of the underlying element.

**Args:**

* ​x (`T0`): The 1st initial value.
* ​y (`T1`): The 2nd initial value.
* ​z (`T2`): The 3rd initial value.
* ​w (`T3`): The 4th initial value.
* ​v (`T4`): The 5th initial value.

**Returns:**

The constructed IndexList.

---

## IndexList

`@register_passable(trivial)`
`struct IndexList[size: Int, *, element_type: DType = int64]`

A base struct that implements size agnostic index functions.

## Parameters

* ​size (`Int`): The size of the tuple.
* ​element\_type (`DType`): The underlying dtype of the integer element value.

## Fields

* ​data (`StaticTuple[SIMD[element_type, 1], size]`): The underlying storage of the tuple value.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`EqualityComparable`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`,
`_HashableWithHasher`

## Methods

### `__init__`

`__init__() -> Self`

Constructs a static int tuple of the given size.

`@implicit`
`__init__(data: StaticTuple[SIMD[element_type, 1], size]) -> Self`

Constructs a static int tuple of the given size.

**Args:**

* ​data (`StaticTuple[SIMD[element_type, 1], size]`): The StaticTuple to construct the IndexList from.

`@implicit`
`__init__(elems: Tuple[Int, Int]) -> Self`

Constructs a static int tuple given a tuple of integers.

**Args:**

* ​elems (`Tuple[Int, Int]`): The tuple to copy from.

`@implicit`
`__init__(elems: Tuple[Int, Int, Int]) -> Self`

Constructs a static int tuple given a tuple of integers.

**Args:**

* ​elems (`Tuple[Int, Int, Int]`): The tuple to copy from.

`@implicit`
`__init__(elems: Tuple[Int, Int, Int, Int]) -> Self`

Constructs a static int tuple given a tuple of integers.

**Args:**

* ​elems (`Tuple[Int, Int, Int, Int]`): The tuple to copy from.

`@implicit`
`__init__(*elems: Int, *, __list_literal__: Tuple[] = Tuple()) -> Self`

Constructs a static int tuple given a set of arguments.

**Args:**

* ​\*elems (`Int`): The elements to construct the tuple.
* ​**list\_literal** (`Tuple[]`): Specifies that this constructor can be used for
  list literals.

`@implicit`
`__init__(elem: Int) -> Self`

Constructs a static int tuple given a set of arguments.

**Args:**

* ​elem (`Int`): The elem to splat into the tuple.

`__init__(*, other: Self) -> Self`

Copy constructor.

**Args:**

* ​other (`Self`): The other tuple to copy from.

`@implicit`
`__init__(values: VariadicList[Int]) -> Self`

Creates a tuple constant using the specified values.

**Args:**

* ​values (`VariadicList[Int]`): The list of values.

### `__getitem__`

`__getitem__[idx: Int](self) -> Int`

Gets an element from the tuple by index.

**Parameters:**

* ​idx (`Int`): The element index.

**Returns:**

The tuple element value.

`__getitem__[I: Indexer](self, idx: I) -> Int`

Gets an element from the tuple by index.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The element index.

**Returns:**

The tuple element value.

### `__setitem__`

`__setitem__[idx: Int](mut self, val: Int)`

Sets an element in the tuple at the given static index.

**Parameters:**

* ​idx (`Int`): The element index.

**Args:**

* ​val (`Int`): The value to store.

`__setitem__[idx: Int](mut self, val: SIMD[element_type, 1])`

Sets an element in the tuple at the given static index.

**Parameters:**

* ​idx (`Int`): The element index.

**Args:**

* ​val (`SIMD[element_type, 1]`): The value to store.

`__setitem__(mut self, idx: Int, val: Int)`

Sets an element in the tuple at the given index.

**Args:**

* ​idx (`Int`): The element index.
* ​val (`Int`): The value to store.

### `__lt__`

`__lt__(self, rhs: Self) -> Bool`

Compares this tuple to another tuple using LT comparison.

A tuple is less-than another tuple if all corresponding elements of lhs
is less than rhs.

Note: This is **not** a lexical comparison.

**Args:**

* ​rhs (`Self`): Right hand side tuple.

**Returns:**

The comparison result.

### `__le__`

`__le__(self, rhs: Self) -> Bool`

Compares this tuple to another tuple using LE comparison.

A tuple is less-or-equal than another tuple if all corresponding
elements of lhs is less-or-equal than rhs.

Note: This is **not** a lexical comparison.

**Args:**

* ​rhs (`Self`): Right hand side tuple.

**Returns:**

The comparison result.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compares this tuple to another tuple for equality.

The tuples are equal if all corresponding elements are equal.

**Args:**

* ​rhs (`Self`): The other tuple.

**Returns:**

The comparison result.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Compares this tuple to another tuple for non-equality.

The tuples are non-equal if at least one element of LHS isn't equal to
the corresponding element from RHS.

**Args:**

* ​rhs (`Self`): The other tuple.

**Returns:**

The comparison result.

### `__gt__`

`__gt__(self, rhs: Self) -> Bool`

Compares this tuple to another tuple using GT comparison.

A tuple is greater-than than another tuple if all corresponding
elements of lhs is greater-than than rhs.

Note: This is **not** a lexical comparison.

**Args:**

* ​rhs (`Self`): Right hand side tuple.

**Returns:**

The comparison result.

### `__ge__`

`__ge__(self, rhs: Self) -> Bool`

Compares this tuple to another tuple using GE comparison.

A tuple is greater-or-equal than another tuple if all corresponding
elements of lhs is greater-or-equal than rhs.

Note: This is **not** a lexical comparison.

**Args:**

* ​rhs (`Self`): Right hand side tuple.

**Returns:**

The comparison result.

### `__add__`

`__add__(self, rhs: Self) -> Self`

Performs element-wise integer add.

**Args:**

* ​rhs (`Self`): Right hand side operand.

**Returns:**

The resulting index tuple.

### `__sub__`

`__sub__(self, rhs: Self) -> Self`

Performs element-wise integer subtract.

**Args:**

* ​rhs (`Self`): Right hand side operand.

**Returns:**

The resulting index tuple.

### `__mul__`

`__mul__(self, rhs: Self) -> Self`

Performs element-wise integer multiply.

**Args:**

* ​rhs (`Self`): Right hand side operand.

**Returns:**

The resulting index tuple.

### `__floordiv__`

`__floordiv__(self, rhs: Self) -> Self`

Performs element-wise integer floor division.

**Args:**

* ​rhs (`Self`): The elementwise divisor.

**Returns:**

The resulting index tuple.

### `__rfloordiv__`

`__rfloordiv__(self, rhs: Self) -> Self`

Floor divides rhs by this object.

**Args:**

* ​rhs (`Self`): The value to elementwise divide by self.

**Returns:**

The resulting index tuple.

### `__len__`

`__len__(self) -> Int`

Returns the size of the tuple.

**Returns:**

The tuple size.

### `as_tuple`

`as_tuple(self) -> StaticTuple[Int, size]`

Converts this IndexList to StaticTuple.

**Returns:**

The corresponding StaticTuple object.

### `canonicalize`

`canonicalize(self) -> IndexList[size]`

Canonicalizes the IndexList.

**Returns:**

Canonicalizes the object.

### `flattened_length`

`flattened_length(self) -> Int`

Returns the flattened length of the tuple.

**Returns:**

The flattened length of the tuple.

### `remu`

`remu(self, rhs: Self) -> Self`

Performs element-wise integer unsigned modulo.

**Args:**

* ​rhs (`Self`): Right hand side operand.

**Returns:**

The resulting index tuple.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this IndexList value to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__str__`

`__str__(self) -> String`

Get the tuple as a string.

**Returns:**

A string representation.

### `cast`

`cast[dtype: DType](self) -> IndexList[size, element_type=dtype]`

Casts to the target DType.

**Parameters:**

* ​dtype (`DType`): The dtype to cast towards.

**Returns:**

The list casted to the target type.

### `__hash__`

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with the underlying bytes.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

---

## index

Implements `IndexList` which is commonly used to represent N-D indices.

You can import these APIs from the `utils` package. For example:

```mojo
from utils import IndexList
```

## Structs

* [​`IndexList`](/mojo/stdlib/utils/index_/IndexList): A base struct that implements size agnostic index functions.

## Functions

* [​`Index`](/mojo/stdlib/utils/index_/Index-function): Constructs a 1-D Index from the given value.
* [​`product`](/mojo/stdlib/utils/index_/product): Computes a product of values in the tuple up to the given index.

---

## product

`product[size: Int](tuple: IndexList[size, element_type=element_type], end_idx: Int = size) -> Int`

Computes a product of values in the tuple up to the given index.

**Parameters:**

* ​size (`Int`): The tuple size.

**Args:**

* ​tuple (`IndexList[size, element_type=element_type]`): The tuple to get a product of.
* ​end\_idx (`Int`): The end index.

**Returns:**

The product of all tuple elements in the given range.

`product[size: Int](tuple: IndexList[size, element_type=element_type], start_idx: Int, end_idx: Int) -> Int`

Computes a product of values in the tuple in the given index range.

**Parameters:**

* ​size (`Int`): The tuple size.

**Args:**

* ​tuple (`IndexList[size, element_type=element_type]`): The tuple to get a product of.
* ​start\_idx (`Int`): The start index of the range.
* ​end\_idx (`Int`): The end index of the range.

**Returns:**

The product of all tuple elements in the given range.

---

## BlockingScopedLock

`struct BlockingScopedLock`

A scope adapter for BlockingSpinLock.

## Fields

* ​lock (`UnsafePointer[BlockingSpinLock]`): The underlying lock instance.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `LockType`

`alias LockType = BlockingSpinLock`

The type of the lock.

## Methods

### `__init__`

`__init__(out self, lock: UnsafePointer[BlockingSpinLock])`

Primary constructor.

**Args:**

* ​lock (`UnsafePointer[BlockingSpinLock]`): A pointer to the underlying lock.

`__init__(out self, mut lock: BlockingSpinLock)`

Secondary constructor.

**Args:**

* ​lock (`BlockingSpinLock`): A mutable reference to the underlying lock.

### `__enter__`

`__enter__(mut self)`

Acquire the lock on entry. This is done by setting the owner of the lock to own address.

### `__exit__`

`__exit__(mut self)`

Release the lock on exit. Reset the address on the underlying lock.

---

## BlockingSpinLock

`struct BlockingSpinLock`

A basic locking implementation that uses an integer to represent the owner of the lock.

## Fields

* ​counter (`Atomic[int64]`): The atomic counter implementing the spin lock.

## Implemented traits

`AnyType`,
`Defaultable`,
`UnknownDestructibility`

## Aliases

### `UNLOCKED`

`alias UNLOCKED = -1`

non-zero means locked, -1 means unlocked.

## Methods

### `__init__`

`__init__(out self)`

Default constructor.

### `lock`

`lock(mut self, owner: Int)`

Acquires the lock.

**Args:**

* ​owner (`Int`): The lock's owner (usually an address).

### `unlock`

`unlock(mut self, owner: Int) -> Bool`

Releases the lock.

**Args:**

* ​owner (`Int`): The lock's owner (usually an address).

**Returns:**

The successful release of the lock.

---

## SpinWaiter

`struct SpinWaiter`

A proxy for the C++ runtime's SpinWaiter type.

## Fields

* ​storage (`UnsafePointer[NoneType]`): Pointer to the underlying SpinWaiter instance.

## Implemented traits

`AnyType`,
`Defaultable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Initializes a SpinWaiter instance.

### `__del__`

`__del__(owned self)`

Destroys the SpinWaiter instance.

### `wait`

`wait(self)`

Blocks the current task for a duration determined by the underlying policy.

---

## lock

## Structs

* [​`BlockingScopedLock`](/mojo/stdlib/utils/lock/BlockingScopedLock): A scope adapter for BlockingSpinLock.
* [​`BlockingSpinLock`](/mojo/stdlib/utils/lock/BlockingSpinLock): A basic locking implementation that uses an integer to represent the owner of the lock.
* [​`SpinWaiter`](/mojo/stdlib/utils/lock/SpinWaiter): A proxy for the C++ runtime's SpinWaiter type.

---

## FPUtils

`struct FPUtils[dtype: DType, *, _constraint: NoneType = NoneType(_constrain_fp_type[::DType]())]`

Collection of utility functions for working with FP values.

**Constraints:**

The dtype is floating point.

## Parameters

* ​dtype (`DType`): The concrete FP dtype (FP32/FP64/etc).
* ​\_constraint (`NoneType`): Implements the constraint. Do not pass explicitly.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `integral_type`

`alias integral_type = _integral_type_of[::DType]()`

The equivalent integer dtype of the float type.

### `uint_type`

`alias uint_type = _unsigned_integral_type_of[::DType]()`

The equivalent uint dtype of the float type.

## Methods

### `mantissa_width`

`static mantissa_width() -> Int`

Returns the mantissa width of a floating point type.

**Returns:**

The mantissa width.

### `max_exponent`

`static max_exponent() -> Int`

Returns the max exponent of a floating point dtype without accounting for inf representations. This is not the maximum representable exponent, which is generally equal to the exponent\_bias.

**Returns:**

The max exponent.

### `exponent_width`

`static exponent_width() -> Int`

Returns the exponent width of a floating point type.

**Returns:**

The exponent width.

### `mantissa_mask`

`static mantissa_mask() -> Int`

Returns the mantissa mask of a floating point type.

**Returns:**

The mantissa mask.

### `exponent_bias`

`static exponent_bias() -> Int`

Returns the exponent bias of a floating point type.

**Returns:**

The exponent bias.

### `sign_mask`

`static sign_mask() -> Int`

Returns the sign mask of a floating point type.

It is computed by `1 

### `exponent_mask`

`static exponent_mask() -> Int`

Returns the exponent mask of a floating point type.

It is computed by `~(sign_mask | mantissa_mask)`.

**Returns:**

The exponent mask.

### `exponent_mantissa_mask`

`static exponent_mantissa_mask() -> Int`

Returns the exponent and mantissa mask of a floating point type.

It is computed by `exponent_mask | mantissa_mask`.

**Returns:**

The exponent and mantissa mask.

### `quiet_nan_mask`

`static quiet_nan_mask() -> Int`

Returns the quiet NaN mask for a floating point type.

The mask is defined by evaluating:

```
(1

### `bitcast_to_integer`

`static bitcast_to_integer(value: SIMD[dtype, 1]) -> Int`

Bitcasts the floating-point value to an integer.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point type.

**Returns:**

An integer representation of the floating-point value.

### `bitcast_to_uint`

`static bitcast_to_uint(value: SIMD[dtype, 1]) -> SIMD[_unsigned_integral_type_of[::DType](), 1]`

Bitcasts the floating-point value to an integer.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point type.

**Returns:**

An integer representation of the floating-point value.

### `bitcast_from_integer`

`static bitcast_from_integer(value: Int) -> SIMD[dtype, 1]`

Bitcasts the floating-point value from an integer.

**Args:**

* ​value (`Int`): The int value.

**Returns:**

An floating-point representation of the Int.

### `get_sign`

`static get_sign(value: SIMD[dtype, 1]) -> Bool`

Returns the sign of the floating point value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point type.

**Returns:**

Returns True if the sign is set and False otherwise.

### `set_sign`

`static set_sign(value: SIMD[dtype, 1], sign: Bool) -> SIMD[dtype, 1]`

Sets the sign of the floating point value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point value.
* ​sign (`Bool`): True to set the sign and false otherwise.

**Returns:**

Returns the floating point value with the sign set.

### `get_exponent`

`static get_exponent(value: SIMD[dtype, 1]) -> Int`

Returns the exponent bits of the floating-point value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point value.

**Returns:**

Returns the exponent bits.

### `get_exponent_biased`

`static get_exponent_biased(value: SIMD[dtype, 1]) -> Int`

Returns the biased exponent of the floating-point value as an Int, this is how the value is stored before subtracting the exponent bias.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point value.

**Returns:**

The biased exponent as an Int.

### `set_exponent`

`static set_exponent(value: SIMD[dtype, 1], exponent: Int) -> SIMD[dtype, 1]`

Sets the exponent bits of the floating-point value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point value.
* ​exponent (`Int`): The exponent bits.

**Returns:**

Returns the floating-point value with the exponent bits set.

### `get_mantissa`

`static get_mantissa(value: SIMD[dtype, 1]) -> Int`

Gets the mantissa bits of the floating-point value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point value.

**Returns:**

The mantissa bits.

### `get_mantissa_uint`

`static get_mantissa_uint(value: SIMD[dtype, 1]) -> SIMD[_unsigned_integral_type_of[::DType](), 1]`

Gets the mantissa bits of the floating-point value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point value.

**Returns:**

The mantissa bits.

### `set_mantissa`

`static set_mantissa(value: SIMD[dtype, 1], mantissa: Int) -> SIMD[dtype, 1]`

Sets the mantissa bits of the floating-point value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point value.
* ​mantissa (`Int`): The mantissa bits.

**Returns:**

Returns the floating-point value with the mantissa bits set.

### `pack`

`static pack(sign: Bool, exponent: Int, mantissa: Int) -> SIMD[dtype, 1]`

Construct a floating-point value from its constituent sign, exponent, and mantissa.

**Args:**

* ​sign (`Bool`): The sign of the floating-point value.
* ​exponent (`Int`): The exponent of the floating-point value.
* ​mantissa (`Int`): The mantissa of the floating-point value.

**Returns:**

Returns the floating-point value.

---

## FlushDenormals

`struct FlushDenormals`

Flushes and denormals are set to zero within the context and the state is restored to the prior value on exit.

## Fields

* ​state (`SIMD[int32, 1]`): The current state.

## Implemented traits

`AnyType`,
`Defaultable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Initializes the FlushDenormals.

### `__enter__`

`__enter__(self)`

Enters the context. This will set denormals to zero.

### `__exit__`

`__exit__(self)`

Exits the context. This will restore the prior FPState.

---

## get_accum_type

`get_accum_type[dtype: DType, *, preferred_accum_type: DType = float32]() -> DType`

Returns the recommended dtype for accumulation operations.

Half precision and float8 types can introduce numerical error if they are
used in reduction/accumulation operations. This method returns a higher
precision dtype to use for accumulation if a half precision types is
provided, otherwise it returns the original dtype.

The rules are as follows:
\- If the dtype is a float8 type, return a float16 type.
\- If the dtype is a bfloat16 precision type, return a float32 type.
\- If the dtype is a float16 precision type, return a float32 dtype if
the preferred\_accum\_type is float32, otherwise return a float16
type.
\- Otherwise, return the original type.

**Parameters:**

* ​dtype (`DType`): The dtype of some accumulation operation.
* ​preferred\_accum\_type (`DType`): The preferred dtype for accumulation.

**Returns:**

The recommended dtype for accumulation operations based on the input
dtype and the preferred accumulation type.

---

## numerics

Defines utilities to work with numeric types.

You can import these APIs from the `utils` package. For example:

```mojo
from utils.numerics import FPUtils
```

## Structs

* [​`FlushDenormals`](/mojo/stdlib/utils/numerics/FlushDenormals): Flushes and denormals are set to zero within the context and the state is restored to the prior value on exit.
* [​`FPUtils`](/mojo/stdlib/utils/numerics/FPUtils): Collection of utility functions for working with FP values.

## Functions

* [​`get_accum_type`](/mojo/stdlib/utils/numerics/get_accum_type): Returns the recommended dtype for accumulation operations.
* [​`inf`](/mojo/stdlib/utils/numerics/inf): Gets a +inf value for the given dtype.
* [​`isfinite`](/mojo/stdlib/utils/numerics/isfinite): Checks if the value is not infinite.
* [​`isinf`](/mojo/stdlib/utils/numerics/isinf): Checks if the value is infinite.
* [​`isnan`](/mojo/stdlib/utils/numerics/isnan): Checks if the value is Not a Number (NaN).
* [​`max_finite`](/mojo/stdlib/utils/numerics/max_finite): Returns the maximum finite value of type.
* [​`max_or_inf`](/mojo/stdlib/utils/numerics/max_or_inf): Returns the maximum (potentially infinite) value of type.
* [​`min_finite`](/mojo/stdlib/utils/numerics/min_finite): Returns the minimum (lowest) finite value of type.
* [​`min_or_neg_inf`](/mojo/stdlib/utils/numerics/min_or_neg_inf): Returns the minimum (potentially negative infinite) value of type.
* [​`nan`](/mojo/stdlib/utils/numerics/nan): Gets a NaN value for the given dtype.
* [​`neg_inf`](/mojo/stdlib/utils/numerics/neg_inf): Gets a -inf value for the given dtype.
* [​`nextafter`](/mojo/stdlib/utils/numerics/nextafter): Computes next representable value of `arg0` in the direction of `arg1`.

---

## inf

`inf[dtype: DType]() -> SIMD[dtype, 1]`

Gets a +inf value for the given dtype.

**Constraints:**

Can only be used for FP dtypes.

**Parameters:**

* ​dtype (`DType`): The value dtype.

**Returns:**

The +inf value of the given dtype.

---

## isfinite

`isfinite[dtype: DType, simd_width: Int](val: SIMD[dtype, simd_width]) -> SIMD[bool, simd_width]`

Checks if the value is not infinite.

This is always True for non-FP data types.

**Parameters:**

* ​dtype (`DType`): The value dtype.
* ​simd\_width (`Int`): The width of the SIMD vector.

**Args:**

* ​val (`SIMD[dtype, simd_width]`): The value to check.

**Returns:**

True if val is finite and False otherwise.

---

## isinf

`isinf[dtype: DType, simd_width: Int](val: SIMD[dtype, simd_width]) -> SIMD[bool, simd_width]`

Checks if the value is infinite.

This is always False for non-FP data types.

**Parameters:**

* ​dtype (`DType`): The value dtype.
* ​simd\_width (`Int`): The width of the SIMD vector.

**Args:**

* ​val (`SIMD[dtype, simd_width]`): The value to check.

**Returns:**

True if val is infinite and False otherwise.

---

## isnan

`isnan[dtype: DType, simd_width: Int](val: SIMD[dtype, simd_width]) -> SIMD[bool, simd_width]`

Checks if the value is Not a Number (NaN).

**Parameters:**

* ​dtype (`DType`): The value dtype.
* ​simd\_width (`Int`): The width of the SIMD vector.

**Args:**

* ​val (`SIMD[dtype, simd_width]`): The value to check.

**Returns:**

True if val is NaN and False otherwise.

---

## max_finite

`max_finite[dtype: DType]() -> SIMD[dtype, 1]`

Returns the maximum finite value of type.

**Parameters:**

* ​dtype (`DType`): The value dtype.

**Returns:**

The maximum representable value of the type. Does not include infinity
for floating-point types.

---

## max_or_inf

`max_or_inf[dtype: DType]() -> SIMD[dtype, 1]`

Returns the maximum (potentially infinite) value of type.

**Parameters:**

* ​dtype (`DType`): The value dtype.

**Returns:**

The maximum representable value of the type. Can include infinity for
floating-point types.

---

## min_finite

`min_finite[dtype: DType]() -> SIMD[dtype, 1]`

Returns the minimum (lowest) finite value of type.

**Parameters:**

* ​dtype (`DType`): The value dtype.

**Returns:**

The minimum representable value of the type. Does not include negative
infinity for floating-point types.

---

## min_or_neg_inf

`min_or_neg_inf[dtype: DType]() -> SIMD[dtype, 1]`

Returns the minimum (potentially negative infinite) value of type.

**Parameters:**

* ​dtype (`DType`): The value dtype.

**Returns:**

The minimum representable value of the type. Can include negative
infinity for floating-point types.

---

## nan

`nan[dtype: DType]() -> SIMD[dtype, 1]`

Gets a NaN value for the given dtype.

**Constraints:**

Can only be used for FP dtypes.

**Parameters:**

* ​dtype (`DType`): The value dtype.

**Returns:**

The NaN value of the given dtype.

---

## neg_inf

`neg_inf[dtype: DType]() -> SIMD[dtype, 1]`

Gets a -inf value for the given dtype.

**Constraints:**

Can only be used for FP dtypes.

**Parameters:**

* ​dtype (`DType`): The value dtype.

**Returns:**

The -inf value of the given dtype.

---

## nextafter

`nextafter[dtype: DType, simd_width: Int](arg0: SIMD[dtype, simd_width], arg1: SIMD[dtype, simd_width]) -> SIMD[dtype, simd_width]`

Computes next representable value of `arg0` in the direction of `arg1`.

**Constraints:**

The element dtype of the input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​simd\_width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​arg0 (`SIMD[dtype, simd_width]`): The first input argument.
* ​arg1 (`SIMD[dtype, simd_width]`): The second input argument.

**Returns:**

The `nextafter` of the inputs.

---

## StaticTuple

`@register_passable(trivial)`
`struct StaticTuple[element_type: AnyTrivialRegType, size: Int]`

A statically sized tuple type which contains elements of homogeneous types.

## Parameters

* ​element\_type (`AnyTrivialRegType`): The type of the elements in the tuple.
* ​size (`Int`): The size of the tuple.

## Fields

* ​array (`array, element_type>`): The underlying storage for the static tuple.

## Implemented traits

`AnyType`,
`Copyable`,
`Defaultable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `type`

`alias type = array, element_type>`

## Methods

### `__init__`

`__init__() -> Self`

Constructs an empty (undefined) tuple.

`@implicit`
`__init__(array: array, element_type>) -> Self`

Constructs from an array type.

**Args:**

* ​array (`array, element_type>`): Underlying MLIR array type.

`@implicit`
`__init__(*elems: element_type) -> Self`

Constructs a static tuple given a set of arguments.

**Args:**

* ​\*elems (`element_type`): The element types.

`@implicit`
`__init__(values: VariadicList[element_type]) -> Self`

Creates a tuple constant using the specified values.

**Args:**

* ​values (`VariadicList[element_type]`): The list of values.

`__init__(*, other: Self) -> Self`

Explicitly copy the provided StaticTuple.

**Args:**

* ​other (`Self`): The StaticTuple to copy.

### `__getitem__`

`__getitem__[index: Int](self) -> element_type`

Returns the value of the tuple at the given index.

**Parameters:**

* ​index (`Int`): The index into the tuple.

**Returns:**

The value at the specified position.

`__getitem__[I: Indexer, //](self, idx: I) -> element_type`

Returns the value of the tuple at the given dynamic index.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The index into the tuple.

**Returns:**

The value at the specified position.

### `__setitem__`

`__setitem__[I: Indexer, //](mut self, idx: I, val: element_type)`

Stores a single value into the tuple at the specified dynamic index.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The index into the tuple.
* ​val (`element_type`): The value to store.

`__setitem__[idx: Int](mut self, val: element_type)`

Stores a single value into the tuple at the specified index.

**Parameters:**

* ​idx (`Int`): The index into the tuple.

**Args:**

* ​val (`element_type`): The value to store.

### `__len__`

`__len__(self) -> Int`

Returns the length of the array. This is a known constant value.

**Returns:**

The size of the list.

---

## static_tuple

Implements StaticTuple, a statically-sized uniform container.

You can import these APIs from the `utils` package. For example:

```mojo
from utils import StaticTuple
```

## Structs

* [​`StaticTuple`](/mojo/stdlib/utils/static_tuple/StaticTuple): A statically sized tuple type which contains elements of homogeneous types.

---

## Variant

`struct Variant[*Ts: Copyable & Movable]`

A runtime-variant type.

Data for this type is stored internally. Currently, its size is the
largest size of any of its variants plus a 16-bit discriminant.

You can
\- use `isa[T]()` to check what type a variant is
\- use `unsafe_take[T]()` to take a value from the variant
\- use `[T]` to get a value out of a variant
\- This currently does an extra copy/move until we have origins
\- It also temporarily requires the value to be mutable
\- use `set[T](owned new_value: T)` to reset the variant to a new value
\- use `is_type_supported[T]` to check if the variant permits the type `T`

Example:

```mojo
from utils import Variant
alias IntOrString = Variant[Int, String]
fn to_string(mut x: IntOrString) -> String:
    if x.isa[String]():
        return x[String]
    # x.isa[Int]()
    return String(x[Int])

# They have to be mutable for now, and implement Copyable & Movable
var an_int = IntOrString(4)
var a_string = IntOrString(String("I'm a string!"))
var who_knows = IntOrString(0)
import random
if random.random_ui64(0, 1):
    who_knows.set[String]("I'm actually a string too!")

print(to_string(an_int))
print(to_string(a_string))
print(to_string(who_knows))
```

## Parameters

* ​\*Ts (`Copyable & Movable`): The elements of the variadic.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, *, unsafe_uninitialized: Tuple[])`

Unsafely create an uninitialized Variant.

**Args:**

* ​unsafe\_uninitialized (`Tuple[]`): Marker argument indicating this initializer is unsafe.

`@implicit`
`__init__[T: Copyable & Movable](out self, owned value: T)`

Create a variant with one of the types.

**Parameters:**

* ​T (`Copyable & Movable`): The type to initialize the variant to. Generally this should
  be able to be inferred from the call type, eg. `Variant[Int, String](4)`.

**Args:**

* ​value (`T`): The value to initialize the variant with.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Creates a deep copy of an existing variant.

**Args:**

* ​other (`Self`): The variant to copy from.

### `__moveinit__`

`__moveinit__(out self, owned other: Self)`

Move initializer for the variant.

**Args:**

* ​other (`Self`): The variant to move.

### `__del__`

`__del__(owned self)`

Destroy the variant.

### `__getitem__`

`__getitem__[T: Copyable & Movable](ref self) -> ref [self] T`

Get the value out of the variant as a type-checked type.

This explicitly check that your value is of that type!
If you haven't verified the type correctness at runtime, the program
will abort!

For now this has the limitations that it
\- requires the variant value to be mutable

**Parameters:**

* ​T (`Copyable & Movable`): The type of the value to get out.

**Returns:**

A reference to the internal data.

### `copy`

`copy(self, out copy: Self)`

Explicitly creates a deep copy of an existing variant.

**Returns:**

A copy of the value.

### `take`

`take[T: Copyable & Movable](mut self) -> T`

Take the current value of the variant with the provided type.

The caller takes ownership of the underlying value.

This explicitly check that your value is of that type!
If you haven't verified the type correctness at runtime, the program
will abort!

**Parameters:**

* ​T (`Copyable & Movable`): The type to take out.

**Returns:**

The underlying data to be taken out as an owned value.

### `unsafe_take`

`unsafe_take[T: Copyable & Movable](mut self) -> T`

Unsafely take the current value of the variant with the provided type.

The caller takes ownership of the underlying value.

This doesn't explicitly check that your value is of that type!
If you haven't verified the type correctness at runtime, you'll get
a type that *looks* like your type, but has potentially unsafe
and garbage member data.

**Parameters:**

* ​T (`Copyable & Movable`): The type to take out.

**Returns:**

The underlying data to be taken out as an owned value.

### `replace`

`replace[Tin: Copyable & Movable, Tout: Copyable & Movable](mut self, owned value: Tin) -> Tout`

Replace the current value of the variant with the provided type.

The caller takes ownership of the underlying value.

This explicitly check that your value is of that type!
If you haven't verified the type correctness at runtime, the program
will abort!

**Parameters:**

* ​Tin (`Copyable & Movable`): The type to put in.
* ​Tout (`Copyable & Movable`): The type to take out.

**Args:**

* ​value (`Tin`): The value to put in.

**Returns:**

The underlying data to be taken out as an owned value.

### `unsafe_replace`

`unsafe_replace[Tin: Copyable & Movable, Tout: Copyable & Movable](mut self, owned value: Tin) -> Tout`

Unsafely replace the current value of the variant with the provided type.

The caller takes ownership of the underlying value.

This doesn't explicitly check that your value is of that type!
If you haven't verified the type correctness at runtime, you'll get
a type that *looks* like your type, but has potentially unsafe
and garbage member data.

**Parameters:**

* ​Tin (`Copyable & Movable`): The type to put in.
* ​Tout (`Copyable & Movable`): The type to take out.

**Args:**

* ​value (`Tin`): The value to put in.

**Returns:**

The underlying data to be taken out as an owned value.

### `set`

`set[T: Copyable & Movable](mut self, owned value: T)`

Set the variant value.

This will call the destructor on the old value, and update the variant's
internal type and data to the new value.

**Parameters:**

* ​T (`Copyable & Movable`): The new variant type. Must be one of the Variant's type arguments.

**Args:**

* ​value (`T`): The new value to set the variant to.

### `isa`

`isa[T: Copyable & Movable](self) -> Bool`

Check if the variant contains the required type.

**Parameters:**

* ​T (`Copyable & Movable`): The type to check.

**Returns:**

True if the variant contains the requested type.

### `unsafe_get`

`unsafe_get[T: Copyable & Movable](ref self) -> ref [self] T`

Get the value out of the variant as a type-checked type.

This doesn't explicitly check that your value is of that type!
If you haven't verified the type correctness at runtime, you'll get
a type that *looks* like your type, but has potentially unsafe
and garbage member data.

For now this has the limitations that it
\- requires the variant value to be mutable

**Parameters:**

* ​T (`Copyable & Movable`): The type of the value to get out.

**Returns:**

The internal data represented as a `Pointer[T]`.

### `is_type_supported`

`static is_type_supported[T: Copyable & Movable]() -> Bool`

Check if a type can be used by the `Variant`.

Example:

```mojo
from utils import Variant

def takes_variant(mut arg: Variant):
    if arg.is_type_supported[Float64]():
        arg = Float64(1.5)

def main():
    var x = Variant[Int, Float64](1)
    takes_variant(x)
    if x.isa[Float64]():
        print(x[Float64]) # 1.5
```

For example, the `Variant[Int, Bool]` permits `Int` and `Bool`.

**Parameters:**

* ​T (`Copyable & Movable`): The type of the value to check support for.

**Returns:**

`True` if type `T` is supported by the `Variant`.

---

## variant

Defines a Variant type.

You can use this type to implement variant/sum types. For example:

```mojo
from utils import Variant

alias IntOrString = Variant[Int, String]
fn to_string(mut x: IntOrString) -> String:
  if x.isa[String]():
    return x[String]
  # x.isa[Int]()
  return String(x[Int])

# They have to be mutable for now, and implement Copyable & Movable
var an_int = IntOrString(4)
var a_string = IntOrString(String("I'm a string!"))
var who_knows = IntOrString(0)
import random
if random.random_ui64(0, 1):
    who_knows.set[String]("I'm actually a string too!")

print(to_string(an_int))
print(to_string(a_string))
print(to_string(who_knows))
```

## Structs

* [​`Variant`](/mojo/stdlib/utils/variant/Variant): A runtime-variant type.

---

## Writable

The `Writable` trait describes how a type is written into a `Writer`.

You must implement `write_to` which takes `self` and a type conforming to
`Writer`:

```mojo
struct Point(Writable):
    var x: Float64
    var y: Float64

    fn write_to[W: Writer](self, mut writer: W):
        var string = "Point"
        # Write a single `Span[Byte]`:
        writer.write_bytes(string.as_bytes())
        # Pass multiple args that can be converted to a `Span[Byte]`:
        writer.write("(", self.x, ", ", self.y, ")")
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `write_to`

`write_to[W: Writer](self: _Self, mut writer: W)`

Formats the string representation of this type to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The type conforming to `Writable`.

---

## Writer

Describes a type that can be written to by any type that implements the `write_to` function.

This enables you to write one implementation that can be written to a
variety of types such as file descriptors, strings, network locations etc.
The types are written as a `Span[Byte]`, so the `Writer` can avoid
allocations depending on the requirements. There is also a general `write`
that takes multiple args that implement `write_to`.

Example:

```mojo
from memory import Span

@fieldwise_init
struct NewString(Writer, Writable, Copyable, Movable):
    var s: String

    # Writer requirement to write a Span of Bytes
    fn write_bytes(mut self, bytes: Span[Byte, _]):
        self.s._iadd(bytes)

    # Writer requirement to take multiple args
    fn write[*Ts: Writable](mut self, *args: *Ts):
        @parameter
        for i in range(args.__len__()):
            args[i].write_to(self)

    # Also make it Writable to allow `print` to write the inner String
    fn write_to[W: Writer](self, mut writer: W):
        writer.write(self.s)

@fieldwise_init
struct Point(Writable, Copyable, Movable):
    var x: Int
    var y: Int

    # Pass multiple args to the Writer. The Int and StaticString types
    # call `writer.write_bytes` in their own `write_to` implementations.
    fn write_to[W: Writer](self, mut writer: W):
        writer.write("Point(", self.x, ", ", self.y, ")")

    # Enable conversion to a String using `String(point)`
    fn __str__(self) -> String:
        return String.write(self)

fn main():
    var point = Point(1, 2)
    var new_string = NewString(String(point))
    new_string.write("\n", Point(3, 4))
    print(new_string)
```

Output:

```plaintext
Point(1, 2)
Point(3, 4)
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `write_bytes`

`write_bytes(mut self: _Self, bytes: Span[SIMD[uint8, 1], origin])`

Write a `Span[Byte]` to this `Writer`.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The string slice to write to this Writer. Must NOT be
  null-terminated.

### `write`

`write[*Ts: Writable](mut self: _Self, *args: *Ts)`

Write a sequence of Writable arguments to the provided Writer.

**Parameters:**

* ​\*Ts (`Writable`): Types of the provided argument sequence.

**Args:**

* ​\*args (`*Ts`): Sequence of arguments to write to this Writer.

---

## write

Establishes the contract between `Writer` and `Writable` types.

## Aliases

### `HEAP_BUFFER_BYTES`

`alias HEAP_BUFFER_BYTES = env_get_int[::StringSlice[::Bool()`

How much memory to pre-allocate for the heap buffer, will abort if exceeded.

### `STACK_BUFFER_BYTES`

`alias STACK_BUFFER_BYTES = env_get_int[::StringSlice[::Bool()`

The size of the stack buffer for IO operations from CPU.

## Traits

* [​`Writable`](/mojo/stdlib/utils/write/Writable): The `Writable` trait describes how a type is written into a `Writer`.
* [​`Writer`](/mojo/stdlib/utils/write/Writer): Describes a type that can be written to by any type that implements the `write_to` function.