# Mojo API Documentation > The Mojo API reference. This file contains all documentation content in a single document following the llmtxt.org standard. ## max The MAX Mojo API reference. The MAX API provides a state-of-the-art graph compiler and runtime library that executes AI models with incredible speed on a wide range of hardware. ## Packages * [​`tensor`](/max/api/mojo/tensor/): APIs to create and manage tensors in a graph. --- ## tensor APIs to create and manage tensors in a graph. ## Modules * [​`io_spec`](/max/api/mojo/tensor/io_spec/): * [​`managed_tensor_slice`](/max/api/mojo/tensor/managed_tensor_slice/): Implements the `ManagedTensorSlice` type - a view of a tensor that doesn't own the underlying data. This type is used to build custom graph operations. * [​`tensor_spec`](/max/api/mojo/tensor/tensor_spec/): You can import these APIs from the `max.tensor` package. For example: * [​`transitional`](/max/api/mojo/tensor/transitional/): Utilities for transitional period during NDBuffer deprecation. --- ## IO `@register_passable(trivial)` `struct IO` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `FusedInput` `alias FusedInput = IO(2)` ### `FusedOutput` `alias FusedOutput = IO(3)` ### `Input` `alias Input = IO(1)` ### `Output` `alias Output = IO(0)` ### `Unknown` `alias Unknown = IO(-1)` ## Methods ### `__init__` `__init__(value: Int) -> Self` ### `__eq__` `__eq__(self, other: Self) -> Bool` --- ## IOSpec `@register_passable(trivial)` `struct IOSpec[mut: Bool, input: IO]` Parameter used to encode whether a particular tensor argument to a DPS kernel is an output, input, or mutable input. ```mojo Input == IOSpec[False, IO.Input]() Output == IOSpec[True, IO.Output]() MutableInput == IOSpec[True, IO.Input]() FusedInput == IOSpec[False, IO.FusedInput]() FusedOutput == IOSpec[True, IO.FusedOutput]() ``` ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` --- ## io_spec ## Aliases ### `FusedInput` `alias FusedInput = IOSpec()` ### `FusedOutput` `alias FusedOutput = IOSpec()` ### `Input` `alias Input = IOSpec()` ### `IOUnknown` `alias IOUnknown = IOSpec()` ### `MutableInput` `alias MutableInput = IOSpec()` ### `Output` `alias Output = IOSpec()` ## Structs * [​`IO`](/max/api/mojo/tensor/io_spec/IO): * [​`IOSpec`](/max/api/mojo/tensor/io_spec/IOSpec): Parameter used to encode whether a particular tensor argument to a DPS kernel is an output, input, or mutable input. --- ## DynamicTensor `struct DynamicTensor[dtype: DType, rank: Int]` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `Type` `alias Type = ManagedTensorSlice[IOSpec(), static_spec=create_unknown()]` --- ## ManagedTensorSlice `@register_passable(trivial)` `struct ManagedTensorSlice[mut: Bool, input: IO, dtype: DType, rank: Int, //, io_spec: IOSpec[mut, input], *, static_spec: StaticTensorSpec[dtype, rank]]` A view of a tensor that does not own the underlying allocated pointer. When the object lifetime ends it does not free the underlying pointer. Conversely, if a `ManagedTensorSlice` is created, it will not extend the life of the underlying pointer. Therefore, the user must take care to keep the pointer alive until the last use of a `ManagedTensorSlice` instance. This class is useful for writing custom operations where memory is managed by an external runtime like in MAX's inference stack. ## Implemented traits `AnyType`, `Copyable`, `DevicePassable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `address_space` `alias address_space = static_spec.address_space` ### `alignment` `alias alignment = static_spec.alignment` ### `device_type` `alias device_type = LayoutTensor[dtype, static_spec.to_layout(), MutableAnyOrigin]` ### `exclusive` `alias exclusive = static_spec.exclusive` ## Methods ### `__init__` `__init__(ptr: UnsafePointer[SIMD[dtype, 1]], slices: InlineArray[Slice, rank], slicer_spec: RuntimeTensorSpec[dtype, rank]) -> Self` Initializes a ManagedTensorSlice from a pointer, array of slices and tensor spec. In general, custom operations should not create `ManagedTensorSlice` instances, but instead use the ones provided by the MAX inference engine. `__init__(ptr: UnsafePointer[SIMD[dtype, 1]], shape: IndexList[rank]) -> Self` Initializes a ManagedTensorSlice from a pointer and shape. In general, custom operations should not create `ManagedTensorSlice` instances, but instead use the ones provided by the MAX inference engine. `__init__(ptr: UnsafePointer[SIMD[dtype, 1]], shape: IndexList[rank], strides: IndexList[rank]) -> Self` Initializes a ManagedTensorSlice from a pointer, shape, and strides. In general, custom operations should not create `ManagedTensorSlice` instances, but instead use the ones provided by the MAX inference engine. ### `__getitem__` `__getitem__(self, indices: IndexList[rank]) -> SIMD[dtype, 1]` Gets the value at the specified indices. **Args:** * ​indices (`IndexList[rank]`): The indices of the value to retrieve. **Returns:** The value at the specified indices. `__getitem__(self, *indices: Int) -> SIMD[dtype, 1]` Gets the value at the specified indices. **Args:** * ​\*indices (`Int`): The indices of the value to retrieve. **Returns:** The value at the specified indices. ### `__setitem__` `__setitem__(self, *indices: Int, *, val: SIMD[dtype, 1])` Stores the value at the specified indices. **Args:** * ​\*indices (`Int`): The indices of the value to store. * ​val (`SIMD[dtype, 1]`): The value to store. `__setitem__(self, indices: IndexList[rank], val: SIMD[dtype, 1])` Stores the value at the specified indices. **Args:** * ​indices (`IndexList[rank]`): The indices of the value to store. * ​val (`SIMD[dtype, 1]`): The value to store. ### `get_type_name` `static get_type_name() -> String` ### `get_device_type_name` `static get_device_type_name() -> String` ### `spec` `spec(self) -> RuntimeTensorSpec[dtype, rank]` Gets the `TensorSpec` of this tensor slice, which provides meta-data about the tensor slice. **Returns:** The static `TensorSpec` for this tensor slice. ### `shape` `shape(self) -> IndexList[rank]` Gets the shape of this tensor slice, as an `IndexList`. **Returns:** The shape of this tensor slice. ### `dim_size` `dim_size(self, index: Int) -> Int` Gets the size of a given dimension of this tensor slice using a run time value. **Args:** * ​index (`Int`): The zero-based index of the dimension. **Returns:** The size of the tensor slice in the given dimension. `dim_size[index: Int](self) -> Int` Gets the size of a given dimension of this tensor slice using a compile time value. **Parameters:** * ​index (`Int`): The zero-based index of the dimension. **Returns:** The size of the tensor slice in the given dimension. ### `strides` `strides(self) -> IndexList[rank]` Gets the strides of this tensor slice, as an `IndexList`. **Returns:** The strides of this tensor slice. ### `stride_length` `stride_length(self, index: Int) -> Int` Gets the length of the stride of a given dimension of this tensor slice using a run time value. **Args:** * ​index (`Int`): The zero-based index of the dimension. **Returns:** The size of the tensor slice in the given dimension. `stride_length[index: Int](self) -> Int` Gets the length of the stride of a given dimension of this tensor slice using a compile time value. **Parameters:** * ​index (`Int`): The zero-based index of the dimension. **Returns:** The size of the tensor slice in the given dimension. ### `size` `size(self) -> Int` Computes the tensor slice's number of elements. **Returns:** The total number of elements in the tensor slice. ### `unsafe_ptr` `unsafe_ptr[__type: DType = dtype](self) -> UnsafePointer[SIMD[__type, 1]]` Get the pointer stored in this tensor slice. Since this method obtains the pointer stored in this tensor slice, it can modify the invariants of this tensor slice and lead to unexpected behavior. It should be used with caution. **Parameters:** * ​\_\_type (`DType`): The type of the `UnsafePointer` in this tensor slice. **Returns:** The `UnsafePointer` which contains the data for this tensor slice. ### `load` `load[width: Int, _rank: Int](self, index: IndexList[_rank]) -> SIMD[dtype, width]` Gets data from this tensor slice as a `SIMD`. **Parameters:** * ​width (`Int`): The width of the `SIMD` value. This must be large enough to contain the data from this tensor slice. * ​\_rank (`Int`): The rank of the tensor slice. **Args:** * ​index (`IndexList[_rank]`): An `IndexList` of size `_rank` to indicate the dimension of the tensor slice to obtain data from. **Returns:** Data from this tensor slice at dimension `index`. ### `store` `store[width: Int, _rank: Int, element_alignment: Int = 1](self: ManagedTensorSlice[io_spec, static_spec=static_spec], index: IndexList[_rank], val: SIMD[dtype, width])` Sets data in this tensor slice from a `SIMD`. **Parameters:** * ​width (`Int`): The width of the `SIMD` value. * ​\_rank (`Int`): The rank of the tensor slice. * ​element\_alignment (`Int`): Indicate the alignment of the pointer stored to memory. This is needed to issue vector store for GPUs with strict alignment requirements. **Args:** * ​index (`IndexList[_rank]`): An `IndexList` of size `_rank` to indicate the dimension of the tensor slice to set data in. * ​val (`SIMD[dtype, width]`): The data to set into this tensor slice. ### `with_layout` `with_layout[new_rank: Int, //, new_static_shape: DimList, new_static_strides: DimList](self, new_runtime_shape: IndexList[new_rank], new_runtime_strides: IndexList[new_rank], offset_ptr: OptionalReg[UnsafePointer[SIMD[dtype, 1]]] = OptionalReg[UnsafePointer[SIMD[dtype, 1]]]({:i1 0, 1})) -> ManagedTensorSlice[io_spec, static_spec=static_spec.with_layout[::Int](new_static_shape, new_static_strides)]` ### `to_layout_tensor` `to_layout_tensor(self) -> LayoutTensor[dtype, static_spec.to_layout(), MutableAnyOrigin]` ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this buffer to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__repr__` `__repr__(self) -> String` Gets the buffer as a string. **Returns:** A compact string representation of the buffer. ### `__str__` `__str__(self) -> String` Gets the buffer as a string. **Returns:** A compact string of the buffer. --- ## VariadicTensors `@register_passable(trivial)` `struct VariadicTensors[mut: Bool, input: IO, //, dtype: DType, rank: Int, size: Int, io_spec: IOSpec[mut, input], *, static_specs: StaticTuple[StaticTensorSpec[dtype, rank], size]]` A tuple-like container of tensors representing variadic arguments from the graph compiler. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__getitem__` `__getitem__[index: Int](self) -> ManagedTensorSlice[io_spec, static_spec=static_specs.__getitem__[::Indexer](index)]` Returns the tensor at the given position in the variadic argument argument pack. **Parameters:** * ​index (`Int`): The index into the variadic tensor arguments. **Returns:** The tensor at the specified index. ### `__len__` `__len__(self) -> Int` Returns the number of variadic arguments in the pack. **Returns:** The number of variadic arguments. --- ## foreach `foreach[dtype: DType, rank: Int, //, func: fn[Int](IndexList[rank]) capturing -> SIMD[dtype, $0], *, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), simd_width: Int = get_kernel_simd_width[::DType,::StringSlice[::Bool(), _synchronous: Bool = False, _trace_name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("mogg.for_each")](tensor: ManagedTensorSlice[io_spec, static_spec=static_spec], ctx: DeviceContextPtr = DeviceContextPtr())` Apply the function `func` to each element of the tensor slice. **Parameters:** * ​dtype (`DType`): The data type of the elements in the tensor slice. * ​rank (`Int`): The rank of the tensor slice. * ​func (`fn[Int](IndexList[rank]) capturing -> SIMD[dtype, $0]`): The function to apply to each element of the tensor slice. * ​target (`StringSlice[StaticConstantOrigin]`): Indicates the type of the target device (e.g. "cpu", "gpu"). * ​simd\_width (`Int`): The SIMD width for the target (usually leave this as its default value). * ​\_synchronous (`Bool`): True to run the custom op synchronously in the runtime (defaults to False). * ​\_trace\_name (`StringSlice[StaticConstantOrigin]`): Name of the executed operation displayed in the trace\_description. **Args:** * ​tensor (`ManagedTensorSlice[io_spec, static_spec=static_spec]`): The output tensor slice which receives the return values from `func`. * ​ctx (`DeviceContextPtr`): The call context (forward this from the custom operation). `foreach[: origin.set, dtype: DType, rank: Int, //, func: fn[Int](IndexList[rank]) capturing -> SIMD[dtype, $0], out_func: fn[Int](IndexList[rank]) capturing -> None, *, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), simd_width: Int = get_kernel_simd_width[::DType,::StringSlice[::Bool(), _synchronous: Bool = False, _trace_name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("mogg.for_each")](tensor: ManagedTensorSlice[io_spec, static_spec=static_spec], ctx: DeviceContextPtr = DeviceContextPtr())` Apply the function `func` to each element of the tensor slice. **Parameters:** * ​dtype (`DType`): The data type of the elements in the tensor slice. * ​rank (`Int`): The rank of the tensor slice. * ​func (`fn[Int](IndexList[rank]) capturing -> SIMD[dtype, $0]`): The function to apply to each element of the tensor slice. * ​out\_func (`fn[Int](IndexList[rank]) capturing -> None`): The function to apply on each output element. * ​target (`StringSlice[StaticConstantOrigin]`): Indicates the type of the target device (e.g. "cpu", "gpu"). * ​simd\_width (`Int`): The SIMD width for the target (usually leave this as its default value). * ​\_synchronous (`Bool`): True to run the custom op synchronously in the runtime (defaults to False). * ​\_trace\_name (`StringSlice[StaticConstantOrigin]`): Name of the executed operation displayed in the trace\_description. **Args:** * ​tensor (`ManagedTensorSlice[io_spec, static_spec=static_spec]`): The input tensor slice which the consumed values. * ​ctx (`DeviceContextPtr`): The call context (forward this from the custom operation). --- ## managed_tensor_slice Implements the `ManagedTensorSlice` type - a view of a tensor that doesn't own the underlying data. This type is used to build custom graph operations. ## Aliases ### `InputTensor` `alias InputTensor = ManagedTensorSlice[IOSpec(), static_spec=?]` ### `InputVariadicTensors` `alias InputVariadicTensors = VariadicTensors[?, ?, ?, IOSpec(), static_specs=?]` ### `OutputTensor` `alias OutputTensor = ManagedTensorSlice[IOSpec(), static_spec=?]` ### `OutputVariadicTensors` `alias OutputVariadicTensors = VariadicTensors[?, ?, ?, IOSpec(), static_specs=?]` ## Structs * [​`DynamicTensor`](/max/api/mojo/tensor/managed_tensor_slice/DynamicTensor): * [​`ManagedTensorSlice`](/max/api/mojo/tensor/managed_tensor_slice/ManagedTensorSlice): A view of a tensor that does not own the underlying allocated pointer. When the object lifetime ends it does not free the underlying pointer. Conversely, if a `ManagedTensorSlice` is created, it will not extend the life of the underlying pointer. * [​`VariadicTensors`](/max/api/mojo/tensor/managed_tensor_slice/VariadicTensors): A tuple-like container of tensors representing variadic arguments from the graph compiler. ## Functions * [​`foreach`](/max/api/mojo/tensor/managed_tensor_slice/foreach): Apply the function `func` to each element of the tensor slice. * [​`rebuild_mix_precision_static_tensor_specs_with_input_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_mix_precision_static_tensor_specs_with_input_lambda): * [​`rebuild_mix_precision_static_tensor_specs_with_output_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_mix_precision_static_tensor_specs_with_output_lambda): * [​`rebuild_static_tensor_specs_with_input_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_static_tensor_specs_with_input_lambda): * [​`rebuild_static_tensor_specs_with_output_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_static_tensor_specs_with_output_lambda): * [​`trace_slice_arg`](/max/api/mojo/tensor/managed_tensor_slice/trace_slice_arg): Helper to stringify the type and shape of a kernel argument for tracing. --- ## rebuild_mix_precision_static_tensor_specs_with_input_lambda `rebuild_mix_precision_static_tensor_specs_with_input_lambda[func_type: AnyTrivialRegType, //, src_type: DType, dst_type: DType, rank: Int](spec: StaticTensorSpec[src_type, rank], in_lambda: func_type) -> StaticTensorSpec[dst_type, rank]` --- ## rebuild_mix_precision_static_tensor_specs_with_output_lambda `rebuild_mix_precision_static_tensor_specs_with_output_lambda[func_type: AnyTrivialRegType, //, src_rank: Int, src_shape: DimList, src_type: DType](spec: StaticTensorSpec[dtype, rank], out_lambda: func_type) -> StaticTensorSpec[src_type, src_rank]` --- ## rebuild_static_tensor_specs_with_input_lambda `rebuild_static_tensor_specs_with_input_lambda[func_type: AnyTrivialRegType, //, dtype: DType, rank: Int](spec: StaticTensorSpec[dtype, rank], in_lambda: func_type) -> StaticTensorSpec[dtype, rank]` --- ## rebuild_static_tensor_specs_with_output_lambda `rebuild_static_tensor_specs_with_output_lambda[func_type: AnyTrivialRegType, //, dtype: DType, rank: Int](spec: StaticTensorSpec[dtype, rank], out_lambda: func_type) -> StaticTensorSpec[dtype, rank]` --- ## trace_slice_arg `trace_slice_arg(name: String, buf: ManagedTensorSlice[io_spec, static_spec=static_spec]) -> String` Helper to stringify the type and shape of a kernel argument for tracing. **Args:** * ​name (`String`): The name of the argument. * ​buf (`ManagedTensorSlice[io_spec, static_spec=static_spec]`): The NDBuffer to trace. **Returns:** A string representation of the buffer with its shape and data type. --- ## RuntimeTensorSpec `@register_passable(trivial)` `struct RuntimeTensorSpec[type: DType, rank: Int]` ## Fields * ​shape (`IndexList[rank]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__getitem__` `__getitem__(self, idx: Int) -> Int` ### `bytecount` `bytecount(self) -> Int` Gets the total byte count. **Returns:** The total byte count. --- ## tensor_spec You can import these APIs from the `max.tensor` package. For example: ```mojo from max.tensor import RuntimeTensorSpec ``` ## Structs * [​`RuntimeTensorSpec`](/max/api/mojo/tensor/tensor_spec/RuntimeTensorSpec): --- ## transitional Utilities for transitional period during NDBuffer deprecation. ## Functions * [​`managed_tensor_slice_to_ndbuffer`](/max/api/mojo/tensor/transitional/managed_tensor_slice_to_ndbuffer): --- ## managed_tensor_slice_to_ndbuffer `managed_tensor_slice_to_ndbuffer[: DType, : Int, spec: StaticTensorSpec[$0, $1], //](tensor: ManagedTensorSlice[io_spec, static_spec=spec]) -> NDBuffer[dtype, rank, MutableAnyOrigin, spec.shape, spec.strides, alignment=spec.alignment, address_space=spec.address_space, exclusive=spec.exclusive]` --- ## kv_cache Contains implementations for several types of key-value caches. [KV caches](/glossary/ai/kv-cache) are used in transformer models to store key-value tensors output from self-attention layers. These APIs are used in the higher-level functions in the [`nn`](/mojo/kernels/nn) package. ## Modules * [​`types`](./types/): This module contains the types for the key-value cache APIs. --- ## ContinuousBatchingKVCache `@register_passable(trivial)` `struct ContinuousBatchingKVCache[type_: DType, kv_params_: KVCacheStaticParams]` Wrapper for the ContinuousKVCache of a given layer in the transformer model. This abstracts the Pointer indirection for accessing the ContinuousKVCache for a given batch entry. THIS IS THE TYPE THAT IS PASSED TO KV PROJECTION AND FLASH ATTENTION KERNELS. ## Fields * ​blocks (`NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`): * ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`): * ​lookup\_table (`NDBuffer[uint32, 1, MutableAnyOrigin]`): * ​max\_seq\_length (`SIMD[uint32, 1]`): * ​max\_cache\_length (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `KVCacheT`, `Movable`, `UnknownDestructibility` ## Aliases ### `blocks_shape` `alias blocks_shape = __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))` ### `blocks_stride` `alias blocks_stride = _strides_from_shape[::DimList,::Int]()` ### `blocks_type` `alias blocks_type = NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]` ### `kv_params` `alias kv_params = kv_params_` ### `type` `alias type = type_` ## Methods ### `__init__` `__init__(blocks: NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1]) -> Self` ### `max_tile_size` `static max_tile_size() -> Int` Returns the maximum tile size for the KVCache. ### `cache_lengths_nd` `cache_lengths_nd(self) -> NDBuffer[uint32, 1, MutableAnyOrigin]` ### `cache_length` `cache_length(self, batch_idx: Int) -> Int` ### `load` `load[width: Int](self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int) -> SIMD[type_, width]` ### `store` `store(self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int, val: SIMD[type_, size])` ### `empty_cache` `empty_cache(self) -> Bool` Returns true if the cache\_lengths for all requests is 0, false otherwise. ### `max_prompt_length` `max_prompt_length(self) -> SIMD[uint32, 1]` Returns the maximum sequence length across all batches of the current request. ### `max_context_length` `max_context_length(self) -> SIMD[uint32, 1]` Returns the maximum cache length used across all batches of the current request. ### `block_paged_ptr` `block_paged_ptr[tile_size: Int](self, batch_idx: Int, start_tok_idx: Int, head_idx: Int, head_dim_idx: Int = 0) -> UnsafePointer[SIMD[type_, 1]]` --- ## ContinuousBatchingKVCacheCollection `struct ContinuousBatchingKVCacheCollection[type_: DType, kv_params_: KVCacheStaticParams]` This is a "view" of the cache for the given sequences in the batch. This object does not own the underlying buffers in k\_cache and v\_cache, it's borrowing them from the BlockWrappers in our KVCacheManager. It does own the Pointer\[NDBuffer\[type, 3]] and valid\_lengths buffer ## Fields * ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`): * ​lookup\_table (`NDBuffer[uint32, 1, MutableAnyOrigin]`): * ​blocks (`NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`): * ​max\_seq\_length (`SIMD[uint32, 1]`): * ​max\_cache\_length (`SIMD[uint32, 1]`): * ​kv\_cache\_dynamic\_shape (`IndexList[4]`): * ​kv\_cache\_dynamic\_strides (`IndexList[4]`): ## Implemented traits `AnyType`, `Copyable`, `KVCollectionT`, `Movable`, `UnknownDestructibility` ## Aliases ### `blocks_shape` `alias blocks_shape = DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))` ### `blocks_stride` `alias blocks_stride = _strides_from_shape[::DimList,::Int]()` ### `blocks_type` `alias blocks_type = NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]` ### `CacheType` `alias CacheType = ContinuousBatchingKVCache[type_, kv_params_]` ### `kv_params` `alias kv_params = kv_params_` ### `name_str` `alias name_str = "continuous_batching"` ### `type` `alias type = type_` ## Methods ### `__init__` `__init__(out self, blocks: NDBuffer[type_, 6, MutableAnyOrigin], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1])` ### `copy` `copy(self) -> Self` Explicitly construct a copy of self. **Returns:** A copy of this value. ### `get_key_cache` `get_key_cache(self, layer_idx: Int) -> ContinuousBatchingKVCache[type_, kv_params_]` ### `get_value_cache` `get_value_cache(self, layer_idx: Int) -> ContinuousBatchingKVCache[type_, kv_params_]` ### `cache_length` `cache_length(self, bs_idx: Int) -> Int` --- ## KVCacheStaticParams `@register_passable(trivial)` `struct KVCacheStaticParams` ## Fields * ​num\_heads (`UInt`): * ​head\_size (`UInt`): ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Movable`, `UnknownDestructibility` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` ### `__ne__` `__ne__(self, rhs: Self) -> Bool` --- ## KVCacheT Trait for different KVCache types and implementations. Represents a single (key or value) cache. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `kv_params` `alias kv_params` ### `type` `alias type` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `cache_lengths_nd` `cache_lengths_nd(self: _Self) -> NDBuffer[uint32, 1, MutableAnyOrigin]` Returns the cache lengths as a NDBuffer. ### `cache_length` `cache_length(self: _Self, batch_idx: Int) -> Int` Returns the length of the cache for a given batch index. ### `load` `load[width: Int](self: _Self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int) -> SIMD[get_vtable_entry(:trait _Self, "type"), width]` Loads an element from the given index. ### `store` `store(self: _Self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int, val: SIMD[get_vtable_entry(:trait _Self, "type"), size])` Stores an element at the given index. ### `empty_cache` `empty_cache(self: _Self) -> Bool` Returns true if the cache\_lengths for all requests is 0, false otherwise. ### `max_prompt_length` `max_prompt_length(self: _Self) -> SIMD[uint32, 1]` Returns the maximum sequence length across all batches of the current request. ### `max_context_length` `max_context_length(self: _Self) -> SIMD[uint32, 1]` Returns the maximum cache length used across all batches of the current request. ### `block_paged_ptr` `block_paged_ptr[tile_size: Int](self: _Self, batch_idx: Int, start_tok_idx: Int, head_idx: Int, head_dim_idx: Int = 0) -> UnsafePointer[SIMD[get_vtable_entry(:trait _Self, "type"), 1]]` Returns a LayoutTensor pointing to the KVCache block at the given index. Paged KVCache implementations must have a block\_size which is a multiple of the and greater than the layout's first dimension. ### `max_tile_size` `static max_tile_size() -> Int` Returns the maximum tile size for the KVCache. --- ## KVCollectionT Trait for a pair of caches (keys and values). ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `CacheType` `alias CacheType` ### `kv_params` `alias kv_params` ### `name_str` `alias name_str` ### `type` `alias type` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `get_key_cache` `get_key_cache(self: _Self, layer_idx: Int) -> get_vtable_entry(:trait _Self, "CacheType")` ### `get_value_cache` `get_value_cache(self: _Self, layer_idx: Int) -> get_vtable_entry(:trait _Self, "CacheType")` ### `cache_length` `cache_length(self: _Self, bs_idx: Int) -> Int` --- ## PagedKVCache `@register_passable(trivial)` `struct PagedKVCache[type_: DType, kv_params_: KVCacheStaticParams, page_size: Int]` The PagedKVCache is a wrapper around the KVCache blocks for a given layer. It is used to access the KVCache blocks for PagedAttention. ## Fields * ​blocks (`NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`): * ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`): * ​lookup\_table (`NDBuffer[uint32, 2, MutableAnyOrigin]`): * ​max\_seq\_length (`SIMD[uint32, 1]`): * ​max\_cache\_length (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `KVCacheT`, `Movable`, `UnknownDestructibility` ## Aliases ### `blocks_shape` `alias blocks_shape = __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))` ### `blocks_stride` `alias blocks_stride = _strides_from_shape[::DimList,::Int]()` ### `blocks_type` `alias blocks_type = NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]` ### `kv_params` `alias kv_params = kv_params_` ### `type` `alias type = type_` ## Methods ### `__init__` `__init__(blocks: NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 2, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1]) -> Self` ### `max_tile_size` `static max_tile_size() -> Int` Returns the maximum tile size for the KVCache. ### `cache_lengths_nd` `cache_lengths_nd(self) -> NDBuffer[uint32, 1, MutableAnyOrigin]` ### `cache_length` `cache_length(self, batch_idx: Int) -> Int` Returns the length of the cache for a given batch index. ### `load` `load[width: Int](self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int) -> SIMD[type_, width]` Loads an element from the given index. ### `store` `store(self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int, val: SIMD[type_, size])` Stores an element at the given index. ### `empty_cache` `empty_cache(self) -> Bool` Returns true if the cache\_lengths for all requests is 0, false otherwise. ### `max_prompt_length` `max_prompt_length(self) -> SIMD[uint32, 1]` Returns the maximum sequence length across all batches of the current request. ### `max_context_length` `max_context_length(self) -> SIMD[uint32, 1]` Returns the maximum cache length used across all batches of the current request. ### `block_paged_ptr` `block_paged_ptr[tile_size: Int](self, batch_idx: Int, start_tok_idx: Int, head_idx: Int, head_dim_idx: Int = 0) -> UnsafePointer[SIMD[type_, 1]]` --- ## PagedKVCacheCollection `struct PagedKVCacheCollection[type_: DType, kv_params_: KVCacheStaticParams, page_size: Int]` ## Fields * ​blocks (`NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`): * ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`): * ​lookup\_table (`NDBuffer[uint32, 2, MutableAnyOrigin]`): * ​max\_seq\_length (`SIMD[uint32, 1]`): * ​max\_cache\_length (`SIMD[uint32, 1]`): * ​kv\_cache\_dynamic\_shape (`IndexList[4]`): * ​kv\_cache\_dynamic\_strides (`IndexList[4]`): ## Implemented traits `AnyType`, `Copyable`, `KVCollectionT`, `Movable`, `UnknownDestructibility` ## Aliases ### `blocks_shape` `alias blocks_shape = DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))` ### `blocks_stride` `alias blocks_stride = _strides_from_shape[::DimList,::Int]()` ### `blocks_type` `alias blocks_type = NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]` ### `CacheType` `alias CacheType = PagedKVCache[type_, kv_params_, page_size]` ### `kv_params` `alias kv_params = kv_params_` ### `name_str` `alias name_str = "paged"` ### `type` `alias type = type_` ## Methods ### `__init__` `__init__(out self, blocks: NDBuffer[type_, 6, MutableAnyOrigin], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 2, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1])` ### `__copyinit__` `__copyinit__(out self, other: Self)` ### `__moveinit__` `__moveinit__(out self, owned other: Self)` ### `copy` `copy(self) -> Self` Explicitly construct a copy of self. **Returns:** A copy of this value. ### `get_key_cache` `get_key_cache(self, layer_idx: Int) -> PagedKVCache[type_, kv_params_, page_size]` ### `get_value_cache` `get_value_cache(self, layer_idx: Int) -> PagedKVCache[type_, kv_params_, page_size]` ### `cache_length` `cache_length(self, bs_idx: Int) -> Int` --- ## types This module contains the types for the key-value cache APIs. The module includes structs implementing several different types of [KV caches](/glossary/ai/kv-cache). This module defines two traits that define the roles of the different structs * `KVCacheT`: Defines the interface for a single (key or value) cache. * `KVCollectionT`: Defines the interface for a pair of caches (keys and values). ## Structs * [​`ContinuousBatchingKVCache`](./ContinuousBatchingKVCache): Wrapper for the ContinuousKVCache of a given layer in the transformer model. * [​`ContinuousBatchingKVCacheCollection`](./ContinuousBatchingKVCacheCollection): This is a "view" of the cache for the given sequences in the batch. * [​`KVCacheStaticParams`](./KVCacheStaticParams): * [​`PagedKVCache`](./PagedKVCache): The PagedKVCache is a wrapper around the KVCache blocks for a given layer. It is used to access the KVCache blocks for PagedAttention. * [​`PagedKVCacheCollection`](./PagedKVCacheCollection): ## Traits * [​`KVCacheT`](./KVCacheT): Trait for different KVCache types and implementations. * [​`KVCollectionT`](./KVCollectionT): Trait for a pair of caches (keys and values). --- ## Element `struct Element[dtype: DType, layout: Layout, /, index_type: DType = _get_index_type(layout)]` A wrapper around SIMD types that provides layout-driven vectorized operations. The `Element` struct extends SIMD types with layout-aware load and store operations, enabling efficient vectorized access to multi-dimensional data. It maps between logical tensor coordinates and physical memory locations according to the specified layout. ## Parameters * ​dtype (`DType`): The data type of the elements. * ​layout (`Layout`): The memory layout describing how elements are organized. * ​index\_type (`DType`): The integer type of the index pointing to each element. ## Fields * ​element\_data (`SIMD[dtype, layout.size()]`): The actual SIMD data stored in this element. This field contains the vectorized data values that can be processed efficiently using SIMD operations. * ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout information for memory access patterns. This field stores the layout information needed to map between logical tensor coordinates and physical memory locations, supporting both compile-time and runtime-determined access patterns. ## Implemented traits `AnyType`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `element_data_type` `alias element_data_type = SIMD[dtype, layout.size()]` The SIMD type used to store and process the element data. This type alias defines a SIMD vector with the specified data type and size matching the layout's total element count, enabling efficient vectorized operations. ## Methods ### `__init__` `@implicit` `__init__(out self, element_data: SIMD[dtype, layout.size()])` Initializes an Element with the given SIMD data. **Args:** * ​element\_data (`SIMD[dtype, layout.size()]`): The SIMD data to initialize the element with. `__init__(out self, element_data: SIMD[dtype, layout.size()], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type])` Initializes an Element with the given SIMD data and runtime layout. **Args:** * ​element\_data (`SIMD[dtype, layout.size()]`): The SIMD data to initialize the element with. * ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access. ### `load` `static load(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type] = RuntimeLayout()) -> Self` Loads data from memory according to the specified layout. This method loads data from memory using the layout information to determine the memory access pattern. It supports both rank-1 and rank-2 layouts with various stride patterns, optimizing for contiguous memory access when possible. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from. * ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access. **Returns:** A new `Element` containing the loaded data. ### `masked_load` `static masked_load(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type] = RuntimeLayout()) -> Self` Loads data from memory with masking for partial loads. This method loads data from memory using the layout information, but also handles cases where the runtime dimensions are smaller than the static layout dimensions. It ensures that only valid memory locations are accessed. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from. * ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access. **Returns:** A new `Element` containing the loaded data, with zeros in positions beyond the runtime dimensions. ### `store` `store(self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin])` Stores element data to memory according to the specified layout. This method performs a layout-aware store operation, writing data to memory following the access patterns defined by the layout. It optimizes memory writes based on the layout's stride patterns to maximize performance. The method handles different memory layout patterns: * For rank-1 tensors with contiguous memory (stride=1), it uses vectorized stores * For rank-2 tensors with contiguous rows or columns, it uses optimized slice-based stores * For non-contiguous memory layouts, it performs element-by-element stores Unlike `masked_store()`, this method assumes the full static dimensions will be written and does not perform runtime dimension boundary checking. Note: This method is constrained to layouts with rank ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): Mutable pointer to the memory location where data will be stored. ### `masked_store` `masked_store(self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin])` Stores element data to memory with masking for partial stores. This method performs a layout-aware store operation with boundary checking. It ensures that only valid memory locations are written to when the runtime dimensions are smaller than the static layout dimensions, preventing out-of-bounds memory access. The method optimizes for different memory layouts: * For contiguous memory (stride=1), it uses vectorized stores when possible * For non-contiguous memory, it performs element-by-element stores * For all patterns, it respects runtime dimension bounds Note: This method is constrained to layouts with rank ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): Pointer to the memory location where data will be stored. ### `__str__` `__str__(self) -> String` Returns a string representation of the element. **Returns:** A string representation of the element's data. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the element to the specified writer. **Parameters:** * ​W (`Writer`): Type parameter representing a Writer implementation. **Args:** * ​writer (`W`): The writer to output the element representation to. --- ## MemoryElement `struct MemoryElement[dtype: DType, layout: Layout, address_space: AddressSpace, alignment: Int, /, *, index_type: DType = _get_index_type(layout, address_space)]` Represents data in memory organized according to a specific layout. The `MemoryElement` struct provides a high-level interface for accessing data in memory with a specific layout. It encapsulates a pointer to the memory location and the runtime layout information needed to access the data correctly. This abstraction enables efficient memory operations that respect the underlying memory organization, supporting vectorized loads and stores while handling different memory layouts transparently. ## Parameters * ​dtype (`DType`): The data type of the elements. * ​layout (`Layout`): The memory layout describing how elements are organized. * ​address\_space (`AddressSpace`): The memory address space where the data is located. * ​alignment (`Int`): The memory alignment requirement for the data. * ​index\_type (`DType`): The integer type of the index pointing to each memory element. ## Fields * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment]`): Pointer to the memory location where the data is stored. This pointer provides access to the underlying memory with the specified address space and alignment requirements. It points to the first element of the data structure in memory. * ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): Runtime layout information used for memory access calculations. This field stores the runtime layout information needed to compute memory offsets for accessing elements according to the specified layout pattern. It handles both compile-time known dimensions and runtime-determined dimensions. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type])` Initializes a `MemoryElement` with the given pointer and runtime layout. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment]`): Pointer to the memory location of the element. * ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access. ### `load` `load(self, out result: Element[dtype, layout, index_type])` Loads data from memory according to the specified layout. This method performs a layout-aware load operation, reading data from memory following the access patterns defined by the layout. It optimizes memory reads based on the layout's stride patterns to maximize performance. The method leverages the underlying `Element.load` implementation which handles different memory layout patterns including contiguous and strided access. **Returns:** An `Element` containing the loaded data organized according to the layout. ### `store` `store(self, src: Element[dtype, layout, index_type])` Stores element data to the memory location of this MemoryElement. This method performs a layout-aware store operation, writing data to memory following the access patterns defined by the layout. It optimizes memory writes based on the layout's stride patterns to maximize performance. The method delegates to the `Element.store` implementation which handles different memory layout patterns including vectorized stores for contiguous memory and element-by-element stores for non-contiguous layouts. **Args:** * ​src (`Element[dtype, layout, index_type]`): The `Element` containing the data to store. ### `transfer` `transfer(self, src: MemoryElement[dtype, layout, address_space, alignment, index_type=index_type])` Transfers data from another `MemoryElement` to this one. This method efficiently transfers data between memory locations with potentially different layouts and data types. It performs the following operations: 1. Loads data from the source `MemoryElement` using its layout 2. Converts the data to the destination data type if necessary 3. Stores the converted data to the destination memory location using its layout This provides a high-performance way to copy and convert data between different memory representations while respecting both source and destination memory layouts. **Args:** * ​src (`MemoryElement[dtype, layout, address_space, alignment, index_type=index_type]`): The source `MemoryElement` to transfer data from. --- ## element Provides element-based access to memory using layout-driven vectorization. This module implements efficient memory access patterns for multi-dimensional data using the layout system. It provides abstractions for loading and storing data with specific memory layouts, enabling vectorized operations that respect the underlying memory organization. Key components: * `Element`: A wrapper around SIMD types that provides layout-driven vectorized operations * `MemoryElement`: Represents data in memory organized according to a specific layout These components enable efficient tensor operations by ensuring memory accesses follow optimal patterns defined by the layout system. ## Structs * [​`Element`](./Element): A wrapper around SIMD types that provides layout-driven vectorized operations. * [​`MemoryElement`](./MemoryElement): Represents data in memory organized according to a specific layout. --- ## layout Provides layout and layout tensor types, which abstract memory layout for multidimensional data. * The [`Layout`](/mojo/kernels/layout/layout/Layout) type represents a mapping between a set of logical coordinates and a linear index. It can be used, for example, to map logical tensor coordinates to a memory address, or to map GPU threads to tiles of data. * The [`LayoutTensor`](/mojo/kernels/layout/layout_tensor/LayoutTensor) type is a high-performance tensor with explicit memory layout via a `Layout`. ## Modules * [​`element`](./element/): Provides element-based access to memory using layout-driven vectorization. * [​`int_tuple`](./int_tuple/): Hierarchical integer tuple data structures for high-performance tensor operations. * [​`layout`](./layout/): Provides a high-performance tensor layout system for memory mapping and indexing. * [​`layout_tensor`](./layout_tensor/): Provides the `LayoutTensor` type for representing multidimensional data. * [​`math`](./math/): Implements math methods that work on layout tensors. * [​`runtime_layout`](./runtime_layout/): Provides the `RuntimeLayout` type and functions for working with it. You can use `RuntimeLayout` to define a layout where the dimensions are not known at compile time. * [​`runtime_tuple`](./runtime_tuple/): Provides the `RuntimeTuple` data structure and related utility functions for handling tuple-like data with both compile-time and runtime elements. `RuntimeTuple` is designed for high-performance tensor operations, supporting efficient manipulation of multi-dimensional data structures like shapes, indices, and coordinates. * [​`swizzle`](./swizzle/): Defines swizzle layouts for optimizing memory access patterns. * [​`tensor_builder`](./tensor_builder/): Tensor Builder Module * [​`tensor_core`](./tensor_core/): Tensor Core Module for High-Performance Matrix Operations * [​`tensor_core_async`](./tensor_core_async/): Tensor Core Async Module * [​`tma_async`](./tma_async/): Tensor Memory Accelerator (TMA) Asynchronous Operations Module --- ## IntArray `@register_passable` `struct IntArray` A memory-efficient, register-passable array of integers. `IntArray` provides a low-level implementation of a dynamically-sized integer array with direct memory management. It supports both owned and non-owned (view) modes for efficient memory sharing without copying. This struct serves as the underlying storage mechanism for `IntTuple` and related data structures, optimized for high-performance tensor operations. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(size: Int = 0) -> Self` Initialize a new owned `IntArray` with the specified size. **Args:** * ​size (`Int`): Number of integers to allocate space for. Defaults to 0. `__init__(*, non_owned: Self, offset: Int = 0) -> Self` Create a non-owned view into another `IntArray`. Creates a view starting at the specified offset in the source array. The resulting array doesn't own the memory and won't free it when destroyed. **Args:** * ​non\_owned (`Self`): The source array to create a view into. * ​offset (`Int`): Starting position in the source array. Defaults to 0. ### `__copyinit__` `__copyinit__(existing: Self) -> Self` Initialize by copying an existing `IntArray`. For owned arrays, this performs a deep copy of the data. For non-owned arrays, this creates another view of the same data (zero-copy operation). **Args:** * ​existing (`Self`): The source array to copy from. ### `__del__` `__del__(owned self)` Destroy the `IntArray` and free its memory if owned. Only frees memory for owned arrays (positive \_size) to prevent double-free errors with views. ### `__getitem__` `__getitem__(self, idx: Int) -> Int` Access an element at the specified index. Note: Bounds checking is only performed when `INT_TUPLE_VALIDATION` is enabled. **Args:** * ​idx (`Int`): Zero-based index of the element to access. **Returns:** The integer value at the specified index. ### `__setitem__` `__setitem__(mut self, idx: Int, value: Int)` Set the value at the specified index. Note: Bounds checking is only performed when `INT_TUPLE_VALIDATION` is enabled. **Args:** * ​idx (`Int`): Zero-based index of the element to modify. * ​value (`Int`): The integer value to store at the specified index. ### `owning` `owning(self) -> Bool` Check if this `IntArray` owns its memory. **Returns:** True if this array owns its memory (positive \_size), False if it's a view (negative \_size). ### `size` `size(self) -> Int` Get the number of elements in the array. **Returns:** The number of elements in the array, regardless of ownership status. ### `copy_from` `copy_from(mut self, offset: Int, source: Self, size: Int)` Copy elements from another `IntArray`. **Args:** * ​offset (`Int`): Destination offset in this array. * ​source (`Self`): Source array to copy from. * ​size (`Int`): Number of elements to copy. `copy_from(mut self, dst_offset: Int, source: Self, src_offset: Int, size: Int)` Copy elements from another IntArray with source offset. **Args:** * ​dst\_offset (`Int`): Destination offset in this array. * ​source (`Self`): Source array to copy from. * ​src\_offset (`Int`): Source offset in the source array. * ​size (`Int`): Number of elements to copy. --- ## IntTuple `struct IntTuple[origin: ImmutableOrigin = {}]` A hierarchical, nested tuple of integers with efficient memory management. IntTuple provides a flexible data structure for representing multi-dimensional shapes, indices, and other nested integer collections. It supports both flat and hierarchical representations with efficient memory sharing. This structure is fundamental for tensor operations, layout specifications, and dimension handling in high-performance computing contexts. ## Parameters * ​origin (`ImmutableOrigin`): Origin tracking for memory safety. Defaults to the current origin. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `EqualityComparable`, `Intable`, `Movable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `MinimumValue` `alias MinimumValue = -65534` Minimum allowed value for integers in an `IntTuple`. This constant defines the lower bound for integer values that can be stored directly in an `IntTuple`. Values below this threshold are reserved for internal use to represent structural information like sub-tuple offsets. ## Methods ### `__init__` `__init__(out self)` Initialize an empty IntTuple. Creates an `IntTuple` with zero elements, which can be used as a starting point for building tuples incrementally with `append` or `extend`. Performance: * Minimal allocation (just a single element for length). * Structure validation only performed when `INT_TUPLE_VALIDATION` is enabled. `__init__(out self, *, num_elems: Int)` Initialize an `IntTuple` with a specified number of uninitialized elements. Creates an `IntTuple` with space for the specified number of elements, but does not initialize the elements themselves. Note: Structure validation only performed when `INT_TUPLE_VALIDATION` is enabled. **Args:** * ​num\_elems (`Int`): The number of elements to allocate space for. `@implicit` `__init__(out self, *elements: Int)` Initialize an `IntTuple` with a variadic list of integers. Creates an `IntTuple` containing the provided integer values. This constructor is implicit, allowing direct conversion from integer lists. **Args:** * ​\*elements (`Int`): Variable number of integer values to store in the tuple. `__init__(out self, elements: VariadicList[Int])` Initialize an `IntTuple` with a list of integers. Creates an `IntTuple` containing the provided integer values. This constructor is implicit, allowing direct conversion from integer lists. Notes: * Pre-allocates exact memory needed for efficiency. * Validates that all values are above `MinimumValue`. If any value is less than `MinimumValue`, aborts with an error message. * Structure validation only performed when `INT_TUPLE_VALIDATION` is enabled. **Args:** * ​elements (`VariadicList[Int]`): List of integer values to store in the tuple. `@implicit` `__init__(out self, value: Int)` Initialize an `IntTuple` with a single integer value. Creates an `IntTuple` containing a single integer element. **Args:** * ​value (`Int`): The integer value to store in the tuple. `__init__(out self, *elements: IntTuple[origin], *, __list_literal__: Tuple[] = Tuple())` Initialize an `IntTuple` with nested IntTuples. Creates a hierarchical `IntTuple` containing the provided `IntTuple` elements, preserving their nested structure. **Args:** * ​\*elements (`IntTuple[origin]`): Variable number of `IntTuple` values to store in the tuple. * ​**list\_literal** (`Tuple[]`): Specifies that this constructor can be used for list literals. `__init__(out self, *, non_owned: IntArray)` Initialize an `IntTuple` with a non-owned `IntArray`. Creates an `IntTuple` that uses the provided `IntArray` as its storage without taking ownership. This allows creating views into existing `IntTuple` data without copying. **Args:** * ​non\_owned (`IntArray`): The `IntArray` to use as storage without taking ownership. `__init__(out self, existing: Self, rng: _StridedRange)` Initialize an `IntTuple` as a slice of an existing `IntTuple`. Creates a new `IntTuple` containing only the elements from the existing `IntTuple` that are specified by the range. Notes: * Preserves nested structure of elements in the slice. * Structure validation only performed when `INT_TUPLE_VALIDATION` is enabled. **Args:** * ​existing (`Self`): The source `IntTuple` to slice from. * ​rng (`_StridedRange`): The range of indices to include in the new `IntTuple`. `__init__(out self, dimlist: DimList)` Initialize an `IntTuple` from a DimList. Creates an `IntTuple` containing the dimensions from a DimList, handling both defined and undefined dimensions appropriately. Notes: * Converts undefined dimensions to `UNKNOWN_VALUE`. * Validates that all values are above `MinimumValue`. If any value is less than `MinimumValue`, aborts with an error message. **Args:** * ​dimlist (`DimList`): The DimList containing dimension information. `@implicit` `__init__(out self, zipper: _zip[origin, 2])` Initialize an `IntTuple` from a zip iterator. Creates an `IntTuple` by appending each element from the zip iterator. This constructor is implicit, allowing direct conversion from zip iterators. Note: This implementation is not optimized and may be improved in future versions. **Args:** * ​zipper (`_zip[origin, 2]`): A zip iterator containing pairs of elements to append. ### `__copyinit__` `__copyinit__(out self, existing: Self)` Initialize by copying an existing `IntTuple`. Creates a deep copy of the provided `IntTuple`, copying all its data into newly allocated memory. Note: There is a Mojo bug where this method unnecessarily propagates the origin of self to the new copy. **Args:** * ​existing (`Self`): The `IntTuple` to copy from. ### `__getitem__` `__getitem__(self, _idx: Int) -> IntTuple[self]` Retrieves an element at the specified index from the `IntTuple`. Supports negative indexing (e.g., `-1` for the last element). Notes: If index validation is enabled and the index is out of bounds, aborts with an error message. **Args:** * ​\_idx (`Int`): The index of the element to retrieve. **Returns:** An `IntTuple` containing either a single value or a sub-tuple. `__getitem__(self, span: Slice) -> Self` Retrieves a slice of elements from the `IntTuple`. Creates a new `IntTuple` containing the elements specified by the slice. **Args:** * ​span (`Slice`): A slice object specifying the range of elements to retrieve. **Returns:** A new `IntTuple` containing the specified elements. ### `__lt__` `__lt__(self, rhs: IntTuple[origin]) -> Bool` Compare two `IntTuple`s lexicographically. This function performs element-wise comparison of two `IntTuple`s and determines if the first is lexicographically less than the second. It compares corresponding elements until it finds a pair where the elements differ. Example: ```mojo from layout.int_tuple import IntTuple var tuple1 = IntTuple(1, 2, 3) var tuple2 = IntTuple(1, 2, 4) var result = tuple1 rhs (`IntTuple[origin]`): The other `IntTuple` to compare. **Returns:** True if `self` is lexicographically less than `rhs`, False otherwise. ### `__eq__` `__eq__(self, other: Self) -> Bool` Equality operator for `IntTuple`. **Args:** * ​other (`Self`): The `IntTuple` to compare with. **Returns:** True if the `IntTuple`s are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Inequality operator for `IntTuple`. **Args:** * ​other (`Self`): The `IntTuple` to compare with. **Returns:** True if the `IntTuple`s are not equal, False otherwise. ### `elements_size` `static elements_size[origin: ImmutableOrigin](elements: VariadicListMem[IntTuple[origin], origin, is_owned]) -> Int` Calculate the total storage size needed for a list of IntTuples. Computes the sum of sizes for all elements, accounting for both direct integer values and nested sub-tuples. **Parameters:** * ​origin (`ImmutableOrigin`): Origin of the elements in the `IntTuple`. **Args:** * ​elements (`VariadicListMem[IntTuple[origin], origin, is_owned]`): List of `IntTuple` elements to measure. **Returns:** The total storage size required for all elements. `static elements_size[origin: ImmutableOrigin, n: Int](elements: InlineArray[Pointer[IntTuple, origin], n], idx: Int) -> Int` Calculate the total storage size needed for IntTuples at a specific index. Computes the sum of sizes for all elements at the given index in an array of `IntTuple` pointers. **Parameters:** * ​origin (`ImmutableOrigin`): Origin tracking for memory safety. * ​n (`Int`): Size of the inline array. **Args:** * ​elements (`InlineArray[Pointer[IntTuple, origin], n]`): Array of pointers to `IntTuple`s. * ​idx (`Int`): Index to access in each `IntTuple`. **Returns:** The total storage size required for all elements at the specified index. ### `owned_copy` `owned_copy(self) -> IntTuple` Create a deep copy of this `IntTuple` with its own memory ownership. This method creates a completely independent copy of the `IntTuple` with newly allocated memory. Unlike `__copyinit__`, this method can be called on an existing instance to create a separate copy. Example: ```mojo from layout import IntTuple var original = IntTuple(1, 2, 3) var copy = original.owned_copy() # Modifying copy will not affect original ``` . **Returns:** A new `IntTuple` containing the same data as this one but with independent memory ownership. ### `replace_entry` `replace_entry(self, idx: Int, value: IntTuple[origin]) -> IntTuple` Replace an entry in the tuple with another `IntTuple`. Creates a new `IntTuple` with the element at the specified index replaced by the provided `IntTuple`. Note: If the index is out of bounds and `INT_TUPLE_VALIDATION` is enabled, aborts with an error message. **Args:** * ​idx (`Int`): The index of the element to replace. * ​value (`IntTuple[origin]`): The `IntTuple` to insert at the specified index. **Returns:** A new `IntTuple` with the replacement applied. `replace_entry(mut self, idx: Int, *, int_value: Int)` Replace an integer value at the specified index in-place. Directly modifies the tuple by replacing the integer value at the given index. This is more efficient than creating a new tuple when only a single value needs to be changed. Note: If the index is out of bounds and `INT_TUPLE_VALIDATION` is enabled, aborts with an error message. **Args:** * ​idx (`Int`): The index of the element to replace. * ​int\_value (`Int`): The integer value to insert at the specified index. ### `count_values` `count_values(self) -> Int` Count the total number of integer values in this tuple hierarchy. Recursively traverses the nested tuple structure and counts all integer values. This is useful for determining the size needed for flattened representations. Note: For a flat tuple, this will return the same value as `len(self)`. For nested tuples, it counts all leaf integer values. **Returns:** The total count of integer values in this tuple and all nested tuples. ### `flatten` `flatten(self) -> IntTuple` Flatten a nested `IntTuple` into a single-level `IntTuple`. This function converts a hierarchical `IntTuple` structure into a flat sequence of integer values, preserving the order of elements. **Returns:** A new `IntTuple` containing all integer values in a flat structure. ### `all_known` `all_known(self) -> Bool` Check if all values in this tuple hierarchy are known (not `UNKNOWN_VALUE`). Recursively traverses the nested tuple structure and checks if any value is equal to `UNKNOWN_VALUE`. **Returns:** True if all values in this tuple and nested tuples are known, False if any value is `UNKNOWN_VALUE`. ### `append` `append(mut self, *elements: IntTuple[origin])` Append one or more `IntTuple` elements to this tuple. This method modifies the tuple in-place by adding the provided elements to the end of the tuple. It handles both value tuples and nested tuples. Notes: * This operation requires reallocating the underlying `IntArray` storage to accommodate the new elements, which may impact performance for large tuples. * Aborts if called on a non-owning (sub-tuple) instance. **Args:** * ​\*elements (`IntTuple[origin]`): Variable number of `IntTuple` objects to append to this tuple. ### `extend` `extend(mut self, tuple: IntTuple[origin])` Extends this tuple by appending all elements from another tuple. This method modifies the tuple in-place by adding all elements from the provided tuple to the end of this tuple. It efficiently handles both value elements and nested tuples. Notes: * This operation requires reallocating the underlying `IntArray` storage to accommodate the new elements, which may impact performance for large tuples. * Aborts if called on a non-owning (sub-tuple) instance. * If the input tuple is empty, this method returns without making any changes. **Args:** * ​tuple (`IntTuple[origin]`): The `IntTuple` whose elements will be appended to this tuple. ### `size` `size(self) -> Int` Returns the total size of the `IntTuple` in memory. For owning tuples, returns the size of the underlying `IntArray`. For non-owning tuples, calculates the size recursively. **Returns:** The total size in memory units. ### `tuple_size` `static tuple_size(data: IntArray) -> Int` Recursively calculates the size of a tuple represented by an `IntArray`. This method traverses the tuple structure, accounting for both direct values and nested sub-tuples to compute the total memory footprint. **Args:** * ​data (`IntArray`): `IntArray` containing the tuple data. **Returns:** The total size of the tuple in memory units. ### `validate_structure` `validate_structure(self)` Validates the internal structure of the `IntTuple`. Ensures that the actual size of the underlying data matches the computed size based on the tuple's structure. This helps detect memory corruption or implementation errors. Aborts execution with an error message if validation fails. ### `__len__` `__len__(self) -> Int` Returns the number of elements in the `IntTuple`. This is the logical length of the tuple, not its memory size. **Returns:** The number of elements in the tuple. ### `__iter__` `__iter__(self) -> _IntTupleIter[self, origin]` Returns an iterator over the elements of the `IntTuple`. This enables iteration through the tuple using for-loops. **Returns:** An iterator object for this `IntTuple`. ### `is_value` `is_value(self) -> Bool` Determines if this `IntTuple` represents a single value rather than a tuple. **Returns:** True if this `IntTuple` contains exactly one element that is a value, False otherwise. `is_value(self, i: Int) -> Bool` Determines if the element at the specified index is a value rather than a tuple. Notes: If index validation is enabled and the index is out of bounds, aborts with an error message. **Args:** * ​i (`Int`): The index of the element to check. **Returns:** True if the element at index i is a value, False if it's a tuple. ### `is_tuple` `is_tuple(self) -> Bool` Determines if this `IntTuple` represents a tuple rather than a single value. **Returns:** True if this `IntTuple` is a tuple (not a single value), False otherwise. `is_tuple(self, i: Int) -> Bool` Determines if the element at the specified index is a tuple rather than a value. Notes: This is the complement of is\_value(i). **Args:** * ​i (`Int`): The index of the element to check. **Returns:** True if the element at index i is a tuple, False if it's a value. ### `value` `value(self) -> Int` Retrieves the value of this `IntTuple` if it represents a single value. This method should only be called if `is_value()` returns True. **Returns:** The integer value stored in this `IntTuple`. `value(self, i: Int) -> Int` Retrieves the value of the element at the specified index. This method should only be called if `is_value(i)` returns True. Notes: If the element is not a value, the behavior is undefined. **Args:** * ​i (`Int`): The index of the element to retrieve. **Returns:** The integer value stored at the specified index. ### `tuple` `tuple(ref self) -> ref [self] Self` Returns a reference to this `IntTuple` as a tuple. Notes: This method is used to access the current `IntTuple` as a tuple without creating a copy of the data. **Returns:** A reference to this `IntTuple` to avoid unnecessary copying. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes a string representation of this `IntTuple` to the provided writer. Notes: For single values, writes just the value. For tuples, writes a comma-separated list of elements enclosed in parentheses. **Parameters:** * ​W (`Writer`): A type that conforms to the Writer trait. **Args:** * ​writer (`W`): The writer to output the string representation to. ### `__str__` `__str__(self) -> String` Returns a string representation of this `IntTuple`. **Returns:** A string representation of the `IntTuple`, using the `write_to` method. ### `is_equal` `static is_equal(a: IntTuple[origin], b: IntTuple[origin]) -> Bool` Compares two `IntTuple`s for equality. Notes: Handles nested tuples and special cases where a single-element tuple is equivalent to its contained value. **Args:** * ​a (`IntTuple[origin]`): The first `IntTuple` to compare. * ​b (`IntTuple[origin]`): The second `IntTuple` to compare. **Returns:** True if the `IntTuple`s are equal in structure and values, False otherwise. ### `__repr__` `__repr__(self) -> String` Returns a string representation of this `IntTuple` for debugging. **Returns:** A string representation of the `IntTuple`, same as `__str__`. ### `__int__` `__int__(self) -> Int` Converts this `IntTuple` to an integer. This method should only be called if `is_value()` returns True. Notes: If the `IntTuple` is not a single value, the behavior is undefined. **Returns:** The integer value stored in this `IntTuple`. --- ## abs `abs(t: IntTuple[origin]) -> IntTuple` Compute the absolute value of each element in an `IntTuple`. This function applies the absolute value operation to each integer in a potentially nested `IntTuple` structure. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to transform. **Returns:** A new `IntTuple` with the same structure but with absolute values. --- ## apply `apply[: origin.set, //, func: fn(Int) capturing -> Int](t: IntTuple[origin]) -> IntTuple` Apply a function to each integer value in an `IntTuple`. This function recursively applies the given function to each integer value in a potentially nested `IntTuple` structure, preserving the structure. **Parameters:** * ​func (`fn(Int) capturing -> Int`): Function to apply to each integer value. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to transform. **Returns:** A new `IntTuple` with the same structure but with each integer value transformed by the function. --- ## apply_predicate `apply_predicate[predicate: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool](a: IntTuple[origin], b: IntTuple[origin]) -> Bool` Apply a predicate function recursively to two `IntTuple`s. This function traverses two `IntTuple`s with the same structure and applies a predicate function to corresponding elements. The predicate is applied only to the leaf nodes (integer values). Note: If the structures of the two `IntTuple`s don't match (different nesting or length), the function returns False without applying the predicate. **Parameters:** * ​predicate (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool`): A function that takes two `IntTuple`s (containing integer values) and returns a boolean result. **Args:** * ​a (`IntTuple[origin]`): First `IntTuple` to compare. * ​b (`IntTuple[origin]`): Second `IntTuple` to compare. **Returns:** True if the predicate returns True for all corresponding elements and the structures match, False otherwise. --- ## apply_zip `apply_zip[func: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin]) -> IntTuple` Apply a function to pairs of elements from two `IntTuple`s. This function zips two `IntTuple`s together and applies the given function to each pair of elements, creating a new `IntTuple` with the results. **Parameters:** * ​func (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> IntTuple`): Function that takes two `IntTuple`s and returns an `IntTuple`. **Args:** * ​t1 (`IntTuple[origin]`): First `IntTuple`. * ​t2 (`IntTuple[origin]`): Second `IntTuple`. **Returns:** A new `IntTuple` containing the results of applying func to each pair. `apply_zip[: origin.set, //, func: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) capturing -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin]) -> IntTuple` Apply a capturing function to pairs of elements from two `IntTuple`s. This overload allows the function to capture variables from its environment. **Parameters:** * ​func (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) capturing -> IntTuple`): Capturing function that takes two `IntTuple`s and returns an `IntTuple`. **Args:** * ​t1 (`IntTuple[origin]`): First `IntTuple`. * ​t2 (`IntTuple[origin]`): Second `IntTuple`. **Returns:** A new `IntTuple` containing the results of applying func to each pair. `apply_zip[func: fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin], t3: IntTuple[origin]) -> IntTuple` Apply a function to triplets of elements from three `IntTuple`s. This function zips three `IntTuple`s together and applies the given function to each triplet of elements, creating a new `IntTuple` with the results. **Parameters:** * ​func (`fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) -> IntTuple`): Function that takes three `IntTuple`s and returns an `IntTuple`. **Args:** * ​t1 (`IntTuple[origin]`): First `IntTuple`. * ​t2 (`IntTuple[origin]`): Second `IntTuple`. * ​t3 (`IntTuple[origin]`): Third `IntTuple`. **Returns:** A new `IntTuple` containing the results of applying func to each triplet. `apply_zip[: origin.set, //, func: fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) capturing -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin], t3: IntTuple[origin]) -> IntTuple` Apply a capturing function to triplets of elements from three `IntTuple`s. This overload allows the function to capture variables from its environment. **Parameters:** * ​func (`fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) capturing -> IntTuple`): Capturing function that takes three `IntTuple`s and returns an `IntTuple`. **Args:** * ​t1 (`IntTuple[origin]`): First `IntTuple`. * ​t2 (`IntTuple[origin]`): Second `IntTuple`. * ​t3 (`IntTuple[origin]`): Third `IntTuple`. **Returns:** A new `IntTuple` containing the results of applying func to each triplet. --- ## compact_order `compact_order(shape: IntTuple[origin], order: IntTuple[origin]) -> IntTuple` Create a compact stride based on shape and order. This function generates a stride tuple where lower order numbers imply faster varying strides. The resulting shape and stride form a bijective layout. Performance: * Always inlined for optimal performance in tight loops. * Flattens inputs and re-nests results for consistent behavior. Example: ```mojo from layout import IntTuple from layout.int_tuple import compact_order # Create a compact layout with dimensions (2,3,4,5) and ordering (1,4,3,5) var x = compact_order(IntTuple(2,3,4,5), IntTuple(1,4,3,5)) # returns (1,8,2,24) # Create a compact layout with nested dimensions and corresponding ordering var y = compact_order(IntTuple(2,IntTuple(3,4),5), IntTuple(1,IntTuple(2,3),4)) # returns (1,(2,6),24) ``` . **Args:** * ​shape (`IntTuple[origin]`): The shape tuple defining dimensions. * ​order (`IntTuple[origin]`): The order tuple defining the relative ordering of dimensions. **Returns:** A stride tuple that creates a compact memory layout according to the specified order. --- ## compatible `compatible(a: IntTuple[origin], b: IntTuple[origin]) -> Bool` Test if two shapes are compatible for tensor operations. This function checks if shape A is compatible with shape B, meaning: 1. The total size of A and B are the same 2. Any coordinate into A can also be used as a coordinate into B Compatible can also be thought of as a partial order on A and B: A a (`IntTuple[origin]`): The first `IntTuple` to compare. * ​b (`IntTuple[origin]`): The second `IntTuple` to compare. **Returns:** True if shape A is compatible with shape B, False otherwise. --- ## congruent `congruent(a: IntTuple[origin], b: IntTuple[origin]) -> Bool` Test if two `IntTuple`s have the same hierarchical structure. This function checks if two `IntTuple`s have identical nesting patterns, regardless of the actual integer values they contain. **Args:** * ​a (`IntTuple[origin]`): First `IntTuple` to compare. * ​b (`IntTuple[origin]`): Second `IntTuple` to compare. **Returns:** True if both `IntTuple`s have the same hierarchical structure, False otherwise. --- ## crd2idx `crd2idx(crd: IntTuple[origin], shape: IntTuple[origin]) -> Int` Map a logical coordinate to a linear index. This function converts a multi-dimensional coordinate to a linear index based on the shape. It uses default strides computed from the shape. **Args:** * ​crd (`IntTuple[origin]`): The coordinate tuple to convert. * ​shape (`IntTuple[origin]`): The shape of the tensor/array. **Returns:** The linear index corresponding to the coordinate. `crd2idx(crd: IntTuple[origin], shape: IntTuple[origin], _stride: IntTuple[origin]) -> Int` Map a logical coordinate to a linear index with custom strides. This function converts a multi-dimensional coordinate to a linear index based on the shape and stride information. If no stride is provided, it computes default strides from the shape. The function handles various input combinations: * Tuple coordinates with tuple shapes and strides * Single integer coordinate with tuple shapes and strides * Single integer coordinate with single integer shape and stride Aborts: ``` - If coordinate and shape dimensions don't match. - If shape and stride dimensions don't match. - If input type combinations are invalid. ``` **Args:** * ​crd (`IntTuple[origin]`): The coordinate(s) to convert, can be a single value or a tuple of coordinates. * ​shape (`IntTuple[origin]`): The shape of the tensor/array, can be a single value or a tuple of dimensions. * ​\_stride (`IntTuple[origin]`): Optional custom strides, defaults to row-major strides if not provided. **Returns:** The linear index corresponding to the coordinate. --- ## depth `depth(src: IntTuple[origin]) -> Int` Calculates the maximum nesting depth of an `IntTuple`. This function recursively traverses the `IntTuple` structure to determine its maximum nesting depth. A scalar value has depth 0, a flat tuple has depth 1, and nested tuples increase the depth accordingly. Example: ```mojo from layout import IntTuple, depth print(depth(IntTuple(1))) # prints 0 print(depth(IntTuple(1, 2))) # prints 1 print(depth((IntTuple(1, 2)))) # prints 2 ``` . **Args:** * ​src (`IntTuple[origin]`): The `IntTuple` to measure the depth of. **Returns:** An integer representing the maximum nesting depth. --- ## fill_like `fill_like(src: IntTuple[origin], val: Int) -> IntTuple` Creates an `IntTuple` with the same structure as the source but filled with a specified value. This function recursively traverses the source `IntTuple` and creates a new `IntTuple` with identical structure, but with all leaf values replaced by the specified value. **Args:** * ​src (`IntTuple[origin]`): The source `IntTuple` whose structure will be copied. * ​val (`Int`): The integer value to fill the new `IntTuple` with. **Returns:** A new `IntTuple` with the same structure as src but filled with val. --- ## flatten `flatten(t: IntTuple[origin]) -> IntTuple` Flatten a nested `IntTuple` into a single-level `IntTuple`. This function converts a hierarchical `IntTuple` structure into a flat sequence of integer values, preserving the order of elements. **Args:** * ​t (`IntTuple[origin]`): The nested `IntTuple` to flatten. **Returns:** A new `IntTuple` containing all integer values in a flat structure. --- ## idx2crd `idx2crd(idx: IntTuple[origin], shape: IntTuple[origin]) -> IntTuple` Converts a linear index to a coordinate tuple within a given shape. This function splits an index into a coordinate within a Shape via a colexicographical enumeration of coordinates in Shape. **Args:** * ​idx (`IntTuple[origin]`): The linear index to convert. * ​shape (`IntTuple[origin]`): The shape of the tensor/array. **Returns:** A new `IntTuple` containing the coordinates corresponding to the linear index. `idx2crd(idx: IntTuple[origin], shape: IntTuple[origin], _stride: IntTuple[origin]) -> IntTuple` Converts a linear index to a coordinate tuple within a given shape using custom strides. **Args:** * ​idx (`IntTuple[origin]`): The linear index to convert. * ​shape (`IntTuple[origin]`): The shape of the tensor/array. * ​\_stride (`IntTuple[origin]`): Custom strides to use for the conversion. **Returns:** A new `IntTuple` containing the coordinates corresponding to the linear index. --- ## idx2crd2 `idx2crd2(idx: IntTuple[origin], shape: IntTuple[origin], _stride: IntTuple[origin]) -> IntTuple` Convert a linear index to coordinates. This function handles the actual conversion logic for different input combinations. Notes: * Handles four cases: tuple-tuple-tuple, tuple-int-int, int-tuple-tuple, and int-int-int. * When input shapes don't match, `abort()` will be called. **Args:** * ​idx (`IntTuple[origin]`): The linear index to convert. * ​shape (`IntTuple[origin]`): The shape of the tensor/array. * ​\_stride (`IntTuple[origin]`): Custom strides to use for the conversion. If empty, strides are computed from the shape using prefix\_product. **Returns:** A new IntTuple containing the coordinates corresponding to the linear index. --- ## int_tuple Hierarchical integer tuple data structures for high-performance tensor operations. This module provides a flexible, memory-efficient implementation of nested integer tuples optimized for tensor shape, stride, and index operations in high-performance computing. The core data structures support both flat and hierarchical representations with efficient memory sharing and zero-copy views. Key components: * `IntArray`: Low-level register-passable array with direct memory management * `IntTuple`: Hierarchical nested tuple with efficient memory layout and operations * Utility functions for tensor shape manipulation, coordinate transformations, and layout operations Performance features: * Register-passable data structures for optimal compiler optimizations * Zero-copy views for efficient memory sharing * Specialized memory layout for nested structures * Optimized algorithms for common tensor operations Common operations: * Shape manipulation: `flatten`, `to_nest`, `apply`, `product`, `sum` * Coordinate transformations: `idx2crd`, `crd2idx` * Layout operations: `compact_order`, `prefix_product` * Structural comparisons: `congruent`, `compatible`, `weakly_congruent` Example usage: ```mojo from layout import IntTuple from layout.int_tuple import flatten, compact_order, size # Create nested tuples var shape = IntTuple(2, IntTuple(3, 4), 5) # Represents shape (2, (3, 4), 5) # Flatten a nested tuple var flat = flatten(shape) # Results in (2, 3, 4, 5) # Create compact strides for a given shape and order var order = IntTuple(1, IntTuple(2, 3), 4) var strides = compact_order(shape, order) # Results in (1, (2, 6), 24) # Calculate total size (product of all elements) var total_size = size(shape) # Results in 120 ``` ## Aliases ### `INT_TUPLE_VALIDATION` `alias INT_TUPLE_VALIDATION = False` ### `IntList` `alias IntList = List[Int, True]` A type alias for a List of integers with ownership. This alias defines a List that contains Int values and has ownership of its data. It's used throughout the module for storing and manipulating collections of integers, particularly for operations like permutations and indices. ### `UNKNOWN_VALUE` `alias UNKNOWN_VALUE = -1` Special value indicating an unknown or unspecified dimension. This constant is used throughout the `IntTuple` system to represent dimensions that are not known at compile time or have not been specified. ## Structs * [​`IntArray`](./IntArray): A memory-efficient, register-passable array of integers. * [​`IntTuple`](./IntTuple): A hierarchical, nested tuple of integers with efficient memory management. ## Functions * [​`abs`](./abs): Compute the absolute value of each element in an `IntTuple`. * [​`apply`](./apply): Apply a function to each integer value in an `IntTuple`. * [​`apply_predicate`](./apply_predicate): Apply a predicate function recursively to two `IntTuple`s. * [​`apply_zip`](./apply_zip): Apply a function to pairs of elements from two `IntTuple`s. * [​`compact_order`](./compact_order): Create a compact stride based on shape and order. * [​`compatible`](./compatible): Test if two shapes are compatible for tensor operations. * [​`congruent`](./congruent): Test if two `IntTuple`s have the same hierarchical structure. * [​`crd2idx`](./crd2idx): Map a logical coordinate to a linear index. * [​`depth`](./depth): Calculates the maximum nesting depth of an `IntTuple`. * [​`fill_like`](./fill_like): Creates an `IntTuple` with the same structure as the source but filled with a specified value. * [​`flatten`](./flatten): Flatten a nested `IntTuple` into a single-level `IntTuple`. * [​`idx2crd`](./idx2crd): Converts a linear index to a coordinate tuple within a given shape. * [​`idx2crd2`](./idx2crd2): Convert a linear index to coordinates. * [​`inner_product`](./inner_product): Compute the inner product of two `IntTuple`s. * [​`is_flat`](./is_flat): Check if an `IntTuple` is flat. * [​`is_int`](./is_int): Check if an `IntTuple` represents a single integer value. * [​`is_tuple`](./is_tuple): Check if an `IntTuple` represents a nested tuple. * [​`mul`](./mul): Multiply each element in an `IntTuple` by a scalar value. * [​`prefix_product`](./prefix_product): Compute the exclusive prefix product of an `IntTuple`. * [​`product`](./product): Calculate the product of all values in an `IntTuple`. * [​`product_each`](./product_each): Compute the product of elements in each sub-tuple of an `IntTuple`. * [​`propagate_unknown`](./propagate_unknown): Propagates unknown dimensions from the target `IntTuple` to the source `IntTuple`. * [​`reduce`](./reduce): Apply a reduction function to an `IntTuple` with an initial value. * [​`reverse`](./reverse): Reverses the order of elements in an `IntTuple`, recursively. * [​`shallow_apply`](./shallow_apply): Apply a function to each top-level element of an `IntTuple`. * [​`shape_div`](./shape_div): Performs division operation between shape tuples. * [​`signum`](./signum): Calculate the sign of an integer. * [​`size`](./size): Calculate the total size (product of all elements) of an `IntTuple`. * [​`sorted`](./sorted): Sort an IntTuple using the provided comparison function. * [​`sum`](./sum): Calculate the sum of all values in an `IntTuple`. * [​`to_nest`](./to_nest): Nests a flat `IntTuple` according to the structure of a nested `IntTuple`. * [​`to_unknown`](./to_unknown): Create an `IntTuple` with the same structure but filled with `UNKNOWN_VALUE`. * [​`tuple_max`](./tuple_max): Calculate the maximum value in an `IntTuple`. * [​`tuple_min`](./tuple_min): Compute the element-wise minimum of two `IntTuple`s. * [​`weakly_compatible`](./weakly_compatible): Test if shape A is weakly compatible with shape B. * [​`weakly_congruent`](./weakly_congruent): Test if two IntTuples have similar hierarchical structures. * [​`zip`](./zip): Create a zip iterator from an array of `IntTuple` pointers. --- ## inner_product `inner_product(a: IntTuple[origin], b: IntTuple[origin]) -> Int` Compute the inner product of two `IntTuple`s. For flat tuples, this is the sum of element-wise products. For nested tuples, the function recurses into corresponding nested elements. Note: If the input tuples have different lengths, `abort()` will be called. **Args:** * ​a (`IntTuple[origin]`): First `IntTuple`. * ​b (`IntTuple[origin]`): Second `IntTuple`. **Returns:** The inner product as an `Int`. --- ## is_flat `is_flat(t: IntTuple[origin]) -> Bool` Check if an `IntTuple` is flat. This function checks if the `IntTuple` is flat, meaning it has no nested elements. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to check. **Returns:** True if the `IntTuple` is flat, False otherwise. --- ## is_int `is_int(t: IntTuple[origin]) -> Bool` Check if an `IntTuple` represents a single integer value. This function determines whether the given `IntTuple` contains a single integer value rather than a nested tuple structure. Example: ```mojo from layout.int_tuple import is_int, IntTuple var single_value = IntTuple(5) var nested_tuple = IntTuple(1, 2, 3) var result1 = is_int(single_value) # Returns True var result2 = is_int(nested_tuple) # Returns False ``` . **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to check. **Returns:** True if the `IntTuple` contains a single integer value, False if it's a nested tuple. --- ## is_tuple `is_tuple(t: IntTuple[origin]) -> Bool` Check if an `IntTuple` represents a nested tuple. This function determines whether the given `IntTuple` contains nested elements rather than a single integer value. It is the complement of the `is_int` function. Example: ```mojo from layout.int_tuple import is_tuple, IntTuple var single_value = IntTuple(5) var nested_tuple = IntTuple(1, 2, 3) var result1 = is_tuple(single_value) # Returns False var result2 = is_tuple(nested_tuple) # Returns True ``` . **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to check. **Returns:** True if the `IntTuple` contains nested elements, False if it's a single integer value. --- ## mul `mul(lhs: IntTuple[origin], rhs: Int) -> IntTuple` Multiply each element in an `IntTuple` by a scalar value. This function creates a new `IntTuple` where each element (at any nesting level) is multiplied by the provided integer value. **Args:** * ​lhs (`IntTuple[origin]`): The `IntTuple` whose elements will be multiplied. * ​rhs (`Int`): The scalar integer to multiply each element by. **Returns:** A new `IntTuple` with the same structure as the input but with all elements multiplied by the scalar value. --- ## prefix_product `prefix_product(a: IntTuple[origin]) -> IntTuple` Compute the exclusive prefix product of an `IntTuple`. This is a convenience wrapper that initializes the prefix product with 1. **Args:** * ​a (`IntTuple[origin]`): The input `IntTuple` to compute the prefix product for. **Returns:** A new `IntTuple` containing the exclusive prefix product of the input. `prefix_product(a: IntTuple[origin], init: Int) -> IntTuple` Compute the exclusive prefix product of an `IntTuple` with an initial value. This function delegates to the implementation in prefix\_product2. **Args:** * ​a (`IntTuple[origin]`): The input `IntTuple` to compute the prefix product for. * ​init (`Int`): The initial value(s) for the prefix product, defaults to 1. **Returns:** A new `IntTuple` containing the exclusive prefix product of the input. --- ## product `product(t: IntTuple[origin]) -> Int` Calculate the product of all values in an `IntTuple`. This function recursively computes the product of all integer values in a potentially nested `IntTuple` structure. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to multiply. **Returns:** The product of all integer values, or `UNKNOWN_VALUE` if any value in the tuple is `UNKNOWN_VALUE`. --- ## product_each `product_each(t: IntTuple[origin]) -> IntTuple` Compute the product of elements in each sub-tuple of an `IntTuple`. For each immediate child of the input tuple, this function computes the product of all elements within that child. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` containing sub-tuples. **Returns:** A new `IntTuple` where each element is the product of the corresponding sub-tuple in the input. --- ## propagate_unknown `propagate_unknown(src: IntTuple[origin], target: IntTuple[origin]) -> IntTuple` Propagates unknown dimensions from the target `IntTuple` to the source `IntTuple`. This function creates a new `IntTuple` by combining the source and target `IntTuple`s, preserving unknown dimensions (UNKNOWN\_VALUE) from the target while using values from the source for known dimensions. **Args:** * ​src (`IntTuple[origin]`): The source `IntTuple` containing known dimension values. * ​target (`IntTuple[origin]`): The target `IntTuple` that may contain unknown dimensions (UNKNOWN\_VALUE). **Returns:** A new `IntTuple` with unknown dimensions from target and known dimensions from src. --- ## reduce `reduce[: origin.set, //, reducer: fn[ImmutableOrigin](a: Int, b: IntTuple[$0]) capturing -> Int](t: IntTuple[origin], initializer: Int) -> Int` Apply a reduction function to an `IntTuple` with an initial value. This function iterates through each element of the `IntTuple` and applies the provided reduction function cumulatively, starting with the initializer. **Parameters:** * ​reducer (`fn[ImmutableOrigin](a: Int, b: IntTuple[$0]) capturing -> Int`): A function that combines the accumulated result with the next element. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to reduce. * ​initializer (`Int`): The initial value for the reduction operation. **Returns:** The final accumulated result after applying the reduction function to all elements in the `IntTuple`. --- ## reverse `reverse(src: IntTuple[origin]) -> IntTuple` Reverses the order of elements in an `IntTuple`, recursively. This function reverses the top-level elements of the `IntTuple` and recursively reverses any nested `IntTuple`s. Example: ```mojo from layout.int_tuple import IntTuple, reverse var t = IntTuple(1, 2, IntTuple(3, 4)) var reversed = reverse(t) # returns ((4, 3), 2, 1) ``` . **Args:** * ​src (`IntTuple[origin]`): The source `IntTuple` to reverse. **Returns:** A new `IntTuple` with elements in reversed order. --- ## shallow_apply `shallow_apply[func: fn[ImmutableOrigin](IntTuple[$0]) -> Int](t: IntTuple[origin]) -> IntTuple` Apply a function to each top-level element of an `IntTuple`. Unlike `apply()`, this function only operates on the immediate children of the input tuple without recursing into nested tuples. **Parameters:** * ​func (`fn[ImmutableOrigin](IntTuple[$0]) -> Int`): Function that takes an `IntTuple` and returns an `Int`. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` whose elements will be transformed. **Returns:** A new `IntTuple` with the function applied to each top-level element. --- ## shape_div `shape_div(a: IntTuple[origin], b: IntTuple[origin]) -> IntTuple` Performs division operation between shape tuples. Handles four cases: 1. tuple-tuple: Performs shape\_div element-wise when dimensions match 2. tuple-int: Folds the division of b across each element of a Example: `shape_div((4,5,6),40)` -> `shape_div((1,5,6),10)` -> `shape_div((1,1,6),2)` -> `(1,1,3)` 3. int-tuple: Returns `shape_div(a, product(b))` 4. int-int: Enforces the divisibility condition `a % b == 0 || b % a == 0` when possible Returns `a / b` with rounding away from `0` (that is, `1` or `-1` when `a a (`IntTuple[origin]`): The dividend `IntTuple`. * ​b (`IntTuple[origin]`): The divisor `IntTuple`. **Returns:** A new `IntTuple` containing the result of the division operation --- ## signum `signum(a: Int) -> Int` Calculate the sign of an integer. This function determines the sign of the input integer and returns a corresponding indicator value. Example: ```mojo from layout.int_tuple import signum var result1 = signum(5) # Returns 1 var result2 = signum(-10) # Returns -1 var result3 = signum(0) # Returns 0 ``` . **Args:** * ​a (`Int`): The integer value to determine the sign of. **Returns:** 1 if `a` > 0, -1 if `a` --- ## size `size(a: IntTuple[origin]) -> Int` Calculate the total size (product of all elements) of an `IntTuple`. This function computes the product of all integer values in the `IntTuple`, regardless of nesting level. **Args:** * ​a (`IntTuple[origin]`): The `IntTuple` whose elements will be multiplied together. **Returns:** The product of all elements in the `IntTuple`. --- ## sorted `sorted[cmp: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool = __lt__[::Origin[::Bool[?, ?]](tuple: IntTuple[origin]) -> IntTuple` Sort an IntTuple using the provided comparison function. This function implements a merge sort algorithm to efficiently sort the elements of an IntTuple. The sorting is stable and has `O(n log n)` time complexity. **Parameters:** * ​cmp (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool`): A comparison function that takes two `IntTuple` elements and returns True if the first should come before the second. Defaults to the `lt` function which performs lexicographical ordering. **Args:** * ​tuple (`IntTuple[origin]`): The `IntTuple` to be sorted. **Returns:** A new `IntTuple` containing the same elements as the input but sorted according to the comparison function. --- ## sum `sum(t: IntTuple[origin]) -> Int` Calculate the sum of all values in an `IntTuple`. This function recursively computes the sum of all integer values in a potentially nested `IntTuple` structure. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to sum. **Returns:** The sum of all integer values, or `UNKNOWN_VALUE` if any value in the tuple is `UNKNOWN_VALUE`. --- ## to_nest `to_nest(nested: IntTuple[origin], flat: IntTuple[origin]) -> IntTuple` Nests a flat `IntTuple` according to the structure of a nested `IntTuple`. This function reshapes a flat sequence of values into a hierarchical structure that matches the pattern of a template nested `IntTuple`. Example: ```mojo from layout import IntTuple from layout.int_tuple import to_nest var result = to_nest(IntTuple(2, IntTuple(3, 4), 5), IntTuple(1, 2, 3, 4)) # returns IntTuple(1, (2, 3), 4) ``` . **Args:** * ​nested (`IntTuple[origin]`): The template `IntTuple` defining the desired structure. * ​flat (`IntTuple[origin]`): The flat `IntTuple` containing the values to be nested. **Returns:** A new `IntTuple` with the values from flat arranged in the structure of nested. --- ## to_unknown `to_unknown(t: IntTuple[origin]) -> IntTuple` Create an `IntTuple` with the same structure but filled with `UNKNOWN_VALUE`. This function preserves the hierarchical structure of the input `IntTuple` but replaces all integer values with `UNKNOWN_VALUE`. **Args:** * ​t (`IntTuple[origin]`): The template `IntTuple` defining the structure. **Returns:** A new `IntTuple` with the same structure as t but with all values replaced by `UNKNOWN_VALUE`. --- ## tuple_max `tuple_max(t: IntTuple[origin]) -> Int` Calculate the maximum value in an `IntTuple`. This function recursively finds the maximum integer value in a potentially nested `IntTuple` structure. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to search. **Returns:** The maximum integer value found in the tuple. --- ## tuple_min `tuple_min(a: IntTuple[origin], b: IntTuple[origin]) -> IntTuple` Compute the element-wise minimum of two `IntTuple`s. This function compares corresponding elements of two `IntTuple`s and returns a new `IntTuple` containing the minimum value at each position. Aborts: If the input tuples have different lengths. Note: If either input contains `UNKNOWN_VALUE`, the result will be `UNKNOWN_VALUE`. **Args:** * ​a (`IntTuple[origin]`): First `IntTuple`. * ​b (`IntTuple[origin]`): Second `IntTuple`. **Returns:** A new `IntTuple` with each element being the minimum of the corresponding elements in a and b. --- ## weakly_compatible `weakly_compatible(a: IntTuple[origin], b: IntTuple[origin]) -> Bool` Test if shape A is weakly compatible with shape B. A shape A is weakly compatible with shape B if there exists a shape C congruent to A such that compatible(elem\_scale(A,C), B). This establishes a partial order relation between shapes where A a (`IntTuple[origin]`): The first `IntTuple` to compare. * ​b (`IntTuple[origin]`): The second `IntTuple` to compare. **Returns:** True if shape A is weakly compatible with shape B, False otherwise. --- ## weakly_congruent `weakly_congruent(a: IntTuple[origin], b: IntTuple[origin]) -> Bool` Test if two IntTuples have similar hierarchical structures. This function establishes a partial order relation between IntTuples based on their hierarchical structure. It's less strict than congruent. **Args:** * ​a (`IntTuple[origin]`): First IntTuple to compare. * ​b (`IntTuple[origin]`): Second IntTuple to compare. **Returns:** True if a's structure is compatible with b's structure, False otherwise. --- ## zip `zip[origin: ImmutableOrigin, n: Int](ts: InlineArray[Pointer[IntTuple, origin], n]) -> _zip[origin, n]` Create a zip iterator from an array of `IntTuple` pointers. This function creates a zip iterator that allows simultaneous traversal of multiple `IntTuple` collections. **Parameters:** * ​origin (`ImmutableOrigin`): The origin tracking parameter for memory safety. * ​n (`Int`): The number of `IntTuple` collections being zipped together. **Args:** * ​ts (`InlineArray[Pointer[IntTuple, origin], n]`): Array of pointers to the `IntTuple` collections to zip. **Returns:** A `_zip` object that can be iterated over. `zip(a: IntTuple[origin], b: IntTuple[origin], out result: _zip[{a, b}, 2])` Create a zip iterator for two `IntTuple`s. This function creates a zip iterator that allows simultaneous traversal of two `IntTuple`s, yielding pairs of corresponding elements. **Args:** * ​a (`IntTuple[origin]`): First `IntTuple` to zip. * ​b (`IntTuple[origin]`): Second `IntTuple` to zip. **Returns:** The resulting zip iterator for the input `IntTuple`s. `zip(a: IntTuple[origin], b: IntTuple[origin], c: IntTuple[origin], out result: _zip[{a, b, c}, 3])` Create a zip iterator for three `IntTuple`s. This function creates a zip iterator that allows simultaneous traversal of three `IntTuple`s, yielding triplets of corresponding elements. **Args:** * ​a (`IntTuple[origin]`): First `IntTuple` to zip. * ​b (`IntTuple[origin]`): Second `IntTuple` to zip. * ​c (`IntTuple[origin]`): Third `IntTuple` to zip. **Returns:** The resulting zip iterator for the input `IntTuple`s. --- ## Layout `struct Layout` Represents a memory layout for multi-dimensional data. The Layout struct is the primary implementation of the LayoutTrait, providing a concrete representation of memory layouts using shape and stride information. It maps between logical coordinates and linear memory indices, enabling efficient access to multi-dimensional data. A Layout consists of: * shape: Defines the dimensions of the logical coordinate space * stride: Defines the step sizes in memory for each dimension The Layout struct supports various operations including: * Creation of row-major and column-major layouts * Conversion between coordinates and indices * Composition with other layouts * Iteration over sub-layouts Layouts can be hierarchical, with nested shapes and strides, allowing for complex memory access patterns like blocked or tiled layouts. ## Fields * ​shape (`IntTuple`): The dimensions of the layout. This field defines the size of each dimension in the logical coordinate space. For example, a shape of (3, 4) represents a 3×4 grid of elements. * ​stride (`IntTuple`): The memory step sizes for each dimension. This field defines how many elements to skip in memory when moving one unit in each dimension. For example, in a row-major 3×4 layout, the strides might be (4, 1), meaning moving one unit in the first dimension requires skipping 4 elements in memory, while moving one unit in the second dimension requires skipping 1 element. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `EqualityComparable`, `LayoutTrait`, `Movable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `has_shape` `alias has_shape = True` Indicates whether the layout has a valid shape. ## Methods ### `__init__` `__init__(out self)` Initializes an empty layout with no dimensions. Creates a layout with empty shape and stride tuples, which can be populated later using append operations. `@implicit` `__init__(out self, shape: IntTuple[origin])` Initializes a layout with the given shape and column-major strides. Creates a layout with the specified shape and automatically calculates column-major strides (where the first dimension varies fastest in memory). **Args:** * ​shape (`IntTuple[origin]`): The dimensions of the layout. `__init__(out self, shape: IntTuple[origin], stride: IntTuple[origin])` Initializes a layout with the given shape and stride. Creates a layout with explicitly specified shape and stride values. If an empty stride is provided, column-major strides are calculated. **Args:** * ​shape (`IntTuple[origin]`): The dimensions of the layout. * ​stride (`IntTuple[origin]`): The memory step size for each dimension, or empty for column-major. `__init__(out self, *, other: Self)` Explicitly constructs a deep copy of the provided layout. **Args:** * ​other (`Self`): The layout to copy. ### `__getitem__` `__getitem__(self, index: Int) -> Self` Returns a sub-layout for the specified dimension. **Args:** * ​index (`Int`): The dimension index to extract. **Returns:** A Layout containing the shape and stride for the specified dimension. ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if this layout is equal to another layout. Two layouts are considered equal if they have identical shape and stride tuples. **Args:** * ​other (`Self`): The layout to compare with. **Returns:** True if the layouts are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if this layout is not equal to another layout. **Args:** * ​other (`Self`): The layout to compare with. **Returns:** True if the layouts are not equal, False otherwise. ### `idx2crd` `idx2crd(self, idx: IntTuple[origin]) -> IntTuple` Converts a linear index to logical coordinates. This is the inverse operation of the **call** method, mapping from a memory index back to the corresponding logical coordinates. **Args:** * ​idx (`IntTuple[origin]`): The linear index to convert. **Returns:** The logical coordinates corresponding to the given index. ### `col_major` `static col_major(*dims: Int) -> Self` Creates a column-major layout with the specified dimensions. In a column-major layout, the first dimension varies fastest in memory, which is the default layout in languages like Fortran and MATLAB. Example: ```mojo from layout import Layout # Create a 3x4 column-major layout var layout = Layout.col_major(3, 4) # Result: Layout with shape (3,4) and stride (1,3) ``` . **Args:** * ​\*dims (`Int`): Variable number of dimension sizes. **Returns:** A column-major Layout with the specified dimensions `static col_major(shape: IntTuple[origin]) -> Self` Creates a column-major layout with the specified shape. In a column-major layout, the first dimension varies fastest in memory, which is the default layout in languages like Fortran and MATLAB. Example: ```mojo from layout import Layout from layout.int_tuple import IntTuple # Create a 3x4 column-major layout var layout = Layout.col_major(IntTuple(3, 4)) # Result: Layout with shape (3,4) and stride (1,3) ``` . **Args:** * ​shape (`IntTuple[origin]`): An IntTuple specifying the dimensions. **Returns:** A column-major Layout with the specified shape ### `row_major` `static row_major(*dims: Int) -> Self` Creates a row-major layout with the specified dimensions. In a row-major layout, the last dimension varies fastest in memory, which is the default layout in languages like C, C++, and Python. Example: ```mojo from layout import Layout # Create a 3x4 row-major layout var layout = Layout.row_major(3, 4) # Result: Layout with shape (3,4) and stride (4,1) ``` . **Args:** * ​\*dims (`Int`): Variable number of dimension sizes. **Returns:** A row-major Layout with the specified dimensions `static row_major[rank: Int](dims: DimList) -> Self` Creates a row-major layout from a DimList with compile-time rank. This method creates a row-major layout where the last dimension varies fastest in memory. It handles both known and unknown dimensions at compile time, properly calculating strides for each dimension. If any dimension is unknown, subsequent strides will also be marked as unknown. Example: ```mojo from layout import Layout from layout.layout import DimList # Create a row-major layout with compile-time rank var dims = DimList(3, 4) var layout = Layout.row_major[2](dims) # Result: Layout with shape (3,4) and stride (4,1) ``` . **Parameters:** * ​rank (`Int`): The compile-time rank (number of dimensions) of the layout. **Args:** * ​dims (`DimList`): A DimList containing the dimensions of the layout. **Returns:** A row-major Layout with the specified dimensions and computed strides. `static row_major[rank: Int](tuple: IndexList[rank]) -> Self` Creates a row-major layout from a DimList with compile-time rank. This method creates a row-major layout where the last dimension varies fastest in memory. It handles both known and unknown dimensions at compile time, properly calculating strides for each dimension. If any dimension is unknown, subsequent strides will also be marked as unknown. Example: ```mojo from layout import Layout from layout.layout import DimList # Create a row-major layout with compile-time rank var dims = DimList(3, 4) var layout = Layout.row_major[2](dims) # Result: Layout with shape (3,4) and stride (4,1) ``` . **Parameters:** * ​rank (`Int`): The compile-time rank (number of dimensions) of the layout. **Args:** * ​tuple (`IndexList[rank]`): An IndexList containing the dimensions of the layout. **Returns:** A row-major Layout with the specified dimensions and computed strides. `static row_major[rank: Int]() -> Self` Creates a row-major layout with unknown values for each axis from a compile-time rank. Example: ```mojo from layout import Layout var layout = Layout.row_major[2]() # Result: Layout with shape (UNKNOWN_VALUE, UNKNOWN_VALUE) ``` **Parameters:** * ​rank (`Int`): The compile-time rank (number of dimensions) of the layout. **Returns:** A row-major Layout with the given rank. `static row_major(shape: IntTuple[origin]) -> Self` Creates a row-major layout from an IntTuple of dimensions. In a row-major layout, the last dimension varies fastest in memory. This method computes the appropriate strides for a row-major layout given the input shape. Example: ```mojo from layout import Layout from layout.int_tuple import IntTuple # Create a row-major layout from a shape tuple var shape = IntTuple(3, 4) var layout = Layout.row_major(shape) # Result: Layout with shape (3,4) and stride (4,1) ``` . **Args:** * ​shape (`IntTuple[origin]`): An IntTuple containing the dimensions of the layout. **Returns:** A row-major Layout with the specified shape and computed strides. ### `make_shape_unknown` `make_shape_unknown[axis: Int = -1](self) -> Self` Creates a new Layout with unknown shape dimensions. This method creates a copy of the current Layout but marks either all dimensions or a specific dimension as unknown, while preserving the original strides. This is useful for tiling tensors with runtime sizes where the tile's shape is unknown but the memory layout (strides) remains constant. Example: ```mojo from layout import Layout from layout.int_tuple import IntTuple # Mark all dimensions as unknown var layout = Layout(IntTuple(2, 3)) var unknown = layout.make_shape_unknown() # Result: Layout with shape (?, ?) and original strides # Mark only first dimension as unknown var partial = layout.make_shape_unknown[0]() # Result: Layout with shape (?, 3) and original strides ``` . **Parameters:** * ​axis (`Int`): The dimension to mark as unknown. If UNKNOWN\_VALUE (default), all dimensions are marked as unknown. **Returns:** A new Layout with the specified dimension(s) marked as unknown and original strides preserved. ### `copy` `copy(self) -> Self` Explicitly constructs a copy of this layout. Creates a deep copy of the layout, including its shape and stride tuples. **Returns:** A new Layout instance with identical shape and stride values. ### `__str__` `__str__(self) -> String` Converts the layout to a string representation. **Returns:** A string representation of the layout in the format "(shape:stride)". ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the layout to the specified writer. Formats the layout as "(shape:stride)" and writes it to the provided writer. **Parameters:** * ​W (`Writer`): Type parameter representing a Writer implementation. **Args:** * ​writer (`W`): The writer to output the layout representation to. ### `__len__` `__len__(self) -> Int` Returns the number of dimensions in the layout. **Returns:** The number of elements in the shape tuple. ### `__iter__` `__iter__(self) -> _LayoutIter[self]` Returns an iterator over the layout's dimensions. Each iteration yields a Layout containing the shape and stride for one dimension. **Returns:** An iterator over the layout's dimensions. ### `size` `size(self) -> Int` Returns the total number of elements in the layout's domain. Calculates the product of all dimensions in the shape. **Returns:** The total number of elements in the layout. ### `cosize` `cosize(self) -> Int` Returns the size of the memory region spanned by the layout. Calculates the maximum memory index plus one, representing the total memory footprint required by the layout. **Returns:** The size of the memory region required by the layout. ### `rank` `rank(self) -> Int` Returns the number of dimensions in the layout. This is equivalent to **len** and returns the number of elements in the shape tuple. **Returns:** The number of dimensions in the layout. ### `__call__` `__call__(self, idx: IntTuple[origin]) -> Int` Maps logical coordinates to a linear memory index. This is the core functionality of a layout, converting multi-dimensional coordinates to a linear memory location. **Args:** * ​idx (`IntTuple[origin]`): The logical coordinates to map. **Returns:** The linear memory index corresponding to the given coordinates. ### `append` `append(mut self, item: Self)` Appends another layout to this layout. This method adds the shape and stride from the provided layout to this layout, effectively increasing its dimensionality. **Args:** * ​item (`Self`): The layout to append to this layout. ### `all_dims_known` `all_dims_known(self) -> Bool` Checks if all dimensions in the layout have known values. A dimension is considered unknown if its shape or stride is set to the special `UNKNOWN_VALUE` constant. **Returns:** True if all dimensions have known shape and stride values, False otherwise. ### `known_shape` `known_shape(self) -> Bool` Checks if all shape dimensions in the layout have known values. A dimension is considered unknown if its shape is set to the special `UNKNOWN_VALUE` constant. This method only checks shapes, not strides. **Returns:** True if all shape dimensions have known values, False otherwise. --- ## LayoutTrait Defines the interface for mapping between logical coordinates and memory indices. The `LayoutTrait` provides a common interface for all layout types, including basic layouts, swizzles, and composed layouts. It enables mapping from multi-dimensional logical coordinates to linear memory indices, which is essential for tensor operations. Implementations of this trait must provide methods for: 1. Mapping coordinates to indices via the `__call__` method 2. Calculating the total size of the layout's domain 3. Calculating the size of the layout's codomain (memory footprint) 4. Indicating whether the layout has a valid shape This trait serves as the foundation for the layout system, allowing different layout implementations to be used interchangeably in algorithms. ## Implemented traits `AnyType`, `Copyable`, `UnknownDestructibility` ## Aliases ### `has_shape` `alias has_shape` Indicates whether the layout has a valid shape. Layouts and ComposedLayouts with at least one Layout have valid shapes and can be used in layout algebra. Swizzles don't have shapes and should be excluded from layout algebra. ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__call__` `__call__(self: _Self, index: IntTuple[origin]) -> Int` Maps a logical coordinate to a linear memory index. **Args:** * ​index (`IntTuple[origin]`): An IntTuple representing the logical coordinates to map. **Returns:** The linear memory index corresponding to the given coordinates. ### `size` `size(self: _Self) -> Int` Returns the total number of elements in the layout's domain. For a layout with shape (m, n), this returns m \* n, representing the total number of valid coordinates in the layout. **Returns:** The total number of elements in the layout. ### `cosize` `cosize(self: _Self) -> Int` Returns the size of the memory region spanned by the layout. For a layout with shape `(m, n)` and stride `(r, s)`, this returns `(m-1)*r + (n-1)*s + 1`, representing the memory footprint. **Returns:** The size of the memory region required by the layout. --- ## MakeLayoutList `MakeLayoutList(v0: Layout, v1: Layout) -> List[Layout]` Creates a list containing two layouts. This is a convenience function for creating a LayoutList with two elements. **Args:** * ​v0 (`Layout`): The first layout to include in the list. * ​v1 (`Layout`): The second layout to include in the list. **Returns:** A LayoutList containing the two provided layouts. --- ## MakeTileLayoutList `MakeTileLayoutList[*tile_sizes: Int]() -> List[Layout]` Creates a list of layouts for tiling operations. This function creates a list of simple layouts, each with a shape from the provided tile\_sizes and a stride of 1. These layouts can be used for tiling operations. **Parameters:** * ​\*tile\_sizes (`Int`): Variable number of integer tile dimensions. **Returns:** A LayoutList containing layouts for each tile size. --- ## apply_tiler `apply_tiler[func: fn(Layout, Layout) -> Layout](layout_a: Layout, tiler: List[Layout]) -> Layout` Applies a layout transformation function to each element of a layout with a tiler. This utility function applies the specified transformation function to each corresponding pair of elements from the layout and tiler list. It's a generic mechanism for implementing various tiling operations. Example: ```mojo from layout import Layout, LayoutList, IntTuple from layout.layout import apply_tiler, logical_divide # Apply logical_divide to each element of a layout with a tiler var base = Layout.row_major(6, 8) var tilers = LayoutList() tilers.append(Layout(IntTuple(2, 2), IntTuple(1, 2))) var result = apply_tiler[logical_divide](base, tilers) ``` . **Parameters:** * ​func (`fn(Layout, Layout) -> Layout`): A function that takes two layouts and returns a transformed layout. **Args:** * ​layout\_a (`Layout`): The base layout to transform. * ​tiler (`List[Layout]`): A list of layouts to use in the transformation. **Returns:** A new layout resulting from applying the transformation function to each pair. --- ## blocked_product `blocked_product(layout_a: Layout, layout_b: Layout) -> Layout` Creates a blocked layout by combining two layouts. This function creates a hierarchical blocked layout by combining a base layout with a block layout. The result is a layout where each element of the base layout is replaced by a block defined by the second layout. This is particularly useful for creating tiled layouts for efficient cache utilization in tensor operations like matrix multiplication. Example: ```mojo from layout import Layout from layout.layout import blocked_product # Create a 2x3 matrix layout var matrix = Layout.row_major(2, 3) # Define 2x2 blocks var block = Layout.row_major(2, 2) # Create a blocked layout with 2x2 blocks var blocked = blocked_product(block, matrix) ``` Output: ```plaintext (((2, 2), (2, 3)):((2, 12), (1, 4))) 0 1 2 3 4 5 +----+----+----+----+----+----+ 0 | 0 | 1 | 4 | 5 | 8 | 9 | +----+----+----+----+----+----+ 1 | 2 | 3 | 6 | 7 | 10 | 11 | +----+----+----+----+----+----+ 2 | 12 | 13 | 16 | 17 | 20 | 21 | +----+----+----+----+----+----+ 3 | 14 | 15 | 18 | 19 | 22 | 23 | +----+----+----+----+----+----+ ``` . **Args:** * ​layout\_a (`Layout`): The base layout to be blocked. * ​layout\_b (`Layout`): The block layout defining the structure within each block. **Returns:** A new layout representing the blocked structure --- ## coalesce `coalesce(layout: Layout, keep_rank: Bool = False) -> Layout` Simplifies a layout by combining dimensions with contiguous strides. This function reduces the rank of a layout by merging dimensions that have contiguous memory layouts, resulting in a simpler but equivalent layout. Example: ```mojo from layout import Layout, IntTuple from layout.layout import coalesce # A layout with shape (2, (1, 4)) and stride (1, (4, 2)) can be coalesced var layout = Layout(IntTuple(2, IntTuple(1, 4)), IntTuple(1, IntTuple(4, 2))) var coalesced = coalesce(layout) # Result: Layout with shape (8) and stride (1) ``` . **Args:** * ​layout (`Layout`): The layout to coalesce. * ​keep\_rank (`Bool`): If True, maintains the original rank of the layout. Default is False. **Returns:** A simplified layout with reduced rank where possible. --- ## complement `complement(layout: Layout, size: Int = 1) -> Layout` Computes the complement layout for a given layout. This function creates a layout that represents the "gaps" or complementary structure of the input layout. It's useful for creating hierarchical layouts where you need to fill in the spaces between existing layout elements. Example: ```mojo from layout import Layout, IntTuple from layout.layout import complement # Compute the complement of a layout var base = Layout(IntTuple(2, 3), IntTuple(3, 1)) var comp = complement(base, 10) # Result: A layout that fills the gaps in the original layout ``` . **Args:** * ​layout (`Layout`): The input layout to compute the complement for. * ​size (`Int`): The total size of the memory region to consider. Defaults to 1. **Returns:** A new layout representing the complement of the input layout. --- ## composition `composition(layout_a: Layout, layout_b: Layout) -> Layout` Composes two layouts to create a new layout. This function creates a new layout by composing two layouts, where the first layout defines the outer structure and the second layout defines the inner structure. The new layout is compatible with `layout_b` (that is, it has the same `size` and every set of coordinates in `layout_b` has an equivalent in the new layout). You can think of `layout_b` as selecting a subset of elements from `layout_a`. Example: ```mojo from layout.layout import Layout, IntTuple from layout.layout import composition # Compose a row-major layout with a tiling layout var base = Layout.row_major(6, 8) var tiling = Layout(IntTuple(3, 2), IntTuple(1, 3)) var composed = composition(base, tiling) # Result: A layout that represents a 3x2 tile from # layout_a ``` . **Args:** * ​layout\_a (`Layout`): The outer layout. * ​layout\_b (`Layout`): The inner layout. **Returns:** A new layout representing the composition of the two layouts. `composition(layout_a: Layout, tiler: List[Layout]) -> Layout` Composes a layout with a list of layouts to create a hierarchical layout. This function creates a new layout by composing each element of the first layout with the corresponding element in the tiler list. If the tiler list is shorter than the layout, the remaining elements from the layout are appended unchanged. Example: ```mojo from layout import Layout, LayoutList, IntTuple from layout.layout import composition # Compose a layout with a list of tiling layouts var base = Layout.row_major(6, 8) var tilers = LayoutList() tilers.append(Layout(IntTuple(2, 2), IntTuple(1, 2))) tilers.append(Layout(IntTuple(3, 3), IntTuple(1, 3))) var composed = composition(base, tilers) # Result: A layout with hierarchical tiling based on the tiler list ``` . **Args:** * ​layout\_a (`Layout`): The base layout to compose with the tiler. * ​tiler (`List[Layout]`): A list of layouts to compose with the base layout. **Returns:** A new layout representing the composition of the base layout with the tiler. --- ## cosize `cosize(l: Layout) -> Int` Returns the size of the memory region spanned by the layout. This is a standalone function equivalent to the Layout.cosize() method. **Args:** * ​l (`Layout`): The layout to calculate the cosize for. **Returns:** The size of the memory region required by the layout. --- ## downcast `downcast(layout: Layout, factor: Int) -> Layout` Splits elements in a layout to create a finer layout without changing the total number of elements so that the alignment is preserved. This function is useful for converting between different data type granularities, such as from uint128 to bf16. **Args:** * ​layout (`Layout`): The layout to downcast. * ​factor (`Int`): The number of elements to split into. **Returns:** A new layout with adjusted shape and stride for the finer granularity. --- ## expand_modes_alike `expand_modes_alike(shape_a: IntTuple[origin], stride_a: IntTuple[origin], shape_b: IntTuple[origin], stride_b: IntTuple[origin]) -> InlineArray[IntTuple, 3]` Aligns two shape-stride pairs to have the same hierarchical structure. This function is used to make two layouts compatible for operations by ensuring they have the same hierarchical structure, expanding scalar values into tuples as needed. **Args:** * ​shape\_a (`IntTuple[origin]`): The first shape tuple. * ​stride\_a (`IntTuple[origin]`): The first stride tuple. * ​shape\_b (`IntTuple[origin]`): The second shape tuple. * ​stride\_b (`IntTuple[origin]`): The second stride tuple. **Returns:** An array containing three tuples: the common shape, the expanded stride\_a, and the expanded stride\_b. `expand_modes_alike(layout_a: Layout, layout_b: Layout) -> InlineArray[Layout, 2]` Aligns two layouts to have the same hierarchical structure. This function tiles both layouts so they mirror each other's structure, making them compatible for operations that require matching hierarchies. Example: Given layouts with different structures: * layout\_0: (((3, (5, 2)), 4):((1, (24, 12)), 3)) * layout\_1: ((30, (2, 2)):(2, (60, 1))) The result would be two layouts with matching structures: * (((3, (5, 2)), (2, 2)):((1, (24, 12)), (3, 6))) * (((3, (5, 2)), (2, 2)):((2, (6, 30)), (60, 1))) ```mojo from layout import Layout, IntTuple from layout.layout import expand_modes_alike alias layout_0 = Layout( IntTuple(IntTuple(3, IntTuple(5, 2)), 4), IntTuple(IntTuple(1, IntTuple(24, 12)), 3), ) alias layout_1 = Layout( IntTuple(30, IntTuple(2, 2)), IntTuple(2, IntTuple(60, 1)) ) alias uc = expand_modes_alike(layout_0, layout_1) print(uc[0]) # (((3, (5, 2)), (2, 2)):((1, (24, 12)), (3, 6))) print(uc[1]) # (((3, (5, 2)), (2, 2)):((2, (6, 30)), (60, 1))) ``` . **Args:** * ​layout\_a (`Layout`): The first layout to align. * ​layout\_b (`Layout`): The second layout to align. **Returns:** An array containing two layouts with matching hierarchical structures. --- ## expand_strides `expand_strides(shape: IntTuple[origin], stride: Int) -> IntTuple` Expands a scalar stride into a stride tuple matching a shape tuple. This function creates a stride tuple that matches the structure of a shape tuple, with each stride value calculated based on the cumulative product of shape dimensions. **Args:** * ​shape (`IntTuple[origin]`): The shape tuple to match. * ​stride (`Int`): The base stride value to expand. **Returns:** A stride tuple matching the structure of the shape tuple. --- ## format_layout `format_layout[W: Writer](layout: Layout, mut writer: W)` Formats a 2D layout as a table and writes it to the specified writer. This function creates a visual representation of a 2D layout as a table showing the memory indices for each logical coordinate. **Parameters:** * ​W (`Writer`): Type parameter representing a Writer implementation. **Args:** * ​layout (`Layout`): The 2D layout to format. * ​writer (`W`): The writer to output the formatted layout to. --- ## hierarchical_unzip `hierarchical_unzip(layout_a: Layout, tiler: List[Layout]) -> Layout` Hierarchically unzips a layout according to a list of layouts. This function creates a hierarchical layout by unzipping the first layout according to the layouts in the tiler list. It's useful for decomposing a layout into hierarchical components for more efficient memory access patterns or to enable specialized tensor operations. Example: ```mojo from layout import Layout, LayoutList, IntTuple from layout.layout import hierarchical_unzip # Create a layout to unzip var base = Layout.row_major(6, 8) var tilers = LayoutList() tilers.append(Layout(IntTuple(2, 2))) var result = hierarchical_unzip(base, tilers) ``` . **Args:** * ​layout\_a (`Layout`): The layout to be unzipped. * ​tiler (`List[Layout]`): A list of layouts defining the unzipping patterns. **Returns:** A new layout representing the hierarchical unzipping with components from both the original layout and the tiler layouts. `hierarchical_unzip(layout_a: Layout, layout_b: Layout) -> Layout` Hierarchically unzips a layout according to another layout. This function creates a hierarchical layout by unzipping the first layout according to the second layout. It's a fundamental operation for decomposing a layout into hierarchical components, which enables more efficient memory access patterns for various tensor operations. Example: ```mojo from layout import Layout, IntTuple from layout.layout import hierarchical_unzip # Create layouts var base = Layout.row_major(6, 8) var pattern = Layout(IntTuple(2, 2)) var result = hierarchical_unzip(base, pattern) ``` . **Args:** * ​layout\_a (`Layout`): The layout to be unzipped. * ​layout\_b (`Layout`): The layout defining the unzipping pattern. **Returns:** A new layout representing the hierarchical unzipping of layout\_a according to the pattern defined by layout\_b. --- ## layout Provides a high-performance tensor layout system for memory mapping and indexing. The layout module implements a comprehensive system for describing memory layouts of multi-dimensional tensors, enabling efficient mapping between logical tensor coordinates and physical memory locations. This is a critical component for high-performance tensor operations in machine learning and scientific computing. These low-level primitives require careful use to avoid errors. Understanding the relationship between tensor shapes, strides, and memory layout is essential for effective use. Key components: * `LayoutTrait`: Core trait defining the interface for all layout types * `Layout`: Primary struct implementing memory layout with shape and stride information * Layout algebra: Functions for composing, dividing, and transforming layouts * Tiling operations: Functions for hierarchical decomposition of layouts Performance features: * Zero-cost abstractions for mapping between logical and physical indices * Support for both compile-time and runtime-determined shapes * Efficient memory access patterns through layout transformations * Hierarchical tiling for cache-friendly memory access Common use cases: * Defining memory layouts for tensors with different storage formats (row-major, column-major) * Implementing efficient tensor operations with optimal memory access patterns * Supporting hardware-specific memory layouts for accelerators * Enabling zero-copy tensor views and reshaping operations Example: ```mojo from layout import Layout, IntTuple from layout.layout import blocked_product # Create a 3x4 row-major layout var layout = Layout.row_major(3, 4) # Access the memory location for logical coordinates (1, 2) var memory_idx = layout([1, 2]) # Create a tiled layout for blocked matrix multiplication var tiled = blocked_product(layout, Layout([2, 2])) ``` ## Aliases ### `LayoutList` `alias LayoutList = List[Layout]` ## Structs * [​`Layout`](./Layout): Represents a memory layout for multi-dimensional data. ## Traits * [​`LayoutTrait`](./LayoutTrait): Defines the interface for mapping between logical coordinates and memory indices. ## Functions * [​`apply_tiler`](./apply_tiler): Applies a layout transformation function to each element of a layout with a tiler. * [​`blocked_product`](./blocked_product): Creates a blocked layout by combining two layouts. * [​`coalesce`](./coalesce): Simplifies a layout by combining dimensions with contiguous strides. * [​`complement`](./complement): Computes the complement layout for a given layout. * [​`composition`](./composition): Composes two layouts to create a new layout. * [​`cosize`](./cosize): Returns the size of the memory region spanned by the layout. * [​`downcast`](./downcast): Splits elements in a layout to create a finer layout without changing the total number of elements so that the alignment is preserved. * [​`expand_modes_alike`](./expand_modes_alike): Aligns two shape-stride pairs to have the same hierarchical structure. * [​`expand_strides`](./expand_strides): Expands a scalar stride into a stride tuple matching a shape tuple. * [​`format_layout`](./format_layout): Formats a 2D layout as a table and writes it to the specified writer. * [​`hierarchical_unzip`](./hierarchical_unzip): Hierarchically unzips a layout according to a list of layouts. * [​`is_contiguous_dim`](./is_contiguous_dim): Checks if a flat layout is contiguous in a specific dimension. * [​`is_row_major`](./is_row_major): Checks if a layout has row-major ordering for the specified rank. * [​`logical_divide`](./logical_divide): Divides a layout into blocks according to another layout. * [​`logical_product`](./logical_product): Creates a product of two layouts. * [​`make_layout`](./make_layout): Creates a composite layout by concatenating multiple layouts. * [​`make_ordered_layout`](./make_ordered_layout): Creates a layout with strides ordered according to a specified traversal order. * [​`MakeLayoutList`](./MakeLayoutList): Creates a list containing two layouts. * [​`MakeTileLayoutList`](./MakeTileLayoutList): Creates a list of layouts for tiling operations. * [​`print_layout`](./print_layout): Prints a 2D layout to the standard output. * [​`right_inverse`](./right_inverse): Creates a right inverse of a layout. * [​`size`](./size): Returns the total number of elements in the layout's domain. * [​`sublayout`](./sublayout): Creates a sublayout by selecting specific dimensions from a layout. * [​`tile_to_shape`](./tile_to_shape): Creates a layout by tiling a base layout to match a target shape. * [​`upcast`](./upcast): Fuses consecutive elements in a layout to create a coarser layout. * [​`zip_modes`](./zip_modes): Combines corresponding modes from two layouts. * [​`zipped_divide`](./zipped_divide): Divides a layout into blocks according to another layout. --- ## is_contiguous_dim `is_contiguous_dim(layout: Layout, dim: Int) -> Bool` Checks if a flat layout is contiguous in a specific dimension. This function checks if a flat layout is contiguous in a specified dimension, considering both positive strides and zero strides with a single element. The latter case is necessary for coalesced layouts. **Args:** * ​layout (`Layout`): The layout to check. * ​dim (`Int`): The dimension to check. **Returns:** True if the layout is contiguous in the specified dimension, False otherwise. --- ## is_row_major `is_row_major[rank: Int](layout: Layout) -> Bool` Checks if a layout has row-major ordering for the specified rank. A row-major layout has strides that decrease from left to right, with the rightmost dimension having a stride of 1. **Parameters:** * ​rank (`Int`): The expected rank of the layout. **Args:** * ​layout (`Layout`): The layout to check. **Returns:** True if the layout has row-major ordering for the specified rank, False otherwise. --- ## logical_divide `logical_divide(layout_a: Layout, _layout_b: Layout) -> Layout` Divides a layout into blocks according to another layout. This function creates a hierarchical layout by dividing the first layout according to the second layout. It's useful for creating blocked or tiled representations of tensors. **Args:** * ​layout\_a (`Layout`): The layout to be divided. * ​\_layout\_b (`Layout`): The layout defining the division pattern. **Returns:** A new layout representing the hierarchical division. `logical_divide(layout_a: Layout, tiler: List[Layout]) -> Layout` Divides a layout into blocks according to a list of layouts. This is a variant of logical\_divide that works with a list of layouts for more complex tiling patterns. **Args:** * ​layout\_a (`Layout`): The layout to be divided. * ​tiler (`List[Layout]`): A list of layouts defining the division patterns. **Returns:** A new layout representing the hierarchical division. --- ## logical_product `logical_product(_layout_a: Layout, layout_b: Layout) -> Layout` Creates a product of two layouts. This function creates a hierarchical layout by taking the logical product of two layouts. It's a fundamental operation for creating blocked or tiled layouts. **Args:** * ​\_layout\_a (`Layout`): The first layout. * ​layout\_b (`Layout`): The second layout. **Returns:** A new layout representing the logical product of the two layouts. `logical_product(layout_a: Layout, tiler: List[Layout]) -> Layout` Creates a product of a layout with a list of layouts. This is a variant of logical\_product that works with a list of layouts for more complex tiling patterns. It applies the logical\_product operation to each element of the layout with the corresponding element in the tiler list. Example: ```mojo from layout import Layout, LayoutList, IntTuple from layout.layout import logical_product # Create a product of a layout with a list of layouts var base = Layout.row_major(6, 8) var tilers = LayoutList() tilers.append(Layout(IntTuple(2, 2))) var result = logical_product(base, tilers) ``` . **Args:** * ​layout\_a (`Layout`): The base layout to create products with. * ​tiler (`List[Layout]`): A list of layouts defining the product patterns. **Returns:** A new layout representing the logical product with the tiler layouts. --- ## make_layout `make_layout(*layouts: Layout) -> Layout` Creates a composite layout by concatenating multiple layouts. This function combines multiple layouts into a single layout by concatenating their shapes and strides. The resulting layout represents a hierarchical structure where each input layout becomes a component of the output layout. Example: ```mojo from layout import Layout, IntTuple from layout.layout import make_layout var layout1 = Layout(IntTuple(2, 3), IntTuple(3, 1)) var layout2 = Layout(IntTuple(4, 5), IntTuple(5, 1)) var combined = make_layout(layout1, layout2) # Result: Layout with shape ((2, 3), (4, 5)) and stride ((3, 1), (5, 1)) ``` . **Args:** * ​\*layouts (`Layout`): Variable number of `Layout` objects to combine. **Returns:** A new Layout with concatenated shapes and strides from the input layouts. `make_layout(layout_a: Layout, layout_b: Layout) -> Layout` Creates a composite layout from two layouts. This is a specialized version of make\_layout that takes exactly two layouts and combines them into a single layout. This function exists as a workaround for compiler limitations. **Args:** * ​layout\_a (`Layout`): The first layout to include in the composite. * ​layout\_b (`Layout`): The second layout to include in the composite. **Returns:** A new `Layout` with concatenated shapes and strides from the input layouts. --- ## make_ordered_layout `make_ordered_layout(shape: IntTuple[origin], order: IntTuple[origin]) -> Layout` Creates a layout with strides ordered according to a specified traversal order. This function generates a compact (bijective) layout where the stride values follow the traversal order specified by the order parameter. This allows creating layouts with custom memory traversal patterns while maintaining a compact memory representation. Example: ```mojo from layout import IntTuple, Layout from layout.layout import make_ordered_layout # Create a layout with shape (2,3,4,5) where dimensions are traversed # in the order: dim0, dim3, dim2, dim1 var layout = make_ordered_layout( IntTuple(2, 3, 4, 5), IntTuple(1, 4, 3, 2) ) # Result: Layout with shape (2,3,4,5) and stride (1,24,6,2) ``` . **Args:** * ​shape (`IntTuple[origin]`): The shape of the layout. * ​order (`IntTuple[origin]`): The traversal order priority (lower values indicate higher priority). **Returns:** A `Layout` with the specified shape and strides ordered according to the traversal order. --- ## print_layout `print_layout(layout: Layout)` Prints a 2D layout to the standard output. This function visualizes a 2D layout by printing a formatted table showing the memory indices for each logical coordinate. **Args:** * ​layout (`Layout`): The 2D layout to print. --- ## right_inverse `right_inverse(layout: Layout) -> Layout` Creates a right inverse of a layout. The right inverse of a layout maps memory indices back to logical coordinates. This is useful for converting between different memory layouts. **Args:** * ​layout (`Layout`): The layout to invert. **Returns:** A new layout representing the right inverse of the input layout. --- ## size `size(l: Layout) -> Int` Returns the total number of elements in the layout's domain. This is a standalone function equivalent to the Layout.size() method. **Args:** * ​l (`Layout`): The layout to calculate the size for. **Returns:** The total number of elements in the layout. --- ## sublayout `sublayout(layout: Layout, *modes: Int) -> Layout` Creates a sublayout by selecting specific dimensions from a layout. This function extracts a subset of dimensions from a layout to create a new layout with lower rank. For example, from a 3D layout, you could extract a 2D layout containing only the first and third dimensions. Example: From a layout with shape (3,4,5), sublayout(layout, 0, 2) would create a layout with shape (3,5). **Args:** * ​layout (`Layout`): The source layout to extract dimensions from. * ​\*modes (`Int`): The indices of dimensions to include in the sublayout. **Returns:** A new layout containing only the specified dimensions. --- ## tile_to_shape `tile_to_shape(tile: Layout, target_shape: IntTuple[origin], order: Optional[IntTuple] = Optional(None)) -> Layout` Creates a layout by tiling a base layout to match a target shape. This function creates a hierarchical layout by repeating a tile layout to match a target shape. It calculates how many times the tile needs to be repeated in each dimension to reach the target shape, and creates a tiler layout with this information. Example: ```mojo from layout import Layout, IntTuple from layout.layout import tile_to_shape # Create a 2x2 tile layout var tile = Layout.row_major(2, 2) # Tile it to create a 6x4 layout var tiled = tile_to_shape(tile, IntTuple(6, 4)) # Result: A layout with 3x2 tiles of size 2x2 each ``` . **Args:** * ​tile (`Layout`): The base layout to be tiled. * ​target\_shape (`IntTuple[origin]`): The desired final shape to tile to. * ​order (`Optional[IntTuple]`): Optional memory ordering for the tiler layout. If None, defaults to column-major ordering. **Returns:** A new layout representing the tiled structure that matches the target shape. --- ## upcast `upcast(layout: Layout, factor: Int) -> Layout` Fuses consecutive elements in a layout to create a coarser layout. This function is useful for converting between different data type granularities, such as from bytes to larger data types like bfloat16 or tf32. **Args:** * ​layout (`Layout`): The layout to upcast. * ​factor (`Int`): The number of consecutive elements to fuse into one. **Returns:** A new layout with adjusted shape and stride for the coarser granularity. --- ## zip_modes `zip_modes(layout_a: Layout, layout_b: Layout) -> Layout` Combines corresponding modes from two layouts. This function creates a new layout by combining corresponding dimensions from two layouts. If a dimension in layout\_b has a non-positive shape, the corresponding dimension from layout\_a is used directly. **Args:** * ​layout\_a (`Layout`): The first layout. * ​layout\_b (`Layout`): The second layout. **Returns:** A new layout with combined dimensions from both input layouts. --- ## zipped_divide `zipped_divide(layout_a: Layout, layout_b: Layout) -> Layout` Divides a layout into blocks according to another layout. This function creates a hierarchical layout by dividing the first layout according to the second layout. It's an alias for hierarchical\_unzip that provides a more intuitive name for the division operation. This is useful for creating blocked or tiled representations of tensors. Example: ```mojo from layout import Layout, IntTuple from layout.layout import zipped_divide # Create layouts var base = Layout.row_major(6, 8) var pattern = Layout(IntTuple(2, 2)) var result = zipped_divide(base, pattern) ``` . **Args:** * ​layout\_a (`Layout`): The layout to be divided. * ​layout\_b (`Layout`): The layout defining the division pattern. **Returns:** A new layout representing the hierarchical division of layout\_a according to layout\_b. `zipped_divide(layout_a: Layout, tiler: List[Layout]) -> Layout` Divides a layout into blocks according to a list of layouts. This function creates a hierarchical layout by dividing the first layout according to the layouts in the tiler list. It's an alias for hierarchical\_unzip that provides a more intuitive name for the division operation when working with multiple tiling patterns. Example: ```mojo from layout import Layout, LayoutList, IntTuple from layout.layout import zipped_divide # Create layouts var base = Layout.row_major(6, 8) var tilers = LayoutList() tilers.append(Layout(IntTuple(2, 2))) var result = zipped_divide(base, tilers) ``` . **Args:** * ​layout\_a (`Layout`): The layout to be divided. * ​tiler (`List[Layout]`): A list of layouts defining the division patterns. **Returns:** A new layout representing the hierarchical division of layout\_a according to the patterns in tiler. --- ## LayoutTensor `@register_passable(trivial)` `struct LayoutTensor[mut: Bool, //, dtype: DType, layout: Layout, origin: Origin[mut], /, *, address_space: AddressSpace = AddressSpace(0), element_layout: Layout = __init__[::Origin[::Bool(IntTuple(1), IntTuple(1)), layout_int_type: DType = _get_layout_type(layout, address_space), linear_idx_type: DType = _get_index_type(layout, address_space), masked: Bool = False, alignment: Int = alignof[::DType,__mlir_type.!kgen.target]()]` A high-performance tensor with explicit memory layout and hardware-optimized access patterns. `LayoutTensor` provides a powerful abstraction for multi-dimensional data with precise control over memory organization. It supports various memory layouts (row-major, column-major, tiled), hardware-specific optimizations, and efficient parallel access patterns. Example: ```mojo from layout import Layout, LayoutTensor # Create tensor on CPU using InlineArray to allocate storage space. var storage = InlineArray[Scalar[DType.float32], 5 * 4](uninitialized = True) var tensor_5x4 = LayoutTensor[DType.float32, Layout.row_major(5, 4)](storage) ``` ## Parameters * ​mut (`Bool`): The inferred mutability of the underlying pointer. * ​dtype (`DType`): The data type of the underlying pointer. * ​layout (`Layout`): The memory layout of the tensor. * ​origin (`Origin[mut]`): The origin of the underlying pointer. * ​address\_space (`AddressSpace`): The address space of the underlying pointer. * ​element\_layout (`Layout`): The memory layout of each element in the tensor. * ​layout\_int\_type (`DType`): The integer type of each dimension of runtime layout. * ​linear\_idx\_type (`DType`): The integer type of the index pointing to memory locations. * ​masked (`Bool`): If true the tensor is masked and runtime layouts determine the shape. * ​alignment (`Int`): Alignment of the data pointer. ## Fields * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the underlying memory buffer containing the tensor data. This pointer respects the specified address space, alignment, mutability, and origin tracking for memory safety and performance optimization. * ​runtime\_layout (`RuntimeLayout[layout, element_type=layout_int_type, linear_idx_type=linear_idx_type]`): Runtime representation of the tensor's memory layout. Handles both compile-time and runtime-determined dimensions, enabling efficient mapping between logical tensor coordinates and physical memory locations. * ​runtime\_element\_layout (`RuntimeLayout[element_layout, element_type=int32, linear_idx_type=linear_idx_type]`): Runtime representation of each element's internal layout. Used when elements themselves have structure, such as in blocked or tiled layouts. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable`, `_Expable` ## Aliases ### `element_size` `alias element_size = element_layout.size()` The number of scalar values in each element of the tensor. ### `element_type` `alias element_type = SIMD[dtype, element_layout.size()]` The SIMD vector type used for vectorized operations on tensor elements. ### `rank` `alias rank = layout.rank()` The number of dimensions in the tensor's layout. ## Methods ### `__init__` `@implicit` `__init__(span: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]) -> Self` Create a `LayoutTensor` with a `Span`. **Constraints:** Layout must be fully static. **Args:** * ​span (`Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]`): The `Span` pointing to the underlying data. `__init__(span: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self` Create a `LayoutTensor` with a `Span` and a runtime layout for the tensor. The runtime layout element type will be casted to the layout tensor layout integer type. **Constraints:** * Element layout must be fully static. **Args:** * ​span (`Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]`): The `Span` pointing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the LayoutTensor. `__init__(span: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self` Create a `LayoutTensor` with a `Span`, a runtime layout of the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type. **Constraints:** * Runtime layout and `LayoutTensor` must have the same bitwidth and index type. **Args:** * ​span (`Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]`): The `Span` pointing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`. * ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element. `@implicit` `__init__(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self` Create a `LayoutTensor` with an `UnsafePointer`. **Constraints:** Layout must be fully static. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The `UnsafePointer` pointing to the underlying data. `__init__(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self` Create a `LayoutTensor` with an `UnsafePointer` and a runtime layout for the tensor. The runtime layout element type will be casted to the layout tensor layout integer type. **Constraints:** Element layout must be fully static. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The UnsafePointer pointing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the LayoutTensor. `__init__(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self` Create a `LayoutTensor` with an `UnsafePointer`, a runtime layout for the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The `UnsafePointer` pointing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`. * ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element. `@implicit` `__init__(ref [origin] device_buffer: DeviceBuffer[dtype]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Create a `LayoutTensor` from a `DeviceBuffer`. The layout must have statically known dimensions. Note that the device buffer memory is on the accelerator device (GPU global memory). Code running on the CPU can use the [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) to allocate a `DeviceBuffer` and use that to construct a `LayoutTensor` that can be accessed on the GPU. You cannot directly access data in the `DeviceBuffer` or `LayoutTensor` from the CPU. The following example shows a typical pattern for using `DeviceBuffer` to construct a `LayoutTensor` that you can use on the GPU. ```mojo from gpu.host import DeviceContext, DeviceBuffer from layout import Layout, LayoutTensor alias dtype = DType.float32 var ctx = DeviceContext() # Allocate buffers var dev_buf = ctx.enqueue_create_buffer[dtype](16) var host_buf = ctx.enqueue_create_host_buffer[dtype](16) # Ensure buffers have been created ctx.synchronize() # Initialize host buffer and copy to device buffer for i in range(16): host_buf[i] = i ctx.enqueue_copy(dev_buf, host_buf) # Create LayoutTensor to use on device alias layout = Layout.row_major(4, 4) var tensor = LayoutTensor[dtype, layout](dev_buf) ... ``` **Constraints:** * Layout must be fully static. **Args:** * ​device\_buffer (`DeviceBuffer[dtype]`): Contains the underlying data to point to. `@implicit` `__init__(ref [origin] host_buffer: HostBuffer[dtype]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Create a `LayoutTensor` from a `HostBuffer`. The layout must have statically known dimensions. The resulting tensor's data can only be accessed on the CPU. ```mojo from gpu.host import DeviceContext, HostBuffer from layout import Layout, LayoutTensor alias dtype = DType.float32 var ctx = DeviceContext() var dev_buf = ctx.enqueue_create_host_buffer[dtype](8) alias layout = Layout.row_major(4, 4) var tensor = LayoutTensor[dtype, layout](dev_buf) ``` **Constraints:** * Layout must be fully static. **Args:** * ​host\_buffer (`HostBuffer[dtype]`): Contains the underlying data to point to. `__init__(ref [origin] device_buffer: DeviceBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Create a `LayoutTensor` from a `DeviceBuffer` and a runtime layout. The runtime layout element type will be casted to the layout tensor layout integer type. The resulting tensor's data can only be accessed on the GPU. **Constraints:** * Element layout must be fully static. **Args:** * ​device\_buffer (`DeviceBuffer[dtype]`): The `DeviceBuffer` containing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the LayoutTensor. `__init__(ref [origin] host_buffer: HostBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Create a `LayoutTensor` from a `HostBuffer` and a runtime layout. The runtime layout element type will be casted to the layout tensor layout integer type. The resulting tensor's data can only be accessed on the CPU. **Constraints:** * Element layout must be fully static. **Args:** * ​host\_buffer (`HostBuffer[dtype]`): The `HostBuffer` containing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`. `__init__(ref [origin] device_buffer: DeviceBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Create a `LayoutTensor` from a `DeviceBuffer`, a runtime layout for the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type. The resulting tensor's data can only be accessed on the GPU. **Args:** * ​device\_buffer (`DeviceBuffer[dtype]`): The `DeviceBuffer` containing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`. * ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element. `__init__(ref [origin] host_buffer: HostBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Create a `LayoutTensor` from a `HostBuffer`, a runtime layout for the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type. The resulting tensor's data can only be accessed on the CPU. **Args:** * ​host\_buffer (`HostBuffer[dtype]`): The `HostBuffer` containing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`. * ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element. ### `__getitem__` `__getitem__(self, *dims: Int) -> SIMD[dtype, element_layout.size()]` Retrieves a single element from the tensor at the specified indices. This method provides array-like indexing for the tensor. The number of indices provided must match the rank of the tensor, otherwise an error will occur at runtime. **Args:** * ​\*dims (`Int`): The indices specifying the element's position in each dimension. For example, in a 3D tensor, you would use (i, j, k). **Returns:** The element at the specified position with the tensor's data type. `__getitem__(self, crd: RuntimeTuple[S, element_type=element_type]) -> SIMD[dtype, element_layout.size()]` Retrieves a single element from the tensor at the specified indices. This method provides array-like indexing for the tensor. The number of indices provided must match the rank of the tensor, otherwise an error will occur at runtime. **Args:** * ​crd (`RuntimeTuple[S, element_type=element_type]`): The coordinate specifying the element's position in each dimension. For example, in a 3D tensor, you would use (i, j, k). **Returns:** The element at the specified position with the tensor's data type. ### `__setitem__` `__setitem__(self, d0: Int, val: SIMD[dtype, element_layout.size()])` Sets a single element in a rank-1 tensor at the specified index. This method provides array-like element assignment for rank-1 tensors. Notes: * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. **Args:** * ​d0 (`Int`): The index along the first dimension. * ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position. `__setitem__(self, d0: Int, d1: Int, val: SIMD[dtype, element_layout.size()])` Sets a single element in a rank-2 tensor at the specified indices. This method provides array-like element assignment for rank-2 tensors. Performance: * Direct memory access with minimal overhead. * Memory access pattern follows the tensor's stride configuration. Notes: * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. **Args:** * ​d0 (`Int`): The index along the first dimension. * ​d1 (`Int`): The index along the second dimension. * ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position. `__setitem__(self, d0: Int, d1: Int, d2: Int, val: SIMD[dtype, element_layout.size()])` Sets a single element in a rank-3 tensor at the specified indices. This method provides array-like element assignment for rank-3 tensors. Performance: * Direct memory access with minimal overhead. * Memory access pattern follows the tensor's stride configuration. Notes: * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. **Args:** * ​d0 (`Int`): The index along the first dimension. * ​d1 (`Int`): The index along the second dimension. * ​d2 (`Int`): The index along the third dimension. * ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position. `__setitem__(self, d0: Int, d1: Int, d2: Int, d3: Int, val: SIMD[dtype, element_layout.size()])` Sets a single element in a rank-4 tensor at the specified indices. This method provides array-like element assignment for rank-4 tensors. Performance: * Direct memory access with minimal overhead. * Memory access pattern follows the tensor's stride configuration. Notes: * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. **Args:** * ​d0 (`Int`): The index along the first dimension. * ​d1 (`Int`): The index along the second dimension. * ​d2 (`Int`): The index along the third dimension. * ​d3 (`Int`): The index along the fourth dimension. * ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position. `__setitem__(self, d0: Int, d1: Int, d2: Int, d3: Int, d4: Int, val: SIMD[dtype, element_layout.size()])` Sets a single element in a rank-5 tensor at the specified indices. This method provides array-like element assignment for rank-5 tensors. Performance: * Direct memory access with minimal overhead. * Memory access pattern follows the tensor's stride configuration. Notes: * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. **Args:** * ​d0 (`Int`): The index along the first dimension. * ​d1 (`Int`): The index along the second dimension. * ​d2 (`Int`): The index along the third dimension. * ​d3 (`Int`): The index along the fourth dimension. * ​d4 (`Int`): The index along the fifth dimension. * ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position. ### `__add__` `__add__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Add a scalar value to each element of the tensor. Performs an elementwise addition operation, adding the scalar value to each element in the tensor. This operation creates a new tensor with the results. Performance: * This operation creates a copy of the tensor before performing the addition. * For in-place addition, use the `__iadd__` method instead (`+=` operator). **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to add to each element. **Returns:** A new tensor containing the results of the addition operation. `__add__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Add another tensor to this tensor elementwise. Performs an elementwise addition between this tensor and another tensor. This operation creates a new tensor with the results. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Performance: * This operation creates a copy of the tensor before performing the addition. * For in-place addition, use the `__iadd__` method instead (`+=` operator). **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to add to this tensor. **Returns:** A new tensor containing the results of the addition operation. ### `__sub__` `__sub__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Subtract a scalar value from each element of the tensor. Performs an elementwise subtraction operation, subtracting the scalar value from each element in the tensor. This operation creates a new tensor with the results. Performance: * This operation creates a copy of the tensor before performing the subtraction. * For in-place subtraction, use the `__isub__` method instead (`-=` operator). **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to subtract from each element. **Returns:** A new tensor containing the results of the subtraction operation. `__sub__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Subtract another tensor from this tensor elementwise. Performs an elementwise subtraction between this tensor and another tensor. This operation creates a new tensor with the results. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Performance: * This operation creates a copy of the tensor before performing the subtraction. * For in-place subtraction, use the `__isub__` method instead (`-=` operator). **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to subtract from this tensor. **Returns:** A new tensor containing the results of the subtraction operation. ### `__mul__` `__mul__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Multiply each element of the tensor by a scalar value. Performs an elementwise multiplication operation, multiplying each element in the tensor by the scalar value. This operation creates a new tensor with the results. Performance: * This operation creates a copy of the tensor before performing the multiplication. * For in-place multiplication, use the `__imul__` method instead (`*=` operator). **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to multiply with each element. **Returns:** A new tensor containing the results of the multiplication operation. `__mul__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Multiply this tensor with another tensor elementwise. Performs an elementwise multiplication (Hadamard product) between this tensor and another tensor. This operation creates a new tensor with the results. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Note: This is NOT a matrix multiplication operation. For matrix multiplication, use the appropriate matmul function instead. Performance: * This operation creates a copy of the tensor before performing the multiplication. * For in-place multiplication, use the `__imul__` method instead (`*=` operator). **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to multiply with this tensor. **Returns:** A new tensor containing the results of the elementwise multiplication. ### `__truediv__` `__truediv__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Divide each element of the tensor by a scalar value. Performs an elementwise division operation, dividing each element in the tensor by the scalar value. This operation creates a new tensor with the results. Performance: * This operation creates a copy of the tensor before performing the division. * For in-place division, use the `__itruediv__` method instead (`/=` operator). Notes: * Division by zero will result in undefined behavior or errors depending on the dtype. * For integer dtypes, this performs integer division. **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to divide each element by. **Returns:** A new tensor containing the results of the division operation. `__truediv__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Divide this tensor by another tensor elementwise. Performs an elementwise division between this tensor and another tensor. This operation creates a new tensor with the results. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Performance: * This operation creates a copy of the tensor before performing the division. * For in-place division, use the `__itruediv__` method instead (`/=` operator). Notes: * Division by zero will result in undefined behavior or errors depending on the dtype. * For integer dtypes, this performs integer division. **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to divide this tensor by. **Returns:** A new tensor containing the results of the division operation. ### `__iadd__` `__iadd__(self, other: SIMD[dtype, 1])` Add a scalar value to each element of the tensor in-place. Performs an elementwise addition operation, adding the scalar value to each element in the tensor. This operation modifies the tensor in-place. Performance: * This operation modifies the tensor directly without creating a copy. **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to add to each element. `__iadd__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Add another tensor to this tensor elementwise in-place. Performs an elementwise addition between this tensor and another tensor. This operation modifies the tensor in-place. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Performance: * This operation modifies the tensor directly without creating a copy. **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to add to this tensor. ### `__isub__` `__isub__(self, other: SIMD[dtype, 1])` Subtract a scalar value from each element of the tensor in-place. Performs an elementwise subtraction operation, subtracting the scalar value from each element in the tensor. This operation modifies the tensor in-place. Performance: * This operation modifies the tensor directly without creating a copy. **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to subtract from each element. `__isub__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Subtract another tensor from this tensor elementwise in-place. Performs an elementwise subtraction between this tensor and another tensor. This operation modifies the tensor in-place. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Performance: * This operation modifies the tensor directly without creating a copy. **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to subtract from this tensor. ### `__imul__` `__imul__(self, other: SIMD[dtype, 1])` Multiply each element of the tensor by a scalar value in-place. Performs an elementwise multiplication operation, multiplying each element in the tensor by the scalar value. This operation modifies the tensor in-place. Performance: * This operation modifies the tensor directly without creating a copy. **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to multiply with each element. `__imul__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Multiply this tensor with another tensor elementwise in-place. Performs an elementwise multiplication (Hadamard product) between this tensor and another tensor. This operation modifies the tensor in-place. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Note: This is NOT a matrix multiplication operation. For matrix multiplication, use the appropriate matmul function instead. Performance: * This operation modifies the tensor directly without creating a copy. **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to multiply with this tensor. ### `__itruediv__` `__itruediv__(self, other: SIMD[dtype, 1])` Divide each element of the tensor by a scalar value in-place. Performs an elementwise division operation, dividing each element in the tensor by the scalar value. This operation modifies the tensor in-place. Performance: * This operation modifies the tensor directly without creating a copy. Notes: * Division by zero will result in undefined behavior or errors depending on the dtype. * For integer dtypes, this performs integer division. **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to divide each element by. `__itruediv__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Divide this tensor by another tensor elementwise in-place. Performs an elementwise division between this tensor and another tensor. This operation modifies the tensor in-place. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Performance: * This operation modifies the tensor directly without creating a copy. Notes: * Division by zero will result in undefined behavior or errors depending on the dtype. * For integer dtypes, this performs integer division. **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to divide this tensor by. ### `copy` `copy(self) -> Self` Explicitly copy the other `LayoutTensor`. **Returns:** A copy of the value. ### `bitcast` `bitcast[new_type: DType, /, address_space: AddressSpace = address_space, element_layout: Layout = element_layout](self) -> LayoutTensor[new_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]` Bitcast the underlying pointer to a new data type. **Parameters:** * ​new\_type (`DType`): The new data type it is casting to. * ​address\_space (`AddressSpace`): The address space of the returned `LayoutTensor`. * ​element\_layout (`Layout`): The element layout of the returned `LayoutTensor`. **Returns:** A new `LayoutTensor` with the same memory location but with the specified data type, address space, and element layout. ### `origin_cast` `origin_cast[mut: Bool = mut, origin: Origin[mut] = origin](self) -> LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Changes the origin or mutability of a pointer. **Parameters:** * ​mut (`Bool`): Whether the origin is mutable. * ​origin (`Origin[mut]`): Origin of the destination pointer. **Returns:** A new `LayoutTensor` object with the same type and the same address, as the original `LayoutTensor`, and the new specified mutability and origin. ### `address_space_cast` `address_space_cast[address_space: AddressSpace = address_space](self) -> LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Changes the origin or mutability of a pointer. **Parameters:** * ​address\_space (`AddressSpace`): The new address space. **Returns:** A new `LayoutTensor` object with the same type and origin as the original `LayoutTensor`, and the new specified address\_space. ### `get_immutable` `get_immutable(self) -> LayoutTensor[dtype, layout, (muttoimm origin._mlir_origin), address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Return an immutable version of this tensor. **Returns:** A `LayoutTensor` covering the same elements, but without mutability. ### `__exp__` `__exp__(self) -> Self` Computes element-wise exponential function. Returns a new tensor containing the [element-wise exponential](/mojo/stdlib/math/math/exp/) of the input tensor. **Returns:** A new tensor containing the element-wise exponential. ### `load` `load[width: Int](self, m: Int, n: Int) -> SIMD[dtype, width]` Load a SIMD vector from the tensor at the specified 2D coordinates. Performs a vectorized load operation from the tensor's memory, retrieving `width` consecutive elements starting at position (m, n). This method enables efficient SIMD operations on tensor data. Performance: * Uses unaligned memory access which may be slower on some architectures. * For aligned access, use `aligned_load` instead when data alignment is guaranteed. * The load operation is optimized based on the tensor's memory layout. Notes: * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. * The elements are loaded according to the tensor's stride configuration. **Parameters:** * ​width (`Int`): The number of elements to load into the SIMD vector. Should match the target hardware's vector width for optimal performance. **Args:** * ​m (`Int`): The row index (first dimension). * ​n (`Int`): The column index (second dimension). **Returns:** A SIMD vector containing 'width' consecutive elements from the tensor. ### `prefetch` `prefetch(self, m: Int, n: Int)` Prefetch tensor data at the specified 2D coordinates into cache. Issues a software prefetch hint to the processor to load the data at position (m, n) into the cache hierarchy. This can improve performance by reducing memory latency for subsequent accesses to the same location. Performance: * Prefetching is a performance hint and does not guarantee data will be cached. * Most effective when issued sufficiently ahead of the actual data access. * Uses high locality prefetch to the data cache, optimized for data that will be accessed multiple times. * Can reduce memory access latency by 50-90% when used correctly. Notes: * Excessive prefetching can pollute the cache and degrade performance. * Most beneficial for predictable access patterns that would otherwise cause cache misses. * No operation is performed on the prefetched data. **Args:** * ​m (`Int`): The row index (first dimension). * ​n (`Int`): The column index (second dimension). ### `aligned_load` `aligned_load[width: Int](self, m: Int, n: Int) -> SIMD[dtype, width]` Load a SIMD vector with alignment guarantees from the tensor. Performs an aligned vectorized load operation from the tensor's memory, retrieving `width` consecutive elements starting at position (m, n). The alignment is automatically calculated based on the SIMD width and dtype. Performance: * Uses aligned memory access which is faster than unaligned access on most architectures. * The alignment is automatically calculated based on the SIMD width and dtype. * Can be up to 2x faster than unaligned loads on architectures that require alignment. Notes: * The caller must ensure that the memory at (m, n) is properly aligned. Misaligned access with this method may cause hardware exceptions on some architectures. * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. **Parameters:** * ​width (`Int`): The number of elements to load into the SIMD vector. Should match the target hardware's vector width for optimal performance. **Args:** * ​m (`Int`): The row index (first dimension). * ​n (`Int`): The column index (second dimension). **Returns:** A SIMD vector containing 'width' consecutive elements from the tensor. ### `store` `store[width: Int](self, m: Int, n: Int, val: SIMD[dtype, width])` Store a SIMD vector to the tensor at the specified 2D coordinates. Performs a vectorized store operation to the tensor's memory, writing 'width' consecutive elements starting at position (m, n). This method enables efficient SIMD operations on tensor data. Performance: * Uses unaligned memory access which may be slower on some architectures. * For aligned access, use aligned\_store instead when data alignment is guaranteed. * The store operation is optimized based on the tensor's memory layout. Notes: * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. * The elements are stored according to the tensor's stride configuration. * This operation modifies the tensor's data in-place. **Parameters:** * ​width (`Int`): The number of elements in the SIMD vector to store. Should match the target hardware's vector width for optimal performance. **Args:** * ​m (`Int`): The row index (first dimension) where the store operation begins. * ​n (`Int`): The column index (second dimension) where the store operation begins. * ​val (`SIMD[dtype, width]`): The SIMD vector containing the values to store in the tensor. ### `aligned_store` `aligned_store[width: Int](self, m: Int, n: Int, val: SIMD[dtype, width])` Store a SIMD vector with alignment guarantees to the tensor. Performs an aligned vectorized store operation to the tensor's memory, writing `width` consecutive elements starting at position (m, n). The alignment is automatically calculated based on the SIMD width and dtype. Performance: * Uses aligned memory access which is faster than unaligned access on most architectures. * The alignment is automatically calculated based on the SIMD width and dtype. * Can be up to 2x faster than unaligned stores on architectures that require alignment. * Particularly important for streaming stores that bypass the cache. Notes: * The caller must ensure that the memory at (m, n) is properly aligned. Misaligned access with this method may cause hardware exceptions on some architectures. * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. * This operation modifies the tensor's data in-place. **Parameters:** * ​width (`Int`): The number of elements in the SIMD vector to store. Should match the target hardware's vector width for optimal performance. **Args:** * ​m (`Int`): The row index (first dimension) where the store operation begins. * ​n (`Int`): The column index (second dimension) where the store operation begins. * ​val (`SIMD[dtype, width]`): The SIMD vector containing the values to store in the tensor. ### `size` `size(self) -> Int` Get the total number of elements that the tensor can contain. **Returns:** The total number of elements that can be stores in the tensor. ### `stack_allocation` `static stack_allocation[*, alignment: Int = alignment]() -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Allocates stack memory for a `LayoutTensor` with a fully static layout. Creates a new `LayoutTensor` instance with memory allocated on the stack rather than the heap. This provides deterministic memory management and potentially better performance for tensors with known sizes at compile time. Performance: * Stack allocation is typically faster than heap allocation. * Proper alignment can significantly improve memory access performance, especially for vectorized operations. * No dynamic memory management overhead (no malloc/free calls). Notes: * Only works with tensors that have fully static layouts known at compile time. * Stack memory is limited, so this should only be used for reasonably sized tensors. * The allocated memory is automatically freed when the function returns. **Constraints:** * The layout must be fully static (all dimensions known at compile time). * The alignment must be a multiple of the tensor's minimum required alignment. **Parameters:** * ​alignment (`Int`): Memory alignment value for the allocation in bytes. Must be a multiple of the tensor's minimum required alignment. Default is the tensor's natural alignment based on its data type and layout. **Returns:** A new `LayoutTensor` instance with memory allocated on the stack. ### `shape` `static shape[idx: Int]() -> Int` Returns the size of the tensor along the specified dimension. Provides static access to the tensor's shape information. This method returns the size of a specific dimension without requiring an instance of the tensor, as the shape is part of the tensor's static type information. Performance: * This is a compile-time operation with no runtime cost when used with static dimensions. Notes: * This is a static method that operates on the tensor's type information, not on a specific tensor instance. **Parameters:** * ​idx (`Int`): The dimension index to query (0-based). For example, in a 3D tensor with shape \[10, 20, 30]: * `shape[0]()` returns 10 (first dimension). * `shape[1]()` returns 20 (second dimension). * `shape[2]()` returns 30 (third dimension). **Returns:** The size of the tensor along the specified dimension as an integer. ### `stride` `static stride[idx: Int]() -> Int` Returns the memory stride of the tensor along the specified dimension. Provides static access to the tensor's stride information. The stride represents the number of elements to skip in memory to move one position along a particular dimension. This method returns the stride without requiring an instance of the tensor, as the stride is part of the tensor's static type information. Performance: * This is a compile-time operation with no runtime cost when used with static dimensions. * Understanding stride patterns is crucial for optimizing memory access patterns in performance-critical code. Notes: * Strides depend on the memory layout (row-major, column-major, or custom). * For non-contiguous tensors (e.g., tensor slices), strides may not follow a simple pattern. **Parameters:** * ​idx (`Int`): The dimension index to query (0-based). For example, in a 2D tensor with shape \[10, 20] and row-major layout: * `stride[0]()` might return 20 (moving one row requires skipping 20 elements). * `stride[1]()` might return 1 (moving one column requires skipping 1 element). **Returns:** The memory stride of the tensor along the specified dimension as an integer. ### `dim` `dim(self, idx: Int) -> Int` Returns the runtime dimension size of the tensor along the specified axis. Unlike the static `dim` method, this instance method takes a runtime dimension index. **Args:** * ​idx (`Int`): The dimension index to query (0-based). For example, in a 3D tensor with shape `[10, 20, 30]`: * `dim(0)` returns 10 (first dimension). * `dim(1)` returns 20 (second dimension). * `dim(2)` returns 30 (third dimension). **Returns:** The dimension of the tensor along the specified axis as an integer. `dim[idx: Int](self) -> Int` Returns the dimension size of the tensor along the specified axis. Unlike the static `shape` method, this instance method provides access to the tensor's actual dimension sizes. If the dimension is unknown, the runtime layout is used to get the dimension size. Performance: * For static dimensions known at compile time, prefer the static `shape` method when possible for better performance. Notes: * This method works with both static and dynamic dimensions. * For tensors with masked or partial views, this returns the actual size of the view, not the original tensor. **Constraints:** * Only works with tensors that have depth-1 layouts (no nested shapes). **Parameters:** * ​idx (`Int`): The dimension index to query (0-based). For example, in a 3D tensor with shape `[10, 20, 30]`: * `dim[0]()` returns 10 (first dimension). * `dim[1]()` returns 20 (second dimension). * `dim[2]()` returns 30 (third dimension). **Returns:** The size of the tensor along the specified dimension as an integer. ### `coalesce` `coalesce(self) -> LayoutTensor[dtype, coalesce(layout, False), origin, address_space=address_space, element_layout=element_layout]` Creates a tensor with a coalesced memory layout from this tensor. Coalescing a tensor's layout means reorganizing its memory representation to be as contiguous as possible, which can improve memory access patterns and performance. This operation does not move or copy data; it only changes how the same memory is interpreted. Performance: * Coalesced layouts typically provide better cache utilization and memory access patterns. * This operation is zero-cost at runtime as it only changes the layout information, not the actual data. * Particularly beneficial before operations that perform sequential memory access or vectorized operations. Notes: * The coalesced tensor shares the same memory as the original tensor, so modifications to one will affect the other. * The shape of the tensor remains the same, only the stride information is optimized. * For already optimally coalesced tensors, this operation has no effect. **Returns:** A tensor with the same data but with a coalesced memory layout. The returned tensor has type `LayoutTensor` with the same dtype but with a coalesced layout. ### `tile_type` `static tile_type[*tile_sizes: Int](*tile_coords: Int) -> LayoutTensor[dtype, _compute_tile_layout[*::Int]()[0], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _tile_is_masked[::Layout,*::Int](), alignment=alignment]` Returns a the type of a tile view of the tensor with specified dimensions and coordinates. **Parameters:** * ​\*tile\_sizes (`Int`): The dimensions of each tile along each axis of the tensor. **Args:** * ​\*tile\_coords (`Int`): The coordinates of the specific tile to extract. **Returns:** The type of a view into the original tensor representing the specified tile. ### `tile` `tile[*tile_sizes: Int](self, *tile_coords: Int) -> LayoutTensor[dtype, _compute_tile_layout[*::Int]()[0], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _tile_is_masked[::Layout,*::Int](), alignment=alignment]` Extract a tile (sub-tensor) from this tensor with specified dimensions and position. Tiling is a fundamental operation for high-performance tensor computations that divides a tensor into smaller blocks for better cache locality and parallelism. This method extracts a specific tile at the given coordinates without copying data. Example: For a 4×4 tensor with values: ``` [1 2 3 4] [2 3 4 5] [5 4 3 2] [1 1 1 1] ``` `tile[2, 2](1, 0)` will extract the tile: ``` [5 4] [1 1] ``` Performance: * Creates a view without copying data, making it very efficient. * Optimized for both static and dynamic layouts with different code paths. * Properly handles edge cases where tiles may be partially outside the tensor. * Maintains stride information for efficient memory access within the tile. Notes: * The resulting tile is a view into the original tensor, so modifications to the tile will affect the original tensor. * For tiles at the edges of the tensor, the actual dimensions may be smaller than the requested tile\_sizes if masking is enabled. * The implementation automatically selects between static and dynamic tiling based on the tensor's layout properties. **Parameters:** * ​\*tile\_sizes (`Int`): The dimensions of each tile along each axis of the tensor. For example, in a 2D tensor, `tile[32, 32]` creates 32×32 tiles. **Args:** * ​\*tile\_coords (`Int`): The coordinates of the specific tile to extract. For example, `tile[32, 32](1, 2)` extracts the tile at position (1, 2) in the grid of 32×32 tiles. **Returns:** A view into the original tensor representing the specified tile. ### `tile_with_offset` `tile_with_offset[*tile_sizes: Int](self, *tile_coords: Int, out result: Tuple[LayoutTensor[dtype, _compute_tile_layout[*::Int]()[0], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _tile_is_masked[::Layout,*::Int](), alignment=alignment], IndexList[len[::Sized](flatten[::Origin[::Bool(layout.shape)), element_type=layout_int_type], SIMD[linear_idx_type, 1]])` Similar to `tile`, but also returns the corner coordinates of the tile as well as the offset. **Parameters:** * ​\*tile\_sizes (`Int`): The dimensions of each tile along each axis of the tensor. **Args:** * ​\*tile\_coords (`Int`): The coordinates of the specific tile to extract. **Returns:** A tuple containing: * The extracted tile as a `LayoutTensor`. * The corner coordinates of the tile. * The offset of the tile. ### `tiled_iterator` `tiled_iterator[*tile_sizes: Int, *, axis: Int = 0](self, *tile_coords: Int) -> LayoutTensorIter[dtype, _compute_tile_layout[*::Int]()[0], origin, address_space=address_space, axis=OptionalReg[Int]({:_stdlib::_builtin::_int::_Int axis, 0}), layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _tile_is_masked[::Layout,*::Int]()]` Create an iterator that traverses tiles along a specified axis. This method creates an iterator that allows efficient traversal of tiles within a tensor. The iterator starts at the specified tile coordinates and can move along the specified axis, providing access to consecutive tiles. Performance: * Provides efficient sequential access to tiles with good cache locality. * Optimized for both static and dynamic layouts with different code paths. * Maintains stride information for efficient memory access within each tile. * Properly handles edge cases where tiles may be partially outside the tensor. Notes: * The iterator provides views into the original tensor, so modifications through the iterator will affect the original tensor. * For tiles at the edges of the tensor, the actual dimensions may be smaller than the requested tile\_sizes if masking is enabled. * The iterator is not circular by default, meaning it will not wrap around when reaching the end of the tensor along the iteration axis. * The implementation automatically selects between static and dynamic tiling based on the tensor's layout properties. Example: ```mojo var iter = tensor.tiled_iterator[16, 16, axis=0](0, 0) for i in range(num_tiles_along_axis): var tile = iter.get() # Process tile iter.next() ``` **Parameters:** * ​\*tile\_sizes (`Int`): The dimensions of each tile along each axis of the tensor. For example, in a 2D tensor, `tiled_iterator[32, 32]` creates an iterator over 32×32 tiles. * ​axis (`Int`): The axis along which the iterator will traverse. Default is 0 (first dimension). For example, with axis=0, the iterator will move vertically through tiles. **Args:** * ​\*tile\_coords (`Int`): The starting coordinates of the tile where iteration begins. **Returns:** A `LayoutTensorIter` that can be used to traverse tiles along the specified axis. ### `split` `split[count: Int, axis: Int = 0](self) -> StaticTuple[LayoutTensor[dtype, _compute_tile_layout[::Int,::Int]()[0], origin, address_space=address_space, element_layout=element_layout, alignment=alignment], count]` Split the `LayoutTensor` along a axis and return a `StaticTuple` of `LayoutTensor`. **Parameters:** * ​count (`Int`): Number of portion to split. * ​axis (`Int`): The axis where the split is applied to. **Returns:** A `StaticTuple` containing `count` `LayoutTensors`, each representing an equal-sized partition of the original tensor along the specified axis. Each partition has the same data type and memory characteristics as the original tensor, but with a reduced size along the split axis. `split[axis: Int = 0, alignment: Int = 1](self, count: Int, idx: Int) -> LayoutTensor[dtype, layout.make_shape_unknown[::Int](), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Retrieve a specific partition of the tensor after splitting along a specified axis. This method divides the tensor into 'count' partitions along the specified axis and returns the partition at index 'idx'. The partitioning is done with alignment considerations to optimize memory access patterns. Unlike the overloaded split method that returns all partitions, this method returns only a single partition, making it more memory-efficient for cases where only one partition is needed at a time. Notes: * The shape along the split axis becomes unknown at compile time. * Only works with dimensions that have statically known sizes. * The last partition may be smaller than others if the dimension size is not evenly divisible by `count`. * Partition sizes are aligned up to the specified alignment value, which can improve performance for vectorized operations. Performance: * Uses aligned partitioning to improve memory access patterns. * Avoids creating all partitions in memory, reducing memory usage. * Maintains the original tensor's stride information for efficient element access within the partition. **Constraints:** * The dimension being split must have a statically known size. * Cannot split dimensions with unknown or dynamic sizes. **Parameters:** * ​axis (`Int`): The axis along which to split the tensor. Defaults to 0 (first dimension). * ​alignment (`Int`): Memory alignment value for the partition size. Defaults to 1. **Args:** * ​count (`Int`): The number of partitions to divide the tensor into. * ​idx (`Int`): The index of the partition to return (0-based). **Returns:** A `LayoutTensor` representing the requested partition. ### `distribute_type` `static distribute_type[threads_layout: Layout, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1})]() -> LayoutTensor[dtype, _compute_distribute_layout[::Layout,::Layout,::OptionalReg[::Int]]()[1], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _distribute_is_masked[::Layout,::Layout,::OptionalReg[::Int]]() if is_nvidia_gpu() else False]` Returns the type of the distributed tensor. **Parameters:** * ​threads\_layout (`Layout`): The layout of the threads. * ​axis (`OptionalReg[Int]`): The axis to distribute along. **Returns:** The type of the distributed tensor. ### `distribute` `distribute[threads_layout: Layout, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), submode_axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1})](self, thread_id: UInt) -> LayoutTensor[dtype, _compute_distribute_layout[::Layout,::Layout,::OptionalReg[::Int]]()[1], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _distribute_is_masked[::Layout,::Layout,::OptionalReg[::Int]]() if is_nvidia_gpu() else False]` Distribute tensor workload across multiple threads in a structured pattern. This method partitions a tensor across multiple threads for parallel processing, assigning each thread a specific portion of the tensor. The distribution pattern is determined by the threads\_layout parameter, which defines the logical arrangement of threads. Example: For a 4×4 tensor distributed across 4 threads in a 2×2 grid: * Thread 0 might get the top-left quadrant * Thread 1 might get the top-right quadrant * Thread 2 might get the bottom-left quadrant * Thread 3 might get the bottom-right quadrant If axis=0 is specified with the same setup: * Thread 0 and Thread 2 would get the same data (left half) * Thread 1 and Thread 3 would get the same data (right half) Performance: * Creates a view without copying data, making it very efficient for parallel processing. * The swizzle parameter can significantly improve cache locality and memory access patterns. * Optimized for both static and dynamic layouts with different code paths. Notes: * The resulting tensor is a view into the original tensor, so modifications will affect the original tensor. * For optimal performance, the `threads_layout` should match the hardware's thread organization (e.g., warp/wavefront size and shape). * When using swizzling, carefully consider the memory access patterns to avoid cache thrashing or bank conflicts. * This function is particularly useful for GPU programming where threads are organized in structured grids. **Constraints:** * For dynamic layouts, the shape must be known at runtime and the threads\_layout must be fully static. **Parameters:** * ​threads\_layout (`Layout`): Defines the logical arrangement of threads (e.g., 2×2 grid of 4 threads). This layout determines how the tensor is partitioned. * ​axis (`OptionalReg[Int]`): Optional. If specified, restricts distribution to only this axis. For example, with axis=0 in a 2D thread layout, threads that differ only in their second coordinate will receive the same data. * ​swizzle (`OptionalReg[Swizzle]`): Optional. A function that remaps the distribution pattern to improve memory access patterns or cache locality. * ​submode\_axis (`OptionalReg[Int]`): Optional. Specifies an axis for specialized distribution modes. **Args:** * ​thread\_id (`UInt`): The ID of the current thread (0-based). **Returns:** A view into the original tensor representing the portion assigned to this thread. ### `distribute_with_offset` `distribute_with_offset[threads_layout: Layout, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), submode_axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1})](self, thread_id: UInt, out result: Tuple[LayoutTensor[dtype, _compute_distribute_layout[::Layout,::Layout,::OptionalReg[::Int]]()[1], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _distribute_is_masked[::Layout,::Layout,::OptionalReg[::Int]]() if is_nvidia_gpu() else False], IndexList[threads_layout.rank(), element_type=layout_int_type], SIMD[linear_idx_type, 1]])` Similar to `distribute`, but also returns the corner coordinates of the tile as well as the offset. **Parameters:** * ​threads\_layout (`Layout`): The layout of the threads. * ​axis (`OptionalReg[Int]`): The axis to distribute along. * ​swizzle (`OptionalReg[Swizzle]`): An optional swizzle function. * ​submode\_axis (`OptionalReg[Int]`): An optional submode axis. **Args:** * ​thread\_id (`UInt`): The ID of the current thread (0-based). **Returns:** A tuple containing: * The distributed tensor. * The corner coordinates of the tile. * The offset of the tile. ### `vectorize_type` `static vectorize_type[*vector_shape: Int]() -> LayoutTensor[dtype, coalesce(_compute_tile_layout[*::Int]()[1], True), origin, address_space=address_space, element_layout=_divide_tiles[*::Int]()[0], layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]` Returns the type of a vectorized view of the tensor with specified vector dimensions. **Parameters:** * ​\*vector\_shape (`Int`): The dimensions of each vector unit along each axis of the tensor. **Returns:** The type of a view into the original tensor with a vectorized layout. ### `vectorize` `vectorize[*vector_shape: Int](self) -> LayoutTensor[dtype, coalesce(_compute_tile_layout[*::Int]()[1], True), origin, address_space=address_space, element_layout=_divide_tiles[*::Int]()[0], layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]` Reshape a tensor into a vectorized form for efficient SIMD operations. This method transforms the tensor's logical layout to enable efficient vectorized processing, treating blocks of elements as vector units. The transformation is particularly useful for SIMD (Single Instruction Multiple Data) operations and hardware acceleration. Example: For a 16×16 tensor, `vectorize[4, 4]` will produce a 4×4 tensor where each element represents a 4×4 block from the original tensor. Performance: * Creates a view without copying data, making it very efficient. * Enables hardware-accelerated vector operations on blocks of data. * Improves cache locality by grouping related elements together. * Particularly beneficial for operations that can leverage SIMD instructions. Notes: * The tensor dimensions must be divisible by the corresponding vector dimensions. * For dimensions with unknown size, the corresponding vector dimension must be 1. * The resulting tensor has the same data but a different logical organization. * Modifications to the vectorized tensor affect the original tensor. * This transformation is particularly useful for GPU and vector processor optimizations. **Constraints:** * Each tensor dimension must be divisible by the corresponding vector dimension. * Vector dimensions must be smaller than or equal to the corresponding tensor dimensions. * For dimensions with unknown size, the vector dimension must be 1. **Parameters:** * ​\*vector\_shape (`Int`): The dimensions of each vector unit along each axis of the tensor. or example, in a 2D tensor, `vectorize[4, 4]` treats 4×4 blocks as vector units. **Returns:** A view of the tensor with a vectorized layout, where each element in the resulting tensor represents a vector of elements from the original tensor. ### `slice` `slice[d0_slice: Slice, d1_slice: Slice](self) -> LayoutTensor[dtype, _compute_slice_layout(d0_slice, d1_slice), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Extract a slice from a rank-2 tensor using slice objects. This method creates a view into a subset of the tensor defined by the slice specifications for each dimension. The slice is a continuous region of the tensor with no gaps (step size must be 1). Example: For a 4×4 tensor, `t` with values: ``` [1 2 3 4] [5 6 7 8] [9 10 11 12] [13 14 15 16] ``` ```mojo t.slice[Slice(1, 3), Slice(0, 2)] ``` will extract: ``` [5 6] [9 10] ``` Performance: * Creates a view without copying data, making it very efficient. * Maintains the original tensor's stride information for efficient memory access. * Zero-cost abstraction at runtime when used with compile-time constant slices. Notes: * The slice is a view into the original tensor, so modifications to the slice will affect the original tensor. * Only supports rank-2 tensors. For higher-rank tensors, use the overloaded version with slice indices. * The step size must be 1 (no gaps allowed in the slice). * Slice bounds are not checked at runtime; accessing out-of-bounds indices will result in undefined behavior. **Constraints:** * Only works with rank-2 tensors. **Parameters:** * ​d0\_slice (`Slice`): Slice specification for the first dimension (rows). Defines the start and end indices for the slice along this dimension. * ​d1\_slice (`Slice`): Slice specification for the second dimension (columns). Defines the start and end indices for the slice along this dimension. **Returns:** A view into the original tensor representing the specified slice. `slice[d0_slice: Slice, d1_slice: Slice, slice_indices: IndexList[2], __offset_dims: Int = (layout.rank() + -2)](self, offsets: IndexList[__offset_dims]) -> LayoutTensor[dtype, _compute_slice_layout(d0_slice, d1_slice, slice_indices.__getitem__[::Indexer](0), slice_indices.__getitem__[::Indexer](1)), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Extract a 2D slice from a higher-rank tensor at specific indices. This method creates a view into a 2D subset of a higher-rank tensor: Selecting two dimensions to slice using the slice\_indices parameter. Applying slice specifications to those dimensions. Using fixed offsets for all other dimensions. Example: Given a 3×4×5 tensor, `t`, the following example extracts a 2×2 slice from dimensions 0 and 2, with dimension 1 fixed at index 1. ```mojo t.slice = t.slice[Slice(1, 3), Slice(0, 2), IndexList[2](0, 2)](1) ``` Performance: * Creates a view without copying data, making it very efficient. * Maintains the original tensor's stride information for efficient memory access. * Zero-cost abstraction at runtime when used with compile-time constant slices. Notes: * The slice is a view into the original tensor, so modifications to the slice will affect the original tensor. * The slice indices must be ordered (e.g., \[0, 2] is valid, \[2, 0] is not). * The step size must be 1 (no gaps allowed in the slice). * Slice bounds are not checked at runtime; accessing out-of-bounds indices will result in undefined behavior. **Constraints:** * Slice step size must be 1 (no gaps). * Slice indices must be ordered (ascending). * Tensor rank must be at least 2. **Parameters:** * ​d0\_slice (`Slice`): Slice specification for the first selected dimension. * ​d1\_slice (`Slice`): Slice specification for the second selected dimension. * ​slice\_indices (`IndexList[2]`): Indices of the two dimensions to slice (must be ordered). * ​\_\_offset\_dims (`Int`): Internal parameter representing number of fixed dimensions. **Args:** * ​offsets (`IndexList[__offset_dims]`): Fixed index values for all dimensions not being sliced. **Returns:** A 2D view into the original tensor representing the specified slice. ### `slice_1d` `slice_1d[d0_slice: Slice, slice_indices: IndexList[1], __offset_dims: Int = (layout.rank() + -1)](self, offsets: IndexList[__offset_dims]) -> LayoutTensor[dtype, _compute_slice_layout(d0_slice, slice_indices.__getitem__[::Indexer](0)), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Extract a 1D slice from a higher-rank tensor at a specific index. This method creates a view into a 1D subset of a higher-rank tensor by: 1. Selecting one dimension to slice using the slice\_indices parameter 2. Applying a slice specification to that dimension 3. Using fixed offsets for all other dimensions Example: For a 3×4×5 tensor, `t`, the following example extracts a 1D slice from dimension 0, with dimensions 1 and 2 fixed at indices 1 and 2: ```mojo t.slice_1d[Slice(1, 3), IndexList[1](0)](1, 2)` ``` Performance: * Creates a view without copying data, making it very efficient. * Maintains the original tensor's stride information for efficient memory access. * Zero-cost abstraction at runtime when used with compile-time constant slices. Notes: * The slice is a view into the original tensor, so modifications to the slice will affect the original tensor. * The step size must be 1 (no gaps allowed in the slice). * Slice bounds are not checked at runtime; accessing out-of-bounds indices will result in undefined behavior. * This function exists as a workaround for compiler limitations with overloading. **Constraints:** * Slice step size must be 1 (no gaps). * Tensor rank must be at least 1. **Parameters:** * ​d0\_slice (`Slice`): Slice specification for the selected dimension. * ​slice\_indices (`IndexList[1]`): Index of the dimension to slice. * ​\_\_offset\_dims (`Int`): Internal parameter representing number of fixed dimensions. **Args:** * ​offsets (`IndexList[__offset_dims]`): Fixed index values for all dimensions not being sliced. **Returns:** A 1D view into the original tensor representing the specified slice. ### `transpose` `transpose[M: Int = shape[::Int](), N: Int = shape[::Int]()](self) -> LayoutTensor[dtype, composition(layout, __init__[::Origin[::Bool(__init__[::Origin[::Bool(IntTuple(N), IntTuple(M), Tuple()), __init__[::Origin[::Bool(IntTuple(M), IntTuple(1), Tuple()))), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Create a transposed view of a rank-2 tensor. This method creates a view of the tensor with its dimensions swapped, effectively converting rows to columns and columns to rows. The transposition is performed without copying data, by adjusting the tensor's layout information. Example: For a 2×3 tensor with values: ``` [1 2 3] [4 5 6] ``` `transpose()` will produce a 3×2 tensor: ``` [1 4] [2 5] [3 6] ``` Performance: * Creates a view without copying data, making it very efficient. * The operation is zero-cost at runtime as it only changes the layout information. * Memory access patterns may be less efficient in the transposed view due to non-contiguous memory access, especially for row-major storage. Notes: * The transposed tensor shares the same memory as the original tensor, so modifications to one will affect the other. * Only works with rank-2 tensors. * For optimal performance when repeatedly accessing the transposed data, consider creating a physical copy with the transposed layout. **Constraints:** * Only works with rank-2 tensors. **Parameters:** * ​M (`Int`): The size of the first dimension (rows) of the original tensor. Defaults to the static shape value of the first dimension. * ​N (`Int`): The size of the second dimension (columns) of the original tensor. Defaults to the static shape value of the second dimension. **Returns:** A view of the tensor with dimensions transposed (rows become columns and vice versa). ### `reshape` `reshape[dst_layout: Layout](self) -> LayoutTensor[dtype, dst_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Create a view of the tensor with a different shape. This method creates a view of the tensor with a new shape, without changing the underlying data. The total number of elements must remain the same. Example: Given a 2×6 row-major tensor, `reshape[Layout.col_major(3, 4)]()` produces a 3×4 tensor with the same elements in column-major order. Performance: * Creates a view without copying data, making it very efficient. * The operation is zero-cost at runtime as it only changes the layout information. * Memory access patterns may change, potentially affecting performance depending on the original and target layouts. Notes: * The reshaped tensor shares the same memory as the original tensor, so modifications to one will affect the other. * The total number of elements must remain the same after reshaping. * The reshape operation assumes a row-major (C-style) memory layout. * For tensors with complex strides or non-contiguous memory, reshaping may not produce the expected results. * Masked tensors cannot be reshaped. **Constraints:** * Cannot reshape masked tensors. * The total number of elements must be the same in both layouts. **Parameters:** * ​dst\_layout (`Layout`): The target layout for the reshaped tensor. Must have the same total number of elements as the original tensor. **Returns:** A view of the tensor with the new shape specified by dst\_layout. ### `composition` `composition[rhs_layout: Layout, dst_layout: Layout = composition(layout, rhs_layout)](self) -> LayoutTensor[dtype, dst_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Create a view of the tensor with a composed layout. This method creates a view of the tensor with a new layout that is the composition of the original layout with another layout. Layout composition allows for complex transformations of the tensor's logical structure without copying data. Example: For a 4×4 tensor with a standard row-major layout, composing with a layout that represents a 2×2 tiling would result in a tensor that logically views the data as 2×2 blocks. Performance: * Creates a view without copying data, making it very efficient. * The operation is zero-cost at runtime as it only changes the layout information. * Can be used to optimize memory access patterns for specific algorithms. Notes: * The composed tensor shares the same memory as the original tensor, so modifications to one will affect the other. * Layout composition is a powerful tool for expressing complex data transformations like tiling, transposition, and reshaping in a unified framework. * Understanding the mathematical properties of layout composition is important for correctly using this function. **Constraints:** * The layouts must be compatible for composition. * The total number of elements must remain the same after composition. **Parameters:** * ​rhs\_layout (`Layout`): The layout to compose with the tensor's current layout. * ​dst\_layout (`Layout`): The resulting layout after composition. Defaults to the composition of the tensor's layout with rhs\_layout. **Returns:** A view of the tensor with the composed layout. ### `distance` `distance(self, addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space]) -> SIMD[linear_idx_type, 1]` Calculate the element-wise distance between this tensor's pointer and another pointer. This method computes the number of elements (not bytes) between the tensor's pointer and the provided address. This is useful for determining offsets within a larger memory allocation or for pointer arithmetic operations. Example: If `tensor.ptr` points to an element at index 100 in a buffer, and `addr` points to element at index 50, then `distance(addr)` returns 50. Performance: * This is a lightweight operation that only involves pointer arithmetic. * The operation is optimized based on the address space, using smaller integer types for shared memory to improve efficiency. Notes: * The distance is calculated in elements, not bytes. * The result can be positive or negative depending on the relative positions of the pointers. * This function is particularly useful for GPU programming where understanding memory offsets is critical for performance. * Care should be taken when using this with pointers from different allocations, as the result would be meaningless. **Args:** * ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space]`): The target pointer to calculate the distance to. **Returns:** The number of elements between this tensor's pointer and the provided address. The result is of type `_uint_dtype`. `distance[_layout: Layout, _uint_dtype: DType = _get_unsigned_type(_layout, address_space)](self, src: LayoutTensor[dtype, _layout, origin, address_space=address_space]) -> SIMD[_uint_dtype, 1]` Calculate the element-wise distance between this tensor and another tensor. This method computes the number of elements (not bytes) between this tensor's pointer and another tensor's pointer. This is useful for determining the relative positions of tensors within a larger memory allocation. Example: If tensor1 points to element at index 100 in a buffer, and tensor2 points to element at index 50, then `tensor1.distance(tensor2)` would return 50. Performance: * This is a lightweight operation that only involves pointer arithmetic. * The operation is optimized based on the address space and layout, using appropriate integer types for efficiency. Notes: * The distance is calculated in elements, not bytes. * The result can be positive or negative depending on the relative positions of the tensors. * This function is particularly useful for GPU programming where understanding memory offsets is critical for performance. * Both tensors must be in the same address space for the result to be meaningful. * This overload is more type-safe than the pointer-based version as it ensures the tensors have compatible data types and address spaces. **Parameters:** * ​\_layout (`Layout`): The layout of the source tensor. * ​\_uint\_dtype (`DType`): The unsigned integer type to use for the result. Automatically determined based on the layout and address space. **Args:** * ​src (`LayoutTensor[dtype, _layout, origin, address_space=address_space]`): The source tensor to calculate the distance to. **Returns:** The number of elements between this tensor's pointer and the source tensor's pointer. The result is of type \_uint\_dtype. ### `copy_from` `copy_from(self, other: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Copy data from another tensor to this tensor. This method performs an element-by-element copy from the source tensor to this tensor, respecting the layouts of both tensors. The copy operation handles different memory layouts correctly, ensuring that elements are copied to their proper positions regardless of how the data is arranged in memory. * Both tensors must have statically known shapes. * The total number of elements must be the same in both tensors. * The element sizes must match between the tensors. Example: ```mojo from layout import LayoutTensor, Layout var src_storage = InlineArray[Float32, 2 * 3](uninitialized=True) var dst_storage = InlineArray[Float32, 3 * 2](uninitialized=True) var src = LayoutTensor[ DType.float32, Layout([2, 3]), ](src_storage).fill(1.0) var dst = LayoutTensor[ DType.float32, Layout([3, 2]), ](dst_storage) dst.copy_from(src) # Copies all elements from src to dst ``` Performance: * Performs element-by-element copying, which may be less efficient than vectorized or bulk memory operations. * The copy respects the memory layout of both tensors, which may involve non-contiguous memory access patterns. * For optimal performance with large tensors, consider using specialized copy functions that can leverage hardware acceleration. Notes: * Both tensors must have statically known shapes. * The total number of elements must be the same in both tensors. * The element sizes must match between the tensors. * This function handles different memory layouts correctly, making it suitable for copying between tensors with different shapes or strides. * The copy is performed element by element, not as a bulk memory copy. **Args:** * ​other (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor to copy data from. Must have the same total number of elements as this tensor. ### `copy_from_async` `copy_from_async[is_masked: Bool = False, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), fill: Fill = Fill(0), eviction_policy: CacheEviction = CacheEviction(0)](self, src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_idx_bound: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0), base_offset: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0))` Asynchronously copy data from another tensor to this tensor using GPU hardware. This method performs an asynchronous copy from the source tensor to this tensor using GPU hardware acceleration. It's specifically designed for copying data from global memory to shared memory in GPU kernels, leveraging hardware-specific asynchronous copy mechanisms for improved performance. For optimal performance, you need to arrange the copy correctly. Use the [`distribute()`](/mojo/kernels/layout/layout_tensor/LayoutTensor/#distribute) method to create thread-local fragments of the source and destination tensors, assigning each thread one or more elements to copy. Optionally, use the \[`vectorize()`]\((/mojo/kernels/layout/layout\_tensor/LayoutTensor/#vectorize) method to get vectorized views of both tensors before calling `distribute()`. This allows each thread to copy multiple elements of the tensor. For example: ```mojo var fragment = tensor.vectorize[1, simd_width]().distribute[ thread_layout ](thread_id) ``` The copy operation is asynchronous, so you must call [`async_copy_wait_all()`](/mojo/stdlib/gpu/memory/async_copy_wait_all/) or [`async_copy_wait_group()`](/mojo/stdlib/gpu/memory/async_copy_wait_group/) to ensure the copy has completed before using the data. Example: ```mojo from layout import LayoutTensor, Layout from gpu import thread_idx, block_idx, block_dim, global_idx, grid_dim from gpu.memory import AddressSpace, async_copy_wait_all alias dtype = DType.float32 alias in_size = 128 alias block_size = 16 num_blocks = in_size // block_size alias input_layout = Layout.row_major(in_size, in_size) fn kernel(tensor: LayoutTensor[dtype, input_layout, MutableAnyOrigin]): # extract a tile from the input tensor. var global_tile = tensor.tile[block_size, block_size](block_idx.x, block_idx.y) # allocate a shared memory tile alias tile_layout = Layout.row_major(block_size, block_size) var shared_tile = LayoutTensor[ dtype, tile_layout, MutableAnyOrigin, address_space = AddressSpace.SHARED, ].stack_allocation() # Create per-thread tile fragments for copying var tid = thread_idx.y + thread_idx.x * block_dim.x alias thread_layout = Layout.row_major(block_size, block_size) var global_fragment = global_tile.distribute[thread_layout](tid) var shared_fragment = shared_tile.distribute[thread_layout](tid) # async copy to shared memory shared_fragment.copy_from_async(global_fragment) async_copy_wait_all() # ... do something with the shared tile ``` Performance: * Supports vectorized copies for 4, 8, or 16-byte elements for better throughput. * Can bypass L1 cache with appropriate eviction policies for specific access patterns. * Swizzling can improve memory access patterns and reduce bank conflicts. Notes: * For vectorized copies, both tensors must have contiguous element layouts. * Asynchronous copies allow computation to overlap with memory transfers. * A synchronization barrier is required before using the copied data. **Constraints:** * Destination must be in shared memory. * Source and destination data types must match. * Element size must be 4, 8, or 16 bytes. * Destination tensor must have a static layout. **Parameters:** * ​is\_masked (`Bool`): Whether to perform a masked copy, where elements outside the `src_idx_bound` are not copied or filled with zeros. * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns. * ​fill (`Fill`): Fill policy for elements that are not copied (only used with masked copies). * ​eviction\_policy (`CacheEviction`): Cache eviction policy for the source data. **Args:** * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor to copy data from. * ​src\_idx\_bound (`SIMD[linear_idx_type, 1]`): For masked copies, the upper bound index for valid source elements. * ​base\_offset (`SIMD[linear_idx_type, 1]`): Base offset for swizzling calculations. ### `fill` `fill[*, use_runtime_layout: Bool = (layout.all_dims_known() ^ True) if (layout.all_dims_known() ^ True) else ((layout.size() LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Fill the entire tensor with a single value. This method sets all elements of the tensor to the specified value. It works with both statically and dynamically shaped tensors. For statically known layouts, the fill operation is unrolled at compile time. For dynamic layouts, a runtime loop is used. No vectorization is applied, so performance may be suboptimal for large tensors. Consider using hardware-specific fill operations for better performance with large tensors. This method can be used with tensors of any rank and shape. The fill operation respects the tensor's layout, filling all elements regardless of how they are arranged in memory. For tensors with `element_layout`, all elements within each logical element are filled with the same value. Example: ```mojo from layout import Layout, LayoutTensor def main(): var storage = InlineArray[Float32, 3 * 4](uninitialized=True) var tensor = LayoutTensor[ DType.float32, Layout([3, 4]), ](storage).fill(0.0) print(tensor) ``` If not using method chaining, you can either reassign the result to the tensor variable, or assign the result to the discard pattern (`_`) to avoid warnings about an unused value: ```mojo tensor = tensor.fill(0.0) # or _ = tensor.fill(0.0) ``` **Parameters:** * ​use\_runtime\_layout (`Bool`): Whether to use the runtime layout for filling. This parameter is defaulted to `True` if the layout is not statically known. If loop bounds are too large, it's better to use the runtime layout to avoid long compilation time. **Args:** * ​val (`SIMD[dtype, 1]`): The value to fill the tensor with. Must be of the same data type as the tensor. **Returns:** The tensor itself (self), allowing for method chaining. ### `__str__` `__str__(self) -> String` Convert the tensor to a string representation. This method converts the tensor to a human-readable string representation by writing its contents to a string. It delegates to the `write_to` method which formats the tensor appropriately based on its rank and shape. **Returns:** A string representation of the tensor. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Format and write the tensor's contents to a writer. This method formats the tensor's contents and writes them to the provided writer. For 2D tensors, it formats the output in a 2D grid. For tensors of other ranks, it prints all values in column-major coordinate order. Example: ```mojo from layout import Layout, LayoutTensor def main(): var storage = InlineArray[Float32, 2 * 3](uninitialized=True) var tensor = LayoutTensor[ DType.float32, Layout([2, 3]), ](storage).fill(1.0) print(tensor) # Internally calls `write_to` with a StringWriter ``` Output for a 2×3 tensor: ``` [[1.0, 1.0, 1.0], [1.0, 1.0, 1.0]] ``` Notes: * For 2D tensors, the output is formatted as a 2D grid with rows and columns. * For tensors of other ranks, values are printed in column-major coordinate order. * Empty tensors (size 0) produce no output. * This method is used by the `__str__` method to convert the tensor to a string. * The formatting is designed for human readability rather than parsing. * For large tensors, the output may be truncated to avoid excessive output. **Parameters:** * ​W (`Writer`): The writer type that will receive the formatted output. **Args:** * ​writer (`W`): The writer instance to write the formatted output to. --- ## LayoutTensorIter `@register_passable(trivial)` `struct LayoutTensorIter[mut: Bool, //, type: DType, layout: Layout, origin: Origin[mut], /, *, address_space: AddressSpace = AddressSpace(0), alignment: Int = alignof[::DType,__mlir_type.!kgen.target]() if is_nvidia_gpu() else 1, circular: Bool = False, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), layout_int_type: DType = _get_index_type(address_space), linear_idx_type: DType = _get_index_type(address_space), masked: Bool = False]` Iterator for traversing a memory buffer with a specific layout. `LayoutTensorIter` provides a way to iterate through memory according to a specific layout pattern, constructing layout tensors at each position. This enables efficient traversal of multi-dimensional data structures with custom memory layouts. Notes: The returned layout tensor is NOT vectorized. Users should explicitly vectorize if needed for performance-critical operations. ## Parameters * ​mut (`Bool`): Whether the iterator allows mutation of the underlying data. * ​type (`DType`): The data type of the tensor elements. * ​layout (`Layout`): The memory layout pattern to follow during iteration. * ​origin (`Origin[mut]`): Origin tracking for memory safety. * ​address\_space (`AddressSpace`): The memory address space (`GLOBAL`, `SHARED`, etc.). * ​alignment (`Int`): Memory alignment requirement for the data. * ​circular (`Bool`): Whether iteration wraps around at boundaries. * ​axis (`OptionalReg[Int]`): Optional axis for dimension-specific operations. * ​layout\_int\_type (`DType`): Integer type used for layout indices. * ​linear\_idx\_type (`DType`): Integer type used for indexing into memory. * ​masked (`Bool`): Whether to apply bounds masking during iteration. ## Fields * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory region being iterated, with appropriate type and memory attributes. * ​offset (`SIMD[linear_idx_type, 1]`): Current offset from the base pointer, representing the iterator's position in memory. * ​stride (`SIMD[linear_idx_type, 1]`): Step size between consecutive elements or blocks in memory during iteration. * ​bound (`SIMD[linear_idx_type, 1]`): Upper bound of the memory region, limiting the iteration range. * ​runtime\_layout (`RuntimeLayout[layout, element_type=layout_int_type, linear_idx_type=linear_idx_type]`): Runtime representation of the layout pattern used for mapping logical indices to memory locations. * ​dimension\_bound (`SIMD[layout_int_type, 1]`): Boundary value for the current dimension when iterating along a specific axis. * ​idx (`SIMD[linear_idx_type, 1]`): Current logical index position within the iteration sequence. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `UnknownDestructibility` ## Aliases ### `layout_uint_type` `alias layout_uint_type = SIMD[layout_int_type, 1]` The unsigned integer type used for layout, based on layout and address space. ### `linear_uint_type` `alias linear_uint_type = SIMD[linear_idx_type, 1]` The unsigned integer type used for indexing into memory. ## Methods ### `__init__` `__init__() -> Self` Initialize an empty iterator. Creates a default iterator with zero values, typically used as a placeholder or default value. `__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], bound: SIMD[linear_idx_type, 1], stride: SIMD[linear_idx_type, 1] = SIMD(layout.size()), offset: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> Self` Initialize an iterator with a pointer and basic parameters. Creates an iterator for a memory region with the specified bounds and stride. **Constraints:** The layout must have all dimensions known at compile time. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the beginning of the memory region. * ​bound (`SIMD[linear_idx_type, 1]`): Upper bound of the memory region. * ​stride (`SIMD[linear_idx_type, 1]`): Step size between consecutive elements (defaults to layout size). * ​offset (`SIMD[linear_idx_type, 1]`): Initial offset from the base pointer. `__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], bound: SIMD[linear_idx_type, 1], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], stride: SIMD[linear_idx_type, 1] = SIMD(layout.size() if layout.all_dims_known() else -1), offset: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0), dimension_bound: SIMD[layout_int_type, 1] = __init__[__mlir_type.!pop.int_literal](0), idx: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> Self` Initialize an iterator with a runtime layout. Creates an iterator with a runtime-determined layout, allowing for more flexible memory traversal patterns. **Constraints:** The runtime layout must have the same bitwidth as specified for the iterator. Circular iteration is not supported when an axis is defined. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the beginning of the memory region. * ​bound (`SIMD[linear_idx_type, 1]`): Upper bound of the memory region. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): Layout determined at runtime. * ​stride (`SIMD[linear_idx_type, 1]`): Step size between consecutive elements. * ​offset (`SIMD[linear_idx_type, 1]`): Initial offset from the base pointer. * ​dimension\_bound (`SIMD[layout_int_type, 1]`): Bound for the specified dimension when using masked iteration. * ​idx (`SIMD[linear_idx_type, 1]`): Initial index position. ### `__getitem__` `__getitem__(self) -> LayoutTensor[type, layout, origin, address_space=address_space, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Get the layout tensor at the current iterator position. Operator overload that returns a layout tensor representing the data at the current position of the iterator. **Returns:** A layout tensor at the current iterator position. ### `__iadd__` `__iadd__[T: Intable](mut self, rhs: T)` Increment the iterator by an integer value. Advances the iterator by the specified number of positions. Notes: This function is unsafe. It omits bound checking for performance reasons. Caller must ensure the index doesn't go out-of-bounds. **Parameters:** * ​T (`Intable`): A type that can be converted to an integer. **Args:** * ​rhs (`T`): The number of positions to advance. `__iadd__(mut self, rhs: SIMD[linear_idx_type, 1])` Increment the iterator by a uint value. Advances the iterator by the specified number of positions. Notes: This function is unsafe. It omits bound checking for performance reasons. Caller must ensure the index doesn't go out-of-bounds. **Args:** * ​rhs (`SIMD[linear_idx_type, 1]`): The number of positions to advance. ### `get` `get(self) -> LayoutTensor[type, layout, origin, address_space=address_space, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Get the layout tensor at the current iterator position. Returns a layout tensor representing the data at the current position of the iterator. **Returns:** A tensor view at the current iterator position with the same type, layout, and memory characteristics as specified by the output parameter. ### `next` `next[T: Intable](self, rhs: T) -> Self` Return an iterator pointing to a position ahead by rhs steps. Creates a new iterator that points rhs positions ahead of the current one. **Parameters:** * ​T (`Intable`): An integer-convertible type for the step size. **Args:** * ​rhs (`T`): The number of positions to advance. **Returns:** A new iterator pointing to the advanced position. `next(self, rhs: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](1)) -> Self` Return an iterator pointing to a position ahead by rhs steps. Creates a new iterator that points rhs positions ahead of the current one. **Args:** * ​rhs (`SIMD[linear_idx_type, 1]`): The number of positions to advance (defaults to 1). **Returns:** A new iterator pointing to the advanced position. ### `next_unsafe` `next_unsafe(self, rhs: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](1)) -> Self` Return an iterator pointing to a position ahead by rhs steps (unsafe version). Creates a new iterator that points rhs positions ahead of the current one. This is an unsafe version that omits certain checks for performance. **Constraints:** Cannot be used with masked iterators. User must ensure rhs rhs (`SIMD[linear_idx_type, 1]`): The number of positions to advance (defaults to 1). **Returns:** A new iterator pointing to the advanced position. ### `reshape` `reshape[dst_layout: Layout](self) -> LayoutTensorIter[type, dst_layout, origin, address_space=address_space, alignment=alignment, circular=circular, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]` Reshape the iterator to a new layout. This method creates a new iterator with a different layout while preserving the underlying data. The new layout must have the same total size as the original. **Constraints:** * The destination layout must have the same total size as the original. * Both layouts must be contiguous. * Both layouts must have compile-time known dimensions. **Parameters:** * ​dst\_layout (`Layout`): The target layout to reshape to. **Returns:** A new iterator with the specified layout. ### `bitcast` `bitcast[new_type: DType, *, address_space: AddressSpace = address_space, alignment: Int = alignment](self) -> LayoutTensorIter[new_type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]` Reinterpret the iterator's underlying pointer as a different data type. This method performs a bitcast operation, allowing you to view the same memory location as a different data type without copying or converting the data. **Parameters:** * ​new\_type (`DType`): The target data type to cast to. * ​address\_space (`AddressSpace`): The memory address space for the new iterator (defaults to current). * ​alignment (`Int`): Memory alignment requirement for the new iterator (defaults to current). **Returns:** A new LayoutTensorIter with the same layout but different data type. --- ## ThreadScope `@register_passable(trivial)` `struct ThreadScope` Represents the scope of thread operations in GPU programming. This struct defines the scope at which thread operations are performed, particularly for operations like tensor distribution and synchronization. It provides two main scopes: `BLOCK` and `WARP`, which correspond to different levels of thread grouping in GPU programming models. Example: ```mojo from layout.layout_tensor import copy_dram_to_sram, ThreadScope # Distribute tensor at block level (all threads in block participate) copy_dram_to_sram[layout, thread_scope=ThreadScope.BLOCK](dst, src) # Distribute tensor at warp level (only threads in same warp participate) copy_dram_to_sram[layout, thread_scope=ThreadScope.WARP](dst, src) ``` Performance: * WARP scope operations typically have lower synchronization overhead than BLOCK scope operations. * BLOCK scope operations allow coordination across all threads in a block, which is necessary for certain algorithms. * The choice of scope can significantly impact performance and correctness of parallel algorithms. Notes: * The appropriate scope depends on the specific algorithm and hardware. * WARP scope operations may be more efficient for operations that only require coordination within a warp. * BLOCK scope operations are necessary when threads from different warps need to coordinate. * The actual size of a warp or block is hardware-dependent. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `BLOCK` `alias BLOCK = ThreadScope(0)` Represents operations at the thread block level, where all threads in a block participate. ### `WARP` `alias WARP = ThreadScope(1)` Represents operations at the warp level, where only threads within the same warp participate. ## Methods ### `__init__` `@implicit` `__init__(value: Int) -> Self` Initialize a `ThreadScope` with the given integer value. **Args:** * ​value (`Int`): An integer representing the thread scope (0 for `BLOCK`, 1 for `WARP`). ### `__eq__` `__eq__(self, other: Self) -> Bool` Compare two `ThreadScope` objects for equality. **Args:** * ​other (`Self`): Another `ThreadScope` object to compare with. **Returns:** True if the thread scopes are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Compare two `ThreadScope` objects for inequality. **Args:** * ​other (`Self`): Another `ThreadScope` object to compare with. **Returns:** True if the thread scopes are not equal, False otherwise. ### `__str__` `__str__(self) -> String` Convert the `ThreadScope` to a human-readable string representation. Aborts: If the thread scope has an invalid value. **Returns:** A string representation of the thread scope ("BLOCK" or "WARP"). ### `__int__` `__int__(self) -> Int` Convert the `ThreadScope` to an integer value. **Returns:** The integer value of the thread scope (0 for BLOCK, 1 for WARP). --- ## copy `copy[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), thread_scope: ThreadScope = ThreadScope(0), row_major: Bool = False](dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Synchronously copy data from local memory (registers) to SRAM (shared memory). This function performs a synchronous copy operation from register memory to shared memory in a GPU context, distributing the workload across multiple threads for parallel execution. It's particularly useful for transferring processed data from registers to shared memory for inter-thread communication. Performance: * Distributes the copy workload across multiple threads for parallel execution. * Can use swizzling to optimize memory access patterns and reduce bank conflicts. * Optimized for transferring data from registers to shared memory. * On AMD GPUs, the `row_major` parameter can be used to match the memory access pattern used during prefetching from DRAM to registers. Notes: * The destination tensor must be in `SHARED` address space (SRAM). * The source tensor must be in `LOCAL` address space (registers). * This function is particularly useful in GPU kernels for sharing processed data between threads in the same block. * The `row_major` parameter is specifically designed for AMD GPUs when using a prefetching pattern from DRAM to SRAM via registers. **Constraints:** * Destination tensor must be in SHARED address space. * Source tensor must be in LOCAL address space. * For optimal performance, the thread layout should match the memory access patterns of the tensors. **Parameters:** * ​thread\_layout (`Layout`): Layout defining how threads are organized for the operation. This determines how the workload is distributed among threads. * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts. * ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or `WARP` level. `BLOCK` scope involves all threads in a thread block, while `WARP` scope restricts operations to threads within the same warp. Defaults to `ThreadScope.BLOCK`. * ​row\_major (`Bool`): Whether to use row-major ordering for the copy operation. This is particularly relevant when prefetching from DRAM to SRAM via registers on AMD GPUs. Defaults to False. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM). * ​src (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in local memory (registers). --- ## copy_dram_to_local `copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_base: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], offset: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}))` Efficiently copy data from global memory (DRAM) to registers for AMD GPUs. This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's buffer\_load intrinsic to efficiently transfer data from global memory to registers while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput. Notes: * The offset calculation method significantly impacts performance. Current implementation optimizes for throughput over flexibility. * This function is particularly useful for prefetching data into registers before performing computations, reducing memory access latency. **Constraints:** * Only supported on AMD GPUs. * The destination element layout size must match the SIMD width. * Source fragments must be rank 2 with known dimensions. **Parameters:** * ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads. * ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or `WARP` level. `BLOCK` scope involves all threads in a thread block, while `WARP` scope restricts operations to threads within the same warp. Defaults to `ThreadScope.BLOCK`. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in register memory (LOCAL address space). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in global memory (DRAM) to be copied. * ​src\_base (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The original global memory tensor from which src is derived. This is used to construct the buffer descriptor required by AMD's `buffer_load` intrinsic. * ​offset (`OptionalReg[UInt]`): The offset in the global memory. `copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], bounds: SIMD[uint32, 1])` Efficiently copy data from global memory (DRAM) to registers for AMD GPUs. This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's buffer\_load intrinsic to efficiently transfer data from global memory to registers while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput. Notes: * The offset calculation method significantly impacts performance. Current implementation optimizes for throughput over flexibility. * This function is particularly useful for prefetching data into registers before performing computations, reducing memory access latency. **Constraints:** * Only supported on AMD GPUs. * The destination element layout size must match the SIMD width. * Source fragments must be rank 2 with known dimensions. **Parameters:** * ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads. * ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or `WARP` level. `BLOCK` scope involves all threads in a thread block, while `WARP` scope restricts operations to threads within the same warp. Defaults to `ThreadScope.BLOCK`. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in register memory (LOCAL address space). * ​src\_iter (`LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`): The source tensor iterator. * ​bounds (`SIMD[uint32, 1]`): Bounds of the buffer, based on the ptr of the src\_iter. `copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Efficiently copy data from global memory (DRAM) to registers. This function implements an optimized memory transfer operation from global memory to register memory. It distributes the copy operation across multiple threads for maximum throughput while handling bounds checking for safety. **Constraints:** * The source tensor must be in GLOBAL address space (DRAM). * The destination tensor must be in LOCAL address space (registers). * Both tensors must have compatible data types. **Parameters:** * ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads. * ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or `WARP` level. `BLOCK` scope involves all threads in a thread block, while `WARP` scope restricts operations to threads within the same warp. Defaults to `ThreadScope.BLOCK`. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in register memory (LOCAL address space). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in global memory (DRAM). --- ## copy_dram_to_sram `copy_dram_to_sram[src_thread_layout: Layout, dst_thread_layout: Layout = src_thread_layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = src_thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Synchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context. This function performs a synchronous copy operation from global memory (DRAM) to shared memory (SRAM) in a GPU context, distributing the workload across multiple threads for parallel execution. It uses thread affinity mapping to ensure efficient work distribution and supports vectorized memory operations for optimal performance. Performance: * Distributes the copy workload across multiple threads for parallel execution. * Supports vectorized loads and stores for better memory throughput. * Can use swizzling to optimize memory access patterns and reduce bank conflicts. * Thread affinity mapping ensures efficient work distribution. * For masked tensors, performs bounds checking to handle edge cases correctly. Notes: * The source tensor must be in GENERIC or GLOBAL address space (DRAM). * The destination tensor must be in SHARED address space (SRAM). * Both tensors must have the same data type. * This function is synchronous, meaning all threads must complete their copy operations before proceeding. * For optimal performance, the thread layouts should match the memory access patterns of the tensors. * This function is particularly useful in GPU kernels for loading data from global memory to shared memory for faster access. **Constraints:** * Source and destination tensors must have the same data type. * Source tensor must be in GENERIC or GLOBAL address space. * Destination tensor must be in SHARED address space. * For non-masked tensors, the fragment sizes must match. **Parameters:** * ​src\_thread\_layout (`Layout`): Layout defining how threads are organized for the source tensor. This determines how the workload is distributed among threads. * ​dst\_thread\_layout (`Layout`): Layout defining how threads are organized for the destination tensor. Defaults to the same as `src_thread_layout` if not specified. * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts. * ​num\_threads (`Int`): Total number of threads participating in the copy operation. Defaults to the size of `src_thread_layout`. * ​thread\_scope (`ThreadScope`): Scope at which thread operations are performed (`BLOCK` or `WARP`). Defaults to `ThreadScope.BLOCK`, where all threads in a block participate. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory (DRAM). `copy_dram_to_sram[src_thread_layout: Layout, dst_thread_layout: Layout = src_thread_layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = src_thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], bound: Int)` Efficiently copy data from global memory (DRAM) to shared memory (SRAM) on AMD GPUs. This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's `buffer_load` intrinsic to efficiently transfer data while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput. **Parameters:** * ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads. * ​dst\_thread\_layout (`Layout`): The layout used to distribute the destination tensor across threads. Defaults to the same layout as `src_thread_layout`. * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling pattern to apply when distributing the destination tensor. This can improve memory access patterns and reduce bank conflicts. Defaults to None (no swizzling). * ​num\_threads (`Int`): The total number of threads participating in the copy operation. Defaults to the size of `src_thread_layout`. * ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or `WARP` level. `BLOCK` scope involves all threads in a thread block, while `WARP` scope restricts operations to threads within the same warp. Defaults to `ThreadScope.BLOCK`. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in shared memory (SRAM). * ​src\_iter (`LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`): The source tensor iterator in global memory (DRAM) to be copied. * ​bound (`Int`): The bound of the source tensor iterator. `copy_dram_to_sram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], bound: Int)` Synchronously copy data from DRAM to SRAM using a unified thread layout for AMD GPUs. This is a convenience wrapper around the more general `copy_dram_to_sram()` function that uses the same layout for both source and destination tensors. It's specifically designed for AMD GPUs where the buffer\_load intrinsic requires the original base tensor. Performance: * Simplifies API usage when the same thread layout is appropriate for both source and destination tensors. * Optimized for AMD GPUs using buffer\_load intrinsics for efficient memory transfers. * Distributes the copy workload across multiple threads for parallel execution. Notes: * This function is only supported on AMD GPUs. * The source tensor must be in GENERIC or GLOBAL address space (DRAM). * The destination tensor must be in SHARED address space (SRAM). * Both tensors must have the same data type. **Parameters:** * ​thread\_layout (`Layout`): Layout defining how threads are organized for both source and destination. This determines how the workload is distributed among threads. * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts. * ​num\_threads (`Int`): Total number of threads participating in the copy operation. Defaults to the size of thread\_layout. * ​thread\_scope (`ThreadScope`): Scope at which thread operations are performed (`BLOCK` or `WARP`). Defaults to `BLOCK`, where all threads in a block participate. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM). * ​src\_iter (`LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`): The source tensor iterator, which must be in global or generic memory (DRAM). * ​bound (`Int`): The bound of the source tensor iterator. `copy_dram_to_sram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Synchronously copy data from DRAM to SRAM using a unified thread layout. This is a convenience wrapper around the more general `copy_dram_to_sram()` function that uses the same layout for both source and destination tensors. It simplifies the API for the common case where the same thread distribution pattern works well for both tensors. Performance: * Simplifies API usage when the same thread layout is appropriate for both source and destination tensors. * Distributes the copy workload across multiple threads for parallel execution. * Supports vectorized loads and stores for better memory throughput. * Can use swizzling to optimize memory access patterns and reduce bank conflicts. Notes: * The source tensor must be in `GENERIC` or `GLOBAL` address space (DRAM). * The destination tensor must be in `SHARED` address space (SRAM). * Both tensors must have the same data type. * This function is synchronous, meaning all threads must complete their copy operations before proceeding. **Parameters:** * ​thread\_layout (`Layout`): Layout defining how threads are organized for both source and destination. This determines how the workload is distributed among threads. * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts. * ​num\_threads (`Int`): Total number of threads participating in the copy operation. Defaults to the size of `thread_layout`. * ​thread\_scope (`ThreadScope`): Scope at which thread operations are performed (`BLOCK` or `WARP)`. Defaults to `ThreadScope.BLOCK`, where all threads in a block participate. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory (DRAM). --- ## copy_dram_to_sram_async `copy_dram_to_sram_async[src_thread_layout: Layout, dst_thread_layout: Layout, swizzle: Bool = False, fill: Fill = Fill(0), eviction_policy: CacheEviction = CacheEviction(0), num_threads: Int = src_thread_layout.size()](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Asynchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context. This function performs an asynchronous copy operation from global memory (DRAM) to shared memory (SRAM) in a GPU context, using NVIDIA's cp.async hardware mechanism. It distributes the workload across multiple threads and allows computation to overlap with memory transfers for improved performance. Performance: * Performs asynchronous transfers, allowing computation to overlap with memory operations. * Distributes the copy workload across multiple threads for parallel execution. * Can use swizzling to optimize memory access patterns and reduce bank conflicts. * Supports different cache eviction policies to optimize memory hierarchy usage. * For masked tensors, performs bounds checking to handle edge cases correctly. Notes: * This function requires NVIDIA GPUs with `cp.async` support (compute capability 8.0+). * The source tensor must be in GENERIC or GLOBAL address space (DRAM). * The destination tensor must be in SHARED address space (SRAM). * Both tensors must have the same data type. * This function is asynchronous, so you must call [`async_copy_wait_all()`](/mojo/stdlib/gpu/memory/async_copy_wait_all/) or [`async_copy_wait_group()`](/mojo/stdlib/gpu/memory/async_copy_wait_group/) to ensure the copy has completed before using the data. * The maximum size of each element that can be copied is 16 bytes. **Constraints:** * Requires NVIDIA GPUs with cp.async support (compute capability 8.0+). * Source tensor must be in `GENERIC` or `GLOBAL` address space. * Destination tensor must be in `SHARED` address space. * Both tensors must have the same data type. * Element size must be 4, 8, or 16 bytes. **Parameters:** * ​src\_thread\_layout (`Layout`): Layout defining how threads are organized for the source tensor. This determines how the workload is distributed among threads. * ​dst\_thread\_layout (`Layout`): Layout defining how threads are organized for the destination tensor. * ​swizzle (`Bool`): Whether to apply swizzling to the destination indices to reduce bank conflicts. Defaults to False. * ​fill (`Fill`): Fill policy for handling out-of-bounds accesses. Options include: * `Fill.NONE`: No special handling (default). * `Fill.ZERO`: Fill out-of-bounds elements with zeros. * ​eviction\_policy (`CacheEviction`): Cache eviction policy for the source data. Options include: * `CacheEviction.EVICT_NORMAL`: Normal eviction (default). * `CacheEviction.EVICT_FIRST`: Evict data after first use. * `CacheEviction.EVICT_LAST`: Keep data in cache until last use. * ​num\_threads (`Int`): Total number of threads participating in the copy operation. Defaults to the size of src\_thread\_layout. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory (DRAM). `copy_dram_to_sram_async[thread_layout: Layout, swizzle: Bool = False, masked: Bool = False, fill: Fill = Fill(0), eviction_policy: CacheEviction = CacheEviction(0), num_threads: Int = thread_layout.size()](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Asynchronous copy from DRAM to SRAM with thread affinity mapping. This function performs an asynchronous memory transfer from DRAM (global memory) to SRAM (shared memory) using the specified thread layout for distribution. Notes: This is a convenience wrapper around the more general `copy_dram_to_sram_async()` function, using the same thread layout for both source and destination. **Parameters:** * ​thread\_layout (`Layout`): The layout used to distribute work across threads. * ​swizzle (`Bool`): Whether to apply memory access swizzling for better performance. * ​masked (`Bool`): Whether the copy operation should use masking. * ​fill (`Fill`): Fill policy for uninitialized memory regions. * ​eviction\_policy (`CacheEviction`): Cache eviction policy to use during the transfer. * ​num\_threads (`Int`): Number of threads to use for the operation, defaults to the size of `thread_layout`. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Destination tensor in SRAM. * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Source tensor in DRAM. --- ## copy_local_to_dram `copy_local_to_dram[dst_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Efficiently copy data from registers (LOCAL) to global memory (DRAM). This function implements a high-performance memory transfer operation from register memory to global memory. It distributes the copy operation across multiple threads for maximum throughput while handling bounds checking for safety. **Constraints:** * The source tensor must be in LOCAL address space (registers). * The destination tensor must be in GENERIC or GLOBAL address space (DRAM). * Both tensors must have compatible data types. **Parameters:** * ​dst\_thread\_layout (`Layout`): The layout used to distribute the destination tensor across threads. This determines how the workload is divided among participating threads. * ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or `WARP` level. `BLOCK` scope involves all threads in a thread block, while `WARP` scope restricts operations to threads within the same warp. Defaults to `ThreadScope.BLOCK`. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in global memory (DRAM). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in register memory (LOCAL) to be copied. `copy_local_to_dram[dst_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dst_base: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Efficiently copy data from registers (LOCAL) to global memory (DRAM) on AMD GPUs. This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's buffer\_store intrinsic to efficiently transfer data from registers to global memory while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput. Notes: * This function is particularly useful for writing computed results from registers back to global memory with minimal latency. * The offset calculation is optimized for performance rather than flexibility. **Constraints:** * Only supported on AMD GPUs. * Destination tensor must be in GLOBAL address space. * Source tensor must be in LOCAL address space. * Data types must match between source and destination tensors. **Parameters:** * ​dst\_thread\_layout (`Layout`): The layout used to distribute the destination tensor across threads. This determines how the workload is divided among participating threads. * ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or `WARP` level. `BLOCK` scope involves all threads in a thread block, while `WARP` scope restricts operations to threads within the same warp. Defaults to `ThreadScope.BLOCK`. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in global memory (DRAM). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in register memory (LOCAL address space) to be copied. * ​dst\_base (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The original global memory tensor from which dst is derived. This is used to construct the buffer descriptor required by AMD's `buffer_store` intrinsic. --- ## copy_local_to_local `copy_local_to_local(dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Synchronously copy data between local memory (register) tensors with type conversion. This function performs a synchronous copy operation between register tensors in a GPU context, with support for converting from float32 to half-precision formats (bfloat16/float16). It's particularly optimized for specific tensor layouts commonly used in matrix multiplication operations. Example: ```mojo from layout import LayoutTensor, Layout from layout.layout_tensor import copy_local_to_local from gpu.memory import AddressSpace fn kernel(): ... var src_reg = LayoutTensor[DType.float32, Layout.row_major(16, 8), MutableAnyOrigin, address_space = AddressSpace.LOCAL, ].stack_allocation().fill(1) var dst_reg = LayoutTensor[DType.bfloat16, Layout.row_major(16, 8), MutableAnyOrigin, address_space = AddressSpace.LOCAL, ].stack_allocation() # Process data in float32 registers # ... # Convert and copy to bfloat16 registers copy_local_to_local(dst_reg, src_reg) ``` Performance: * Optimized for specific 2D tensor layouts with contiguous inner dimensions. * Special fast path for 2D tensors with specific layouts used in matrix multiplication. * For MMA (Matrix Multiply-Accumulate) operations, efficiently handles the conversion between output fragments and input fragments with different layouts. * Falls back to element-wise copy for general cases. Notes: * Both source and destination tensors must be in `LOCAL` address space (registers). * This function currently only supports copying from float32 to half-precision formats. * For 2D tensors with stride\[1] == 1, a specialized fast path is used that's optimized for matrix multiplication patterns. * This function is particularly useful in GPU kernels for converting between different precision formats while keeping data in registers. **Constraints:** * Destination tensor must be in `LOCAL` address space. * Source tensor must be in `LOCAL` address space. * Destination tensor must have a half-precision floating-point data type. * Source tensor must have float32 data type. * Both tensors must have the same total size. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in local memory (registers) and have a half-precision floating-point data type (bfloat16 or float16). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in local memory (registers) and have float32 data type. --- ## copy_sram_to_dram `copy_sram_to_dram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), binary_op: OptionalReg[fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]]({:i1 0, 1})](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Synchronously copy data from SRAM (shared memory) to DRAM (global memory). This function performs a synchronous memory transfer from SRAM (shared memory) to DRAM (global memory) using the specified thread layout for workload distribution. It supports optional swizzling for optimized memory access patterns and binary operations for combining data during the transfer. Performance: * Distributes the copy workload across multiple threads for parallel execution. * Supports vectorized loads and stores for better memory throughput. * Can use swizzling to optimize memory access patterns. * Supports binary operations to combine data during transfer (e.g., for reduction operations). Notes: * The source tensor must be in `SHARED` address space (SRAM). * The destination tensor must be in `GENERIC` or `GLOBAL` address space (DRAM). * Supports FP32 to half-precision downcast during copy if needed. * Handles masked tensors with proper bounds checking. * This function is synchronous, meaning all threads must complete their copy operations before proceeding. **Constraints:** * Source tensor must be in SHARED address space with a static layout. * Destination tensor must be in GENERIC or GLOBAL address space. * For type conversion, only FP32 to half-precision is supported. * For vectorized copy with type conversion, both tensors must have element layouts matching the SIMD width of the destination type. **Parameters:** * ​thread\_layout (`Layout`): Layout defining how threads are organized for both source and destination. This determines how the workload is distributed among threads. * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the source indices, which can improve memory access patterns and reduce bank conflicts. * ​num\_threads (`Int`): Total number of threads participating in the copy operation. Defaults to the size of thread\_layout. * ​binary\_op (`OptionalReg[fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]]`): Optional binary operation to apply during the copy, combining source data with existing destination data. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in global or generic memory (DRAM). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in shared memory (SRAM). --- ## copy_sram_to_local `copy_sram_to_local[src_warp_layout: Layout, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1})](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Synchronously copy data from SRAM (shared memory) to local memory. This function performs a synchronous memory transfer from SRAM (shared memory) to local memory (registers) using the specified thread layout for workload distribution. Performance: * Distributes the copy workload across multiple threads for parallel execution. * Optimized for transferring data from shared memory to registers. * Supports optional axis-specific distribution for specialized access patterns. **Constraints:** * The source tensor must be in SHARED address space (SRAM). * The destination tensor must be in LOCAL address space (registers). * Both tensors must have the same data type. **Parameters:** * ​src\_warp\_layout (`Layout`): Layout defining how threads are organized for the source tensor. This determines how the workload is distributed among threads. * ​axis (`OptionalReg[Int]`): Optional parameter specifying which axis to distribute along. When provided, distribution happens along the specified axis. When None (default), distribution uses the standard layout pattern. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in local memory (registers). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in shared memory (SRAM). --- ## cp_async_k_major `cp_async_k_major[type: DType, eviction_policy: CacheEviction = CacheEviction(0)](dst: LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with K-major layout. This function performs an asynchronous copy operation from global memory (DRAM) to shared memory (SRAM) using NVIDIA's Tensor Memory Accelerator (TMA) hardware. It optimizes for K-major memory access patterns, which is particularly beneficial for certain tensor operations like matrix multiplications where the inner dimension (K) is accessed contiguously. The function automatically determines the optimal tile size and thread distribution based on the tensor shapes and hardware capabilities, leveraging TMA's efficient memory transfer mechanisms. Performance: * Uses TMA hardware acceleration for optimal memory transfer performance. * Optimizes for K-major access patterns, which can significantly improve performance for certain tensor operations like matrix multiplications. * Performs asynchronous transfers, allowing computation to overlap with memory operations. * Automatically determines optimal tile sizes based on tensor dimensions. * Uses hardware-accelerated swizzling to reduce shared memory bank conflicts. Notes: * This function requires NVIDIA GPUs with TMA support (compute capability 9.0+). * The source tensor must be in GENERIC or GLOBAL address space (DRAM). * The destination tensor must be in SHARED address space (SRAM). * Both tensors must have the same data type. * This function is asynchronous, so you must call [`async_copy_wait_all()`](/mojo/stdlib/gpu/memory/async_copy_wait_all/) or [`async_copy_wait_group()`](/mojo/stdlib/gpu/memory/async_copy_wait_group/) to ensure the copy has completed before using the data. * K-major layout is particularly beneficial for matrix multiplication operations where the inner dimension (K) is accessed contiguously. **Constraints:** * Requires NVIDIA GPUs with TMA support (compute capability 9.0+). * Source tensor must be in GENERIC or GLOBAL address space. * Destination tensor must be in SHARED address space. * Both tensors must have the same data type. * Source and destination tensors must be 2D. **Parameters:** * ​type (`DType`): The data type of the tensor elements. * ​eviction\_policy (`CacheEviction`): The cache eviction policy to use. Default is `CacheEviction.EVICT_NORMAL`. **Args:** * ​dst (`LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM). * ​src (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory (DRAM). --- ## cp_async_mn_major `cp_async_mn_major[type: DType, eviction_policy: CacheEviction = CacheEviction(0)](dst: LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with MN-major layout. This function performs an asynchronous copy operation from global memory (DRAM) to shared memory (SRAM) using NVIDIA's Tensor Memory Accelerator (TMA) hardware. It optimizes for MN-major memory access patterns, which is particularly beneficial for tensor operations where the outer dimensions (M, N) are accessed contiguously. The function automatically determines the optimal tile size and thread distribution based on the tensor shapes and hardware capabilities, leveraging TMA's efficient memory transfer mechanisms. Performance: * Uses TMA hardware acceleration for optimal memory transfer performance. * Optimizes for MN-major access patterns, which can significantly improve performance for certain tensor operations where outer dimensions are accessed contiguously. * Performs asynchronous transfers, allowing computation to overlap with memory operations. * Automatically determines optimal tile sizes based on tensor dimensions. * Uses hardware-accelerated swizzling to reduce shared memory bank conflicts. Notes: * This function requires NVIDIA GPUs with TMA support (compute capability 9.0+). * The source tensor must be in `GENERIC` or `GLOBAL` address space (DRAM). * The destination tensor must be in `SHARED` address space (SRAM). * Both tensors must have the same data type. * This function is asynchronous, so you must call [`async_copy_wait_all()`](/mojo/stdlib/gpu/memory/async_copy_wait_all/) or [`async_copy_wait_group()`](/mojo/stdlib/gpu/memory/async_copy_wait_group/) to ensure the copy has completed before using the data. * MN-major layout is particularly beneficial for operations where the outer dimensions are accessed contiguously, such as certain convolution operations. **Constraints:** * Requires NVIDIA GPUs with TMA support (compute capability 9.0+). * Source tensor must be in `GENERIC` or `GLOBAL` address space. * Destination tensor must be in `SHARED` address space. * Both tensors must have the same data type. * Source and destination tensors must be 2D. **Parameters:** * ​type (`DType`): The data type of the tensor elements. * ​eviction\_policy (`CacheEviction`): The cache eviction policy to use. Default is `CacheEviction.EVICT_NORMAL`. **Args:** * ​dst (`LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM). * ​src (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory (DRAM). --- ## layout_tensor Provides the `LayoutTensor` type for representing multidimensional data. ## Aliases ### `binary_op_type` `alias binary_op_type = fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]` Type alias for binary operations on SIMD vectors. This type represents a function that takes two SIMD vectors of the same type and width and returns a SIMD vector of the same type and width. Args: type: The data type of the SIMD vector elements. width: The width of the SIMD vector. lhs: Left-hand side SIMD vector operand. rhs: Right-hand side SIMD vector operand. Returns: A SIMD vector containing the result of the binary operation. ## Structs * [​`LayoutTensor`](./LayoutTensor): A high-performance tensor with explicit memory layout and hardware-optimized access patterns. * [​`LayoutTensorIter`](./LayoutTensorIter): Iterator for traversing a memory buffer with a specific layout. * [​`ThreadScope`](./ThreadScope): Represents the scope of thread operations in GPU programming. ## Functions * [​`copy`](./copy): Synchronously copy data from local memory (registers) to SRAM (shared memory). * [​`copy_dram_to_local`](./copy_dram_to_local): Efficiently copy data from global memory (DRAM) to registers for AMD GPUs. * [​`copy_dram_to_sram`](./copy_dram_to_sram): Synchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context. * [​`copy_dram_to_sram_async`](./copy_dram_to_sram_async): Asynchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context. * [​`copy_local_to_dram`](./copy_local_to_dram): Efficiently copy data from registers (LOCAL) to global memory (DRAM). * [​`copy_local_to_local`](./copy_local_to_local): Synchronously copy data between local memory (register) tensors with type conversion. * [​`copy_sram_to_dram`](./copy_sram_to_dram): Synchronously copy data from SRAM (shared memory) to DRAM (global memory). * [​`copy_sram_to_local`](./copy_sram_to_local): Synchronously copy data from SRAM (shared memory) to local memory. * [​`cp_async_k_major`](./cp_async_k_major): Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with K-major layout. * [​`cp_async_mn_major`](./cp_async_mn_major): Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with MN-major layout. * [​`stack_allocation_like`](./stack_allocation_like): Create a stack-allocated tensor with the same layout as an existing tensor. --- ## stack_allocation_like `stack_allocation_like[layout: Layout, dtype: DType, *, address_space: AddressSpace, target_address_space: AddressSpace = AddressSpace(0)](in_tensor: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=target_address_space, masked=masked]` Create a stack-allocated tensor with the same layout as an existing tensor. This function creates a new tensor on the stack with the same layout, data type, and masking properties as the input tensor, but potentially with a different address space. This is useful for creating temporary tensors that match the structure of existing tensors. Example: ```mojo from layout import LayoutTensor, Layout from layout.layout_tensor import stack_allocation_like from gpu.memory import AddressSpace var global_tensor = LayoutTensor[ DType.float32, Layout([10, 10]), MutableAnyOrigin, address_space=AddressSpace.GLOBAL ].stack_allocation() var shared_tensor = stack_allocation_like[ target_address_space=AddressSpace.SHARED ](global_tensor) ``` Performance: * Creates a tensor on the stack, which is typically faster to allocate and access than heap-allocated memory. * Stack allocations have automatic lifetime management, reducing memory management overhead. * Stack size is limited, so be cautious with large tensor allocations. Notes: * The new tensor will have the same layout, data type, and masking properties as the input tensor. * The address space can be changed, which is useful for moving data between different memory regions (e.g., from global to shared memory). * Stack allocations are automatically freed when they go out of scope. * The function uses the stack\_allocation method of the result tensor type. **Parameters:** * ​layout (`Layout`): The layout of the tensor to allocate. * ​dtype (`DType`): The data type of the tensor elements. * ​address\_space (`AddressSpace`): The address space of the input tensor. * ​target\_address\_space (`AddressSpace`): The address space for the new tensor. Defaults to GENERIC. **Args:** * ​in\_tensor (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to match the layout of. **Returns:** A new tensor allocated on the stack with the same layout as the input tensor. --- ## math Implements math methods that work on layout tensors. ## Functions * [​`max`](./max): Computes maximum reduction along specified axis. * [​`outer_product_acc`](./outer_product_acc): Updates result tensor with the outer product of two vectors. * [​`sum`](./sum): Computes sum reduction along specified axis. --- ## max `max[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], outp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Computes maximum reduction along specified axis. Reduces the input tensor by taking maximum elements along the specified axis and stores the result in the output tensor. **Constraints:** All tensors must have statically known shapes. `outp.rank` must equal `inp.rank - 1`. Non-reduction dimensions must match between `inp` and `outp`. Currently only supports rank-2 inputs. **Parameters:** * ​axis (`Int`): The axis to take maximum along. **Args:** * ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to reduce. * ​outp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor to store maximum results. `max[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, _reduce_res_row_major_shape(axis, layout), MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Computes maximum reduction along specified axis, returning a new tensor. Reduces the input tensor by taking maximum elements along the specified axis and returns a new tensor with the results. **Constraints:** All tensors must have statically known shapes. Result will have rank equal to `inp.rank` - 1. Non-reduction dimensions in the result match the input. Currently only supports rank-2 inputs. **Parameters:** * ​axis (`Int`): The axis to take maximum along. **Args:** * ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to reduce. **Returns:** A new tensor containing the maximum values along the specified axis. `max[dtype: DType, layout: Layout](x: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], y: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Computes element-wise maximum of two tensors. Returns a new tensor containing the element-wise maximum between the input tensors. **Constraints:** Input tensors must have statically known shapes and matching layouts. **Parameters:** * ​dtype (`DType`): The data type of the input tensors. * ​layout (`Layout`): The layout of the input tensors. **Args:** * ​x (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): First input tensor. * ​y (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Second input tensor. **Returns:** A new tensor containing the element-wise maximum. --- ## outer_product_acc `outer_product_acc(res: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], lhs: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], rhs: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Updates result tensor with the outer product of two vectors. Computes `res += outer(lhs, rhs)` where `lhs` and `rhs` are vectors and `res` is a matrix. **Constraints:** All tensors must have statically known shapes. `res` must be rank 2. `lhs` and `rhs` must be rank 1. `res.shape[0]` `==` `lhs.shape[0]` and `res.shape[1]` `==` `rhs.shape[0]`. **Args:** * ​res (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The result matrix to accumulate into, shape (M, N). * ​lhs (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The left-hand side vector, shape (M,). * ​rhs (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The right-hand side vector, shape (N,). --- ## sum `sum[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], outp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Computes sum reduction along specified axis. Reduces the input tensor by summing elements along the specified axis and stores the result in the output tensor. Example: ```mojo from layout import LayoutTensor, Layout from layout.math import sum data = InlineArray[Int32, 6](0, 1, 2, 3, 4, 5) tensor = LayoutTensor[DType.int32, Layout.row_major(2, 3)](data) print(tensor) print("-----") print(sum[0](tensor)) ``` Output: ```plaintext 0 1 2 3 4 5 ----- 3 5 7 ``` . **Constraints:** All tensors must have statically known shapes. `outp.rank` must equal `inp.rank - 1`. Non-reduction dimensions must match between inp and outp. Currently only supports rank-2 inputs. **Parameters:** * ​axis (`Int`): The axis to sum along. **Args:** * ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to sum. * ​outp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor to store sum results. `sum[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, _reduce_res_row_major_shape(axis, layout), MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Computes sum reduction along specified axis, returning a new tensor. Reduces the input tensor by summing elements along the specified axis and returns a new tensor with the results. **Constraints:** All tensors must have statically known shapes. Result will have rank equal to `inp.rank` - 1. Non-reduction dimensions in the result match the input. Currently only supports rank-2 inputs. **Parameters:** * ​axis (`Int`): The axis to sum along. **Args:** * ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to sum. **Returns:** A new tensor containing the sum values along the specified axis. --- ## RuntimeLayout `@register_passable(trivial)` `struct RuntimeLayout[layout: Layout, /, *, element_type: DType = int64, linear_idx_type: DType = int64]` A runtime-configurable layout that uses `RuntimeTuple` for storage. This struct provides a layout implementation that can be modified at runtime, unlike the static [`Layout`](/mojo/kernels/layout/layout/Layout) type. It uses [`RuntimeTuple`](/mojo/kernels/layout/runtime_tuple/RuntimeTuple) for shape and stride storage. The layout must have statically known dimensions at compile time, but the actual shape and stride values can be modified during execution. ## Parameters * ​layout (`Layout`): The static `Layout` type to base this runtime layout on. * ​element\_type (`DType`): The integer type of the each dimension element. Must be signed. * ​linear\_idx\_type (`DType`): The integer type of the linear index into memory returned by `crd2idx`. Must be signed. ## Fields * ​shape (`RuntimeTuple[layout.shape, element_type=element_type]`): The shape of the layout as a runtime tuple. Stores the size of each dimension. Uses the specified bitwidth and is unsigned. Must match the static layout's shape dimensions. * ​stride (`RuntimeTuple[layout.stride, element_type=linear_idx_type]`): The stride of the layout as a runtime tuple. Stores the stride (step size) for each dimension. Uses 64-bit unsigned integers since strides can be large values. Must match the static layout's stride dimensions. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Initialize a `RuntimeLayout` with default values. Creates a new `RuntimeLayout` instance with default shape and stride values. Requires that the static layout has known dimensions at compile time. **Constraints:** The static layout that this runtime layout is based on must have all dimensions known. `__init__(shape: RuntimeTuple[layout.shape, element_type=element_type], stride: RuntimeTuple[layout.stride, element_type=linear_idx_type]) -> Self` Initialize a `RuntimeLayout` with specified shape and stride. **Args:** * ​shape (`RuntimeTuple[layout.shape, element_type=element_type]`): A `RuntimeTuple` containing the dimensions of each axis. * ​stride (`RuntimeTuple[layout.stride, element_type=linear_idx_type]`): A `RuntimeTuple` containing the stride values for each axis. ### `__call__` `__call__(self, idx: Int) -> SIMD[linear_idx_type, 1]` Convert a single index to a flat linear index. **Args:** * ​idx (`Int`): The one-dimensional index to convert. **Returns:** The corresponding flat linear index in the layout. `__call__[: ImmutableOrigin, //, t: IntTuple[$0]](self, idx: RuntimeTuple[t, element_type=element_type]) -> SIMD[linear_idx_type, 1]` Convert a multi-dimensional index to a flat linear index. **Parameters:** * ​t (`IntTuple[$0]`): The `IntTuple` type for the index. **Args:** * ​idx (`RuntimeTuple[t, element_type=element_type]`): A `RuntimeTuple` containing the multi-dimensional coordinates. **Returns:** The corresponding flat linear index in the layout. ### `idx2crd` `idx2crd[: ImmutableOrigin, //, t: IntTuple[$0]](self, idx: RuntimeTuple[t, element_type=element_type]) -> RuntimeTuple[idx2crd[::Origin[::Bool(t, layout.shape, layout.stride), element_type=element_type]` Converts a linear index to logical coordinates. This is the inverse operation of the **call** method, mapping from a memory index back to the corresponding logical coordinates. **Parameters:** * ​t (`IntTuple[$0]`): The `IntTuple` type for the index. **Args:** * ​idx (`RuntimeTuple[t, element_type=element_type]`): The linear index to convert. **Returns:** The logical coordinates corresponding to the given index. ### `size` `size(self) -> Int` Calculate the total number of elements in the layout. **Returns:** The product of all dimensions in the shape, representing the total number of elements that can be addressed by this layout. ### `bound_check_required` `bound_check_required(self) -> Bool` Determine if bounds checking is required for this layout. **Returns:** True if any dimension in the shape differs from the static layout's shape, False otherwise. ### `cast` `cast[element_type: DType, /, *, linear_idx_type: DType = linear_idx_type](self) -> RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]` Cast the layout to use a different element bitwidth. **Parameters:** * ​element\_type (`DType`): The target data type. * ​linear\_idx\_type (`DType`): The target linear idx type. **Returns:** A new `RuntimeLayout` with the shape cast to the specified type. ### `__str__` `__str__(self) -> String` Convert the layout to a string representation. **Returns:** A string representation of the layout. ### `row_major` `static row_major[rank: Int, //](shape: IndexList[rank, element_type=element_type]) -> Self` Create a row-major layout from the given shape. In row-major layout, elements with adjacent rightmost indices are adjacent in memory. **Parameters:** * ​rank (`Int`): The number of dimensions in the layout. **Args:** * ​shape (`IndexList[rank, element_type=element_type]`): An `IndexList` containing the dimensions of each axis. **Returns:** A `RuntimeLayout` with row-major stride ordering. ### `col_major` `static col_major[rank: Int, //](shape: IndexList[rank, element_type=element_type]) -> Self` Create a column-major layout from the given shape. In column-major layout, elements with adjacent leftmost indices are adjacent in memory. **Parameters:** * ​rank (`Int`): The number of dimensions in the layout. **Args:** * ​shape (`IndexList[rank, element_type=element_type]`): An `IndexList` containing the dimensions of each axis. **Returns:** A `RuntimeLayout` with column-major stride ordering. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Write a string representation of the layout to a writer. **Parameters:** * ​W (`Writer`): The `Writer` type. **Args:** * ​writer (`W`): The `Writer` object to write the layout representation to. ### `sublayout` `sublayout[i: Int](self) -> RuntimeLayout[layout[i], element_type=element_type, linear_idx_type=linear_idx_type]` Extract a nested sublayout at the specified index. **Parameters:** * ​i (`Int`): The index of the nested layout to extract. **Returns:** A `RuntimeLayout` representing the nested layout at index i. ### `dim` `dim(self, i: Int) -> Int` Get the size of the dimension at the specified index. **Args:** * ​i (`Int`): The index of the dimension to retrieve. **Returns:** The size of the dimension at index `i`. ### `__len__` `static __len__() -> Int` Get the number of dimensions in the layout. **Returns:** The number of dimensions (rank) of the layout. --- ## coalesce `coalesce[l: Layout, keep_rank: Bool = False](layout: RuntimeLayout[l, element_type=element_type, linear_idx_type=linear_idx_type]) -> RuntimeLayout[coalesce(l, keep_rank), element_type=element_type, linear_idx_type=linear_idx_type]` Coalesce adjacent dimensions in a runtime layout when possible. This optimizes the layout by merging adjacent dimensions when their relationship allows it, potentially reducing the number of dimensions. **Parameters:** * ​l (`Layout`): The static layout type to coalesce. * ​keep\_rank (`Bool`): Whether to maintain the original rank (currently unsupported). **Args:** * ​layout (`RuntimeLayout[l, element_type=element_type, linear_idx_type=linear_idx_type]`): The input `RuntimeLayout` to coalesce. **Returns:** A new `RuntimeLayout` with coalesced dimensions. --- ## runtime_layout Provides the `RuntimeLayout` type and functions for working with it. You can use `RuntimeLayout` to define a layout where the dimensions are not known at compile time. You can import these APIs from `layout.runtime_layout`. ```mojo from layout.runtime_layout import RuntimeLayout, make_layout ``` ## Structs * [​`RuntimeLayout`](./RuntimeLayout): A runtime-configurable layout that uses `RuntimeTuple` for storage. ## Functions * [​`coalesce`](./coalesce): Coalesce adjacent dimensions in a runtime layout when possible. * [​`make_layout`](./make_layout): Combine two runtime layouts into a single composite layout. --- ## make_layout `make_layout[l1: Layout, l2: Layout, /, *, linear_idx_type: DType = uint64](a: RuntimeLayout[l1, element_type=element_type, linear_idx_type=linear_idx_type], b: RuntimeLayout[l2, element_type=element_type, linear_idx_type=linear_idx_type]) -> RuntimeLayout[make_layout(l1, l2), element_type=element_type, linear_idx_type=linear_idx_type]` Combine two runtime layouts into a single composite layout. This creates a new layout by concatenating the dimensions and strides of the input layouts. **Parameters:** * ​l1 (`Layout`): The static layout type of `a`. * ​l2 (`Layout`): The static layout type of `b`. * ​linear\_idx\_type (`DType`): The integer type of the all index calculated by the returned runtime layout. **Args:** * ​a (`RuntimeLayout[l1, element_type=element_type, linear_idx_type=linear_idx_type]`): The first `RuntimeLayout` to combine. * ​b (`RuntimeLayout[l2, element_type=element_type, linear_idx_type=linear_idx_type]`): The second `RuntimeLayout` to combine. **Returns:** A new `RuntimeLayout` with dimensions from both input layouts. --- ## RuntimeTuple `@register_passable(trivial)` `struct RuntimeTuple[origin: ImmutableOrigin, //, S: IntTuple[origin] = IntTuple(-1), /, *, element_type: DType = int64]` A struct representing tuple-like data with compile-time and runtime elements. RuntimeTuple combines static (compile-time) and dynamic (runtime) handling of tuple-like data structures, typically used for tensor shapes, indices, and coordinates in high-performance computing contexts. This struct is optimized for parallel execution and hardware acceleration, allowing efficient manipulation of multi-dimensional data. It supports both known compile-time values and runtime-determined values. ## Parameters * ​origin (`ImmutableOrigin`): The origin corresponding to the `IntTuple`. * ​S (`IntTuple[origin]`): `IntTuple` with compile-time known values (or `UNKNOWN_VALUE` for runtime values). * ​element\_type (`DType`): Integer type of the underlying elements. ## Fields * ​value (`IndexList[len[::Sized](flatten[::Origin[::Bool(S)), element_type=element_type]`): Storage for the actual tuple values, implemented as an IndexList with the appropriate size and element type. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Intable`, `Movable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `scalar_length` `alias scalar_length = len[::Sized](flatten[::Origin[::Bool(S))` The total number of scalar elements in this RuntimeTuple after flattening nested tuples. ## Methods ### `__init__` `__init__() -> Self` Initialize a `RuntimeTuple` with default values. For dimensions with known compile-time values in S, uses those values. For unknown dimensions, initializes them to UNKNOWN\_VALUE. `@implicit` `__init__(*values: Int) -> Self` Initialize a `RuntimeTuple` with the provided values. **Args:** * ​\*values (`Int`): Variadic number of integer values to initialize the tuple with. `@implicit` `__init__[l: Int](values: IndexList[l, element_type=element_type]) -> Self` Initialize a `RuntimeTuple` from an `IndexList`. **Parameters:** * ​l (`Int`): Compile-time length of the input `IndexList`. **Args:** * ​values (`IndexList[l, element_type=element_type]`): `IndexList` to initialize from. Must have same length as the `RuntimeTuple`. The values will be cast to the appropriate element type if needed. ### `__getitem__` `__getitem__[i: Int](self) -> RuntimeTuple[S[i], element_type=element_type]` Retrieves the element at the specified index in the tuple. This method provides array-like indexing for RuntimeTuple, allowing access to individual elements or sub-tuples. It handles the internal offset calculation to access the correct elements in the flattened storage array. **Parameters:** * ​i (`Int`): The index of the element to retrieve. **Returns:** A new `RuntimeTuple` containing the element or sub-tuple at the specified index. ### `__setitem__` `__setitem__[i: Int](mut self, val: SIMD[element_type, 1])` Sets the value of the element at the specified index in the tuple. This method enables array-like assignment for RuntimeTuple elements, handling the internal offset calculation to modify the correct element in the flattened storage array. **Parameters:** * ​i (`Int`): The index of the element to modify. **Args:** * ​val (`SIMD[element_type, 1]`): The new value to assign to the element. ### `offset_until` `static offset_until[i: Int]() -> Int` Calculates the offset in the flattened value array for a given tuple index. This method computes the sum of lengths of all flattened subtuple elements that come before the specified index, which is used for indexing into the internal storage. **Parameters:** * ​i (`Int`): The tuple index to calculate the offset for. **Returns:** The offset in the flattened array where the i-th element begins. ### `get_int` `get_int(self) -> SIMD[element_type, 1]` Returns the integer value of this RuntimeTuple. For tuples with a known compile-time value, returns that value. For tuples with a runtime value, returns the first element of the internal storage array. **Returns:** The integer value of this RuntimeTuple. ### `__str__` `__str__(self) -> String` Converts the RuntimeTuple to its string representation. This method provides a human-readable string representation of the tuple, which is useful for debugging and logging. **Returns:** A string representation of the `RuntimeTuple`. ### `concat` `concat[: ImmutableOrigin, //, R: IntTuple[$0]](self, rhs: RuntimeTuple[R, element_type=element_type]) -> RuntimeTuple[concat[::Origin[::Bool(S, R), element_type=element_type]` Concatenates two `RuntimeTuple`s together. This method combines the current `RuntimeTuple` with another one, preserving both compile-time and runtime values. It handles the complexity of merging the underlying storage arrays while maintaining the proper semantic structure. **Parameters:** * ​R (`IntTuple[$0]`): The `IntTuple` type parameter of the right-hand side RuntimeTuple. **Args:** * ​rhs (`RuntimeTuple[R, element_type=element_type]`): The `RuntimeTuple` to concatenate to the end of this one. **Returns:** A new `RuntimeTuple` containing all elements from both tuples in sequence. ### `flatten` `flatten(self) -> RuntimeTuple[flatten[::Origin[::Bool(S), element_type=element_type]` Flattens a potentially nested `RuntimeTuple` into a single-level tuple. This method converts a hierarchical structure of tuples into a flat representation, preserving all values but removing the nested structure. This is useful for operations that need to treat all elements uniformly. **Returns:** A new `RuntimeTuple` containing all elements in a flat (non-nested) structure. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the RuntimeTuple to a Writer object. This method is used by the string conversion system to generate a string representation of the RuntimeTuple. It handles both scalar values and nested tuple structures, producing a properly formatted output. **Parameters:** * ​W (`Writer`): The Writer type to use for output. **Args:** * ​writer (`W`): The Writer object to write the string representation to. ### `__len__` `__len__(self) -> Int` Returns the length (number of top-level elements) of the `RuntimeTuple`. This method provides the standard Python-like len() functionality, giving the number of elements at the top level of the tuple structure. For nested tuples, this returns the number of first-level entries, not the total number of scalar values. **Returns:** The number of top-level elements in the tuple. ### `cast` `cast[type: DType](self) -> RuntimeTuple[S, element_type=type]` Casts the RuntimeTuple to use a different numeric type. This method creates a new RuntimeTuple with the same structure and values but using a different underlying numeric type for storage. This is useful for changing precision or signedness of the data. **Parameters:** * ​type (`DType`): The target DType to cast the elements to. **Returns:** A new `RuntimeTuple` with elements cast to the specified type. ### `__int__` `__int__(self) -> Int` Converts the RuntimeTuple to an integer value. This method enables implicit conversion of a RuntimeTuple to an integer, but is constrained to only work on scalar tuples (those that contain a single value). **Returns:** The integer value of the tuple. --- ## concat `concat(owned lhs: IntTuple[origin], rhs: IntTuple[origin]) -> IntTuple` Concatenates two `IntTuple` instances into a single `IntTuple`. This function appends all elements from the right-hand side tuple to the left-hand side tuple, creating a new combined tuple. The operation preserves the hierarchical structure of both tuples. **Args:** * ​lhs (`IntTuple[origin]`): The left-hand side `IntTuple` that will be modified (owned parameter). * ​rhs (`IntTuple[origin]`): The right-hand side `IntTuple` whose elements will be appended. **Returns:** A new `IntTuple` containing all elements from both tuples in sequence. --- ## crd2idx `crd2idx[: ImmutableOrigin, : ImmutableOrigin, : ImmutableOrigin, //, crd_t: IntTuple[$2], shape_t: IntTuple[$1], stride_t: IntTuple[$0], out_type: DType = uint64](crd: RuntimeTuple[crd_t, element_type=element_type], shape: RuntimeTuple[shape_t, element_type=element_type], stride: RuntimeTuple[stride_t, element_type=element_type]) -> SIMD[out_type, 1]` Converts multi-dimensional coordinates to a linear index. This function is the inverse of idx2crd, transforming a set of coordinates into a flat index based on the provided shape and stride information. This is essential for mapping multi-dimensional tensor elements to linear memory. **Parameters:** * ​crd\_t (`IntTuple[$2]`): Type of the coordinates. * ​shape\_t (`IntTuple[$1]`): Type of the shape. * ​stride\_t (`IntTuple[$0]`): Type of the stride. * ​out\_type (`DType`): The output data type for the index (default: uint64). **Args:** * ​crd (`RuntimeTuple[crd_t, element_type=element_type]`): The coordinates to convert. * ​shape (`RuntimeTuple[shape_t, element_type=element_type]`): The shape of the multi-dimensional array. * ​stride (`RuntimeTuple[stride_t, element_type=element_type]`): The stride values for each dimension. **Returns:** A scalar value representing the linear index corresponding to the given coordinates. --- ## idx2crd `idx2crd[: ImmutableOrigin, : ImmutableOrigin, : ImmutableOrigin, //, idx_t: IntTuple[$2], shape_t: IntTuple[$1], stride_t: IntTuple[$0]](idx: RuntimeTuple[idx_t, element_type=element_type], shape: RuntimeTuple[shape_t, element_type=element_type], stride: RuntimeTuple[stride_t, element_type=element_type]) -> RuntimeTuple[idx2crd[::Origin[::Bool(idx_t, shape_t, stride_t), element_type=element_type]` Converts a linear index to multi-dimensional coordinates. This function transforms a flat index into coordinate values based on the provided shape and stride information. This is essential for mapping linear memory accesses to multi-dimensional tensor elements. **Constraints:** The index must be a scalar value (not a tuple). **Parameters:** * ​idx\_t (`IntTuple[$2]`): IntTuple type of the index. * ​shape\_t (`IntTuple[$1]`): IntTuple type of the shape. * ​stride\_t (`IntTuple[$0]`): IntTuple type of the stride. **Args:** * ​idx (`RuntimeTuple[idx_t, element_type=element_type]`): The linear index to convert. * ​shape (`RuntimeTuple[shape_t, element_type=element_type]`): The shape of the multi-dimensional array. * ​stride (`RuntimeTuple[stride_t, element_type=element_type]`): The stride values for each dimension. **Returns:** A `RuntimeTuple` containing the multi-dimensional coordinates. `idx2crd[: ImmutableOrigin, : ImmutableOrigin, //, idx_t: IntTuple[$1], shape_t: IntTuple[$0]](idx: RuntimeTuple[idx_t, element_type=element_type], shape: RuntimeTuple[shape_t, element_type=element_type]) -> RuntimeTuple[idx2crd[::Origin[::Bool(idx_t, shape_t, prefix_product[::Origin[::Bool(shape_t)), element_type=element_type]` Converts a linear index to multi-dimensional coordinates using shape-derived strides. This is a convenience overload of `idx2crd` that automatically calculates the stride values from the shape using `prefix_product`. This is the common case for row-major storage order tensors. **Parameters:** * ​idx\_t (`IntTuple[$1]`): IntTuple type of the index. * ​shape\_t (`IntTuple[$0]`): IntTuple type of the shape. **Args:** * ​idx (`RuntimeTuple[idx_t, element_type=element_type]`): The linear index to convert. * ​shape (`RuntimeTuple[shape_t, element_type=element_type]`): The shape of the multi-dimensional array. **Returns:** A `RuntimeTuple` containing the multi-dimensional coordinates calculated using automatically derived strides from the shape. --- ## runtime_tuple Provides the `RuntimeTuple` data structure and related utility functions for handling tuple-like data with both compile-time and runtime elements. `RuntimeTuple` is designed for high-performance tensor operations, supporting efficient manipulation of multi-dimensional data structures like shapes, indices, and coordinates. Key features: * Hybrid compile-time/runtime value handling * Optimized for parallel execution and hardware acceleration * Support for nested tuple structures * Efficient conversion between linear indices and multi-dimensional coordinates * Specialized operations for tensor shape calculations The module includes functions for tuple manipulation (concatenation, flattening), coordinate transformations (`idx2crd`, `crd2idx`), and specialized tensor operations like shape division and prefix products. ## Structs * [​`RuntimeTuple`](./RuntimeTuple): A struct representing tuple-like data with compile-time and runtime elements. RuntimeTuple combines static (compile-time) and dynamic (runtime) handling of tuple-like data structures, typically used for tensor shapes, indices, and coordinates in high-performance computing contexts. This struct is optimized for parallel execution and hardware acceleration, allowing efficient manipulation of multi-dimensional data. It supports both known compile-time values and runtime-determined values. ## Functions * [​`concat`](./concat): Concatenates two `IntTuple` instances into a single `IntTuple`. * [​`crd2idx`](./crd2idx): Converts multi-dimensional coordinates to a linear index. * [​`idx2crd`](./idx2crd): Converts a linear index to multi-dimensional coordinates. This function transforms a flat index into coordinate values based on the provided shape and stride information. This is essential for mapping linear memory accesses to multi-dimensional tensor elements. * [​`is_int`](./is_int): Determines if a `RuntimeTuple` represents a scalar integer value. * [​`is_tuple`](./is_tuple): Determines if a `RuntimeTuple` represents a tuple rather than a scalar value. * [​`prefix_product`](./prefix_product): Computes the prefix products of elements in the `RuntimeTuple`. * [​`product`](./product): Computes the product of all elements in the `RuntimeTuple`. * [​`shape_div`](./shape_div): Performs specialized shape division between `RuntimeTuple`s. * [​`signum`](./signum): Returns the sign of an integer value. --- ## is_int `is_int[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> Bool` Determines if a `RuntimeTuple` represents a scalar integer value. This function checks if the `RuntimeTuple` holds a single scalar value rather than a tuple structure with multiple elements. **Parameters:** * ​t (`IntTuple[$0]`): The IntTuple type parameter of the RuntimeTuple. **Args:** * ​tuple (`RuntimeTuple[t, element_type=element_type]`): The `RuntimeTuple` to check. **Returns:** True if the `RuntimeTuple` represents a scalar integer, False otherwise. --- ## is_tuple `is_tuple[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> Bool` Determines if a `RuntimeTuple` represents a tuple rather than a scalar value. This function checks the structure of the underlying IntTuple to determine if it represents a tuple with multiple elements or a single scalar value. **Parameters:** * ​t (`IntTuple[$0]`): The IntTuple type parameter of the RuntimeTuple. **Args:** * ​tuple (`RuntimeTuple[t, element_type=element_type]`): The `RuntimeTuple` to check. **Returns:** True if the `RuntimeTuple` represents a tuple, False if it represents a scalar. --- ## prefix_product `prefix_product[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> RuntimeTuple[prefix_product[::Origin[::Bool(t)]` Computes the prefix products of elements in the `RuntimeTuple`. This function calculates the running product of elements, where each output element is the product of all previous elements in the input. This is commonly used in tensor computations to calculate stride values. **Parameters:** * ​t (`IntTuple[$0]`): The IntTuple type parameter of the input RuntimeTuple. **Args:** * ​tuple (`RuntimeTuple[t, element_type=element_type]`): The input `RuntimeTuple`. **Returns:** A new `RuntimeTuple` containing the prefix products of the input elements. --- ## product `product[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> Int` Computes the product of all elements in the `RuntimeTuple`. This function multiplies all scalar values in the tuple, including those in nested tuples after flattening. This is commonly used to calculate the total size of a tensor from its shape. **Parameters:** * ​t (`IntTuple[$0]`): The IntTuple type parameter of the input RuntimeTuple. **Args:** * ​tuple (`RuntimeTuple[t, element_type=element_type]`): The input `RuntimeTuple`. **Returns:** The product of all scalar elements in the tuple. --- ## shape_div `shape_div[: ImmutableOrigin, : ImmutableOrigin, //, a_t: IntTuple[$1], b_t: IntTuple[$0]](a: RuntimeTuple[a_t, element_type=element_type], b: RuntimeTuple[b_t, element_type=element_type]) -> RuntimeTuple[shape_div[::Origin[::Bool(a_t, b_t)]` Performs specialized shape division between `RuntimeTuple`s. This function implements a special division operation specifically designed for tensor shape calculations. Unlike standard division, it handles special cases: 1. If shapes are directly divisible (a % b == 0), returns a standard division (a // b) 2. If shapes are inversely divisible (b % a == 0), returns the signed reciprocal 3. If shapes are incompatible, aborts with an error This operation is essential for transformations between tensor layouts and computing broadcasting semantics. **Parameters:** * ​a\_t (`IntTuple[$1]`): Type of the first operand. * ​b\_t (`IntTuple[$0]`): Type of the second operand. **Args:** * ​a (`RuntimeTuple[a_t, element_type=element_type]`): The dividend `RuntimeTuple`. * ​b (`RuntimeTuple[b_t, element_type=element_type]`): The divisor `RuntimeTuple`. **Returns:** A new `RuntimeTuple` containing the result of the shape division. --- ## signum `signum(a: Int) -> Int` Returns the sign of an integer value. This helper function determines whether a number is positive, negative, or zero, returning 1 for positive, -1 for negative, and 0 for zero. **Args:** * ​a (`Int`): The integer value to determine the sign of. **Returns:** 1 if a > 0, -1 if a --- ## ComposedLayout `struct ComposedLayout[LayoutA: LayoutTrait, LayoutB: LayoutTrait, offset: OptionalReg[Int] = OptionalReg[Int]({:@stdlib::@builtin::@int::@Int {0}, 0})]` Layout composed of two layouts applied sequentially. Combines two layouts. Output of the first (`LayoutA`) is input to the second (`LayoutB`), with optional offset in between. ## Parameters * ​LayoutA (`LayoutTrait`): The first layout to apply. * ​LayoutB (`LayoutTrait`): The second layout to apply. * ​offset (`OptionalReg[Int]`): Optional offset between layouts (default: 0). ## Fields * ​layout\_a (`LayoutA`): The first layout to apply. * ​layout\_b (`LayoutB`): The second layout to apply. ## Implemented traits `AnyType`, `Copyable`, `LayoutTrait`, `UnknownDestructibility` ## Aliases ### `has_shape` `alias has_shape = get_vtable_entry(:trait LayoutA, "has_shape") if get_vtable_entry(:trait LayoutA, "has_shape") else get_vtable_entry(:trait LayoutB, "has_shape")` True if either layout has a shape. ## Methods ### `__init__` `__init__(out self, layout_a: LayoutA, layout_b: LayoutB)` Initialize ComposedLayout with two layouts. **Args:** * ​layout\_a (`LayoutA`): The first layout. * ​layout\_b (`LayoutB`): The second layout. ### `__copyinit__` `__copyinit__(out self, other: Self)` Copy constructor for ComposedLayout. **Args:** * ​other (`Self`): The ComposedLayout to copy from. ### `__call__` `__call__(self, idx: IntTuple[origin]) -> Int` Apply composed layout to an index. Applies `LayoutA`, then adds offset, then applies `LayoutB`. **Args:** * ​idx (`IntTuple[origin]`): The index to transform. **Returns:** The transformed index. `__call__(self, idx: IntTuple[origin], offset_val: Int) -> Int` Apply composed layout with runtime offset. Applies `LayoutA`, then adds runtime `offset_val`, then `LayoutB`. Static offset must not be set when using runtime offset. **Args:** * ​idx (`IntTuple[origin]`): The index to transform. * ​offset\_val (`Int`): Runtime offset to apply. **Returns:** The transformed index. ### `size` `size(self) -> Int` Get the size of the composed layout. Returns the size of the first layout (`LayoutA`). **Returns:** The size of the first layout. ### `cosize` `cosize(self) -> Int` Get the cosize of the composed layout. Returns the cosize of the second layout (`LayoutB`). **Returns:** The cosize of the second layout. --- ## Swizzle `@register_passable(trivial)` `struct Swizzle` Swizzle functor for memory access pattern optimization. Implements a swizzling pattern to reduce bank conflicts in shared memory accesses. It XORs specific bits of memory indices based on configurable parameters. Swizzle operation: Given index `i`, and Swizzle\[bits, base, shift]: 1. Extract `bits` number of bits from `i` starting from position `base + max(0, shift)`. Let's call this `YYY`. 2. Extract `bits` number of bits from `i` starting from position `base - min(0, shift)`. Let's call this `ZZZ`. 3. Result is `i ^ (YYY shifted by 'shift' positions)`. Example (Swizzle\[2, 0, 3]): Input index bits: `xxxxxxxxxxxxxxxxYYxxxxxxxxxZZxxxx` Output index bits: `xxxxxxxxxxxxxxxxYYxxxxxxxxxAAxxxx` where `AA = ZZ ^ YY`. Attributes: bits (Int): Number of bits in the mask (YYY). base (Int): Number of least significant bits to keep constant. shift (Int): Shift distance for the mask (positive: right, negative: left). yyy\_mask (Int): Mask for the bits to be shifted (YYY). zzz\_mask (Int): Mask for the target bits (ZZZ). ## Fields * ​bits (`Int`): Number of bits in the mask. * ​base (`Int`): Number of least significant bits to keep constant. * ​shift (`Int`): Distance to shift the mask (pos right, neg left). * ​yyy\_mask (`Int`): Mask for the bits to be shifted. * ​zzz\_mask (`Int`): Mask for the target bits. ## Implemented traits `AnyType`, `Copyable`, `LayoutTrait`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `has_shape` `alias has_shape = False` Indicates if layout has shape. Swizzle always False. ## Methods ### `__init__` `__init__(bits: Int, base: Int, shift: Int) -> Self` Initialize a Swizzle object. Configures the swizzle operation based on bits, base, and shift parameters. **Args:** * ​bits (`Int`): Number of bits in the mask. * ​base (`Int`): Least significant bits to keep constant. * ​shift (`Int`): Distance to shift the mask. ### `__call__` `__call__(self, index: IntTuple[origin]) -> Int` Apply swizzle to an IntTuple index. Unwraps the IntTuple and applies the swizzle to the integer value. **Args:** * ​index (`IntTuple[origin]`): The IntTuple index to swizzle. **Returns:** The swizzled index value. `__call__(self, offset: Int) -> Int` Apply swizzle to an integer offset. Performs the swizzle operation on an integer offset to rearrange memory access patterns. **Args:** * ​offset (`Int`): The integer offset to swizzle. **Returns:** The swizzled offset value. `__call__(self, offset: SIMD[dtype, 1]) -> SIMD[dtype, 1]` Apply swizzle to a scalar offset. Scalar version of the swizzle operation. Applies swizzle to a scalar offset. **Args:** * ​offset (`SIMD[dtype, 1]`): The scalar offset to swizzle. **Returns:** The swizzled scalar value. ### `size` `size(self) -> Int` Get the size of the swizzle pattern. Calculates the size of the memory region affected by the swizzle pattern. **Returns:** The size of the swizzle pattern. ### `cosize` `cosize(self) -> Int` Get the cosize of the swizzle pattern. Cosize is the same as size for swizzle layouts, representing the output size. **Returns:** The cosize of the swizzle pattern (same as size). ### `write_to` `write_to[W: Writer](self, mut writer: W)` Write the swizzle parameters to a writer. Outputs the swizzle parameters (bits, base, shift) in a tuple format. **Parameters:** * ​W (`Writer`): The writer type that implements the Writer trait. **Args:** * ​writer (`W`): The writer to write to. ### `__str__` `__str__(self) -> String` Convert the swizzle to a string representation. **Returns:** String representation of the swizzle parameters. --- ## eval_composed `eval_composed[composed_layout: ComposedLayout[Layout, Swizzle]](idx: UInt, offset: UInt = UInt(0)) -> UInt` Evaluate a composed layout with swizzle. Evaluates a `ComposedLayout[Layout, Swizzle]`. Applies the base layout, adds an optional offset, and then applies the swizzle. **Parameters:** * ​composed\_layout (`ComposedLayout[Layout, Swizzle]`): The composed layout to evaluate, consisting of a base Layout and a Swizzle transformation. **Args:** * ​idx (`UInt`): The input index to transform. * ​offset (`UInt`): Optional offset to apply between layouts (default: 0). **Returns:** The transformed index after applying both layouts. --- ## swizzle Defines swizzle layouts for optimizing memory access patterns. This module is designed for use in shared memory, especially in GPU kernels, to reduce bank conflicts. It provides tools to create and apply swizzle transformations to memory indices. Swizzling rearranges memory access order to distribute accesses across different memory banks. This mitigates bank contention and improves memory access efficiency. Module components: * `Swizzle` struct: Represents a swizzle transformation with configurable bits, base, and shift parameters. * Helper functions: `make_ldmatrix_swizzle`, `make_swizzle` create predefined swizzle patterns. These are optimized for scenarios like `ldmatrix` instructions and general 2D memory access. * `ComposedLayout` struct: Combines a base layout with a swizzle layout for complex memory access optimizations. Primary use case: GPU kernel development where shared memory bank conflicts can degrade performance. Applying swizzle layouts optimizes memory access patterns for higher throughput. ## Structs * [​`ComposedLayout`](./ComposedLayout): Layout composed of two layouts applied sequentially. * [​`Swizzle`](./Swizzle): Swizzle functor for memory access pattern optimization. ## Functions * [​`eval_composed`](./eval_composed): Evaluate a composed layout with swizzle. * [​`make_ldmatrix_swizzle`](./make_ldmatrix_swizzle): Make swizzle to avoid bank conflict for ldmatrix ops. * [​`make_swizzle`](./make_swizzle): Create a 2D swizzle to avoid bank conflicts. * [​`shiftl`](./shiftl): Shift left or right based on sign of shift amount. * [​`shiftr`](./shiftr): Shift right or left based on sign of shift amount. --- ## make_ldmatrix_swizzle `make_ldmatrix_swizzle[type: DType, row_size: Int, log2_vector_width: Int = 0]() -> Swizzle` Make swizzle to avoid bank conflict for ldmatrix ops. Creates a swizzle pattern optimized for `ldmatrix` operations. Minimizes bank conflicts in shared memory for these operations. Calculates swizzle parameters based on data type and row size. **Parameters:** * ​type (`DType`): The data type of the elements. * ​row\_size (`Int`): Size of each row in elements. * ​log2\_vector\_width (`Int`): Log2 of the vector width (default: 0). **Returns:** A `Swizzle` object configured for `ldmatrix`. --- ## make_swizzle `make_swizzle[num_rows: Int, row_size: Int, access_size: Int]() -> Swizzle` Create a 2D swizzle to avoid bank conflicts. Generates a swizzle pattern for 2D memory layout to minimize bank conflicts in shared memory access. **Parameters:** * ​num\_rows (`Int`): Number of rows in the minimum access pattern. * ​row\_size (`Int`): Size of each row in elements. * ​access\_size (`Int`): Number of elements accessed at once. **Returns:** A `Swizzle` object for 2D memory access. `make_swizzle[type: DType, mode: TensorMapSwizzle]() -> Swizzle` Create swizzle based on predefined swizzle modes. Returns a swizzle pattern based on standard modes (32B, 64B, 128B, none), adjusted for data type. **Parameters:** * ​type (`DType`): The data type of the elements. * ​mode (`TensorMapSwizzle`): The swizzle mode to use (TensorMapSwizzle enum). **Returns:** A `Swizzle` object configured by the specified mode. --- ## shiftl `shiftl(a: Int, s: Int) -> Int` Shift left or right based on sign of shift amount. Performs a left shift if `s` is positive, or a right shift if `s` is negative. **Args:** * ​a (`Int`): The integer value to shift. * ​s (`Int`): The shift amount. Positive for left, negative for right. **Returns:** The shifted integer value. `shiftl(a: SIMD[dtype, 1], s: SIMD[dtype, 1]) -> SIMD[dtype, 1]` Shift left/right based on sign of shift for scalars. Scalar version of `shiftl`. Left shift if `s` is positive, right shift if `s` is negative. **Args:** * ​a (`SIMD[dtype, 1]`): The scalar value to shift. * ​s (`SIMD[dtype, 1]`): The scalar shift amount. Positive for left, negative right. **Returns:** The shifted scalar value. --- ## shiftr `shiftr(a: Int, s: Int) -> Int` Shift right or left based on sign of shift amount. Performs a right shift if `s` is positive, or a left shift if `s` is negative. **Args:** * ​a (`Int`): The integer value to shift. * ​s (`Int`): The shift amount. Positive for right, negative for left. **Returns:** The shifted integer value. `shiftr(a: SIMD[dtype, 1], s: SIMD[dtype, 1]) -> SIMD[dtype, 1]` Shift right/left based on sign of shift for scalars. Scalar version of `shiftr`. Right shift if `s` is positive, left shift if `s` is negative. **Args:** * ​a (`SIMD[dtype, 1]`): The scalar value to shift. * ​s (`SIMD[dtype, 1]`): The scalar shift amount. Positive for right, negative left. **Returns:** The shifted scalar value. --- ## LayoutTensorBuild `@register_passable(trivial)` `struct LayoutTensorBuild[dtype: DType, *, __layout: Layout = __init__[::Origin[::Bool(IntTuple(1)), __layout_init: Bool = False, __address_space: AddressSpace = AddressSpace(0), __layout_int_type: DType = _get_layout_type(__layout, __address_space), __index_type: DType = _get_index_type(__layout, __address_space), __circular: Bool = False]` Tensor layout builder providing a fluent interface for constructing tensors with various layouts. ## Parameters * ​dtype (`DType`): Data type of tensor elements. * ​\_\_layout (`Layout`): The tensor's memory layout. * ​\_\_layout\_init (`Bool`): Whether the layout has been initialized. * ​\_\_address\_space (`AddressSpace`): Memory space (generic, shared, local). * ​\_\_layout\_int\_type (`DType`): Layout index type. * ​\_\_index\_type (`DType`): Type used for indexing. * ​\_\_circular (`Bool`): Whether tensor has circular indexing semantics. ## Fields * ​runtime\_layout (`RuntimeLayout[__layout, element_type=__layout_int_type, linear_idx_type=__index_type]`): Runtime representation of the tensor's layout. This field stores the layout information that can be manipulated at runtime, particularly important for tensors with dynamic dimensions. It encapsulates: * The static layout template from `__layout` parameter * The bit width for index calculations * The appropriate index type based on address space ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Initializes a new `LayoutTensorBuild` instance with default values. ### `row_major` `row_major[*shapes: Int](self) -> LayoutTensorBuild[dtype, __layout=row_major[::Origin[::Bool(_to_int_tuple[::VariadicList[::Int]]()), __layout_init=True]` Creates a row-major layout using compile-time dimensions. **Parameters:** * ​\*shapes (`Int`): Variadic parameter specifying the dimensions of the tensor. Each value represents the size of a dimension. **Returns:** `LayoutTensorBuild` - A new builder with row-major layout. `row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim), __layout_init=True]` Creates a row-major 2D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. **Returns:** `LayoutTensorBuild` - A new builder with row-major layout. `row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim, dim), __layout_init=True]` Creates a row-major 3D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. * ​shape2 (`ValueOrUnknown[dim]`): Third dimension size. **Returns:** `LayoutTensorBuild` - A new builder with row-major layout. `row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim, dim, dim), __layout_init=True]` Creates a row-major 4D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. * ​shape2 (`ValueOrUnknown[dim]`): Third dimension size. * ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size. **Returns:** `LayoutTensorBuild` - A new builder with row-major layout. `row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim], shape4: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim, dim, dim, dim), __layout_init=True]` Creates a row-major 5D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. * ​shape2 (`ValueOrUnknown[dim]`): Third dimension size. * ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size. * ​shape4 (`ValueOrUnknown[dim]`): Fifth dimension size. **Returns:** `LayoutTensorBuild` - A new builder with row-major layout. ### `col_major` `col_major[*shapes: Int](self) -> LayoutTensorBuild[dtype, __layout=col_major[::Origin[::Bool(_to_int_tuple[::VariadicList[::Int]]()), __layout_init=True]` Creates a column-major layout using compile-time dimensions. **Parameters:** * ​\*shapes (`Int`): Variadic parameter specifying the dimensions of the tensor. Each value represents the size of a dimension. **Returns:** `LayoutTensorBuild` - A new builder with column-major layout. `col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim), __layout_init=True]` Creates a column-major 2D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. **Returns:** `LayoutTensorBuild` - A new builder with column-major layout. `col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim, dim), __layout_init=True]` Creates a column-major 3D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. * ​shape2 (`ValueOrUnknown[dim]`): Third dimension size. **Returns:** `LayoutTensorBuild` - A new builder with column-major layout. `col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim, dim, dim), __layout_init=True]` Creates a column-major 4D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. * ​shape2 (`ValueOrUnknown[dim]`): Third dimension size. * ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size. **Returns:** `LayoutTensorBuild` - A new builder with column-major layout. `col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim], shape4: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim, dim, dim, dim), __layout_init=True]` Creates a column-major 5D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. * ​shape2 (`ValueOrUnknown[dim]`): Third dimension size. * ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size. * ​shape4 (`ValueOrUnknown[dim]`): Fifth dimension size. **Returns:** `LayoutTensorBuild` - A new builder with column-major layout. ### `layout` `layout[shape0: Int](self) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(IntTuple(shape0)), __layout_init=True]` Creates a 1D layout with a compile-time dimension. **Parameters:** * ​shape0 (`Int`): Size of the single dimension. **Returns:** `LayoutTensorBuild` - A new builder with the specified layout. `layout[rank: Int, shape: IndexList[rank], stride: IndexList[rank]](self) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(_to_int_tuple[::Int](shape), _to_int_tuple[::Int](stride)), __layout_init=True]` Creates a custom layout with compile-time dimensions and strides. **Parameters:** * ​rank (`Int`): Number of dimensions. * ​shape (`IndexList[rank]`): List of dimension sizes. * ​stride (`IndexList[rank]`): List of strides for each dimension. **Returns:** `LayoutTensorBuild` - A new builder with the specified custom layout. `layout[rank: Int](self, shape: IndexList[rank], stride: IndexList[rank]) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(_to_int_tuple[::Int](-1), _to_int_tuple[::Int](-1)), __layout_init=True]` Creates a custom layout with runtime dimensions and strides. **Parameters:** * ​rank (`Int`): Number of dimensions. **Args:** * ​shape (`IndexList[rank]`): List of dimension sizes. * ​stride (`IndexList[rank]`): List of strides for each dimension. **Returns:** `LayoutTensorBuild` - A new builder with the specified custom layout. `layout(self, shape0: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(IntTuple(dim)), __layout_init=True]` Creates a 1D layout with a runtime dimension. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): Size of the single dimension. **Returns:** `LayoutTensorBuild` - A new builder with the specified layout. ### `shared` `shared(self) -> LayoutTensorBuild[dtype, __layout=__layout, __layout_init=__layout_init, __address_space=AddressSpace(3)]` Places the tensor in GPU shared memory. **Returns:** `LayoutTensorBuild` - A new builder with shared memory address space. ### `local` `local(self) -> LayoutTensorBuild[dtype, __layout=__layout, __layout_init=__layout_init, __address_space=AddressSpace(5)]` Places the tensor in GPU local memory. **Returns:** `LayoutTensorBuild` - A new builder with local memory address space. ### `alloc` `alloc(self) -> LayoutTensor[dtype, __layout, MutableAnyOrigin, address_space=__address_space]` Allocates a new tensor using the current layout. Note: Fails to compile if layout is not set, dimensions are not known, or tensor is circular. **Returns:** `LayoutTensor` - A newly allocated tensor with the specified layout ### `view` `view[address_space: AddressSpace](self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space]) -> LayoutTensor[dtype, __layout, MutableAnyOrigin, address_space=address_space, layout_int_type=__layout_int_type, linear_idx_type=__index_type]` Creates a tensor view over existing memory. Note: Fails to compile if layout is not set, address spaces don't match, or tensor is circular. **Parameters:** * ​address\_space (`AddressSpace`): Memory address space for the tensor (generic, shared, local). **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space]`): Pointer to memory region to create the view over. **Returns:** `LayoutTensor` - A tensor view over the specified memory region with the current layout. ### `circular` `circular(self) -> LayoutTensorBuild[dtype, __layout=__layout, __layout_init=__layout_init, __address_space=__address_space, __circular=True]` Enables circular indexing for the tensor. **Returns:** `LayoutTensorBuild` - A new builder with circular indexing enabled. ### `iter` `iter(self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=__address_space], bound: Int) -> LayoutTensorIter[dtype, __layout, MutableAnyOrigin, address_space=__address_space, circular=__circular, layout_int_type=__layout_int_type, linear_idx_type=__index_type]` Creates an iterator over tensor elements. Note: Fails to compile if layout is not set or dimensions are not known. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=__address_space]`): Pointer to memory region. * ​bound (`Int`): Upper bound for iteration. **Returns:** `LayoutTensorIter` - An iterator over tensor elements. --- ## ValueOrUnknown `struct ValueOrUnknown[dim: Int = -1]` Represents either a static dimension (known at compile time) or a dynamic dimension (known at runtime). ## Parameters * ​dim (`Int`): Optional compile-time dimension value. Default is `UNKNOWN_VALUE` for dynamic dimensions. ## Fields * ​value (`Int`): The runtime value of the dimension. For static dimensions, this is set to the compile-time value. For dynamic dimensions, this is set at runtime. ## Implemented traits `AnyType`, `Defaultable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Initializes a static dimension with compile-time value. Note: Fails to compile if dim is `UNKNOWN_VALUE`, as dynamic dimensions require a runtime value. `@implicit` `__init__(out self, v: Int)` Initializes a dynamic dimension with runtime value. **Args:** * ​v (`Int`): Runtime value for the dimension. --- ## dynamic `dynamic(d: Int) -> ValueOrUnknown` Creates a dynamic dimension with runtime value. **Args:** * ​d (`Int`): Runtime dimension value. **Returns:** `ValueOrUnknown` - A dynamic dimension with the given value. --- ## tensor_builder Tensor Builder Module Provides a fluent interface for constructing tensors with various layouts and memory configurations. It includes utilities for creating both static (compile-time) and dynamic (runtime) tensor dimensions, supporting row-major, column-major, and custom layouts. The module enables memory placement in different address spaces (generic, shared, local) and supports features like circular indexing. Key components: * `ValueOrUnknown`: Represents static or dynamic tensor dimensions * `LayoutTensorBuild`: Builder class for tensor construction * Helper functions for dimension specification and layout creation ## Structs * [​`LayoutTensorBuild`](./LayoutTensorBuild): Tensor layout builder providing a fluent interface for constructing tensors with various layouts. * [​`ValueOrUnknown`](./ValueOrUnknown): Represents either a static dimension (known at compile time) or a dynamic dimension (known at runtime). ## Functions * [​`dynamic`](./dynamic): Creates a dynamic dimension with runtime value. * [​`static`](./static): Creates a static dimension with compile-time value. --- ## static `static[d: Int]() -> ValueOrUnknown[d]` Creates a static dimension with compile-time value. **Parameters:** * ​d (`Int`): The compile-time dimension value to use. **Returns:** `ValueOrUnknown[d]` - A static dimension with the given value. --- ## TensorCore `struct TensorCore[out_type: DType, in_type: DType, shape: IndexList[3], transpose_b: Bool = False]` TensorCore provides an abstraction for GPU tensor core hardware to perform optimized matrix operations. This struct encapsulates the functionality required to efficiently map matrix operations to Tensor Cores on NVIDIA and AMD GPUs. It handles loading matrix fragments, performing matrix multiply-accumulate operations, and storing results with hardware-specific optimizations. Note: Different shapes and data types are supported depending on the GPU hardware. For NVIDIA GPUs: * float32: 16×8×8 or 16×8×4 * half-precision: 16×8×16 * float8: 16×8×32 For AMD GPUs: * float32: 16×16×4 * half-precision: 16×16×16 or 32×32×8 ## Parameters * ​out\_type (`DType`): The data type for output/accumulation operations. * ​in\_type (`DType`): The data type for input matrix elements. * ​shape (`IndexList[3]`): The shape parameters for the matrix operation in the form \[M, N, K] where M×N is the output shape and K is the inner dimension. * ​transpose\_b (`Bool`): Whether to transpose the B matrix before multiplication. Defaults to False. ## Implemented traits `AnyType`, `Defaultable`, `UnknownDestructibility` ## Aliases ### `a_reg_type` `alias a_reg_type = SIMD[in_type, num_matrix_reg[::Int,::Int]()]` ### `b_reg_type` `alias b_reg_type = SIMD[in_type, num_matrix_reg[::Int,::Int]()]` ### `c_reg_tile_type` `alias c_reg_tile_type = LayoutTensor[out_type, col_major(1, num_matrix_reg[::Int,::Int]()), MutableAnyOrigin, address_space=AddressSpace(5)]` ### `c_reg_type` `alias c_reg_type = SIMD[out_type, num_matrix_reg[::Int,::Int]()]` ### `supported_fp32` `alias supported_fp32 = (shape == IndexList(16, 8, 8, Tuple())) if is_nvidia_gpu() else (shape == IndexList(16, 16, 4, Tuple())) if (in_type is float32) else (in_type is float32)` ### `supported_fp8` `alias supported_fp8 = (shape == IndexList(16, 8, 32, Tuple())) if Tuple(VariadicPack(float8_e4m3fn, float8_e5m2)).__contains__[::EqualityComparable & ::Copyable & ::Movable](in_type) else Tuple(VariadicPack(float8_e4m3fn, float8_e5m2)).__contains__[::EqualityComparable & ::Copyable & ::Movable](in_type)` ### `supported_half` `alias supported_half = (shape == IndexList(16, 8, 16, Tuple())) if is_nvidia_gpu() else Tuple(VariadicPack(IndexList(16, 16, 16, Tuple()), IndexList(32, 32, 8, Tuple()))).__contains__[::EqualityComparable & ::Copyable & ::Movable](shape) if in_type.is_half_float() else in_type.is_half_float()` ## Methods ### `__init__` `__init__(out self)` Initialize a new TensorCore instance. ### `get_shapes` `static get_shapes[out_type: DType, in_type: DType]() -> List[IndexList[3]]` Get supported shapes for given data types. Returns a list of valid shapes for the specified output and input data types. Note: The returned shapes are hardware-dependent. Different shapes are supported for different combinations of input and output types. **Parameters:** * ​out\_type (`DType`): The output/accumulation data type. * ​in\_type (`DType`): The input matrix data type. **Returns:** List\[IndexList\[3]]: Valid shapes for the matrix operations given the specified types. ### `load_a` `load_a[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, a: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[in_type, _get_a_reg_tile_layout[::Layout,::IndexList[::Int(), MutableAnyOrigin, address_space=AddressSpace(5)]` Load the A matrix fragments. Loads matrix A from memory into a LayoutTensor suitable for tensor core operations. **Parameters:** * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzle pattern for optimal memory access (AMD only). **Args:** * ​a (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source matrix A data. **Returns:** The loaded matrix fragments as a `LayoutTensor`. `load_a[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, warp_tile: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], fragments: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mma_tile_coord_k: UInt = UInt(0))` Load A matrix fragments from shared memory. Optimized version for loading A matrix fragments from shared memory. **Parameters:** * ​swizzle (`OptionalReg[Swizzle]`): Optional memory access pattern for to optimize memory bandwidth. **Args:** * ​warp\_tile (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source data in shared memory. * ​fragments (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor for fragments. * ​mma\_tile\_coord\_k (`UInt`): The K coordinate of the MMA tile. Defaults to 0. ### `load_b` `load_b[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, b: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[in_type, _get_b_reg_tile_layout[::Layout,::IndexList[::Int(), MutableAnyOrigin, address_space=AddressSpace(5)]` Load the B matrix fragments. Loads matrix B from memory into a `LayoutTensor` suitable for tensor core operations. The function handles different hardware architectures and memory access patterns. Note: If transpose\_b is `True`, the B matrix will be transposed during loading. This is more efficient than transposing the matrix in memory. **Parameters:** * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzle pattern for optimal memory access (AMD only). Will cause an error if used with NVIDIA GPUs. **Args:** * ​b (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source matrix B data. **Returns:** The loaded matrix fragments as a `LayoutTensor`. `load_b[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, warp_tile: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], fragments: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mma_tile_coord_k: UInt = UInt(0), warp_tile_coord_n: UInt = UInt(0))` Load B matrix fragments from shared memory into registers for tensor core operations. This function loads matrix B fragments from a warp tile in shared memory into register fragments for use in tensor core matrix multiply operations. It handles hardware-specific optimizations for both NVIDIA and AMD GPUs. Note: The `warp_tile` must be in shared memory. For NVIDIA GPUs, `swizzle` must be `None`. For AMD GPUs, providing an appropriate `swizzle` pattern can improve performance. **Parameters:** * ​swizzle (`OptionalReg[Swizzle]`): Optional memory access pattern for AMD GPUs to optimize memory bandwidth. Must be None when running on NVIDIA GPUs. For NVIDIA GPUs, swizzle is always on. **Args:** * ​warp\_tile (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Source `LayoutTensor` in shared memory containing the B matrix data. * ​fragments (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Destination `LayoutTensor` to store the loaded matrix fragments. * ​mma\_tile\_coord\_k (`UInt`): K-dimension coordinate within the warp tile. Defaults to 0. * ​warp\_tile\_coord\_n (`UInt`): N-dimension coordinate within the warp tile. Defaults to 0. `load_b(self, warp_tile: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], fragments: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scales: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mma_tile_coord_k: UInt = UInt(0))` Load quantized B matrix fragments from shared memory with dequantization. This function loads int4 quantized matrix B fragments from shared memory, dequantizes them using the provided scales, and stores the result in register fragments for tensor core operations. Notes: * The `warp_tile` must be in shared memory. * The `fragments` and `scales` must be in local memory. * This function only supports half-precision data types (bfloat16, float16). * The quantized data is stored as int4 values packed into int32 elements. * Each thread processes multiple fragments by unpacking and dequantizing the int4 values. **Args:** * ​warp\_tile (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Source `LayoutTensor` in shared memory containing the quantized B matrix data. * ​fragments (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Destination `LayoutTensor` to store the dequantized matrix fragments. * ​scales (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): `LayoutTensor` containing the scaling factors for dequantization. * ​mma\_tile\_coord\_k (`UInt`): K-dimension coordinate within the warp tile. Defaults to 0. ### `load_c` `load_c(self, c: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[out_type, col_major(1, num_matrix_reg[::Int,::Int]()), MutableAnyOrigin, address_space=AddressSpace(5)]` Load the C matrix fragments. Loads matrix C from memory into a `LayoutTensor` suitable for tensor core operations. The function handles different hardware architectures and memory access patterns. **Args:** * ​c (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source matrix C data. **Returns:** The loaded matrix fragments as a `LayoutTensor`. ### `store_d` `store_d(self, d_dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], d_src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Store matrix D to destination memory. Stores the result matrix D from tensor core computation to the destination memory. **Args:** * ​d\_dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor to store the result. * ​d\_src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor containing the computed result. ### `mma_op` `mma_op(self, a: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[out_type, col_major(1, num_matrix_reg[::Int,::Int]()), MutableAnyOrigin, address_space=AddressSpace(5)]` Perform matrix multiply-accumulate operation (MMA). Executes `D = A * B + C` using tensor cores. **Args:** * ​a (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The A matrix input. * ​b (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The B matrix input. * ​c (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The C matrix input for accumulation. **Returns:** `Self.c_reg_tile_type`: The result of the MMA operation. ### `mma` `mma(self, a_frag: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b_frag: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_frag: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Perform matrix multiply-accumulate operation using tensor cores. Executes C = A \* B + C using tensor cores, where A, B, and C are matrix fragments stored in register memory. This function handles the mapping of fragments to hardware tensor core operations. Notes: * All fragments must be properly loaded using the corresponding load functions. * The function assumes fragments are vectorized layout tensors with dimensions num\_vectors x 1. * The c\_frag shape\[0] must equal num\_m\_mmas \* num\_n\_mmas. * The result is accumulated in-place in c\_frag. **Args:** * ​a\_frag (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix A fragments as a `LayoutTensor`. * ​b\_frag (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix B fragments as a `LayoutTensor`. * ​c\_frag (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix C fragments as a `LayoutTensor` for both input and output. --- ## get_fragment_size `get_fragment_size[mma_shape: IndexList[3]]() -> IndexList[3]` Calculates the fragment size per thread for a given MMA shape. For tensor core operations, each thread in a warp handles a portion of the computation. This function determines how many elements each thread needs to process for the A, B, and C/D matrices based on the MMA shape. **Parameters:** * ​mma\_shape (`IndexList[3]`): An `IndexList[3]` containing the MMA dimensions \[M, N, K]. **Returns:** An `IndexList[3]` containing the fragment sizes per thread for matrices A, B, and C/D respectively, calculated as: `[M*K/WARP_SIZE, N*K/WARP_SIZE, M*N/WARP_SIZE]`. --- ## get_mma_shape `get_mma_shape[input_type: DType, accum_type: DType, shape_id: Int = 0]() -> IndexList[3]` Returns the appropriate matrix multiply-accumulate (MMA) shape for tensor core operations. Selects the optimal MMA shape based on the GPU architecture, input data type, accumulation data type, and optional shape identifier. This function handles different configurations for both NVIDIA and AMD GPUs. **Parameters:** * ​input\_type (`DType`): The data type of the input matrices (A and B). * ​accum\_type (`DType`): The data type used for accumulation (C and D). * ​shape\_id (`Int`): Optional identifier to select between multiple valid shapes (default: 0). **Returns:** An `IndexList[3]` containing the MMA dimensions in the format `[M, N, K]`, where `M×N` is the output matrix size and `K` is the reduction dimension. --- ## tensor_core Tensor Core Module for High-Performance Matrix Operations Provides abstractions for using GPU Tensor Cores to perform optimized matrix operations. It supports both NVIDIA and AMD GPU architectures with hardware-specific optimizations. ## Key Components: * `TensorCore`: Core struct that encapsulates tensor core operations with support for various data types and matrix shapes. It handles loading matrix fragments, performing matrix multiply-accumulate operations, and storing results. * Matrix Fragment Management: Functions for loading and storing matrix fragments to/from shared memory with hardware-specific optimizations. * Matrix Multiply-Accumulate (MMA): Optimized implementations of matrix multiplication operations using tensor cores. ## Supported Operations: * Matrix loading with various layouts and swizzling patterns * Matrix multiply-accumulate (D = A \* B + C) * Matrix storing with hardware-specific optimizations ## Supported Data Types: * NVIDIA: float32, bfloat16, float16, float8\_e4m3fn, float8\_e5m2 * AMD: float32, bfloat16, float16 ## Supported Matrix Shapes: * NVIDIA: 16×8×8, 16×8×4, 16×8×16, 8×8×4, 16×8×32 * AMD: 16×16×4, 16×16×16, 32×32×8 ## Aliases ### `shape_16x16x16` `alias shape_16x16x16 = IndexList(16, 16, 16, Tuple())` ### `shape_16x16x4` `alias shape_16x16x4 = IndexList(16, 16, 4, Tuple())` ### `shape_16x8x16` `alias shape_16x8x16 = IndexList(16, 8, 16, Tuple())` ### `shape_16x8x32` `alias shape_16x8x32 = IndexList(16, 8, 32, Tuple())` ### `shape_16x8x4` `alias shape_16x8x4 = IndexList(16, 8, 4, Tuple())` ### `shape_16x8x8` `alias shape_16x8x8 = IndexList(16, 8, 8, Tuple())` ### `shape_32x32x8` `alias shape_32x32x8 = IndexList(32, 32, 8, Tuple())` ### `shape_8x8x4` `alias shape_8x8x4 = IndexList(8, 8, 4, Tuple())` ### `shape_null` `alias shape_null = IndexList(0, 0, 0, Tuple())` ## Structs * [​`TensorCore`](./TensorCore): TensorCore provides an abstraction for GPU tensor core hardware to perform optimized matrix operations. ## Functions * [​`get_fragment_size`](./get_fragment_size): Calculates the fragment size per thread for a given MMA shape. * [​`get_mma_shape`](./get_mma_shape): Returns the appropriate matrix multiply-accumulate (MMA) shape for tensor core operations. * [​`num_matrix_reg`](./num_matrix_reg): Calculates the number of matrix registers required per thread. --- ## num_matrix_reg `num_matrix_reg[dim_1: Int, dim_2: Int]() -> Int` Calculates the number of matrix registers required per thread. Determines how many registers each thread in a warp needs to store a matrix of the given dimensions. This is calculated by dividing the total number of elements (dim\_1 \* dim\_2) by the warp size, as the matrix is distributed across all threads in the warp. **Parameters:** * ​dim\_1 (`Int`): First dimension of the matrix. * ​dim\_2 (`Int`): Second dimension of the matrix. **Returns:** The number of matrix registers needed per thread. --- ## TensorCoreAsync `struct TensorCoreAsync[c_type: DType, a_type: DType, b_type: DType, mma_shape: IndexList[3], /, a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = False]` High-performance asynchronous tensor core operations for matrix multiplication. This struct provides methods for utilizing NVIDIA's Tensor Cores for asynchronous matrix multiplication operations, with support for various data types and swizzling configurations. ## Parameters * ​c\_type (`DType`): Data type of the output matrix C. * ​a\_type (`DType`): Data type of the input matrix A. * ​b\_type (`DType`): Data type of the input matrix B. * ​mma\_shape (`IndexList[3]`): Dimensions for the matrix multiply-accumulate (MMA) operation as \[M, N, K]. * ​a\_swizzle (`TensorMapSwizzle`): Swizzling mode for matrix A (default: SWIZZLE\_NONE). * ​b\_swizzle (`TensorMapSwizzle`): Swizzling mode for matrix B (default: SWIZZLE\_NONE). * ​transpose\_b (`Bool`): Whether to transpose matrix B (default: False). ## Implemented traits `AnyType`, `Defaultable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Initialize the `TensorCoreAsync` instance. Ensures that the provided MMA shape is supported. Note: Fails to compile if `mma_shape` is not supported. ### `wgmma` `static wgmma[num_warp_groups: Int = 1, scale_c: Int = 1, scale_a: Int = 1, scale_b: Int = 1](a_smem_tile: LayoutTensor[a_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b_smem_tile: LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_reg_tile: LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], wg_idx: Int = 0)` Perform asynchronous matrix multiplication using warp group matrix multiply-accumulate (WGMMA). This method handles the case where both A and B matrices are in shared memory. **Parameters:** * ​num\_warp\_groups (`Int`): Number of warp groups to distribute work across (default: 1). * ​scale\_c (`Int`): Scale factor for matrix C. Valid values are 1 or 0 (default: 1). * ​scale\_a (`Int`): Scale factor for matrix A. Valid values are 1 or -1 (default: 1). * ​scale\_b (`Int`): Scale factor for matrix B. Valid values are 1 or -1 (default: 1). **Args:** * ​a\_smem\_tile (`LayoutTensor[a_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix A in shared memory. * ​b\_smem\_tile (`LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix B in shared memory. * ​c\_reg\_tile (`LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Output matrix C in register memory. * ​wg\_idx (`Int`): Warp group index for multi-warp group scenarios (default: 0). `static wgmma(a_frag_tile: LayoutTensor[a_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b_smem_tile: LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_reg_tile: LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Perform asynchronous matrix multiplication using warp group matrix multiply-accumulate (WGMMA). This overloaded method handles the case where matrix A is in register memory and matrix B is in shared memory. **Args:** * ​a\_frag\_tile (`LayoutTensor[a_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix A in register memory. * ​b\_smem\_tile (`LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix B in shared memory. * ​c\_reg\_tile (`LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Output matrix C in register memory. ### `arrive` `static arrive()` Ensures memory consistency by creating a fence for WGMMA operations. This method should be called before committing a group to ensure all shared memory accesses are properly aligned and visible. ### `commit_group` `static commit_group()` Commits the current warp group for execution. This synchronizes the warp group and commits all pending WGMMA operations that have been previously issued. ### `wait_group` `static wait_group[group: Int = 0]()` Waits for the completion of a specific warp group's operations. This method blocks until all WGMMA operations from the specified group are complete. **Parameters:** * ​group (`Int`): The group ID to wait for (default: 0). --- ## tensor_core_async Tensor Core Async Module This module provides high-performance abstractions for utilizing NVIDIA's Tensor Cores to perform asynchronous matrix multiplication operations. It implements optimized memory layouts and access patterns for efficient tensor core computations. Key components: * Layout creation functions for K-major and MN-major memory arrangements * Swizzling support for improved memory access patterns * WGMMA (Warp Group Matrix Multiply-Accumulate) descriptor generation * TensorCoreAsync struct with methods for asynchronous matrix multiplication The module supports various data types, matrix dimensions, and memory configurations, enabling efficient implementation of deep learning primitives and other tensor operations that can leverage hardware acceleration. Performance features: * Asynchronous execution model to overlap computation and memory access * Support for different swizzling modes to optimize memory bandwidth * Efficient register and shared memory utilization * Support for multi-warp group execution This implementation is specifically optimized for NVIDIA GPUs with Tensor Core support. ## Aliases ### `WGMMA_K_BYTES` `alias WGMMA_K_BYTES = 32` ## Structs * [​`TensorCoreAsync`](./TensorCoreAsync): High-performance asynchronous tensor core operations for matrix multiplication. ## Functions * [​`select_k_atom`](./select_k_atom): Creates a core matrix layout for tensor core operations. * [​`st_matrix_n_atom`](./st_matrix_n_atom): Creates a layout for N-major `st_matrix` atom in the context of WGMMA C matrix. * [​`st_matrix_n_layout`](./st_matrix_n_layout): Creates a layout for N-major `st_matrix` in the context of WGMMA C matrix. * [​`tile_layout_k_major`](./tile_layout_k_major): Creates a K-major layout for tensor core operations. * [​`tile_layout_mn_major`](./tile_layout_mn_major): Creates an MN-major layout for tensor core operations. * [​`tile_to_descriptor`](./tile_to_descriptor): Transforms a layout into a WGMMA descriptor-compatible layout. * [​`wgmma_c_layout`](./wgmma_c_layout): Generates three layouts for mapping WGMMA C matrix coordinates. * [​`wgmma_c_thread_layout`](./wgmma_c_thread_layout): Returns the thread layout component for WGMMA C matrix. * [​`wgmma_output_layout`](./wgmma_output_layout): Returns the output layout component for WGMMA C matrix. --- ## select_k_atom `select_k_atom[type: DType, swizzle_mode: TensorMapSwizzle]() -> Layout` Creates a core matrix layout for tensor core operations. Constructs the fundamental atomic layout for tensor core operations based on the specified data type and swizzle mode. This layout represents the minimal dense matrix structure that can be efficiently processed by tensor cores. **Parameters:** * ​type (`DType`): Element data type of the tensor. * ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern swizzling mode. **Returns:** `Layout` - A core matrix layout optimized for tensor core operations. --- ## st_matrix_n_atom `st_matrix_n_atom[num_stmatrix: Int]() -> Layout` Creates a layout for N-major `st_matrix` atom in the context of WGMMA C matrix. The domain of this layout is the warp group local thread index. Thus, the layout takes \[0, 128) as input and returns an offset for a logical array with an element size of 128-bit. **Parameters:** * ​num\_stmatrix (`Int`): Number of N-dimension tiles in the C matrix. **Returns:** `Layout` - A layout that maps warp group local thread index to an offset for a logical array with an element size of 128-bit. --- ## st_matrix_n_layout `st_matrix_n_layout[c_type: DType, WG_BN: Int, num_m_mmas: Int, num_consumer: Int]() -> Layout` Creates a layout for N-major `st_matrix` in the context of WGMMA C matrix. The layout modes are: the warp group local thread index, the N-dimension tiling size `WG_BN // 16`, the number of MMA tiles `num_m_mmas` in the M-dimension, and the number of consumers `num_consumer`. The output is an offset for a logical array with the element type `c_type`. **Parameters:** * ​c\_type (`DType`): Data type of the C matrix. * ​WG\_BN (`Int`): Size of the K dimension in the C matrix in shared memory. * ​num\_m\_mmas (`Int`): Number of MMA tiles in the M dimension. * ​num\_consumer (`Int`): Number of consumers. **Returns:** `Layout` - A layout that maps warp group local thread index to an offset for a logical array with the element type `c_type`. --- ## tile_layout_k_major `tile_layout_k_major[type: DType, BM: Int, BK: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))]() -> Layout` Creates a K-major layout for tensor core operations. Constructs a layout optimized for K-major access patterns in tensor core operations, with optional swizzling for improved memory access patterns. **Parameters:** * ​type (`DType`): Element data type of the tensor. * ​BM (`Int`): Size of the M dimension in the tile. * ​BK (`Int`): Size of the K dimension in the tile. * ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern swizzling mode (default: SWIZZLE\_NONE). **Returns:** `Layout` - A K-major layout configured for the specified dimensions and swizzle mode. --- ## tile_layout_mn_major `tile_layout_mn_major[type: DType, mn_dim: Int, k_dim: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))]() -> Layout` Creates an MN-major layout for tensor core operations. Constructs a unit layout optimized for MN-major access patterns in shared memory, with optional swizzling for improved memory access patterns. Note: This returns the "unit" layout; the actual shared memory layout can be a multiple of this unit. Currently only supports SWIZZLE\_NONE and SWIZZLE\_128B modes. **Parameters:** * ​type (`DType`): Element data type of the tensor. * ​mn\_dim (`Int`): Size of the MN dimension. * ​k\_dim (`Int`): Size of the K dimension. * ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern swizzling mode (default: SWIZZLE\_NONE). **Returns:** `Layout` - An MN-major layout configured for the specified dimensions and swizzle mode. --- ## tile_to_descriptor `tile_to_descriptor[type: DType, layout: Layout, is_k_major: Bool = True]() -> Layout` Transforms a layout into a WGMMA descriptor-compatible layout. Converts a standard layout into a form that can be used with WGMMA descriptors, handling both K-major and MN-major layouts differently. **Parameters:** * ​type (`DType`): Element data type of the tensor. * ​layout (`Layout`): Input layout to transform. * ​is\_k\_major (`Bool`): Whether the layout is K-major (True) or MN-major (False). **Returns:** \`Layout - A transformed layout compatible with WGMMA descriptors. --- ## wgmma_c_layout `wgmma_c_layout[mma_m: Int, mma_n: Int, C: Layout]() -> List[Layout]` Generates three layouts for mapping WGMMA C matrix coordinates. This function creates three layout mappings that are essential for working with WGMMA (Warp Group Matrix Multiply-Accumulate) operations: 1. A projection layout that maps linearized indices to row coordinates (i) 2. A projection layout that maps linearized indices to column coordinates (j) 3. A composite layout that maps thread and vector coordinates to linearized indices across multiple MMA tiles These layouts are particularly useful for operations like attention masking and matrix multiplication epilogues, where register values need to be mapped to the coordinate system of the C matrix. Note: This function enforces constraints on the WGMMA dimensions and ensures the C matrix dimensions are compatible with the WGMMA instruction size. **Parameters:** * ​mma\_m (`Int`): The M dimension (rows) of a single WGMMA instruction, must be 64. * ​mma\_n (`Int`): The N dimension (columns) of a single WGMMA instruction, must be multiple of 8. * ​C (`Layout`): The layout of the C matrix within a thread block. **Returns:** `List[Layout]` - A list containing three layouts: 1. proj\_i: Maps linearized indices to row coordinates 2. proj\_j: Maps linearized indices to column coordinates 3. TV\_tile\_to\_idx: Maps thread/vector/tile coordinates to linearized indices --- ## wgmma_c_thread_layout `wgmma_c_thread_layout[C: Layout]() -> Layout` Returns the thread layout component for WGMMA C matrix. Generates the first mode of the WGMMA C layout, which maps thread coordinates to linearized indices in the output matrix. **Parameters:** * ​C (`Layout`): The layout of the C matrix. **Returns:** `Layout` - A layout mapping thread coordinates to linearized indices. --- ## wgmma_output_layout `wgmma_output_layout[mma_n: Int, C: Layout]() -> Layout` Returns the output layout component for WGMMA C matrix. Generates the second mode of the WGMMA C layout, which maps output vector coordinates to linearized indices in the output matrix. **Parameters:** * ​mma\_n (`Int`): The N dimension of the WGMMA instruction. * ​C (`Layout`): The layout of the C matrix. **Returns:** `Layout` - A layout mapping output vector coordinates to linearized indices. --- ## PipelineState `@register_passable(trivial)` `struct PipelineState[num_stages: Int]` Manages state for a multi-stage pipeline with circular buffer semantics. PipelineState provides a mechanism for tracking the current stage in a multi-stage pipeline, particularly useful for double or triple buffering in GPU tensor operations. It maintains an index that cycles through the available stages, a phase bit that toggles when the index wraps around, and a monotonically increasing count. This struct is commonly used with TMA operations to coordinate the use of multiple buffers in a pipeline fashion, allowing for overlapping computation and data transfer. ## Parameters * ​num\_stages (`Int`): The number of stages in the pipeline (e.g., 2 for double buffering, 3 for triple buffering). ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Initialize a PipelineState with default values. Creates a new PipelineState with index 0, phase 0, and count 0. `__init__(index: Int, phase: Int, count: Int) -> Self` Initialize a PipelineState with specific values. Creates a new PipelineState with the specified index, phase, and count. **Args:** * ​index (`Int`): The initial stage index. * ​phase (`Int`): The initial phase value (0 or 1). * ​count (`Int`): The initial count value. ### `index` `index(self) -> Int` Get the current stage index. **Returns:** The current index value, which ranges from 0 to num\_stages-1. ### `phase` `phase(self) -> SIMD[uint32, 1]` Get the current phase bit. **Returns:** The current phase value (0 or 1), which toggles when the index wraps around. ### `step` `step(mut self)` Advance the pipeline state to the next stage. Increments the index and count. When the index reaches num\_stages, it wraps around to 0 and toggles the phase bit. This function is used to move to the next buffer in a multi-buffer pipeline, implementing circular buffer semantics. --- ## SharedMemBarrier `@register_passable(trivial)` `struct SharedMemBarrier` A hardware-accelerated synchronization primitive for GPU shared memory operations. This struct provides a barrier mechanism optimized for coordinating thread execution and memory transfers in GPU kernels, particularly for Tensor Memory Accelerator (TMA) operations. It enables efficient synchronization between threads and memory operations by leveraging hardware-specific barrier instructions. Key features: * Thread synchronization across thread blocks * Memory transfer completion tracking * Hardware-accelerated barrier operations * Support for phased synchronization This barrier is particularly useful for ensuring that shared memory operations complete before dependent computations begin, which is critical for maintaining data consistency in high-performance GPU kernels. ## Fields * ​mbar (`SIMD[int64, 1]`): Shared memory location used for the barrier state. This field stores an 8-byte aligned shared memory location that maintains the state of the barrier. The memory must be in shared address space to be accessible by all threads in a block. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `init` `init(ref [3] self, num_threads: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](1))` Initialize the barrier state with the expected number of threads. Sets up the barrier to expect arrivals from the specified number of threads before it can be satisfied. This is essential for coordinating thread synchronization in GPU kernels. **Args:** * ​num\_threads (`SIMD[int32, 1]`): Number of threads that must arrive at the barrier before it is satisfied. Defaults to 1. ### `expect_bytes` `expect_bytes(ref [3] self, bytes: SIMD[int32, 1])` Configure the barrier to expect a specific number of bytes to be transferred. Used with TMA operations to indicate the expected size of data transfer. The barrier will be satisfied when the specified number of bytes has been transferred, enabling efficient coordination of memory operations. **Args:** * ​bytes (`SIMD[int32, 1]`): Number of bytes expected to be transferred. ### `expect_bytes_relaxed` `expect_bytes_relaxed(ref [3] self, bytes: SIMD[int32, 1]) -> SIMD[uint64, 1]` Configure the barrier to expect a specific number of bytes to be transferred. Used with TMA operations to indicate the expected size of data transfer. The barrier will be satisfied when the specified number of bytes has been transferred, enabling efficient coordination of memory operations. **Args:** * ​bytes (`SIMD[int32, 1]`): Number of bytes expected to be transferred. **Returns:** The state. ### `wait` `wait(ref [3] self, phase: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0))` Wait until the barrier is satisfied. Blocks the calling thread until the barrier is satisfied, either by the expected number of threads arriving or the expected data transfer completing. This method implements an efficient spin-wait mechanism optimized for GPU execution. Note: Minimizes thread divergence during synchronization by using hardware-accelerated barrier instructions. **Args:** * ​phase (`SIMD[uint32, 1]`): The phase value to check against. Defaults to 0. ### `wait_acquire` `wait_acquire[scope: Scope](ref [3] self, phase: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0))` Acquire and wait until the barrier is satisfied. Blocks the calling thread until the barrier is satisfied, either by the expected number of threads arriving or the expected data transfer completing. This method implements an efficient spin-wait mechanism optimized for GPU execution. Note: Minimizes thread divergence during synchronization by using hardware-accelerated barrier instructions. **Parameters:** * ​scope (`Scope`): The scope of the barrier. **Args:** * ​phase (`SIMD[uint32, 1]`): The phase value to check against. Defaults to 0. ### `wait_relaxed` `wait_relaxed[scope: Scope](ref [3] self, phase: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0))` Wait until the barrier is satisfied with relaxed ordering. Blocks the calling thread until the barrier is satisfied, either by the expected number of threads arriving or the expected data transfer completing. This method implements an efficient spin-wait mechanism optimized for GPU execution. Note: Minimizes thread divergence during synchronization by using hardware-accelerated barrier instructions. **Parameters:** * ​scope (`Scope`): The scope of the barrier. **Args:** * ​phase (`SIMD[uint32, 1]`): The phase value to check against. Defaults to 0. ### `unsafe_ptr` `unsafe_ptr(ref [3] self) -> UnsafePointer[SIMD[int64, 1], address_space=AddressSpace(3), alignment=8, mut=self_is_mut, origin=self_is_origin]` Get an unsafe pointer to the barrier's memory location. Provides low-level access to the shared memory location storing the barrier state. This method is primarily used internally by other barrier operations that need direct access to the underlying memory. **Returns:** An unsafe pointer to the barrier's memory location in shared memory, properly typed and aligned for barrier operations. ### `arrive_cluster` `arrive_cluster(ref [3] self, cta_id: SIMD[uint32, 1], count: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1))` Signal arrival at the barrier from a specific CTA (Cooperative Thread Array) in a cluster. This method is used in multi-CTA scenarios to coordinate barrier arrivals across different CTAs within a cluster. It enables efficient synchronization across thread blocks in clustered execution models. **Args:** * ​cta\_id (`SIMD[uint32, 1]`): The ID of the CTA (Cooperative Thread Array) that is arriving. * ​count (`SIMD[uint32, 1]`): The number of arrivals to signal. Defaults to 1. ### `arrive` `arrive(ref [3] self) -> Int` Signal arrival at the barrier and return the arrival count. This method increments the arrival count at the barrier and returns the updated count. It's used to track how many threads have reached the synchronization point. **Returns:** The updated arrival count after this thread's arrival. --- ## TMATensorTile `struct TMATensorTile[dtype: DType, layout: Layout, desc_layout: Layout = layout]` A hardware-accelerated tensor memory access (TMA) tile for efficient asynchronous data movement. The TMATensorTile struct provides a high-performance interface for asynchronous data transfers between global memory and shared memory in GPU tensor operations. It encapsulates a TMA descriptor that defines the memory access pattern and provides methods for various asynchronous operations. Performance: * Hardware-accelerated memory transfers using TMA instructions * Supports prefetching of descriptors for latency hiding * Enforces 128-byte alignment requirements for optimal memory access ## Parameters * ​dtype (`DType`): DType The data type of the tensor elements. * ​layout (`Layout`): Layout The layout of the tile in shared memory, typically specified as row\_major. * ​desc\_layout (`Layout`): Layout = layout The layout of the descriptor, which can be different from the shared memory layout to accommodate hardware requirements like WGMMA. ## Fields * ​descriptor (`TMADescriptor`): The TMA descriptor that defines the memory access pattern. This field stores the hardware descriptor that encodes information about: * The source tensor's memory layout and dimensions * The tile shape and access pattern * Swizzling configuration for optimal memory access The descriptor is used by the GPU's Tensor Memory Accelerator hardware to efficiently transfer data between global and shared memory. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(out self, descriptor: TMADescriptor)` Initializes a new TMATensorTile with the provided TMA descriptor. **Args:** * ​descriptor (`TMADescriptor`): The TMA descriptor that defines the memory access pattern. ### `__copyinit__` `__copyinit__(out self, other: Self)` Copy initializes this `TMATensorTile` from another instance. **Args:** * ​other (`Self`): The other `TMATensorTile` instance to copy from. ### `prefetch_descriptor` `prefetch_descriptor(self)` Prefetches the TMA descriptor into cache to reduce latency. This method helps hide memory access latency by prefetching the descriptor before it's needed for actual data transfers. ### `async_copy` `async_copy[cta_group: Int = 1](self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ref [3] mem_barrier: SharedMemBarrier, coords: Tuple[UInt, UInt])` Schedules an asynchronous copy from global memory to shared memory at specified coordinates. This method initiates a hardware-accelerated asynchronous transfer of data from global memory to the specified destination in shared memory. The transfer is tracked by the provided memory barrier. **Constraints:** * The destination tensor must be 128-byte aligned in shared memory. * The descriptor layout may be smaller than the shared memory tile shape to accommodate hardware requirements. **Parameters:** * ​cta\_group (`Int`): Int If the TMA is issued with cta\_group == 2, only the leader CTA needs to be notified upon completion. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in shared memory where data will be copied. Must be 128-byte aligned. * ​mem\_barrier (`SharedMemBarrier`): The memory barrier used to track and synchronize the asynchronous transfer. * ​coords (`Tuple[UInt, UInt]`): The 2D coordinates in the source tensor from which to copy data. ### `async_copy_3d` `async_copy_3d(self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ref [3] mem_barrier: SharedMemBarrier, coords: Tuple[UInt, UInt, UInt])` Schedules an asynchronous copy from global memory to shared memory at specified 3D coordinates. This method initiates a hardware-accelerated asynchronous transfer of data from global memory to the specified destination in shared memory for 3D tensors. The transfer is tracked by the provided memory barrier. **Constraints:** * The destination tensor must be 128-byte aligned in shared memory. * The descriptor layout may be smaller than the shared memory tile shape to accommodate hardware requirements. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in shared memory where data will be copied. Must be 128-byte aligned. * ​mem\_barrier (`SharedMemBarrier`): The memory barrier used to track and synchronize the asynchronous transfer. * ​coords (`Tuple[UInt, UInt, UInt]`): The 3D coordinates in the source tensor from which to copy data. ### `async_multicast_load` `async_multicast_load[cta_group: Int = 1](self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ref [3] mem_barrier: SharedMemBarrier, coords: Tuple[UInt, UInt], multicast_mask: SIMD[uint16, 1])` Schedules an asynchronous multicast load from global memory to multiple shared memory locations. This method initiates a hardware-accelerated asynchronous transfer of data from global memory to multiple destination locations in shared memory across different CTAs (Cooperative Thread Arrays) as specified by the multicast mask. **Constraints:** The destination tensor must be 128-byte aligned in shared memory. **Parameters:** * ​cta\_group (`Int`): Int If the TMA is issued with cta\_group == 2, only the leader CTA needs to be notified upon completion. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): LayoutTensor The destination tensor in shared memory where data will be copied. Must be 128-byte aligned. * ​mem\_barrier (`SharedMemBarrier`): SharedMemBarrierArray The memory barrier used to track and synchronize the asynchronous transfer. * ​coords (`Tuple[UInt, UInt]`): Tuple\[UInt, UInt] The 2D coordinates in the source tensor from which to copy data. * ​multicast\_mask (`SIMD[uint16, 1]`): UInt16 A bit mask specifying which CTAs should receive the data. ### `async_store` `async_store(self, src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], coords: Tuple[UInt, UInt])` Schedules an asynchronous store from shared memory to global memory. This method initiates a hardware-accelerated asynchronous transfer of data from shared memory to global memory at the specified coordinates. **Constraints:** The source tensor must be 128-byte aligned in shared memory. **Args:** * ​src (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): LayoutTensor The source tensor in shared memory from which data will be copied. Must be 128-byte aligned. * ​coords (`Tuple[UInt, UInt]`): Tuple\[UInt, UInt] The 2D coordinates in the destination tensor where data will be stored. ### `async_reduce` `async_reduce[reduction_kind: ReduceOp](self, src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], coords: Tuple[UInt, UInt])` Schedules an asynchronous reduction operation from shared memory to global memory. This method initiates a hardware-accelerated asynchronous reduction operation that combines data from shared memory with data in global memory using the specified reduction operation. The reduction is performed element-wise at the specified coordinates in the global tensor. **Constraints:** The source tensor must be 128-byte aligned in shared memory. **Parameters:** * ​reduction\_kind (`ReduceOp`): The type of reduction operation to perform (e.g., ADD, MIN, MAX). This determines how values are combined during the reduction. **Args:** * ​src (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in shared memory containing the data to be reduced. Must be 128-byte aligned. * ​coords (`Tuple[UInt, UInt]`): The 2D coordinates in the destination tensor where the reduction will be applied. ### `commit_group` `commit_group(self)` Commits all prior initiated but uncommitted TMA instructions into a group. This function behaves the same as `cp_async_bulk_commit_group`, which creates a synchronization point for bulk TMA transfer. ### `wait_group` `wait_group[n: Int = 0](self)` Wait for the completion of asynchronous copy until a specified number of groups are waiting. This function behaves the same as `cp_async_bulk_wait_group`, which causes the executing thread to wait until a specified number of the most recent TMA copy are pending. **Parameters:** * ​n (`Int`): The number of pending groups left. ### `smem_tensormap_init` `smem_tensormap_init(self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3)])` Initializes a TMA descriptor in shared memory from this tensor tile's descriptor. This method copies the TMA descriptor from global memory to shared memory, allowing for faster access during kernel execution. The descriptor is copied in 16-byte chunks using asynchronous copy operations for efficiency. Note: * Only one thread should call this method to avoid race conditions * The descriptor is copied in 8 chunks of 16 bytes each (total 128 bytes) **Args:** * ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3)]`): Pointer to the location in shared memory where the descriptor will be stored. Must be properly aligned. ### `replace_tensormap_global_address_in_gmem` `replace_tensormap_global_address_in_gmem[dtype: DType](self, src_ptr: UnsafePointer[SIMD[dtype, 1]])` Replaces the global memory address in the TMA descriptor stored in global memory. This method allows dynamically changing the source tensor for TMA operations without recreating the entire descriptor, which is useful for reusing descriptors with different data sources. The operation modifies the descriptor in global memory directly. Note: A memory fence may be required after this operation to ensure visibility of the changes to other threads. **Parameters:** * ​dtype (`DType`): The data type of the new source tensor. **Args:** * ​src\_ptr (`UnsafePointer[SIMD[dtype, 1]]`): The new source tensor whose address will replace the current one in the descriptor. Must have compatible layout with the original tensor. ### `tensormap_fence_acquire` `tensormap_fence_acquire(self)` Establishes a memory fence for TMA operations with acquire semantics. This method ensures proper ordering of memory operations by creating a barrier that prevents subsequent TMA operations from executing before prior operations have completed. It is particularly important when reading from a descriptor that might have been modified by other threads or processes. The acquire semantics ensure that all memory operations after this fence will observe any modifications made to the descriptor before the fence. Notes: * The entire warp must call this function as the instruction is warp-aligned. * Typically used in pairs with `tensormap_fence_release` for proper synchronization. ### `tensormap_fence_release` `tensormap_fence_release(self)` Establishes a memory fence for TMA operations with release semantics. This method ensures proper ordering of memory operations by creating a barrier that ensures all prior memory operations are visible before subsequent operations can proceed. It is particularly important when modifying a TMA descriptor in global memory that might be read by other threads or processes. The release semantics ensure that all memory operations before this fence will be visible to any thread that observes operations after the fence. Notes: * Typically used after modifying a tensormap descriptor in global memory. * Often paired with `tensormap_fence_acquire` for proper synchronization. ### `replace_tensormap_global_address_in_shared_mem` `replace_tensormap_global_address_in_shared_mem[dtype: DType](self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], src_ptr: UnsafePointer[SIMD[dtype, 1]])` Replaces the global memory address in the TMA descriptor stored in shared memory. This method allows dynamically changing the source tensor for TMA operations without recreating the entire descriptor, which is useful for reusing descriptors with different data sources. The operation modifies a descriptor that has been previously copied to shared memory. Notes: * Only one thread should call this method to avoid race conditions. * A memory fence may be required after this operation to ensure visibility of the changes to other threads. * Typically used with descriptors previously initialized with `smem_tensormap_init`. **Parameters:** * ​dtype (`DType`): The data type of the new source tensor. **Args:** * ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the TMA descriptor in shared memory that will be modified. * ​src\_ptr (`UnsafePointer[SIMD[dtype, 1]]`): The new source tensor whose address will replace the current one in the descriptor. ### `tensormap_cp_fence_release` `tensormap_cp_fence_release(self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3)])` Establishes a memory fence for TMA operations with release semantics for shared memory descriptors. This method ensures proper ordering of memory operations by creating a barrier that ensures all prior memory operations are visible before subsequent operations can proceed. It is specifically designed for synchronizing between global memory and shared memory TMA descriptors. The release semantics ensure that all memory operations before this fence will be visible to any thread that observes operations after the fence. Notes: * The entire warp must call this function as the instruction is warp-aligned * Typically used after modifying a tensormap descriptor in shared memory * More specialized than the general `tensormap_fence_release` for cross-memory space synchronization **Args:** * ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3)]`): Pointer to the TMA descriptor in shared memory that is being synchronized with the global memory descriptor. ### `replace_tensormap_global_dim_strides_in_shared_mem` `replace_tensormap_global_dim_strides_in_shared_mem[dtype: DType, only_update_dim_0: Bool, /, *, rank: Int](self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], gmem_dims: IndexList[rank], gmem_strides: IndexList[rank])` Replaces dimensions and strides in a TMA descriptor stored in shared memory. Note: This function is only supported for CUDA versions >= 12.5. This function allows dynamically modifying the dimensions and strides of a TMA descriptor that has been previously initialized in shared memory. If only the first dimension (dim 0) is updated, then updating strides can be skipped. Notes: * Only one thread should call this method to avoid race conditions. * A memory fence may be required after this operation to ensure visibility of the changes to other threads. **Parameters:** * ​dtype (`DType`): The data type of the new source tensor. * ​only\_update\_dim\_0 (`Bool`): If true, only the first dimension (dim 0) is updated with updating strides. * ​rank (`Int`): The rank of the tensor. **Args:** * ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the TMA descriptor in shared memory that will be modified. * ​gmem\_dims (`IndexList[rank]`): The global dimensions of the tensor to be updated. * ​gmem\_strides (`IndexList[rank]`): The global strides of the tensor to be updated. `replace_tensormap_global_dim_strides_in_shared_mem[dtype: DType, tensor_rank: Int, dim_idx: Int](self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], dim_value: SIMD[uint32, 1], dim_stride: Optional[SIMD[uint64, 1]] = Optional(None))` Replaces dimensions and strides in a TMA descriptor stored in shared memory. Note: This function is only supported for CUDA versions >= 12.5. This function allows dynamically modifying the dimensions and strides of a TMA descriptor that has been previously initialized in shared memory. If only the first dimension is updated, then updating strides can be skipped. Notes: * Only one thread should call this method to avoid race conditions. * A memory fence may be required after this operation to ensure visibility of the changes to other threads. **Parameters:** * ​dtype (`DType`): The data type of the source tensor in GMEM. * ​tensor\_rank (`Int`): The rank of the source tensor in GMEM. * ​dim\_idx (`Int`): The index of the dimension to be updated in the TMA descriptor with the provided dimension and stride values at runtime. **Args:** * ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the TMA descriptor in shared memory that will be modified. * ​dim\_value (`SIMD[uint32, 1]`): The new dimension value to be set. * ​dim\_stride (`Optional[SIMD[uint64, 1]]`): The new stride value to be set. --- ## TMATensorTileArray `@register_passable(trivial)` `struct TMATensorTileArray[num_of_tensormaps: Int, dtype: DType, cta_tile_layout: Layout, desc_layout: Layout]` An array of TMA descripotr. ## Parameters * ​num\_of\_tensormaps (`Int`): Int The number of TMA descriptors aka tensor map. * ​dtype (`DType`): DType The data type of the tensor elements. * ​cta\_tile\_layout (`Layout`): Layout The layout of the tile in shared memory, typically specified as row\_major. * ​desc\_layout (`Layout`): Layout The layout of the descriptor, which can be different from the shared memory layout to accommodate hardware requirements like WGMMA. ## Fields * ​tensormaps\_ptr (`UnsafePointer[SIMD[uint8, 1]]`): A static tuple of pointers to TMA descriptors. This field stores an array of pointers to `TMATensorTile` instances, where each pointer references a TMA descriptor in device memory. The array has a fixed size determined by the num\_of\_tensormaps parameter. The TMA descriptors are used by the GPU hardware to efficiently transfer data between global and shared memory with specific memory access patterns defined by the layouts. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `descriptor_bytes` `alias descriptor_bytes = 128` Size of the TMA descriptor in bytes. This is a constant value that represents the size of the TMA descriptor in bytes. It is used to calculate the offset of the TMA descriptor in the device memory. ## Methods ### `__init__` `__init__(out self, tensormaps_device: DeviceBuffer[uint8])` Initializes a new TMATensorTileArray. **Args:** * ​tensormaps\_device (`DeviceBuffer[uint8]`): Device buffer to store TMA descriptors. ### `__getitem__` `__getitem__(self, index: Int) -> UnsafePointer[TMATensorTile[dtype, cta_tile_layout, desc_layout]]` Retrieve a TMA descriptor. **Args:** * ​index (`Int`): Index of the TMA descriptor. **Returns:** `UnsafePointer` to the `TMATensorTile` at the specified index. --- ## create_tma_tile `create_tma_tile[*tile_sizes: Int, *, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))](ctx: DeviceContext, tensor: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> TMATensorTile[dtype, row_major[::Origin[::Bool(_to_int_tuple[*::Int]())]` Creates a `TMATensorTile` with specified tile dimensions and swizzle mode. This function creates a hardware-accelerated Tensor Memory Access (TMA) descriptor for efficient asynchronous data transfers between global memory and shared memory. It configures the tile dimensions and memory access patterns based on the provided parameters. **Constraints:** * The last dimension's size in bytes must not exceed the swizzle mode's byte limit (32B for SWIZZLE\_32B, 64B for SWIZZLE\_64B, 128B for SWIZZLE\_128B). * Only supports 2D tensors in this overload. **Parameters:** * ​\*tile\_sizes (`Int`): The dimensions of the tile to be transferred. For 2D tensors, this should be \[height, width]. The dimensions determine the shape of data transferred in each TMA operation. * ​swizzle\_mode (`TensorMapSwizzle`): The swizzling mode to use for memory access optimization. Swizzling can improve memory access patterns for specific hardware configurations. **Args:** * ​ctx (`DeviceContext`): The CUDA device context used to create the TMA descriptor. * ​tensor (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor from which data will be transferred. This defines the global memory layout and data type. **Returns:** A `TMATensorTile` configured with the specified tile dimensions and swizzle mode, ready for use in asynchronous data transfer operations. `create_tma_tile[type: DType, rank: Int, tile_shape: IndexList[rank], /, is_k_major: Bool = True, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), *, __tile_layout: Layout = row_major(tile_shape.__getitem__[::Indexer](0), tile_shape.__getitem__[::Indexer](1)), __desc_layout: Layout = _tma_desc_tile_layout[::DType,::Int,::IndexList[$1, ::DType()](ctx: DeviceContext, tensor: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> TMATensorTile[type, __tile_layout, __desc_layout]` Creates a `TMATensorTile` with advanced configuration options for 2D or 3D tensors. This overload provides more control over the TMA descriptor creation, allowing specification of data type, rank, and layout orientation. It supports both 2D and 3D tensors and provides fine-grained control over the memory access patterns. **Constraints:** * Only supports 2D and 3D tensors (rank must be 2 or 3). * For non-SWIZZLE\_NONE modes, the K dimension size in bytes must be a multiple of the swizzle mode's byte size. * For MN-major layout, only SWIZZLE\_128B is supported. * For 3D tensors, only K-major layout is supported. **Parameters:** * ​type (`DType`): DType The data type of the tensor elements. * ​rank (`Int`): Int The dimensionality of the tensor (must be 2 or 3). * ​tile\_shape (`IndexList[rank]`): IndexList\[rank] The shape of the tile to be transferred. * ​is\_k\_major (`Bool`): Bool = True Whether the tensor layout is K-major (True) or MN-major (False). K-major is typically used for weight matrices, while MN-major is used for activation matrices in matrix multiplication operations. * ​swizzle\_mode (`TensorMapSwizzle`): TensorMapSwizzle = TensorMapSwizzle.SWIZZLE\_NONE The swizzling mode to use for memory access optimization. * ​\_\_tile\_layout (`Layout`): Layout = Layout.row\_major(tile\_shape\[0], tile\_shape\[1]) Internal parameter for the tile layout in shared memory. * ​\_\_desc\_layout (`Layout`): Layout = \_tma\_desc\_tile\_layout\[...] Internal parameter for the descriptor layout, which may differ from the tile layout to accommodate hardware requirements. **Args:** * ​ctx (`DeviceContext`): DeviceContext The CUDA device context used to create the TMA descriptor. * ​tensor (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): LayoutTensor\[type, \**, \*\**] The source tensor from which data will be transferred. This defines the global memory layout and must match the specified data type. **Returns:** A `TMATensorTile` configured with the specified parameters, ready for use in asynchronous data transfer operations. --- ## tma_async Tensor Memory Accelerator (TMA) Asynchronous Operations Module Provides high-performance abstractions for NVIDIA's Tensor Memory Accelerator (TMA), enabling efficient asynchronous data movement between global and shared memory in GPU kernels. It is designed for use with NVIDIA Hopper architecture and newer GPUs that support TMA instructions. ## Key Components: * `TMATensorTile`: Core struct that encapsulates a TMA descriptor for efficient data transfers between global and shared memory with various access patterns and optimizations. * `SharedMemBarrier`: Synchronization primitive for coordinating asynchronous TMA operations, ensuring data transfers complete before dependent operations begin. * `PipelineState`: Helper struct for managing multi-stage pipeline execution with circular buffer semantics, enabling efficient double or triple buffering techniques. * `create_tma_tile`: Factory functions for creating optimized `TMATensorTile` instances with various configurations for different tensor shapes and memory access patterns. ## Structs * [​`PipelineState`](./PipelineState): Manages state for a multi-stage pipeline with circular buffer semantics. * [​`SharedMemBarrier`](./SharedMemBarrier): A hardware-accelerated synchronization primitive for GPU shared memory operations. * [​`TMATensorTile`](./TMATensorTile): A hardware-accelerated tensor memory access (TMA) tile for efficient asynchronous data movement. * [​`TMATensorTileArray`](./TMATensorTileArray): An array of TMA descripotr. ## Functions * [​`create_tma_tile`](./create_tma_tile): Creates a `TMATensorTile` with specified tile dimensions and swizzle mode. --- ## accumulate --- ## apple_batched_matmul `apple_batched_matmul[*, transpose_b: Bool = False, elementwise_epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])` --- ## apple_gemv `apple_gemv[*, b_packed: Bool, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape])` --- ## apple_matmul `apple_matmul[*, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](cblas_gemm_fn: fn(_CBLASOrder, _CBLASTranspose, _CBLASTranspose, SIMD[int32, 1], SIMD[int32, 1], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1]) -> None, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])` `apple_matmul[*, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])` --- ## get_cblas_f32_function `get_cblas_f32_function() -> fn(_CBLASOrder, _CBLASTranspose, _CBLASTranspose, SIMD[int32, 1], SIMD[int32, 1], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1]) -> None` --- ## apple_accelerate ## Aliases ### `APPLE_ACCELERATE` `alias APPLE_ACCELERATE = _Global[__init__[__mlir_type.!kgen.string]("APPLE_ACCELERATE"), _OwnedDLHandle, _init_dylib]` ### `cblas_gemm_type` `alias cblas_gemm_type = fn(_CBLASOrder, _CBLASTranspose, _CBLASTranspose, SIMD[int32, 1], SIMD[int32, 1], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1]) -> None` ### `LIB_ACC_PATH` `alias LIB_ACC_PATH = "/System/Library/Frameworks/Accelerate.framework/Accelerate"` ## Functions * [​`apple_batched_matmul`](./apple_batched_matmul): * [​`apple_gemv`](./apple_gemv): * [​`apple_matmul`](./apple_matmul): * [​`get_cblas_f32_function`](./get_cblas_f32_function): * [​`use_apple_accelerate_lib`](./use_apple_accelerate_lib): --- ## use_apple_accelerate_lib `use_apple_accelerate_lib[c_type: DType, a_type: DType, b_type: DType]() -> Bool` --- ## dot_at_b `dot_at_b(c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])` --- ## dot_at_b_impl `dot_at_b_impl(c: NDBuffer[float32, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(16, 16)))], a: NDBuffer[float32, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(16, 16)))], b: NDBuffer[float32, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(16, 16)))])` `dot_at_b_impl(c: NDBuffer[float16, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(32, 32)))], a: NDBuffer[float16, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(32, 32)))], b: NDBuffer[float16, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(32, 32)))])` --- ## extrx `extrx(gpr: Int)` Extracts a row or moves it to x, result in amx0. --- ## extry `extry(gpr: Int)` Extracts a row or moves it to y, result in amx0. --- ## fma `fma[mode: StringSlice[StaticConstantOrigin], type: DType](z_row_index: Int, x_row_index: Int, y_row_index: Int, clear_z: Bool)` --- ## fma16 `fma16(gpr: Int)` Float16 matrix multiply and subtract. --- ## fma32 `fma32(gpr: Int)` Float32 matrix multiply and add. --- ## fma64 `fma64(gpr: Int)` Float64 matrix multiply and add. --- ## fms16 `fms16(gpr: Int)` Float16 matrix multiply and add. --- ## fsm32 `fsm32(gpr: Int)` Float32 matrix multiply and subtract. --- ## fsm64 `fsm64(gpr: Int)` Float64 matrix multiply and subtract. --- ## genlut `genlut(gpr: Int)` --- ## apple_amx_intrinsics ## Functions * [​`dot_at_b`](./dot_at_b): * [​`dot_at_b_impl`](./dot_at_b_impl): * [​`extrx`](./extrx): Extracts a row or moves it to x, result in amx0. * [​`extry`](./extry): Extracts a row or moves it to y, result in amx0. * [​`fma`](./fma): * [​`fma16`](./fma16): Float16 matrix multiply and subtract. * [​`fma32`](./fma32): Float32 matrix multiply and add. * [​`fma64`](./fma64): Float64 matrix multiply and add. * [​`fms16`](./fms16): Float16 matrix multiply and add. * [​`fsm32`](./fsm32): Float32 matrix multiply and subtract. * [​`fsm64`](./fsm64): Float64 matrix multiply and subtract. * [​`genlut`](./genlut): * [​`ldx`](./ldx): * [​`ldy`](./ldy): * [​`ldz`](./ldz): * [​`ldzi`](./ldzi): * [​`load_z`](./load_z): * [​`mac16`](./mac16): SI16 matrix multiply and add. * [​`matfp`](./matfp): Float16 matrix multiply. * [​`max_int__`](./max_int__): UI16 matrix multiply. * [​`read_x`](./read_x): * [​`read_y`](./read_y): * [​`store_x`](./store_x): * [​`store_y`](./store_y): * [​`store_z`](./store_z): * [​`stx`](./stx): * [​`sty`](./sty): * [​`stz`](./stz): * [​`stzi`](./stzi): * [​`transpose_z_to_x_or_y`](./transpose_z_to_x_or_y): * [​`vec_int__`](./vec_int__): Horizontal ui16 multiply `z0[i] += x0[i] + y0[i]`. * [​`vecfp`](./vecfp): Horizontal float16 multiply `z0[i] += x0[i] + y0[i]`. --- ## ldx `ldx(gpr: Int)` --- ## ldy `ldy(gpr: Int)` --- ## ldz `ldz(gpr: Int)` --- ## ldzi `ldzi(gpr: Int)` --- ## load_z `load_z[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)` --- ## mac16 `mac16(gpr: Int)` SI16 matrix multiply and add. --- ## matfp `matfp(gpr: Int)` Float16 matrix multiply. --- ## max_int__ `max_int__(gpr: Int)` UI16 matrix multiply. --- ## read_x `read_x[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)` --- ## read_y `read_y[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)` --- ## store_x `store_x[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)` --- ## store_y `store_y[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)` --- ## store_z `store_z[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)` --- ## stx `stx(gpr: Int)` --- ## sty `sty(gpr: Int)` --- ## stz `stz(gpr: Int)` --- ## stzi `stzi(gpr: Int)` --- ## transpose_z_to_x_or_y `transpose_z_to_x_or_y[destination: StringSlice[StaticConstantOrigin], type: DType](z_col_index: Int, xy_row_index: Int, z_row_suboffset: Int)` --- ## vec_int__ `vec_int__(gpr: Int)` Horizontal ui16 multiply `z0[i] += x0[i] + y0[i]`. --- ## vecfp `vecfp(gpr: Int)` Horizontal float16 multiply `z0[i] += x0[i] + y0[i]`. --- ## batched_matmul `batched_matmul[rank: Int, a_type: DType, b_type: DType, c_type: DType, //, *, transpose_a: Bool, transpose_b: Bool, elementwise_epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), saturated_vnni: Bool = False, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c_buf: NDBuffer[c_type, rank, origin], a_buf: NDBuffer[a_type, rank, origin], b_buf: NDBuffer[b_type, rank, origin], *, context: DeviceContextPtr = DeviceContextPtr())` `batched_matmul[rank: Int, a_type: DType, b_type: DType, c_type: DType, //, *, transpose_b: Bool, elementwise_epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), saturated_vnni: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c_buf: NDBuffer[c_type, rank, origin], a_buf: NDBuffer[a_type, rank, origin], b_buf: NDBuffer[b_type, rank, origin], *, context: DeviceContextPtr = DeviceContextPtr())` --- ## batched_matmul_kernel `batched_matmul_kernel[rank: Int, c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), accum_type: DType = get_accum_type[::DType,::DType]()](c_buff: NDBuffer[c_type, 3, MutableAnyOrigin, c_shape], a_buff: NDBuffer[a_type, 3, MutableAnyOrigin, a_shape], b_buff: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], c_buff_nd_shape: IndexList[rank])` --- ## batched_matmul_shape `batched_matmul_shape[rank: Int, a_type: DType, b_type: DType, single_thread_blocking_override: Bool](a_buff: NDBuffer[a_type, rank, origin], b_buff: NDBuffer[b_type, rank, origin]) -> IndexList[rank]` Compute the output shape of a `batch_matmul` operation, and assert the inputs are compatible. **Parameters:** * ​rank (`Int`): Rank of the input and output tensors. * ​a\_type (`DType`): Type of the lhs input tensor. * ​b\_type (`DType`): Type of the rhs input tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​a\_buff (`NDBuffer[a_type, rank, origin]`): The lhs input tensor. * ​b\_buff (`NDBuffer[b_type, rank, origin]`): The rhs input tensor. **Returns:** The output shape. --- ## bmm ## Aliases ### `elementwise_epilogue_type` `alias elementwise_epilogue_type = fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None` ## Functions * [​`batched_matmul`](./batched_matmul): * [​`batched_matmul_kernel`](./batched_matmul_kernel): * [​`batched_matmul_shape`](./batched_matmul_shape): Compute the output shape of a `batch_matmul` operation, and assert the inputs are compatible. --- ## create_matmul_configs_ampere `create_matmul_configs_ampere[key: String, a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool]() -> MatmulConfig[a_type, b_type, c_type, transpose_b]` --- ## get_dispatch_table `get_dispatch_table[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool]() -> Dict[String, MatmulConfig[a_type, b_type, c_type, transpose_b]]` --- ## dispatch_table_a100_gpu ## Functions * [​`create_matmul_configs_ampere`](./create_matmul_configs_ampere): * [​`get_dispatch_table`](./get_dispatch_table): --- ## distributed_matmul ## Functions * [​`matmul_allreduce`](./matmul_allreduce): Performs C = matmul(A, B^T) followed with Out = allreduce(C) operation across multiple GPUs. Split the A or B and C matrices into `num_partitions` submatrices at dimension `partition_dim`. This way we can perform `num_partitions` independent matmul + allreduce kernels, and overlap some of the computation. --- ## matmul_allreduce `matmul_allreduce[ngpus: Int, partition_dim: Int, num_partitions: Int, outputs_lambda: fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None, type: DType, a_static_shape: DimList, b_static_shape: DimList, c_static_shape: DimList, out_static_shape: DimList, overlap_with_dpl: Bool = True](a_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, a_static_shape], ngpus], b_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, b_static_shape], ngpus], c_temp_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, c_static_shape], ngpus], output_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, out_static_shape], ngpus], rank_sigs: InlineArray[UnsafePointer[Signal], 8], ctxs: List[DeviceContext])` Performs C = matmul(A, B^T) followed with Out = allreduce(C) operation across multiple GPUs. Split the A or B and C matrices into `num_partitions` submatrices at dimension `partition_dim`. This way we can perform `num_partitions` independent matmul + allreduce kernels, and overlap some of the computation. --- ## config_in_smem `config_in_smem[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool, //, max_smem: Int](config: MatmulConfig[a_type, b_type, c_type, transpose_b]) -> MatmulConfig[a_type, b_type, c_type, transpose_b]` --- ## dual_gemm `dual_gemm[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], config: OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]] = OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]]({:i1 0, 1}), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], ctx: DeviceContext)` --- ## dual_gemv `dual_gemv[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], ctx: DeviceContext)` --- ## dual_gemv_kernel `dual_gemv_kernel[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, simd_width: UInt, tile_m: UInt, tile_n: UInt, num_threads: UInt, binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape])` --- ## dual_gemm ## Aliases ### `binary_fn_type` `alias binary_fn_type = fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1]` ## Functions * [​`config_in_smem`](./config_in_smem): * [​`dual_gemm`](./dual_gemm): * [​`dual_gemv`](./dual_gemv): * [​`dual_gemv_kernel`](./dual_gemv_kernel): * [​`multistage_dual_gemm`](./multistage_dual_gemm): * [​`multistage_dual_gemm_kernel`](./multistage_dual_gemm_kernel): * [​`multistage_dual_mma`](./multistage_dual_mma): * [​`swilu`](./swilu): * [​`swishGLU`](./swishGLU): Reference: GLU Variants Improve Transformer by Noam Shazeer The implementation follows cutlass, using one kernel invocation and writing to the destination once. --- ## multistage_dual_gemm `multistage_dual_gemm[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, b_type: DType, b_layout: Layout, //, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: LayoutTensor[c_type, c_layout, origin], a: LayoutTensor[a_type, a_layout, origin], b0: LayoutTensor[b_type, b_layout, origin], b1: LayoutTensor[b_type, b_layout, origin], ctx: DeviceContext)` `multistage_dual_gemm[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), num_k_partitions: Int = 1](c: NDBuffer[c_type, 2, origin, c_shape], a: NDBuffer[a_type, 2, origin, a_shape], b0: NDBuffer[b_type, 2, origin, b_shape], b1: NDBuffer[b_type, 2, origin, b_shape], ctx: DeviceContext)` --- ## multistage_dual_gemm_kernel `multistage_dual_gemm_kernel[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, b_type: DType, b_layout: Layout, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: LayoutTensor[c_type, c_layout, MutableAnyOrigin], a: LayoutTensor[a_type, a_layout, MutableAnyOrigin], b0: LayoutTensor[b_type, b_layout, MutableAnyOrigin], b1: LayoutTensor[b_type, b_layout, MutableAnyOrigin])` --- ## multistage_dual_mma `multistage_dual_mma[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, a_smem_layout: Layout, b_type: DType, b_layout: Layout, b_smem_layout: Layout, //, BM: Int, BN: Int, BK: Int, WM: Int, WN: Int, num_threads: Int, num_pipeline_stages: Int, transpose_b: Bool, /, *, swizzle_a: Bool = True, static_num_iters: Dim = Dim(-31337), k_group_size: UInt = UInt(1)](c0: LayoutTensor[c_type, c_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c1: LayoutTensor[c_type, c_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], a_iter_arg: LayoutTensorIter[type, a_layout, MutableAnyOrigin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b0_iter_arg: LayoutTensorIter[b_type, b_layout, MutableAnyOrigin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b1_iter_arg: LayoutTensorIter[b_type, b_layout, MutableAnyOrigin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], a_smem_iter_arg: LayoutTensorIter[a_type, a_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut b0_smem_iter: LayoutTensorIter[b_type, b_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut b1_smem_iter: LayoutTensorIter[b_type, b_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], num_iters: Int, /, *, num_b_rows: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` --- ## swilu `swilu[type: DType, width: Int](x: SIMD[type, width], y: SIMD[type, width]) -> SIMD[type, width]` --- ## swishGLU `swishGLU[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], ctx: DeviceContextPtr)` Reference: GLU Variants Improve Transformer by Noam Shazeer The implementation follows cutlass, using one kernel invocation and writing to the destination once. --- ## FastDiv `@register_passable(trivial)` `struct FastDiv[type: DType]` Implements fast division for a given type. This struct provides optimized division by a constant divisor, replacing the division operation with a series of shifts and multiplications. This approach significantly improves performance, especially in scenarios where division is a frequent operation. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `uint_type` `alias uint_type = _uint_type_of_width[::Int]()` ## Methods ### `__init__` `@implicit` `__init__(divisor: Int = 1) -> Self` Initializes FastDiv with the divisor. **Constraints:** ConstraintError: If the bitwidth of the type is > 32. **Args:** * ​divisor (`Int`): The divisor to use for fast division. Defaults to 1. ### `__rtruediv__` `__rtruediv__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> SIMD[_uint_type_of_width[::Int](), 1]` Divides the other scalar by the divisor (true division). Uses the fast division algorithm. **Args:** * ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend. **Returns:** The result of the division. ### `__rmod__` `__rmod__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> SIMD[_uint_type_of_width[::Int](), 1]` Computes the remainder of division. **Args:** * ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend. **Returns:** The remainder. ### `__rdiv__` `__rdiv__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> SIMD[_uint_type_of_width[::Int](), 1]` Divides the other scalar by the divisor. **Args:** * ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend. **Returns:** The result of the division. ### `__divmod__` `__divmod__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> Tuple[SIMD[_uint_type_of_width[::Int](), 1], SIMD[_uint_type_of_width[::Int](), 1]]` Computes both quotient and remainder. **Args:** * ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend. **Returns:** A tuple containing the quotient and remainder. --- ## fast_div Implements the fast division algorithm. This method replaces division by constants with a sequence of shifts and multiplications, significantly optimizing division performance. ## Structs * [​`FastDiv`](./FastDiv): Implements fast division for a given type. --- ## block_reduce `block_reduce[type: DType, //, warps_per_block: Int](val: SIMD[type, 1]) -> SIMD[type, 1]` --- ## fp8_quantization ## Functions * [​`block_reduce`](./block_reduce): * [​`matmul_dynamic_scaled_fp8`](./matmul_dynamic_scaled_fp8): * [​`quantize_dynamic_scaled_fp8`](./quantize_dynamic_scaled_fp8): * [​`quantize_fp8_kernel`](./quantize_fp8_kernel): * [​`quantize_static_scaled_fp8`](./quantize_static_scaled_fp8): --- ## matmul_dynamic_scaled_fp8 `matmul_dynamic_scaled_fp8[c_type: DType, a_type: DType, b_type: DType, a_scales_type: DType, b_scales_type: DType, //, transpose_b: Bool = False, config: OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]] = OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]]({:i1 0, 1}), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[b_type, 2, origin, shape], a_scales: NDBuffer[a_scales_type, 2, origin, shape], b_scales: NDBuffer[b_scales_type, 2, origin, shape], ctx: DeviceContext)` --- ## quantize_dynamic_scaled_fp8 `quantize_dynamic_scaled_fp8[out_dtype: DType, in_dtype: DType, scales_dtype: DType, //, group_size_or_per_token: Int](scaled_output: NDBuffer[out_dtype, 2, origin, shape, strides], scales: NDBuffer[scales_dtype, 2, origin, shape, strides], input: NDBuffer[in_dtype, 2, origin, shape, strides], scale_ub: SIMD[float32, 1], ctx: DeviceContext)` --- ## quantize_fp8_kernel `quantize_fp8_kernel[out_type: DType, scales_type: DType, in_type: DType, warps_per_block: Int, group_size: Int](output: NDBuffer[out_type, 2, MutableAnyOrigin], scales: NDBuffer[scales_type, 2, MutableAnyOrigin], input: NDBuffer[in_type, 2, MutableAnyOrigin], scale_ub: SIMD[scales_type, 1])` --- ## quantize_static_scaled_fp8 `quantize_static_scaled_fp8[out_dtype: DType, in_dtype: DType, is_scale_inverted: Bool = True](out_buffer: NDBuffer[out_dtype, 2, origin, shape, strides], in_buffer: NDBuffer[in_dtype, 2, origin, shape, strides], scale: SIMD[float32, 1], context: DeviceContext)` --- ## GEMVAlgorithm `struct GEMVAlgorithm` ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `GEMV_KERNEL` `alias GEMV_KERNEL = GEMVAlgorithm(0)` ### `GEMV_KERNEL_VECTOR` `alias GEMV_KERNEL_VECTOR = GEMVAlgorithm(1)` ### `GEMV_SPLIT_K` `alias GEMV_SPLIT_K = GEMVAlgorithm(2)` ### `GEVM_KERNEL` `alias GEVM_KERNEL = GEMVAlgorithm(4)` ### `GEVM_KERNEL_VECTOR` `alias GEVM_KERNEL_VECTOR = GEMVAlgorithm(3)` ### `MATMUL_NAIVE` `alias MATMUL_NAIVE = GEMVAlgorithm(5)` ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` ### `__is__` `__is__(self, other: Self) -> Bool` ### `__isnot__` `__isnot__(self, other: Self) -> Bool` --- ## gemv `gemv[parallelize: Bool, c_size: Dim, c_type: DType, a_shape: DimList, a_type: DType, b_size: Dim, b_type: DType, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c_buf: NDBuffer[c_type, 1, origin, __init__[::Intable](c_size)], a_buf: NDBuffer[a_type, 2, origin, a_shape], b_buf: NDBuffer[b_type, 1, origin, __init__[::Intable](b_size)])` --- ## gemv_gpu `gemv_gpu[transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], ctx: DeviceContext)` --- ## gemv_gpu_dispatch `gemv_gpu_dispatch[transpose_b: Bool = False, reduction_method: ReductionMethod = ReductionMethod(1), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](kernel_func: GEMVAlgorithm, c: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], ctx: DeviceContext)` --- ## gemv_kernel `gemv_kernel[c_type: DType, a_type: DType, b_type: DType, *, reduction_method: ReductionMethod, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: UnsafePointer[SIMD[c_type, 1]], a: UnsafePointer[SIMD[a_type, 1]], b: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)` --- ## gemv_kernel_vector `gemv_kernel_vector[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, *, reduction_method: ReductionMethod, simd_width: UInt, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], m: UInt, n: UInt, k: UInt)` --- ## gemv_split_k `gemv_split_k[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, simd_width: UInt, tile_m: UInt, tile_n: UInt, num_threads: UInt, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](output: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], act: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], weight: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], m: UInt, n: UInt, k: UInt)` GEMV with tiling in K dimension. Assuming the B (weight) matrix is transposed i.e. row major N x K, this kernel implements a vector (1 x K) times a matrix (N x K). The impl can actually handle M > 1 but it's only optimal fro tiny M. We use it for M = 1 only. --- ## gevm_kernel `gevm_kernel[c_type: DType, a_type: DType, b_type: DType, *, tile_size: Int, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: UnsafePointer[SIMD[c_type, 1]], a: UnsafePointer[SIMD[a_type, 1]], b: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)` --- ## gevm_tc_kernel_vector_8x `gevm_tc_kernel_vector_8x[c_type: DType, a_type: DType, b_type: DType, tile_size: Int, simd_width: Int, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: NDBuffer[c_type, 2, MutableAnyOrigin], a: NDBuffer[a_type, 2, MutableAnyOrigin], b: NDBuffer[b_type, 2, MutableAnyOrigin], m: UInt, n: UInt, k: UInt)` --- ## gemv ## Structs * [​`GEMVAlgorithm`](./GEMVAlgorithm): ## Functions * [​`gemv`](./gemv): * [​`gemv_gpu`](./gemv_gpu): * [​`gemv_gpu_dispatch`](./gemv_gpu_dispatch): * [​`gemv_kernel`](./gemv_kernel): * [​`gemv_kernel_vector`](./gemv_kernel_vector): * [​`gemv_split_k`](./gemv_split_k): GEMV with tiling in K dimension. Assuming the B (weight) matrix is transposed i.e. row major N x K, this kernel implements a vector (1 x K) times a matrix (N x K). * [​`gevm_kernel`](./gevm_kernel): * [​`gevm_tc_kernel_vector_8x`](./gevm_tc_kernel_vector_8x): * [​`naive_gemv`](./naive_gemv): * [​`reverse_idx`](./reverse_idx): --- ## naive_gemv `naive_gemv[c_size: Dim, a_shape: DimList, b_size: Dim, type: DType](c_buf: NDBuffer[type, 1, origin, __init__[::Intable](c_size)], a_buf: NDBuffer[type, 2, origin, a_shape], b_buf: NDBuffer[type, 1, origin, __init__[::Intable](b_size)])` --- ## reverse_idx `reverse_idx[transpose: Bool](x: Int, y: Int) -> IndexList[2]` --- ## default_config_sm90 `default_config_sm90[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool, wgmma_shape: IndexList[3]]() -> MatmulConfig[a_type, b_type, c_type, transpose_b, wgmma_shape]` --- ## grouped_matmul `grouped_matmul[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], max_num_tokens_per_expert: Int, num_active_experts: Int, ctx: DeviceContext)` --- ## grouped_matmul_kernel `grouped_matmul_kernel[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, c_desc_layout: Layout, c_smem_layout: Layout, cluster_shape: StaticTuple[SIMD[int32, 1], 3], a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), c_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = True, num_threads: Int = 128, pipeline_stages: Int = 7, use_tma_store: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c_tma_op: TMATensorTile[c_type, c_smem_layout, c_desc_layout], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin])` --- ## grouped_matmul_sm90 `grouped_matmul_sm90[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool = True, wgmma_shape: IndexList[3] = Index(64, 256, 16), config: MatmulConfig[a_type, b_type, c_type, transpose_b, wgmma_shape] = default_config_sm90[::DType,::DType,::DType,::Bool,::IndexList[::Int(), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], max_num_tokens_per_expert: Int, b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], num_active_experts: Int, ctx: DeviceContext)` --- ## grouped_matmul ## Aliases ### `NumWarpPerWarpGroup` `alias NumWarpPerWarpGroup = 4` ### `WARP_GROUP_SIZE` `alias WARP_GROUP_SIZE = 128` ## Functions * [​`default_config_sm90`](./default_config_sm90): * [​`grouped_matmul`](./grouped_matmul): * [​`grouped_matmul_kernel`](./grouped_matmul_kernel): * [​`grouped_matmul_sm90`](./grouped_matmul_sm90): * [​`naive_grouped_matmul`](./naive_grouped_matmul): * [​`naive_grouped_matmul_kernel`](./naive_grouped_matmul_kernel): --- ## naive_grouped_matmul `naive_grouped_matmul[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool = True](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], max_num_tokens_per_expert: Int, num_active_experts: Int, ctx: DeviceContext)` --- ## naive_grouped_matmul_kernel `naive_grouped_matmul_kernel[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin])` --- ## linalg Provides CPU and GPU implementations of linear algebra functions. ## Modules * [​`accumulate`](./accumulate/): * [​`apple_accelerate`](./apple_accelerate/): * [​`apple_amx_intrinsics`](./apple_amx_intrinsics/): * [​`bmm`](./bmm/): * [​`dispatch_table_a100_gpu`](./dispatch_table_a100_gpu/): * [​`distributed_matmul`](./distributed_matmul/): * [​`dual_gemm`](./dual_gemm/): * [​`fast_div`](./fast_div/): Implements the fast division algorithm. * [​`fp8_quantization`](./fp8_quantization/): * [​`gemv`](./gemv/): * [​`grouped_matmul`](./grouped_matmul/): * [​`intel_amx_intrinsics`](./intel_amx_intrinsics/): * [​`matmul`](./matmul/): * [​`matmul_default`](./matmul_default/): * [​`matmul_gpu`](./matmul_gpu/): * [​`matmul_i8mm`](./matmul_i8mm/): * [​`matmul_neon`](./matmul_neon/): * [​`matmul_sm90`](./matmul_sm90/): * [​`matmul_tile_scheduler`](./matmul_tile_scheduler/): * [​`matmul_vendor`](./matmul_vendor/): * [​`matmul_vnni`](./matmul_vnni/): * [​`matrix_band_part`](./matrix_band_part/): The module implements matrix band part functions. * [​`neon_intrinsics`](./neon_intrinsics/): * [​`packing`](./packing/): * [​`qr_factorization`](./qr_factorization/): * [​`transpose`](./transpose/): The module implements Transpose functions. * [​`utils`](./utils/): * [​`utils_gpu`](./utils_gpu/): * [​`vendor_blas`](./vendor_blas/): * [​`vnni_intrinsics`](./vnni_intrinsics/): --- ## intel_amx_intrinsics ## Aliases ### `void` `alias void = invalid` ## Structs * [​`__tile`](./__tile): An AMX tile representation * [​`tileconfig`](./tileconfig): ## Functions * [​`init_intel_amx`](./init_intel_amx): --- ## init_intel_amx `init_intel_amx() -> Bool` --- ## tileconfig `struct tileconfig` ## Fields * ​pavarte\_id (`SIMD[uint8, 1]`): * ​start\_row (`SIMD[uint8, 1]`): * ​reserved (`StaticTuple[scalar, 14]`): * ​colb (`StaticTuple[scalar, 16]`): * ​rows (`StaticTuple[scalar, 16]`): ## Implemented traits `AnyType`, `UnknownDestructibility` --- ## InnerMatmulKernel ## Implemented traits `AnyType`, `Copyable`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__inner_matmul__` `__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self: _Self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)` --- ## TiledMatmul `struct TiledMatmul[a_mut: Bool, b_mut: Bool, //, config: KernelConfig, transpose_b: Bool, b_packed: Bool, elementwise_epilogue_enabled: Bool, kernel_id: InnerKernelID, a_type: DType, a_shape: DimList, a_origin: Origin[a_mut], b_type: DType, b_shape: DimList, b_origin: Origin[b_mut], c_type: DType, c_shape: DimList, c_origin: MutableOrigin, algorithm: InnerMatmulKernel]` Tiled matmul implementation integrating packing, inner loop and tile partitions. TODO: add tag based implementation dispatch. TODO: add fusion hooks. ## Fields * ​alg (`algorithm`): * ​c (`NDBuffer[c_type, 2, c_origin, c_shape]`): * ​a (`NDBuffer[a_type, 2, a_origin, a_shape]`): * ​b (`NDBuffer[b_type, 2, b_origin, b_shape]`): * ​tile\_n\_k (`IndexList[2]`): * ​global\_tile\_offset (`GemmShape`): * ​global\_tile\_shape (`GemmShape`): * ​b\_tile\_generator (`BTileGenerator[config, a_type, b_type, c_type, b_shape, transpose_b, b_packed, b_origin]`): * ​elementwise\_epilogue\_fn (`fn(GemmShape, GemmShape) escaping -> None`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` --- ## elementwise_epilogue_c_tile `elementwise_epilogue_c_tile[: origin.set, //, simd_width: Int, type: DType, origin: MutableOrigin, c_shape: DimList, func: fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None](offset: GemmShape, tile_len: GemmShape, c: NDBuffer[type, 2, origin, c_shape])` --- ## matmul ## Structs * [​`TiledMatmul`](./TiledMatmul): Tiled matmul implementation integrating packing, inner loop and tile partitions. ## Traits * [​`InnerMatmulKernel`](./InnerMatmulKernel): ## Functions * [​`elementwise_epilogue_c_tile`](./elementwise_epilogue_c_tile): * [​`matmul`](./matmul): * [​`tiled_matmul_run`](./tiled_matmul_run): Interface function to run tiled matmul on a given sub-tile. --- ## matmul `matmul[transpose_a: Bool = False, transpose_b: Bool = False, b_packed: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), saturated_vnni: Bool = False, single_thread_blocking_override: Bool = False, _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], ctx: DeviceContextPtr = DeviceContextPtr())` `matmul[transpose_a: Bool = False, transpose_b: Bool = False, b_packed: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), saturated_vnni: Bool = False, single_thread_blocking_override: Bool = False, _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], ctx: Optional[DeviceContext])` --- ## tiled_matmul_run `tiled_matmul_run[config: KernelConfig, transpose_b: Bool, b_packed: Bool, simd_size: Int, elementwise_epilogue_enabled: Bool, kernel_id: InnerKernelID, algorithm: InnerMatmulKernel](alg: algorithm, c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], elementwise_epilogue_fn: fn(GemmShape, GemmShape) escaping -> None, global_tile_shape: GemmShape, global_tile_offset: GemmShape)` Interface function to run tiled matmul on a given sub-tile. **Args:** * ​alg (`algorithm`): InnerMatmulKernel algorithm for microkernel. * ​c (`NDBuffer[type, 2, origin, shape]`): Pre-allocated buffer space for result. * ​a (`NDBuffer[type, 2, origin, shape]`): Operand A of the matmul. * ​b (`NDBuffer[type, 2, origin, shape]`): Operand B of the mamtul. * ​elementwise\_epilogue\_fn (`fn(GemmShape, GemmShape) escaping -> None`): The elementwise epilogue function. * ​global\_tile\_shape (`GemmShape`): Tile shape this call will process. * ​global\_tile\_offset (`GemmShape`): Tile offset on the original buffer. --- ## Inner_matmul_default `struct Inner_matmul_default` ## Implemented traits `AnyType`, `Copyable`, `InnerMatmulKernel`, `Movable`, `UnknownDestructibility` ## Methods ### `__inner_matmul__` `__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)` Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows, TileN, TileK) tile. --- ## matmul_default ## Structs * [​`Inner_matmul_default`](./Inner_matmul_default): --- ## AMDSchedulerTuning `@register_passable(trivial)` `struct AMDSchedulerTuning` ## Fields * ​block\_shape (`IndexList[2]`): * ​tuning\_values (`IndexList[3]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` --- ## matmul_gpu ## Structs * [​`AMDSchedulerTuning`](./AMDSchedulerTuning): ## Functions * [​`__nvvm_ldg_f4`](./__nvvm_ldg_f4): * [​`matmul_kernel`](./matmul_kernel): Matrix Multiplication using shared memory. This version loads blocks of size tile\_size x tile\_size from A and B and updates a tile\_size x tile\_size in C. The thread block should have shape (tile\_size, tile\_size, 1). Each thread is mapped one element in C. The grid should have shape (N/tile\_size, M/tile\_size, 1). N is the first dimension for coalesced access. * [​`matmul_kernel_naive`](./matmul_kernel_naive): * [​`multistage_gemm`](./multistage_gemm): * [​`split_k_reduce`](./split_k_reduce): --- ## matmul_kernel `matmul_kernel[c_type: DType, a_type: DType, b_type: DType, tile_size: Int, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c_ptr: UnsafePointer[SIMD[c_type, 1]], a_ptr: UnsafePointer[SIMD[a_type, 1]], b_ptr: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)` Matrix Multiplication using shared memory. This version loads blocks of size tile\_size x tile\_size from A and B and updates a tile\_size x tile\_size in C. The thread block should have shape (tile\_size, tile\_size, 1). Each thread is mapped one element in C. The grid should have shape (N/tile\_size, M/tile\_size, 1). N is the first dimension for coalesced access. --- ## matmul_kernel_naive `matmul_kernel_naive[c_type: DType, a_type: DType, b_type: DType, BLOCK_DIM: Int, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c_ptr: UnsafePointer[SIMD[c_type, 1]], a_ptr: UnsafePointer[SIMD[a_type, 1]], b_ptr: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)` --- ## multistage_gemm `multistage_gemm[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), serial_reduction: Bool = False](c: NDBuffer[c_type, 2, origin, c_shape], a: NDBuffer[a_type, 2, origin, a_shape], b: NDBuffer[b_type, 2, origin, b_shape], runtime_config: MatmulConfig[a_type, b_type, c_type, transpose_b], ctx: DeviceContext)` --- ## split_k_reduce `split_k_reduce[c_type: DType, work_space_type: DType, c_shape: DimList, work_space_shape: DimList, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, c_shape], work_space: NDBuffer[work_space_type, 3, origin, work_space_shape], ctx: DeviceContext)` --- ## Inner_matmul_i8mm `struct Inner_matmul_i8mm` ## Implemented traits `AnyType`, `Copyable`, `InnerMatmulKernel`, `Movable`, `UnknownDestructibility` ## Methods ### `__inner_matmul__` `__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)` Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows2, TileN, TileK) tile. --- ## LoadStore_i8mm `struct LoadStore_i8mm[type: DType, simd_size: Int, single_row: Bool, tile_rows: Int, tile_columns: Int]` ## Fields * ​output\_tile (`_Accumulator[type, tile_rows, 0 if (simd_size == 0) else (div_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) + -1) if (((rem_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) == 0) ^ True) & ((simd_size , #lit.struct.extract, 0), {1}, simd_size), "value">), simd_size]`): * ​skip\_boundary\_check (`Bool`): ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `num_simd_cols` `alias num_simd_cols = 0 if (simd_size == 0) else (div_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) + -1) if (((rem_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) == 0) ^ True) & ((simd_size , #lit.struct.extract, 0), {1}, simd_size), "value">)` ## Methods ### `__init__` `@implicit` `__init__(out self, skip_boundary_check: Bool)` --- ## matmul_i8mm ## Structs * [​`Inner_matmul_i8mm`](./Inner_matmul_i8mm): * [​`LoadStore_i8mm`](./LoadStore_i8mm): --- ## Inner_matmul_neon `struct Inner_matmul_neon` ## Implemented traits `AnyType`, `Copyable`, `InnerMatmulKernel`, `Movable`, `UnknownDestructibility` ## Methods ### `__inner_matmul__` `__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)` Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows, TileN, TileK) tile. --- ## matmul_neon ## Structs * [​`Inner_matmul_neon`](./Inner_matmul_neon): --- ## cluster_size `cluster_size[cluster_shape: StaticTuple[SIMD[int32, 1], 3]]() -> SIMD[int32, 1]` --- ## consumer_main_loop `consumer_main_loop[accum_type: DType, a_type: DType, b_type: DType, c_reg_layout: Layout, a_smem_layout: Layout, b_smem_layout: Layout, wgmma_shape: IndexList[3], a_swizzle: TensorMapSwizzle, b_swizzle: TensorMapSwizzle, transpose_b: Bool, pipeline_stages: Int, /, *, num_k_iters: Int, cluster_shape: StaticTuple[SIMD[int32, 1], 3] = StaticTuple(__init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1)), promotion_frequency: Int = 1, num_consumer: Int = 1](final_c_reg_tile: LayoutTensor[accum_type, c_reg_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_reg_tile: LayoutTensor[accum_type, c_reg_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], a_smem_iter: LayoutTensorIter[a_type, a_smem_layout, origin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b_smem_iter: LayoutTensorIter[b_type, b_smem_layout, origin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut read_pipeline_states: PipelineState[pipeline_stages], full: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8], empty: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8], wgmma_op: TensorCoreAsync[accum_type, a_type, b_type, wgmma_shape, a_swizzle, b_swizzle, transpose_b], local_warp_group_idx: UInt, warp_group_thread_idx: UInt)` --- ## hopper_matmul_tma_wgmma `hopper_matmul_tma_wgmma[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, wgmma_shape: IndexList[3], block_tile_shape: IndexList[3]](c_device: NDBuffer[c_type, 2, origin, c_shape], a_device: NDBuffer[a_type, 2, origin, a_shape], b_device: NDBuffer[b_type, 2, origin, b_shape], M: Int, N: Int, K: Int, ctx: DeviceContext)` --- ## hopper_matmul_tma_wgmma_kernel `hopper_matmul_tma_wgmma_kernel[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, transpose_b: Bool = True, promotion_frequency: Int = 1](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin])` --- ## matmul_sm90 ## Aliases ### `NumWarpPerWarpGroup` `alias NumWarpPerWarpGroup = 4` ### `WARP_GROUP_SIZE` `alias WARP_GROUP_SIZE = 128` ## Functions * [​`cluster_size`](./cluster_size): * [​`consumer_main_loop`](./consumer_main_loop): * [​`hopper_matmul_tma_wgmma`](./hopper_matmul_tma_wgmma): * [​`hopper_matmul_tma_wgmma_kernel`](./hopper_matmul_tma_wgmma_kernel): * [​`producer_main_loop`](./producer_main_loop): * [​`promote_to_cuda_cores`](./promote_to_cuda_cores): * [​`tma_wgmma_warp_specialized_gemm_kernel`](./tma_wgmma_warp_specialized_gemm_kernel): * [​`tma_wgmma_warp_specialized_gemm_kernel_persistent`](./tma_wgmma_warp_specialized_gemm_kernel_persistent): * [​`warp_specialize_gemm_with_multicasting`](./warp_specialize_gemm_with_multicasting): * [​`warp_specialized_gemm_output`](./warp_specialized_gemm_output): --- ## producer_main_loop `producer_main_loop[a_type: DType, b_type: DType, a_tile_layout: Layout, b_tile_layout: Layout, a_smem_layout: Layout, b_smem_layout: Layout, a_desc_layout: Layout, b_desc_layout: Layout, pipeline_stages: Int, /, *, num_k_iters: Int, block_tile_shape: IndexList[3], cluster_shape: StaticTuple[SIMD[int32, 1], 3] = StaticTuple(__init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1)), partitioned_multicast: Bool = False](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], a_smem_iter: LayoutTensorIter[a_type, a_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b_smem_iter: LayoutTensorIter[b_type, b_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], m_coord: UInt, n_coord: UInt, rank_n: UInt, rank_m: UInt, mut write_pipeline_states: PipelineState[pipeline_stages], empty_mbar: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8], full_mbar: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8])` --- ## promote_to_cuda_cores `promote_to_cuda_cores[accum_type: DType, layout: Layout](c_reg_tile: LayoutTensor[accum_type, layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], final_c_reg_tile: LayoutTensor[accum_type, layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` --- ## tma_wgmma_warp_specialized_gemm_kernel `tma_wgmma_warp_specialized_gemm_kernel[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, c_desc_layout: Layout, c_tma_layout: Layout, c_smem_layout: Layout, cluster_shape: StaticTuple[SIMD[int32, 1], 3], a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), c_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = True, num_threads: Int = 128, pipeline_stages: Int = 7, partitioned_multicast: Bool = False, use_tma_store: Bool = False, promotion_frequency: Int = 1, pdl_level: PDLLevel = PDLLevel(), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), hilbert_swizzle: Bool = False](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c_tma_op: TMATensorTile[c_type, c_tma_layout, c_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin], lut_ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(1)] = UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(1)](0))` --- ## tma_wgmma_warp_specialized_gemm_kernel_persistent `tma_wgmma_warp_specialized_gemm_kernel_persistent[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, c_desc_layout: Layout, c_tma_layout: Layout, c_smem_layout: Layout, cluster_shape: StaticTuple[SIMD[int32, 1], 3], grid_shape: IndexList[2], schedule: MatmulSchedule, a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), c_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = True, num_threads: Int = 128, pipeline_stages: Int = 7, partitioned_multicast: Bool = False, use_tma_store: Bool = False, promotion_frequency: Int = 1, pdl_level: PDLLevel = PDLLevel(), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1})](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c_tma_op: TMATensorTile[c_type, c_tma_layout, c_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin], problem_shape: IndexList[3])` --- ## warp_specialize_gemm_with_multicasting `warp_specialize_gemm_with_multicasting[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, wgmma_shape: IndexList[3], config: MatmulConfig[a_type, b_type, c_type, transpose_b, wgmma_shape], grid_shape: OptionalReg[IndexList[2]] = OptionalReg[IndexList[2]]({:i1 0, 1}), use_tma_store: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), schedule: MatmulSchedule = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](-1)), hilbert_swizzle: Bool = False](c_device: NDBuffer[c_type, 2, origin, c_shape], a_device: NDBuffer[a_type, 2, origin, a_shape], b_device: NDBuffer[b_type, 2, origin, b_shape], M: Int, N: Int, K: Int, ctx: DeviceContext)` --- ## warp_specialized_gemm_output `warp_specialized_gemm_output[c_type: DType, accum_type: DType, c_layout: Layout, c_smem_layout: Layout, c_tma_layout: Layout, c_reg_layout: Layout, c_desc_layout: Layout, /, *, c_tile_shape: IndexList[2], c_swizzle: TensorMapSwizzle, wgmma_shape: IndexList[3], num_consumer: Int = 1, use_tma_store: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1})](c_tma_op: TMATensorTile[c_type, c_tma_layout, c_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_smem_tile: LayoutTensor[c_type, c_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=128], c_reg_tile: LayoutTensor[accum_type, c_reg_layout, MutableAnyOrigin, address_space=AddressSpace(5)], warp_group_thread_idx: UInt, local_warp_group_idx: UInt, local_thread_idx: UInt, block_y: Int, block_x: Int)` --- ## MatmulSchedule `@register_passable(trivial)` `struct MatmulSchedule` ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `DS_SCHEDULER` `alias DS_SCHEDULER = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](2))` ### `NONE` `alias NONE = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](-1))` ### `TILE1D` `alias TILE1D = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](0))` ### `TILE2D` `alias TILE2D = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](1))` ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` --- ## TileScheduler `@register_passable(trivial)` `struct TileScheduler[problem_shape: IndexList[3], tile_shape: IndexList[3], grid_shape: IndexList[2], cluster: IndexList[3] = Index(1, 1, 1), raster_dim: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1), schedule: MatmulSchedule = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](1))]` ## Fields * ​idx (`SIMD[uint32, 1]`): * ​prob\_shape (`IndexList[3]`): * ​num\_waves\_m (`SIMD[uint32, 1]`): * ​num\_waves\_n (`SIMD[uint32, 1]`): * ​log\_num\_waves\_n (`FastDiv[uint32]`): * ​current\_iter (`Int`): * ​num\_aligned\_m\_blocks (`SIMD[uint32, 1]`): * ​num\_blocks (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `kNum1DBlocksPerGroup` `alias kNum1DBlocksPerGroup = __init__[__mlir_type.!pop.int_literal](16)` ### `kNumNBlocks` `alias kNumNBlocks = SIMD(ceildiv[::CeilDivable](problem_shape.__getitem__[::Indexer](1), tile_shape.__getitem__[::Indexer](1)))` ### `num_grids` `alias num_grids = SIMD((grid_shape.__getitem__[::Indexer](0) * grid_shape.__getitem__[::Indexer](1)))` ### `wave_shape` `alias wave_shape = Index((grid_shape.__getitem__[::Indexer](1) * tile_shape.__getitem__[::Indexer](0)), (grid_shape.__getitem__[::Indexer](0) * tile_shape.__getitem__[::Indexer](1)))` ## Methods ### `__init__` `__init__(prob_shape: IndexList[3]) -> Self` ### `get_current_work_info` `get_current_work_info(mut self) -> WorkInfo` ### `advance` `advance(mut self)` ### `fetch_next_work` `fetch_next_work(mut self) -> WorkInfo` ### `num_output_tiles` `num_output_tiles(self) -> UInt` ### `fetch_next_work_ds` `fetch_next_work_ds(mut self) -> WorkInfo` --- ## WorkInfo `@register_passable(trivial)` `struct WorkInfo` ## Fields * ​m (`SIMD[uint32, 1]`): * ​n (`SIMD[uint32, 1]`): * ​k\_start (`SIMD[uint32, 1]`): * ​num\_k\_tiles (`SIMD[uint32, 1]`): * ​is\_valid\_tile (`Bool`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `is_valid` `is_valid(self) -> Bool` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` --- ## matmul_tile_scheduler ## Structs * [​`MatmulSchedule`](./MatmulSchedule): * [​`TileScheduler`](./TileScheduler): * [​`WorkInfo`](./WorkInfo): --- ## matmul_vendor ## Functions * [​`matmul`](./matmul): This implements the matmul kernel for the Blackwell architecture. Note that we do not currently have pure mojo kernels which would utilize blackwell architectures, so in place we just call the CUBLAS library. --- ## matmul `matmul[c_type: DType, a_type: DType, b_type: DType, //, use_tensor_core: Bool = False, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), config: OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]] = OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]]({:i1 0, 1}), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[b_type, 2, origin, shape], ctx: DeviceContext)` This implements the matmul kernel for the Blackwell architecture. Note that we do not currently have pure mojo kernels which would utilize blackwell architectures, so in place we just call the CUBLAS library. --- ## Inner_matmul_vnni `struct Inner_matmul_vnni[saturated_vnni: Bool]` ## Implemented traits `AnyType`, `Copyable`, `InnerMatmulKernel`, `Movable`, `UnknownDestructibility` ## Methods ### `__inner_matmul__` `__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)` Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows, TileN, TileK) tile. --- ## matmul_vnni ## Structs * [​`Inner_matmul_vnni`](./Inner_matmul_vnni): --- ## matrix_band_part The module implements matrix band part functions. ## Functions * [​`matrix_band_part`](./matrix_band_part): --- ## matrix_band_part `matrix_band_part[: origin.set, //, type: DType, int_type: DType, cond_type: DType, rank: Int, input_0_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], simd_width: Int, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[rank], num_lower: NDBuffer[int_type, 1, origin], num_upper: NDBuffer[int_type, 1, origin], exclude_buf: NDBuffer[cond_type, 1, origin], output: NDBuffer[type, rank, origin], ctx: DeviceContextPtr)` --- ## neon_intrinsics --- ## BTileGenerator `struct BTileGenerator[mut: Bool, //, config: KernelConfig, a_type: DType, b_type: DType, c_type: DType, shape: DimList, transpose_b: Bool, b_packed: Bool, origin: Origin[mut]]` Struct to encapsulate a tile of B that supports prepacking. If b\_packed is true, calls to get\_tile will return a buffer view from B. Otherwise, calls to get\_tile will copy a tile from B into a stack allocated scratch buffer and return a view of that. ## Fields * ​b (`NDBuffer[b_type, 2, origin, shape]`): * ​b\_tile\_stack\_ptr (`UnsafePointer[SIMD[b_type, 1]]`): * ​tile\_n\_k (`IndexList[2]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `get` `static get(b: NDBuffer[b_type, 2, origin, shape], tile_n_k: IndexList[2]) -> Self` ### `get_tile` `get_tile[inner_size: Int](self, global_offset: GemmShape, tile_dim_nk: IndexList[2], valid_data_dim_nk: IndexList[2]) -> NDBuffer[b_type, 3, MutableAnyOrigin, config.packed_shape]` Get a packed matrix (B) tile. valid\_data\_tile\_nk is ignored for pre-packing, where the tile is padded to have shape of tile\_dim\_nk. **Args:** * ​global\_offset (`GemmShape`): Offset in the global M, N, K dimensions. * ​tile\_dim\_nk (`IndexList[2]`): Tile shape based on cache size and matrix dimensions. * ​valid\_data\_dim\_nk (`IndexList[2]`): The upper bounds for N and K dimensions. **Returns:** A view of the packed tile. --- ## PackMatrixCols `struct PackMatrixCols[original_mut: Bool, //, original_shape: DimList, packed_shape: DimList, type: DType, simd_size: Int, column_inner_size: Int, use_vnni: Bool, use_i8mm: Bool, packed_origin: MutableOrigin, original_origin: Origin[original_mut]]` Pack columns from a matrix into the mlas packed layout and extract inner vectors of columns into the packed inner dimension, e.g. extracts \[X, Y] and packs as \[Yo]\[X]\[Yi]. ## Fields * ​packed\_matrix (`NDBuffer[type, 3, packed_origin, packed_shape]`): * ​original\_matrix (`NDBuffer[type, 2, original_origin, original_shape]`): * ​global\_offset (`IndexList[2]`): * ​pack\_tile\_dim (`IndexList[2]`): * ​valid\_data\_dim (`IndexList[2]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `run` `static run(packed_matrix: NDBuffer[type, 3, MutableAnyOrigin, packed_shape], original_matrix: NDBuffer[type, 2, MutableAnyOrigin, original_shape], global_offset: IndexList[2], pack_tile_dim: IndexList[2], valid_data_dim: IndexList[2])` Interface function to run the packing routine. Args: packed\_matrix(NDBuffer): pre-allocated buffer space for packed data. original\_matrix(NDBuffer): data buffer containing the original matrix to pack. global\_offset(IndexList): offset to use when indexing the original matrix. pack\_tile\_dim(IndexList): 2D dimension tuple describing the size of the packed tile. valid\_data\_dim(IndexList): 2D dimension tuple describing the amount of valid data on the global buffer starting from the offset. --- ## PackMatrixRows `struct PackMatrixRows[original_mut: Bool, //, original_shape: DimList, packed_shape: DimList, type: DType, simd_size: Int, row_inner_size: Int, packed_origin: MutableOrigin, original_origin: Origin[original_mut]]` Pack rows from a matrix into the mlas packed layout and extract inner vectors of rows into the packed inner dimension, e.g. extract tile \[X, Y] and pack into \[Xo]\[Y]\[Xi]. ## Fields * ​packed\_matrix (`NDBuffer[type, 3, packed_origin, packed_shape]`): * ​original\_matrix (`NDBuffer[type, 2, original_origin, original_shape]`): * ​global\_offset (`IndexList[2]`): * ​pack\_tile\_dim (`IndexList[2]`): * ​valid\_data\_dim (`IndexList[2]`): * ​valid\_simd\_dim (`IndexList[2]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `run` `static run(packed_matrix: NDBuffer[type, 3, packed_origin, packed_shape], original_matrix: NDBuffer[type, 2, original_origin, original_shape], global_offset: IndexList[2], pack_tile_dim: IndexList[2], valid_data_dim: IndexList[2])` Interface function to run the packing routine. Args: packed\_matrix(NDBuffer): pre-allocated buffer space for packed data. original\_matrix(NDBuffer): data buffer containing the original matrix to pack. global\_offset(IndexList): offset to use when indexing the original matrix. pack\_tile\_dim(IndexList): 2D dimension tuple describing the size of the packed tile. valid\_data\_dim(IndexList): 2D dimension tuple describing the amount of valid data on the global buffer starting from the offset. --- ## packing ## Structs * [​`BTileGenerator`](./BTileGenerator): Struct to encapsulate a tile of B that supports prepacking. * [​`PackMatrixCols`](./PackMatrixCols): Pack columns from a matrix into the mlas packed layout and extract inner vectors of columns into the packed inner dimension, e.g. extracts \[X, Y] and packs as \[Yo]\[X]\[Yi]. * [​`PackMatrixRows`](./PackMatrixRows): Pack rows from a matrix into the mlas packed layout and extract inner vectors of rows into the packed inner dimension, e.g. extract tile \[X, Y] and pack into \[Xo]\[Y]\[Xi]. ## Functions * [​`pack_b`](./pack_b): Utility function to pack the entire B matrix, such that each \[tile\_n // inner\_size, tile\_k, inner\_size] tile of src is contiguous in dst. * [​`pack_b_ndbuffer`](./pack_b_ndbuffer): * [​`pack_matmul_b_shape_func`](./pack_matmul_b_shape_func): * [​`pack_transposed_b_ndbuffer`](./pack_transposed_b_ndbuffer): --- ## pack_b `pack_b[transpose_b: Bool, simd_size: Int, inner_size: Int, a_type: DType, b_type: DType, c_type: DType, src_shape: DimList, dst_shape: DimList](dst: NDBuffer[b_type, 2, origin, dst_shape], src: NDBuffer[b_type, 2, origin, src_shape], tile_n: Int, tile_k: Int)` Utility function to pack the entire B matrix, such that each \[tile\_n // inner\_size, tile\_k, inner\_size] tile of src is contiguous in dst. Tiles (not tile contents) are stored in row major order, so tile\[i, j] is tile\_n \* tile\_k bytes away from tile\[i, j+1]. --- ## pack_b_ndbuffer `pack_b_ndbuffer[b_mut: Bool, //, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, c_type: DType, c_shape: DimList, b_origin: Origin[b_mut], output_origin: MutableOrigin](b_input: NDBuffer[b_type, 2, b_origin, b_shape], output_buffer: NDBuffer[b_type, 2, output_origin])` --- ## pack_matmul_b_shape_func `pack_matmul_b_shape_func[a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, c_type: DType, c_shape: DimList, transpose_in_0: Bool, single_thread_blocking_override: Bool](b_input: NDBuffer[b_type, 2, origin, b_shape]) -> IndexList[2]` --- ## pack_transposed_b_ndbuffer `pack_transposed_b_ndbuffer[a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, c_type: DType, c_shape: DimList](b_input: NDBuffer[b_type, 2, origin, b_shape], output_buffer: NDBuffer[b_type, 2, origin])` --- ## apply_q `apply_q[dtype: DType, element_layout: Layout](sigma: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], A: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], X: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Applies the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` to the `X` matrix. See `qr_factorization` for more details on the construction of the Householder reflector. --- ## form_q `form_q[dtype: DType, element_layout: Layout](sigma: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], A: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], Q: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Forms the Q factor from the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` and stores the result in `Q`. --- ## qr_factorization ## Functions * [​`apply_q`](./apply_q): Applies the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` to the `X` matrix. * [​`form_q`](./form_q): Forms the Q factor from the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` and stores the result in `Q`. * [​`qr_factorization`](./qr_factorization): Performs QR factorization of a matrix `A` using the Householder reflector method. --- ## qr_factorization `qr_factorization[dtype: DType, element_layout: Layout](sigma: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], A: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Performs QR factorization of a matrix `A` using the Householder reflector method. This function computes the QR factorization of matrix `A` in-place using Householder reflections. The result is stored directly in the input matrix `A`, with scaling factors in `sigma`. The implementation follows the LAPACK algorithm for generating Householder reflectors in-place. Algorithm: The Householder reflector is defined as: U = I - σww^H where: w = (x + νe₁)/ξ σ = ξ/ν ξ = x₀ + ν ν = sign(x₀)‖x‖₂ ``` This ensures that U^H x = -νe₁ and U^H U = I. ``` References: \[1] Lehoucq, R. B. (1996). The computation of elementary unitary matrices. ACM Transactions on Mathematical Software, 22(4), 393-400. Note: There is a typo in reference \[lawn72]. The correct result is U^H x = -νe₁. --- ## transpose The module implements Transpose functions. ## Functions * [​`transpose`](./transpose): Permute the axis of `input` based on `perms`, and place the result in `output`. * [​`transpose_2d`](./transpose_2d): * [​`transpose_3d_swap_inner`](./transpose_3d_swap_inner): * [​`transpose_3d_swap_outer`](./transpose_3d_swap_outer): * [​`transpose_4d_swap_middle`](./transpose_4d_swap_middle): * [​`transpose_inplace`](./transpose_inplace): * [​`transpose_strided`](./transpose_strided): * [​`transpose_trivial_memcpy`](./transpose_trivial_memcpy): --- ## transpose `transpose[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape], perms: UnsafePointer[SIMD[index, 1]])` Permute the axis of `input` based on `perms`, and place the result in `output`. Example: ```mojo transpose(output, input, [2, 0, 1]) # guarantees output[x, y, z] = input[z, x, y] ``` **Parameters:** * ​rank (`Int`): The rank of input and output buffers. * ​type (`DType`): The dtype of buffer elements. **Args:** * ​output (`NDBuffer[type, rank, origin, shape]`): The output buffer. * ​input (`NDBuffer[type, rank, origin, shape]`): The input buffer. * ​perms (`UnsafePointer[SIMD[index, 1]]`): Permutation of the input axes. --- ## transpose_2d `transpose_2d[rank: Int, output_shape: DimList, input_shape: DimList, type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int, offset: Int)` --- ## transpose_3d_swap_inner `transpose_3d_swap_inner[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int)` --- ## transpose_3d_swap_outer `transpose_3d_swap_outer[rank: Int, output_shape: DimList, input_shape: DimList, type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int)` --- ## transpose_4d_swap_middle `transpose_4d_swap_middle[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape, strides], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int)` --- ## transpose_inplace `transpose_inplace[rows: Int, cols: Int, type: DType](buf: NDBuffer[type, 2, origin, __init__[::Indexer,::Indexer](rows, cols)])` --- ## transpose_strided `transpose_strided[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape], perms: UnsafePointer[SIMD[index, 1]])` --- ## transpose_trivial_memcpy `transpose_trivial_memcpy[rank: Int, output_shape: DimList, input_shape: DimList, type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape])` --- ## GemmShape `@register_passable(trivial)` `struct GemmShape` Helper class to unpack gemm dimension and layout. ## Fields * ​M (`Int`): * ​N (`Int`): * ​K (`Int`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(index: IndexList[3]) -> Self` Constructor of a gemm shape record from a index tuple. **Args:** * ​index (`IndexList[3]`): The int tuple containing the index(m,n,k). ### `__getitem__` `__getitem__(self, idx: Int) -> Int` ### `__setitem__` `__setitem__(mut self, idx: Int, value: Int)` ### `__add__` `__add__(self, rhs: Self) -> Self` Coordinate-wise addition of two gemm shape records. **Args:** * ​rhs (`Self`): Another gemm shape record to add with. ### `__sub__` `__sub__(self, rhs: Self) -> Self` Coordinate-wise subtraction of two gemm shape records. **Args:** * ​rhs (`Self`): Another gemm shape record to subtract with. ### `get` `static get[transpose_b: Bool](c: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> Self` Constructor of a gemm shape record from input buffers. M, N, and K are intentionally calculated using `a` and `c` ONLY. This is because `b` may be padded to a multiple of the tile size if it has been pre-packed. **Args:** * ​c (`NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): NDBuffer with allocated output space. * ​a (`NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): NDBuffer containing matrix operand A. * ​b (`NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): NDBuffer containing matrix operand B. ### `as_index` `as_index(self) -> IndexList[3]` Utility to convert the underlying data to an index tuple. So that the utilities such as elementwise add can be used. **Returns:** The constructed index tuple. --- ## InnerKernelID `@register_passable(trivial)` `struct InnerKernelID` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `DEFAULT` `alias DEFAULT = InnerKernelID(0)` ### `I8MM` `alias I8MM = InnerKernelID(3)` ### `NEON` `alias NEON = InnerKernelID(2)` ### `VNNI` `alias VNNI = InnerKernelID(1)` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` --- ## KernelConfig `struct KernelConfig` Static configuration of the matmul inner kernel. ## Fields * ​kernel\_rows (`Int`): * ​kernel\_cols (`Int`): * ​simd\_size (`Int`): * ​packed\_shape (`DimList`): ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, *, kernel_rows: Int, kernel_cols: Int, simd_size: Int, packed_shape: DimList)` --- ## MicroKernelShape `@register_passable(trivial)` `struct MicroKernelShape` Record describing the inner kernel shape. ## Fields * ​simd\_rows (`Int`): * ​simd\_cols (`Int`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(rows: Int, cols: Int) -> Self` --- ## SubMatmulConfig `struct SubMatmulConfig` Static configuration of sub-matrices in parallel matmul. ## Fields * ​offset (`IndexList[3]`): * ​shape (`IndexList[3]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `is_valid` `is_valid(self) -> Bool` --- ## apply_epilogue `apply_epilogue[elementwise_lambda: fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None, dst_layout: Layout, dst_element_layout: Layout = __init__[::Origin[::Bool(IntTuple(1), IntTuple(1))](src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], offset: Int)` --- ## calculate_tile_n_k `calculate_tile_n_k[a_type: DType, b_type: DType, c_type: DType, kernel_cols: Int](n: Int, k: Int) -> IndexList[2]` Helper heuristic function to decide on tile size to partition the matmul given the cache size and desired data layout. **Parameters:** * ​a\_type (`DType`): The type of the A tensor. * ​b\_type (`DType`): The type of the B tensor. * ​c\_type (`DType`): The type of the C tensor. * ​kernel\_cols (`Int`): The umber of columns of the micro kernel. **Returns:** The calculated tile size to partition the matmul as (TileN, TileK). `calculate_tile_n_k[a_type: DType, b_type: DType, c_type: DType, kernel_cols: Int](global_tile_shape: GemmShape) -> IndexList[2]` --- ## dispatch_get_kernel_type `dispatch_get_kernel_type[: origin.set, //, func: fn[Bool]() raises capturing -> None](m: Int, n: Int, k: Int)` `dispatch_get_kernel_type[: origin.set, //, func: fn[Bool]() capturing -> None](m: Int, n: Int, k: Int)` --- ## get_kernel_config `get_kernel_config[a_type: DType, b_type: DType, c_type: DType, *, kernel_type: Bool = False]() -> KernelConfig` Utility function to extract matmul configuration parameters for exported Functions. TODO: Add target dependent configuration parameters. --- ## get_kernel_type `get_kernel_type(m: Int, n: Int, k: Int) -> Bool` --- ## get_matmul_arch_factor `get_matmul_arch_factor[use_vnni: Bool, use_i8mm: Bool]() -> Int` --- ## get_matmul_kernel_shape `get_matmul_kernel_shape[a_type: DType, b_type: DType, c_type: DType, kernel_type: Bool]() -> MicroKernelShape` --- ## get_matmul_kernel_shape_ARM `get_matmul_kernel_shape_ARM[a_type: DType, b_type: DType, c_type: DType, kernel_type: Bool]() -> MicroKernelShape` --- ## get_matmul_kernel_shape_x86 `get_matmul_kernel_shape_x86[kernel_type: Bool]() -> MicroKernelShape` --- ## get_matmul_num_tasks `get_matmul_num_tasks[a_type: DType, b_type: DType, c_type: DType, simd_size: Int, kernel_type: Bool](m: Int, n: Int, k: Int, max_num_tasks: Int) -> Int` Compute the number of tasks for parallel matmul. The max number of tasks is typically the number of threads/cores. --- ## get_matmul_prefetch_b_distance_k `get_matmul_prefetch_b_distance_k() -> Int` --- ## get_min_task_size `get_min_task_size() -> Int` --- ## get_packB_unroll_factor `get_packB_unroll_factor() -> Int` --- ## get_pack_data_size `get_pack_data_size[type: DType]() -> Int` Utility to compute the number of elements to pack in each tile. Returns: The number of elements to pack. --- ## get_partitioned_matmul `get_partitioned_matmul[a_type: DType, b_type: DType, c_type: DType, kernel_rows: Int, kernel_cols: Int](m: Int, n: Int, k: Int, task_id: Int, num_tasks: Int) -> SubMatmulConfig` --- ## get_partitioned_matmul_mojo `get_partitioned_matmul_mojo[b_type: DType, kernel_rows: Int, kernel_cols: Int, use_i8mm: Bool = False](m: Int, n: Int, k: Int, task_id: Int, num_tasks: Int) -> SubMatmulConfig` --- ## get_partitioned_matmul_mojo_shape `get_partitioned_matmul_mojo_shape[b_type: DType, kernel_rows: Int, kernel_cols: Int, use_i8mm: Bool](m: Int, n: Int, k: Int, num_tasks: Int) -> IndexList[2]` --- ## utils ## Aliases ### `elementwise_compute_lambda_type` `alias elementwise_compute_lambda_type = fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]` ### `elementwise_epilogue_type` `alias elementwise_epilogue_type = fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None` ## Structs * [​`GemmShape`](./GemmShape): Helper class to unpack gemm dimension and layout. * [​`InnerKernelID`](./InnerKernelID): * [​`KernelConfig`](./KernelConfig): Static configuration of the matmul inner kernel. * [​`MicroKernelShape`](./MicroKernelShape): Record describing the inner kernel shape. * [​`SubMatmulConfig`](./SubMatmulConfig): Static configuration of sub-matrices in parallel matmul. ## Functions * [​`apply_epilogue`](./apply_epilogue): * [​`calculate_tile_n_k`](./calculate_tile_n_k): Helper heuristic function to decide on tile size to partition the matmul given the cache size and desired data layout. * [​`dispatch_get_kernel_type`](./dispatch_get_kernel_type): * [​`get_kernel_config`](./get_kernel_config): Utility function to extract matmul configuration parameters for exported Functions. TODO: Add target dependent configuration parameters. * [​`get_kernel_type`](./get_kernel_type): * [​`get_matmul_arch_factor`](./get_matmul_arch_factor): * [​`get_matmul_kernel_shape`](./get_matmul_kernel_shape): * [​`get_matmul_kernel_shape_ARM`](./get_matmul_kernel_shape_ARM): * [​`get_matmul_kernel_shape_x86`](./get_matmul_kernel_shape_x86): * [​`get_matmul_num_tasks`](./get_matmul_num_tasks): Compute the number of tasks for parallel matmul. The max number of tasks is typically the number of threads/cores. * [​`get_matmul_prefetch_b_distance_k`](./get_matmul_prefetch_b_distance_k): * [​`get_min_task_size`](./get_min_task_size): * [​`get_pack_data_size`](./get_pack_data_size): Utility to compute the number of elements to pack in each tile. Returns: The number of elements to pack. * [​`get_packB_unroll_factor`](./get_packB_unroll_factor): * [​`get_partitioned_matmul`](./get_partitioned_matmul): * [​`get_partitioned_matmul_mojo`](./get_partitioned_matmul_mojo): * [​`get_partitioned_matmul_mojo_shape`](./get_partitioned_matmul_mojo_shape): * [​`packA_i8mm`](./packA_i8mm): * [​`partition_work`](./partition_work): * [​`select_inner_kernel`](./select_inner_kernel): * [​`use_i8mm_fn`](./use_i8mm_fn): * [​`use_vnni_fn`](./use_vnni_fn): --- ## packA_i8mm `packA_i8mm[a_type: DType](t0: Int, t1: Int, k: Int, a_ptr: UnsafePointer[SIMD[a_type, 1]], a_packed_ptr: UnsafePointer[SIMD[a_type, 1]])` --- ## partition_work `partition_work(task_id: Int, num_tasks: Int, work: Int, work_block_size: Int) -> IndexList[2]` --- ## select_inner_kernel `select_inner_kernel[a_type: DType, b_type: DType, c_type: DType]() -> InnerKernelID` --- ## use_i8mm_fn `use_i8mm_fn[a_type: DType, b_type: DType, c_type: DType]() -> Bool` --- ## use_vnni_fn `use_vnni_fn[a_type: DType, b_type: DType, c_type: DType]() -> Bool` --- ## MatmulConfig `@register_passable(trivial)` `struct MatmulConfig[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool = False, mma_shape: IndexList[3] = get_mma_shape[::DType,::DType,::Int]()]` Static configuration of GPU matmul. ## Fields * ​block\_tile\_shape (`IndexList[3]`): * ​warp\_tile\_shape (`IndexList[3]`): * ​num\_pipeline\_stages (`UInt`): * ​num\_k\_partitions (`UInt`): * ​k\_group\_size (`UInt`): * ​num\_warp\_k\_partitions (`UInt`): * ​cluster\_shape (`IndexList[3]`): * ​num\_consumer (`UInt`): * ​partitioned\_multicast (`Bool`): * ​scheduler\_hint (`IndexList[3]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `ACCUM_PRECISION` `alias ACCUM_PRECISION = 1` ### `accum_type` `alias accum_type = get_accum_type[::DType,::DType]()` ### `OUTPUT_PRECISION` `alias OUTPUT_PRECISION = 2` ### `split_k_reduction_scheme` `alias split_k_reduction_scheme = env_get_int[::StringSlice[::Bool()` ### `split_k_reduction_type` `alias split_k_reduction_type = c_type if (env_get_int[::StringSlice[::Bool() == 2) else get_accum_type[::DType,::DType]()` ## Methods ### `__init__` `__init__(block_tile_shape: IndexList[3] = Index(128, 128, 32), warp_tile_shape: IndexList[3] = Index(64, 64, 32), cluster_shape: IndexList[3] = Index(1, 1, 1), num_pipeline_stages: UInt = UInt(4), num_k_partitions: UInt = UInt(1), k_group_size: UInt = UInt(1), num_warp_k_partitions: UInt = UInt(1), num_consumer: UInt = UInt(1), partitioned_multicast: Bool = False, scheduler_hint: IndexList[3] = Index(2, 2, 2), pdl_level: PDLLevel = PDLLevel()) -> Self` ### `__eq__` `__eq__(self, rhs: MatmulConfig[a_type, b_type, c_type, transpose_b, mma_shape]) -> Bool` ### `num_warps_m` `num_warps_m(self) -> UInt` ### `num_warps_n` `num_warps_n(self) -> UInt` ### `num_threads` `num_threads(self) -> UInt` ### `shared_mem_usage` `shared_mem_usage(self) -> Int` ### `grid_dim` `grid_dim(self, m: UInt, n: UInt) -> IndexList[3]` ### `block_dim` `block_dim(self) -> IndexList[3]` ### `work_space_size` `work_space_size(self, M: UInt, N: UInt) -> UInt` ### `pdl_level` `pdl_level(self) -> PDLLevel` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` ### `__repr__` `__repr__(self) -> String` ### `__hash__` `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with the underlying bytes. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. --- ## MatmulKernels `@register_passable(trivial)` `struct MatmulKernels[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool = False]` Supported matmul kernels. The configurations are named as: **. BK, mma shape, and warp tile shape are decided internally. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ampere_128x128_4` `alias ampere_128x128_4 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(4), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `ampere_256x128_3` `alias ampere_256x128_3 = MatmulConfig(Index(128, 256, (_bk_base[::DType,::Bool]() * 2)), Index(64, 64, (_bk_base[::DType,::Bool]() * 2)), Index(1, 1, 1), UInt(3), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `ampere_256x64_4` `alias ampere_256x64_4 = MatmulConfig(Index(64, 256, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(4), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `hopper_128x128_4` `alias hopper_128x128_4 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(4), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `mi300x_128x128_1` `alias mi300x_128x128_1 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `mi300x_128x128_2` `alias mi300x_128x128_2 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(2), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `mi300x_128x256_1` `alias mi300x_128x256_1 = MatmulConfig(Index(128, 256, _bk_base[::DType,::Bool]()), Index(64, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 4, 2), PDLLevel())` ### `mi300x_192x256_1` `alias mi300x_192x256_1 = MatmulConfig(Index(192, 256, _bk_base[::DType,::Bool]()), Index(96, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(4, 6, 2), PDLLevel())` ### `mi300x_224x256_1` `alias mi300x_224x256_1 = MatmulConfig(Index(224, 256, _bk_base[::DType,::Bool]()), Index(112, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(4, 7, 2), PDLLevel())` ### `mi300x_256x256_1` `alias mi300x_256x256_1 = MatmulConfig(Index(256, 256, _bk_base[::DType,::Bool]()), Index(128, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(4, 8, 2), PDLLevel())` ### `mi300x_64x64_1` `alias mi300x_64x64_1 = MatmulConfig(Index(64, 64, _bk_base[::DType,::Bool]()), Index(32, 32, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `mi300x_64x64_splitk_1` `alias mi300x_64x64_splitk_1 = MatmulConfig(Index(64, 64, _bk_base[::DType,::Bool]()), Index(32, 32, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(4), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `tuning_config` `alias tuning_config = MatmulConfig(Index(env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool()), Index(env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool()), Index(1, 1, 1), UInt(env_get_int[::StringSlice[::Bool()), UInt(env_get_int[::StringSlice[::Bool()), UInt(1), UInt(env_get_int[::StringSlice[::Bool()), UInt(1), False, Index(2, 2, 2), PDLLevel())` --- ## block_swizzle `block_swizzle(block_idx: IndexList[2, element_type=element_type], grid_dim: IndexList[2, element_type=element_type]) -> IndexList[2, element_type=element_type]` --- ## create_hilbert_lut `create_hilbert_lut(ctx: DeviceContext, grid_x: Int, grid_y: Int) -> DeviceBuffer[uint32]` Precompute Hilbert-curve block swizzle lookup-table for a rectangular grid. The returned device pointer refers to a 1-D UInt32 array of length grid\_x \* grid\_y. For linear (row-major) block id `id`, the packed value at `lut[id]` encodes the swizzled coordinates: upper 16-bits = y, lower 16-bits = x. --- ## get_config_from_shape `get_config_from_shape[a_type: DType, b_type: DType, c_type: DType, static_N: Int, static_K: Int, transpose_b: Bool = False, target: StringSlice[StaticConstantOrigin] = _accelerator_arch()](dyn_M: Int, ctx: DeviceContext) -> MatmulConfig[a_type, b_type, c_type, transpose_b]` --- ## get_hilbert_lut_with_cache `get_hilbert_lut_with_cache(ctx: DeviceContext, grid_x: Int, grid_y: Int) -> DeviceBuffer[uint32]` Get Hilbert lookup table using global cache (no struct needed). --- ## utils_gpu ## Structs * [​`MatmulConfig`](./MatmulConfig): Static configuration of GPU matmul. * [​`MatmulKernels`](./MatmulKernels): Supported matmul kernels. ## Functions * [​`block_swizzle`](./block_swizzle): * [​`create_hilbert_lut`](./create_hilbert_lut): Precompute Hilbert-curve block swizzle lookup-table for a rectangular grid. * [​`get_config_from_shape`](./get_config_from_shape): * [​`get_hilbert_lut_with_cache`](./get_hilbert_lut_with_cache): Get Hilbert lookup table using global cache (no struct needed). * [​`select_config`](./select_config): --- ## select_config `select_config[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool = False](M: Int, N: Int, K: Int, ctx: DeviceContext) -> MatmulConfig[a_type, b_type, c_type, transpose_b]` --- ## Backend `@register_passable(trivial)` `struct Backend` ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Movable`, `UnknownDestructibility`, `Writable` ## Aliases ### `AUTOMATIC` `alias AUTOMATIC = Backend(0)` ### `CUBLAS` `alias CUBLAS = Backend(1)` ### `CUBLASLT` `alias CUBLASLT = Backend(2)` ### `HIPBLASLT` `alias HIPBLASLT = Backend(4)` ### `ROCBLAS` `alias ROCBLAS = Backend(3)` ## Methods ### `__init__` `@implicit` `__init__(value: Int) -> Self` ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` ### `__is__` `__is__(self, other: Self) -> Bool` ### `__isnot__` `__isnot__(self, other: Self) -> Bool` ### `__int__` `__int__(self) -> Int` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` --- ## Handle `struct Handle[backend: Backend = _resolve_backend[linalg::vendor_blas::Backend,::DType]()]` ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `resolved_backend` `alias resolved_backend = _resolve_backend[linalg::vendor_blas::Backend,::DType]()` ### `type` `alias type = Variant[UnsafePointer[NoneType], Handle, UnsafePointer[NoneType]]` ## Methods ### `__init__` `__init__(out self)` ### `__is__` `__is__(self, other: Backend) -> Bool` ### `__isnot__` `__isnot__(self, other: Backend) -> Bool` ### `__enter__` `__enter__(self) -> Self` ### `__exit__` `__exit__(mut self)` --- ## vendor_blas ## Structs * [​`Backend`](./Backend): * [​`Handle`](./Handle): ## Functions * [​`matmul`](./matmul): Matmul using the vendor BLAS library. With a global handle. --- ## matmul `matmul[use_tf32: Bool = False](ctx: DeviceContext, c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], *, c_row_major: Bool = False, transpose_a: Bool = False, transpose_b: Bool = False, alpha: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](1), beta: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](0))` Matmul using the vendor BLAS library. With a global handle. `matmul[use_tf32: Bool = False](ctx: DeviceContext, handle: Handle[backend], c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], *, c_row_major: Bool = False, transpose_a: Bool = False, transpose_b: Bool = False, alpha: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](1), beta: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](0))` --- ## dot_i16_to_i32_AVX2 `dot_i16_to_i32_AVX2[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` The dot product of the two words in each int32 element of a and b plus a int32 from src. **Constraints:** Requires AVX2. The size of the output vector must be 4, 8 or 16. **Parameters:** * ​width (`Int`): Size of the output SIMD vector. * ​a\_type (`DType`): The DType for a. * ​b\_type (`DType`): The DType for b. * ​c\_type (`DType`): The DType for c. **Args:** * ​src (`SIMD[c_type, width]`): A int32 SIMD vector. * ​a (`SIMD[a_type, width]`): A int16 SIMD vector. * ​b (`SIMD[b_type, width]`): A int16 SIMD vector. **Returns:** A SIMD vector of width elements. --- ## dot_i16_to_i32_x86 `dot_i16_to_i32_x86[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` The dot product of the two words in each int32 element of a and b plus a int32 from src using VNNI or AVX2. **Constraints:** Requires AVX512\_VNNI or AVX2. The size of the output vector must be 4, 8 or 16. **Parameters:** * ​width (`Int`): Size of the output SIMD vector. * ​a\_type (`DType`): The DType for a. * ​b\_type (`DType`): The DType for b. * ​c\_type (`DType`): The DType for c. **Args:** * ​src (`SIMD[c_type, width]`): A int32 SIMD vector. * ​a (`SIMD[a_type, width]`): A int16 SIMD vector. * ​b (`SIMD[b_type, width]`): A int16 SIMD vector. **Returns:** A SIMD vector of width elements. --- ## dot_i8_to_i32_AVX2 `dot_i8_to_i32_AVX2[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` The dot product of the four bytes in each int32 element of a and b plus a int32 from src. **Constraints:** Requires AVX2. The size of the output vector must be 4, 8 or 16. The a argument has range \[0,255]. The b argument has range \[-128,127]. **Parameters:** * ​width (`Int`): Size of the output SIMD vector. * ​a\_type (`DType`): The DType for a. * ​b\_type (`DType`): The DType for b. * ​c\_type (`DType`): The DType for c. **Args:** * ​src (`SIMD[c_type, width]`): A int32 SIMD vector. * ​a (`SIMD[a_type, width]`): A uint8 SIMD vector. * ​b (`SIMD[b_type, width]`): A int8 SIMD vector. **Returns:** A SIMD vector of width elements. --- ## dot_i8_to_i32_saturated_AVX2 `dot_i8_to_i32_saturated_AVX2[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` The dot product of the four bytes in each int32 element of a and b plus a int32 from src. **Constraints:** Requires AVX2. The size of the output vector must be 4, 8 or 16. The a argument has range \[0,127] not \[0, 255]. The b argument has range \[-128,127]. **Parameters:** * ​width (`Int`): Size of the output SIMD vector. * ​a\_type (`DType`): The DType for a. * ​b\_type (`DType`): The DType for b. * ​c\_type (`DType`): The DType for c. **Args:** * ​src (`SIMD[c_type, width]`): A int32 SIMD vector. * ​a (`SIMD[a_type, width]`): A uint8 SIMD vector. * ​b (`SIMD[b_type, width]`): A int8 SIMD vector. **Returns:** A SIMD vector of width elements. --- ## dot_i8_to_i32_saturated_x86 `dot_i8_to_i32_saturated_x86[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2. **Constraints:** Requires AVX512\_VNNI or AVX2. The size of the output vector must be 4, 8 or 16. The a argument has range \[0,127] not \[0, 255]. The b argument has range \[-128,127]. **Parameters:** * ​width (`Int`): Size of the output SIMD vector. * ​a\_type (`DType`): The DType for a. * ​b\_type (`DType`): The DType for b. * ​c\_type (`DType`): The DType for c. **Args:** * ​src (`SIMD[c_type, width]`): A int32 SIMD vector. * ​a (`SIMD[a_type, width]`): A uint8 SIMD vector. * ​b (`SIMD[b_type, width]`): A int8 SIMD vector. **Returns:** A SIMD vector of width elements. --- ## dot_i8_to_i32_x86 `dot_i8_to_i32_x86[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2. **Constraints:** Requires AVX512\_VNNI or AVX2. The size of the output vector must be 4, 8 or 16. The a argument has range \[0,255]. The b argument has range \[-128,127]. **Parameters:** * ​width (`Int`): Size of the output SIMD vector. * ​a\_type (`DType`): The DType for a. * ​b\_type (`DType`): The DType for b. * ​c\_type (`DType`): The DType for c. **Args:** * ​src (`SIMD[c_type, width]`): A int32 SIMD vector. * ​a (`SIMD[a_type, width]`): A uint8 SIMD vector. * ​b (`SIMD[b_type, width]`): A int8 SIMD vector. **Returns:** A SIMD vector of width elements. --- ## vnni_intrinsics ## Functions * [​`dot_i16_to_i32_AVX2`](./dot_i16_to_i32_AVX2): The dot product of the two words in each int32 element of a and b plus a int32 from src. * [​`dot_i16_to_i32_x86`](./dot_i16_to_i32_x86): The dot product of the two words in each int32 element of a and b plus a int32 from src using VNNI or AVX2. * [​`dot_i8_to_i32_AVX2`](./dot_i8_to_i32_AVX2): The dot product of the four bytes in each int32 element of a and b plus a int32 from src. * [​`dot_i8_to_i32_saturated_AVX2`](./dot_i8_to_i32_saturated_AVX2): The dot product of the four bytes in each int32 element of a and b plus a int32 from src. * [​`dot_i8_to_i32_saturated_x86`](./dot_i8_to_i32_saturated_x86): The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2. * [​`dot_i8_to_i32_x86`](./dot_i8_to_i32_x86): The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2. * [​`pmaddubs`](./pmaddubs): * [​`pmaddw`](./pmaddw): * [​`vpdpbusd`](./vpdpbusd): * [​`vpdpbusds`](./vpdpbusds): * [​`vpdpwssd`](./vpdpwssd): * [​`vpdpwssds`](./vpdpwssds): --- ## pmaddubs `pmaddubs[width: Int](a: SIMD[int32, width], b: SIMD[int32, width]) -> SIMD[int32, width]` --- ## pmaddw `pmaddw[width: Int](a: SIMD[int32, width], b: SIMD[int32, width]) -> SIMD[int32, width]` --- ## vpdpbusd `vpdpbusd[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` --- ## vpdpbusds `vpdpbusds[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` --- ## vpdpwssd `vpdpwssd[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` --- ## vpdpwssds `vpdpwssds[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` --- ## elu `elu[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]` Compute the Elu Op using the equation $z if z >= 0 else alpha*(e^z -1)$. **Parameters:** * ​type (`DType`): DType used for the computation. * ​simd\_width (`Int`): SIMD width used for the computation. **Args:** * ​x (`SIMD[type, simd_width]`): The value to compute the ELU operation on. **Returns:** The result of the ELU operation. --- ## gelu `gelu[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]` Compute the GELU Op using the equation $0.5 * x * (1 + erf(x / sqrt(2)))$. **Constraints:** Type must be a floating point type. **Parameters:** * ​type (`DType`): DType used for the computation. * ​simd\_width (`Int`): SIMD width used for the computation. **Args:** * ​x (`SIMD[type, simd_width]`): The value to compute the GELU operation on. **Returns:** The result of the GELU operation. --- ## gelu_approximate `gelu_approximate[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]` Compute the approximate GELU Op using the equation $0.5 * x * (1 + tanh(sqrt(2 / pi) * (x + 0.044715 * x^3)))$. **Constraints:** Type must be a floating point type. **Parameters:** * ​type (`DType`): The `DType` used for the computation. * ​simd\_width (`Int`): SIMD width used for the computation. **Args:** * ​x (`SIMD[type, simd_width]`): The value to compute the GELU operation on. **Returns:** The result of the approximate GELU operation. --- ## activations The module contains implementations of activation functions. ## Functions * [​`elu`](./elu): Compute the Elu Op using the equation $z if z >= 0 else alpha*(e^z -1)$. * [​`gelu`](./gelu): Compute the GELU Op using the equation $0.5 * x * (1 + erf(x / sqrt(2)))$. * [​`gelu_approximate`](./gelu_approximate): Compute the approximate GELU Op using the equation $0.5 * x * (1 + tanh(sqrt(2 / pi) * (x + 0.044715 * x^3)))$. * [​`relu`](./relu): Compute the Relu Op using the equation $max(0, x)$. * [​`relu_n1`](./relu_n1): Compute the Relu N1 Op using the equation $max(min(x,1),-1)$. * [​`sign`](./sign): Compute the sign (0, 1) of the input value. --- ## relu `relu[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]` Compute the Relu Op using the equation $max(0, x)$. **Parameters:** * ​type (`DType`): DType used for the computation. * ​simd\_width (`Int`): SIMD width used for the computation. **Args:** * ​x (`SIMD[type, simd_width]`): The value to compute the RELU operation on. **Returns:** The result of the RELU operation. --- ## relu_n1 `relu_n1[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]` Compute the Relu N1 Op using the equation $max(min(x,1),-1)$. **Parameters:** * ​type (`DType`): DType used for the computation. * ​simd\_width (`Int`): SIMD width used for the computation. **Args:** * ​x (`SIMD[type, simd_width]`): The value to compute the RELU N1 operation on. **Returns:** The result of the RELU N1 operation. --- ## sign `sign[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]` Compute the sign (0, 1) of the input value. **Parameters:** * ​type (`DType`): DType used for the computation. * ​simd\_width (`Int`): SIMD width used for the computation. **Args:** * ​x (`SIMD[type, simd_width]`): The value to compute the sign operation on. **Returns:** The result of the sign operation. --- ## arange `arange[type: DType, simd_width: Int](start: SIMD[type, 1], stop: SIMD[type, 1], step: SIMD[type, 1], index: IndexList[1]) -> SIMD[type, simd_width]` --- ## arange_shape `arange_shape[type: DType, single_thread_blocking_override: Bool](start: SIMD[type, 1], stop: SIMD[type, 1], step: SIMD[type, 1]) -> IndexList[1]` --- ## arange ## Functions * [​`arange`](./arange): * [​`arange_shape`](./arange_shape): --- ## arg_nonzero `arg_nonzero[type: DType, output_type: DType](input_buffer: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output_buffer: LayoutTensor[output_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Gather the indices of all non-zero elements in input buffer storing the indices in the output\_buffer. **Parameters:** * ​type (`DType`): The element type. * ​output\_type (`DType`): The integer type to store the indices in. **Args:** * ​input\_buffer (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to count the non-zeros in. * ​output\_buffer (`LayoutTensor[output_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The indices of all non-zero elements. --- ## arg_nonzero_shape `arg_nonzero_shape[type: DType, single_thread_blocking_override: Bool](input_buffer: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> IndexList[2]` Return \[NumNonZeros, InputRank] where NumNonZeros are the number of non-zero elements in the input. **Parameters:** * ​type (`DType`): The element type. * ​single\_thread\_blocking\_override (`Bool`): This op can block. **Args:** * ​input\_buffer (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to count the non-zeros in. **Returns:** Shape of the arg\_nonzero kernel for this input \[NumNonZeros, InputRank]. --- ## arg_nonzero ## Functions * [​`arg_nonzero`](./arg_nonzero): Gather the indices of all non-zero elements in input buffer storing the indices in the output\_buffer. * [​`arg_nonzero_shape`](./arg_nonzero_shape): Return \[NumNonZeros, InputRank] where NumNonZeros are the number of non-zero elements in the input. --- ## argmax `argmax(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis: Int, output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Finds the indices of the maximum element along the specified axis. **Args:** * ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor. * ​axis (`Int`): The axis. * ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor. `argmax(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis_buf: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Finds the indices of the maximum element along the specified axis. **Args:** * ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor. * ​axis\_buf (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor. * ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor. --- ## argmin `argmin(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis: Int, output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Finds the indices of the minimum element along the specified axis. **Args:** * ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor. * ​axis (`Int`): The axis. * ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor. `argmin(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis_buf: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Finds the indices of the minimum element along the specified axis. **Args:** * ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor. * ​axis\_buf (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor. * ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor. --- ## argmaxmin ## Functions * [​`argmax`](./argmax): Finds the indices of the maximum element along the specified axis. * [​`argmin`](./argmin): Finds the indices of the minimum element along the specified axis. --- ## argmax_gpu `argmax_gpu[type: DType, output_type: DType, rank: Int](ctx: DeviceContext, input: NDBuffer[type, rank, origin], output: NDBuffer[output_type, rank, origin])` --- ## argmaxmin_gpu `argmaxmin_gpu[type: DType, output_type: DType, rank: Int, largest: Bool](ctx: DeviceContext, input: NDBuffer[type, rank, origin], output: NDBuffer[output_type, rank, origin])` Wraps the Top-K GPU kernel with K=1 to perform argmax on the inner-most dimension. **Parameters:** * ​type (`DType`): DType - The data type of the input tensor. * ​output\_type (`DType`): DType - The data type of the output tensor. * ​rank (`Int`): Int - The rank of the input tensor. * ​largest (`Bool`): Bool - Whether to perform argmax or argmin. --- ## argmin_gpu `argmin_gpu[type: DType, output_type: DType, rank: Int](ctx: DeviceContext, input: NDBuffer[type, rank, origin], output: NDBuffer[output_type, rank, origin])` --- ## argmaxmin_gpu ## Functions * [​`argmax_gpu`](./argmax_gpu): * [​`argmaxmin_gpu`](./argmaxmin_gpu): Wraps the Top-K GPU kernel with K=1 to perform argmax on the inner-most dimension. * [​`argmin_gpu`](./argmin_gpu): --- ## argsort `argsort[*, ascending: Bool = True, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ctx: DeviceContext)` Performs argsort on input buffer, storing indices in output buffer. **Parameters:** * ​ascending (`Bool`): Sort direction (True for ascending, False for descending). * ​target (`StringSlice[StaticConstantOrigin]`): Target device ("cpu" or "gpu"). **Args:** * ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer to store sorted indices. * ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer containing values to sort. * ​ctx (`DeviceContext`): Device context for execution. `argsort[ascending: Bool = True](output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` CPU-only version of argsort. **Parameters:** * ​ascending (`Bool`): Sort direction (True for ascending, False for descending). **Args:** * ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer to store sorted indices. * ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer containing values to sort. --- ## argsort ## Functions * [​`argsort`](./argsort): Performs argsort on input buffer, storing indices in output buffer. --- ## cpu_bicubic_kernel `cpu_bicubic_kernel[type: DType, rank: Int, //](output_host: NDBuffer[type, rank, origin, shape, strides], input_host: NDBuffer[type, rank, origin, shape, strides])` Perform bicubic interpolation on an NDBuffer of form NCHW. **Args:** * ​output\_host (`NDBuffer[type, rank, origin, shape, strides]`): Output tensor with desired dimensions. * ​input\_host (`NDBuffer[type, rank, origin, shape, strides]`): Input tensor of shape \[B, C, H, W]. --- ## cubic_kernel `cubic_kernel(x: SIMD[float32, 1]) -> SIMD[float32, 1]` Cubic interpolation kernel matching PyTorch/torchvision's BICUBIC filter. This uses the Catmull-Rom variant (Robidoux cubic) with a = -0.75, which is what PyTorch uses in get\_cubic\_upsample\_coefficients. ([Source](https://github.com/pytorch/pytorch/blob/59eb61b2d1e4b64debbefa036acd0d8c7d55f0a3/aten/src/ATen/native/UpSample.h#L410-L423)). This also matches OpenCV's [interpolateCubic](https://github.com/opencv/opencv/blob/cf2a3c8e7430cc92569dd7f114609f9377b12d9e/modules/imgproc/src/resize.cpp#L907-L915). **Args:** * ​x (`SIMD[float32, 1]`): Distance from the center point. **Returns:** Weight contribution based on the distance. `cubic_kernel(x: SIMD[dtype, size]) -> SIMD[dtype, size]` Cubic interpolation kernel matching PyTorch/torchvision's BICUBIC filter. This uses the Catmull-Rom variant (Robidoux cubic) with a = -0.75, which is what PyTorch uses in get\_cubic\_upsample\_coefficients. ([Source](https://github.com/pytorch/pytorch/blob/59eb61b2d1e4b64debbefa036acd0d8c7d55f0a3/aten/src/ATen/native/UpSample.h#L410-L423)). This also matches OpenCV's [interpolateCubic](https://github.com/opencv/opencv/blob/cf2a3c8e7430cc92569dd7f114609f9377b12d9e/modules/imgproc/src/resize.cpp#L907-L915). **Args:** * ​x (`SIMD[dtype, size]`): Distance from the center point. **Returns:** Weight contribution based on the distance. --- ## gpu_bicubic_kernel `gpu_bicubic_kernel[type: DType, rank: Int](output: NDBuffer[type, rank, MutableAnyOrigin], input: NDBuffer[type, rank, MutableAnyOrigin])` Perform bicubic interpolation using GPU. **Args:** * ​output (`NDBuffer[type, rank, MutableAnyOrigin]`): Output tensor with desired dimensions on the device. * ​input (`NDBuffer[type, rank, MutableAnyOrigin]`): Input tensor of shape \[B, C, H, W] on the device. --- ## bicubic This module provides CPU and GPU implementations for bicubic interpolation. Bicubic interpolation is a 2D extension of cubic interpolation for resampling digital images. It uses the weighted average of the 4x4 neighborhood of pixels around the target location to compute the interpolated value. ## Functions * [​`cpu_bicubic_kernel`](./cpu_bicubic_kernel): Perform bicubic interpolation on an NDBuffer of form NCHW. * [​`cubic_kernel`](./cubic_kernel): Cubic interpolation kernel matching PyTorch/torchvision's BICUBIC filter. * [​`gpu_bicubic_kernel`](./gpu_bicubic_kernel): Perform bicubic interpolation using GPU. * [​`map_output_to_input_coord`](./map_output_to_input_coord): Map output pixel coordinate to input coordinate using center alignment. This implements the standard coordinate mapping for image resizing: input\_coord = (output\_coord + 0.5) \* scale - 0.5 The +0.5 and -0.5 terms ensure pixel centers are aligned properly. Args: output\_coord: Output pixel coordinate. scale: Scale factor (input\_size / output\_size). Returns: Corresponding input coordinate as a float. * [​`resize_bicubic`](./resize_bicubic): Perform bicubic interpolation. --- ## map_output_to_input_coord `map_output_to_input_coord(output_coord: Int, scale: SIMD[float32, 1]) -> SIMD[float32, 1]` Map output pixel coordinate to input coordinate using center alignment. This implements the standard coordinate mapping for image resizing: input\_coord = (output\_coord + 0.5) \* scale - 0.5 The +0.5 and -0.5 terms ensure pixel centers are aligned properly. Args: output\_coord: Output pixel coordinate. scale: Scale factor (input\_size / output\_size). Returns: Corresponding input coordinate as a float. --- ## resize_bicubic `resize_bicubic[type: DType, rank: Int, //, target: StringSlice[StaticConstantOrigin]](output: NDBuffer[type, rank, origin, shape, strides], input: NDBuffer[type, rank, origin, shape, strides], ctx: DeviceContextPtr)` Perform bicubic interpolation. **Args:** * ​output (`NDBuffer[type, rank, origin, shape, strides]`): Output tensor with desired dimensions on host or device. * ​input (`NDBuffer[type, rank, origin, shape, strides]`): Input tensor of shape \[B, C, H, W] on host or device. * ​ctx (`DeviceContextPtr`): Device context to enqueue GPU kernels on. --- ## broadcast `broadcast[type: DType](output: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` For each axis of `input`, if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`. **Args:** * ​output (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer. * ​input (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input buffer. --- ## broadcast_impl `broadcast_impl[type: DType](axis: Int, output: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input_prev_axis_stride: Int, output_prev_axis_stride: Int, input_offset: Int, output_offset: Int, rightmost_broadcast_axis: Int)` For each axis of `input` ∈ \[axis, rank), if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`. **Args:** * ​axis (`Int`): The axis value. * ​output (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer. * ​input (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input buffer. * ​input\_prev\_axis\_stride (`Int`): The stride at axis `axis - 1` for input. * ​output\_prev\_axis\_stride (`Int`): The stride at axis `axis - 1` for output. * ​input\_offset (`Int`): The offset at which we start copying data from. * ​output\_offset (`Int`): The offset at which we start copying data to. * ​rightmost\_broadcast\_axis (`Int`): The largest axis at which we need to duplicate `input` data. --- ## broadcast ## Functions * [​`broadcast`](./broadcast): For each axis of `input`, if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`. * [​`broadcast_impl`](./broadcast_impl): For each axis of `input` ∈ \[axis, rank), if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`. --- ## concat `concat[rank: Int, type: DType, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]({:i1 0, 1})](output: NDBuffer[type, rank, origin], axis: Int, inputs: StaticTuple[NDBuffer[type, rank, MutableAnyOrigin], size], context: DeviceContextPtr = DeviceContextPtr())` --- ## concat_shape `concat_shape[input_rank: Int, input_type: DType, single_thread_blocking_override: Bool](input_bufs: List[NDBuffer[input_type, input_rank, MutableAnyOrigin]], axis: Int) -> IndexList[input_rank]` Compute the output shape of a `pad` operation, and assert the inputs are compatible. **Parameters:** * ​input\_rank (`Int`): Input\_rank of the input tensor. * ​input\_type (`DType`): Type of the input tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input\_bufs (`List[NDBuffer[input_type, input_rank, MutableAnyOrigin]]`): The input tensors list. * ​axis (`Int`): The axis. **Returns:** The output shape. --- ## fused_concat `fused_concat[type: DType, rank: Int, single_thread_blocking_override: Bool, input_fn: fn[Int, Int, Int](IndexList[$2]) capturing -> SIMD[type, $1], output_0_fn: fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](axis: Int, input_shapes: StaticTuple[IndexList[rank], size], output: NDBuffer[type, rank, origin], ctx: DeviceContextPtr)` --- ## concat ## Aliases ### `elementwise_epilogue_type` `alias elementwise_epilogue_type = fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None` ## Functions * [​`concat`](./concat): * [​`concat_shape`](./concat_shape): Compute the output shape of a `pad` operation, and assert the inputs are compatible. * [​`fused_concat`](./fused_concat): * [​`memcpy_or_fuse`](./memcpy_or_fuse): --- ## memcpy_or_fuse `memcpy_or_fuse[rank: Int, type: DType, epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]](dest_data: UnsafePointer[SIMD[int8, 1]], out_byte_offset: Int, src_data: UnsafePointer[SIMD[int8, 1]], n: Int, out_shape: IndexList[rank, element_type=element_type])` --- ## ConvDirectNHWC `struct ConvDirectNHWC[input_mut: Bool, filter_mut: Bool, //, input_rank: Int, filter_rank: Int, output_rank: Int, input_origin: Origin[input_mut], filter_origin: Origin[filter_mut], output_origin: MutableOrigin, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, filter_packed: Bool, conv_attr: ConvInfoStatic[(input_rank + -2)], elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})]` Implement the outer loops for direct convolution. Collapse N, HO, WO into one dimension n\_ho\_wo. Tile n\_ho\_wo, C, and F. The tile factor for C and F are chosen by a heuristic prioritizing C. n\_ho\_wo is tiled by micro kernel's height. If n\_ho\_wo is large enough to spill LLC, we may need to tile n\_ho\_wo as the outer most loop with a factor fit in LLC. Assume F is divisible at least by simd\_size. ## Fields * ​output (`NDBuffer[output_type, output_rank, output_origin, output_shape]`): * ​input (`NDBuffer[input_type, input_rank, input_origin, input_shape]`): * ​filter (`NDBuffer[filter_type, filter_rank, filter_origin, filter_shape]`): * ​conv\_shape (`ConvShape[(input_rank + -2)]`): * ​partition (`ConvPartition`): * ​cf\_tile\_size (`IndexList[2]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `packed_and_fully_static` `alias packed_and_fully_static = filter_packed if filter_shape.all_known[::Int]() if output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else filter_shape.all_known[::Int]() if output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known()` ## Methods ### `run` `static run(output: NDBuffer[output_type, output_rank, output_origin, output_shape], input: NDBuffer[input_type, input_rank, input_origin, input_shape], filter: NDBuffer[filter_type, filter_rank, filter_origin, filter_shape], conv_shape: ConvShape[(input_rank + -2)])` ### `is_new_c_accum` `is_new_c_accum(self, c_idx: Int) -> Bool` ### `update_output_tile_no_padding` `update_output_tile_no_padding[micro_kernel_height: Int, micro_kernel_width: Int, c_fully_cached: Bool, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int, output_flat_coord: Int)` ### `output_space_flat_loop` `output_space_flat_loop[micro_kernel_f_size: Int, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int)` ### `output_space_loop` `output_space_loop[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int)` ### `output_space_loop_1d` `output_space_loop_1d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)` ### `output_space_loop_2d` `output_space_loop_2d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)` ### `output_space_loop_3d` `output_space_loop_3d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)` --- ## CuDNNConvMeta `@register_passable` `struct CuDNNConvMeta` ## Fields * ​ptr\_handle (`UnsafePointer[UnsafePointer[NoneType]]`): * ​ptr\_input\_desc (`UnsafePointer[UnsafePointer[NoneType]]`): * ​ptr\_filter\_desc (`UnsafePointer[UnsafePointer[NoneType]]`): * ​ptr\_conv\_desc (`UnsafePointer[UnsafePointer[NoneType]]`): * ​ptr\_output\_desc (`UnsafePointer[UnsafePointer[NoneType]]`): ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` ### `__del__` `__del__(owned self)` --- ## Naive2dConvolution `struct Naive2dConvolution[output_type: DType, input_type: DType, filter_type: DType]` Struct wrapper for naive 2d convolution implementation. ## Fields * ​output (`UnsafePointer[SIMD[output_type, 1]]`): * ​input (`UnsafePointer[SIMD[input_type, 1]]`): * ​filter (`UnsafePointer[SIMD[filter_type, 1]]`): * ​pad\_d (`IndexList[2]`): * ​pad\_h (`IndexList[2]`): * ​pad\_w (`IndexList[2]`): * ​stride (`IndexList[3]`): * ​dilation (`IndexList[3]`): * ​num\_groups (`Int`): * ​output\_shape (`IndexList[5]`): * ​input\_shape (`IndexList[5]`): * ​filter\_shape (`IndexList[5]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, output: UnsafePointer[SIMD[output_type, 1]], input: UnsafePointer[SIMD[input_type, 1]], filter: UnsafePointer[SIMD[filter_type, 1]], output_shape: IndexList[5], input_shape: IndexList[5], filter_shape: IndexList[5], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[3], dilation: IndexList[3], num_groups: Int)` ### `run` `static run(output: UnsafePointer[SIMD[output_type, 1]], input: UnsafePointer[SIMD[input_type, 1]], filter: UnsafePointer[SIMD[filter_type, 1]], output_shape: IndexList[5], input_shape: IndexList[5], filter_shape: IndexList[5], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[3], dilation: IndexList[3], num_groups: Int)` --- ## accumulate_wo_tile_1d `accumulate_wo_tile_1d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load_filter: Bool, effected_by_padding: Bool, input_dt: DType, filter_dt: DType](c_tile_size: Int, S: Int, mut acc: _Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop], input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, input_stride_to_nbr: Int, filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, filter_stride_to_nbr: Int, partial_load_filter_size: Int, w: Int, W: Int, dilation: Int)` Update one row in the output for a given (c, f) tile. **Parameters:** * ​micro\_kernel\_height (`Int`): Number of input points in register tiling. * ​micro\_kernel\_width (`Int`): Number of SIMD resgiters assigned to F. * ​simd\_size (`Int`): Number of elements in a SIMD register. * ​partial\_load\_filter (`Bool`): Whether using partial load for filter. * ​effected\_by\_padding (`Bool`): Whether the tile is effected by padding. * ​input\_dt (`DType`): DType of input. * ​filter\_dt (`DType`): DType of filter. **Args:** * ​c\_tile\_size (`Int`): Tile size in input channel. * ​S (`Int`): Filter window width. * ​acc (`_Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop]`): Pointer to register tile accumulator. * ​input (`UnsafePointer[SIMD[input_dt, 1]]`): Pointer to the first input point in WO tile. * ​input\_stride (`Int`): Stride between two input points, i.e., C w/ NHWC layout. * ​input\_stride\_to\_nbr (`Int`): Stride between an input point and its neighbor. * ​filter (`UnsafePointer[SIMD[filter_dt, 1]]`): Pointer to the first coef in the filter window. * ​filter\_stride (`Int`): Stride between two segments of size `micro_kernel_width * simd_size`. * ​filter\_stride\_to\_nbr (`Int`): Stride between between two neighbor coefs, i.e., CF w/ RSCF layout. * ​partial\_load\_filter\_size (`Int`): Size of partial load for filter. * ​w (`Int`): Coordinate in an input row. * ​W (`Int`): Input width. * ​dilation (`Int`): Convolution dilation. --- ## accumulate_wo_tile_2d `accumulate_wo_tile_2d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load_filter: Bool, effected_by_padding: Bool, input_dt: DType, filter_dt: DType](c_tile_size: Int, RS: IndexList[2], mut acc: _Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop], input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, input_stride_to_nbr: IndexList[2], filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, filter_stride_to_nbr: IndexList[2], partial_load_filter_size: Int, hw: IndexList[2], HW: IndexList[2], dilation: IndexList[2])` --- ## accumulate_wo_tile_3d `accumulate_wo_tile_3d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load_filter: Bool, effected_by_padding: Bool, input_dt: DType, filter_dt: DType](c_tile_size: Int, QRS: IndexList[3], mut acc: _Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop], input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, input_stride_to_nbr: IndexList[3], filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, filter_stride_to_nbr: IndexList[3], partial_load_filter_size: Int, dhw: IndexList[3], DHW: IndexList[3], dilation: IndexList[3])` --- ## check_cudnn_error `check_cudnn_error(stat: cudnnStatus_t)` --- ## conv1d_update_wo_tile `conv1d_update_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, filter_packed: Bool, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType, elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], first_c_tile: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[rank], n: Int, wo: Int)` --- ## conv2d_gpu_naive_nhwc_rscf `conv2d_gpu_naive_nhwc_rscf[input_dim: DimList, filter_dim: DimList, output_dim: DimList, input_type: DType, filter_type: DType, output_type: DType, block_size: Int, maybe_epilogue_func: OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]](input: NDBuffer[input_type, 4, MutableAnyOrigin, input_dim], filter: NDBuffer[filter_type, 4, MutableAnyOrigin, filter_dim], output: NDBuffer[output_type, 4, MutableAnyOrigin, output_dim], stride: IndexList[2], dilation: IndexList[2], padding: IndexList[2])` --- ## conv2d_update_wo_tile `conv2d_update_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, filter_packed: Bool, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType, elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], first_c_tile: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[2], n: Int, howo: IndexList[2])` --- ## conv3d_gpu_naive_ndhwc_qrscf `conv3d_gpu_naive_ndhwc_qrscf[input_dim: DimList, filter_dim: DimList, output_dim: DimList, input_type: DType, filter_type: DType, output_type: DType, block_size: Int, maybe_epilogue_func: OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]](input: NDBuffer[input_type, 5, MutableAnyOrigin, input_dim], filter: NDBuffer[filter_type, 5, MutableAnyOrigin, filter_dim], output: NDBuffer[output_type, 5, MutableAnyOrigin, output_dim], stride: IndexList[3], dilation: IndexList[3], padding: IndexList[3])` --- ## conv3d_update_wo_tile `conv3d_update_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, filter_packed: Bool, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType, elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], first_c_tile: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[3], n: Int, dohowo: IndexList[3])` --- ## conv_cudnn `conv_cudnn[input_type: DType, filter_type: DType, output_type: DType](input: NDBuffer[input_type, 4, MutableAnyOrigin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], filter: NDBuffer[filter_type, 4, MutableAnyOrigin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], output: NDBuffer[output_type, 4, MutableAnyOrigin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], stride: IndexList[2], dilation: IndexList[2], padding: IndexList[2], num_groups: Int, ctx: DeviceContext)` --- ## conv_gpu `conv_gpu[input_rank: Int, filter_rank: Int, input_dim: DimList, filter_dim: DimList, output_dim: DimList, input_type: DType, filter_type: DType, output_type: DType, maybe_epilogue_func: OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]({:i1 0, 1}), filter_is_fcrs: Bool = False](input: NDBuffer[input_type, input_rank, MutableAnyOrigin, input_dim], filter: NDBuffer[filter_type, filter_rank, MutableAnyOrigin, filter_dim], output: NDBuffer[output_type, input_rank, MutableAnyOrigin, output_dim], stride: IndexList[(input_rank + -2)], dilation: IndexList[(input_rank + -2)], padding: IndexList[(input_rank + -2)], num_groups: Int, ctx: DeviceContext)` --- ## conv_nhwc_direct `conv_nhwc_direct[input_rank: Int, filter_rank: Int, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, filter_packed: Bool, conv_info_static: ConvInfoStatic[(input_rank + -2)], lambdas_have_fusion: Bool, elementwise_lambda: fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None](input: NDBuffer[input_type, input_rank, origin, input_shape], filter: NDBuffer[filter_type, filter_rank, origin, filter_shape], output: NDBuffer[output_type, input_rank, origin, output_shape], stride: IndexList[(input_rank + -2)], dilation: IndexList[(input_rank + -2)], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], num_groups: Int)` --- ## conv_shape `conv_shape[input_rank: Int, filter_rank: Int, input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], filter_buf: NDBuffer[filter_type, filter_rank, origin], strides_buf: NDBuffer[strides_type, 1, origin], dilations_buf: NDBuffer[dilations_type, 1, origin], paddings_buf: NDBuffer[paddings_type, 1, origin], num_groups_scalar: SIMD[dtype, 1]) -> IndexList[input_rank]` Compute the output shape of a `conv` operation, and assert the inputs are compatible. **Parameters:** * ​input\_rank (`Int`): Rank of the input tensor. * ​filter\_rank (`Int`): Rank of the filter tensor. * ​input\_type (`DType`): Type of the input tensor. * ​filter\_type (`DType`): Type of the filter tensor. * ​strides\_type (`DType`): Type of the strides tensor. * ​dilations\_type (`DType`): Type of the dilations tensor. * ​paddings\_type (`DType`): Type of the paddings tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run ssynchronouslysing a single thread. **Args:** * ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor. * ​filter\_buf (`NDBuffer[filter_type, filter_rank, origin]`): The filter tensor. * ​strides\_buf (`NDBuffer[strides_type, 1, origin]`): The strides tensor. * ​dilations\_buf (`NDBuffer[dilations_type, 1, origin]`): The dilations tensor. * ​paddings\_buf (`NDBuffer[paddings_type, 1, origin]`): The paddings tensor. * ​num\_groups\_scalar (`SIMD[dtype, 1]`): The num\_groups scalar. **Returns:** The output shape. --- ## get_cudnn_dtype `get_cudnn_dtype[dtype: DType]() -> cudnnDataType_t` Map Mojo DType to cuDNN data type. Support only floating point dtypes for now. --- ## conv ## Structs * [​`ConvDirectNHWC`](./ConvDirectNHWC): Implement the outer loops for direct convolution. Collapse N, HO, WO into one dimension n\_ho\_wo. Tile n\_ho\_wo, C, and F. The tile factor for C and F are chosen by a heuristic prioritizing C. n\_ho\_wo is tiled by micro kernel's height. * [​`CuDNNConvMeta`](./CuDNNConvMeta): * [​`Naive2dConvolution`](./Naive2dConvolution): Struct wrapper for naive 2d convolution implementation. ## Functions * [​`accumulate_wo_tile_1d`](./accumulate_wo_tile_1d): Update one row in the output for a given (c, f) tile. * [​`accumulate_wo_tile_2d`](./accumulate_wo_tile_2d): * [​`accumulate_wo_tile_3d`](./accumulate_wo_tile_3d): * [​`check_cudnn_error`](./check_cudnn_error): * [​`conv1d_update_wo_tile`](./conv1d_update_wo_tile): * [​`conv2d_gpu_naive_nhwc_rscf`](./conv2d_gpu_naive_nhwc_rscf): * [​`conv2d_update_wo_tile`](./conv2d_update_wo_tile): * [​`conv3d_gpu_naive_ndhwc_qrscf`](./conv3d_gpu_naive_ndhwc_qrscf): * [​`conv3d_update_wo_tile`](./conv3d_update_wo_tile): * [​`conv_cudnn`](./conv_cudnn): * [​`conv_gpu`](./conv_gpu): * [​`conv_nhwc_direct`](./conv_nhwc_direct): * [​`conv_shape`](./conv_shape): Compute the output shape of a `conv` operation, and assert the inputs are compatible. * [​`get_cudnn_dtype`](./get_cudnn_dtype): Map Mojo DType to cuDNN data type. * [​`pack_conv_filter_shape`](./pack_conv_filter_shape): Compute the output shape of convolution filter packing. * [​`pack_filter`](./pack_filter): This packs the filter form RSCF to FRSCf. Use the default micro kernel size for dynamic shapes. * [​`pack_filter_shape`](./pack_filter_shape): Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel. * [​`pack_filter_shape_impl`](./pack_filter_shape_impl): Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel. --- ## pack_conv_filter_shape `pack_conv_filter_shape[single_thread_blocking_override: Bool](filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int) -> IndexList[(rank + 1)]` Compute the output shape of convolution filter packing. **Parameters:** * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The filter to be packed. * ​num\_groups (`Int`): The number of groups in the convolution. **Returns:** The output shape. --- ## pack_filter `pack_filter(filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], packed_filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int)` This packs the filter form RSCF to FRSCf. Use the default micro kernel size for dynamic shapes. `pack_filter[simd_size: Int, micro_kernel_f_size: Int](filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], packed_filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int)` This packs the filter form RSCF to FRSCf. F is first broken down to segments of size micro\_kernel\_f\_size, then the remainder is further divided by simd\_size. The last residual elements if any is padded with zero to fill simd\_size. **Parameters:** * ​simd\_size (`Int`): Can differ from the simd size of the input type. * ​micro\_kernel\_f\_size (`Int`): The size of the last dimension in FRSCf, which is equals the size of the micro kernel's F dimension. **Args:** * ​filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): Filter in RSCF layout (if 2D). * ​packed\_filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): Packed filter in FRSCf layout (if 2D). F - the index of continuous segments in micro kernel. R, S, C - original R, S, C. f - the index within a continuous segments. * ​num\_groups (`Int`): The number of groups in the convolution. --- ## pack_filter_shape `pack_filter_shape[filter_type: DType, input_shape: DimList, filter_shape: DimList, output_shape: DimList, strides: DimList, dilations: DimList, paddings: DimList, num_groups: Int, single_thread_blocking_override: Bool](filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> IndexList[(rank + 1)]` Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel. **Returns:** The output shape. --- ## pack_filter_shape_impl `pack_filter_shape_impl[filter_type: DType](Q: Int, R: Int, S: Int, C: Int, F: Int, num_groups: Int) -> IndexList[6]` Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel. **Args:** * ​Q (`Int`): Original Q filter dimension. * ​R (`Int`): Original R filter dimension. * ​S (`Int`): Original S filter dimension. * ​C (`Int`): Original C filter dimension. * ​F (`Int`): Original F filter dimension. * ​num\_groups (`Int`): Number of groups in the convolution. **Returns:** The output shape. --- ## ConvTransposedPacked `struct ConvTransposedPacked[input_mut: Bool, filter_mut: Bool, //, input_rank: Int, filter_rank: Int, output_rank: Int, input_origin: Origin[input_mut], filter_origin: Origin[filter_mut], output_origin: MutableOrigin, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, conv_attr: ConvInfoStatic[(input_rank + -2)], elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})]` ## Fields * ​output (`NDBuffer[output_type, output_rank, output_origin, output_shape]`): * ​input (`NDBuffer[input_type, input_rank, input_origin, input_shape]`): * ​filter (`NDBuffer[filter_type, filter_rank, filter_origin, filter_shape]`): * ​conv\_shape (`ConvShape[(input_rank + -2)]`): * ​partition (`ConvPartition`): * ​cf\_tile\_size (`IndexList[2]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `run` `static run(output: NDBuffer[output_type, output_rank, output_origin, output_shape], input: NDBuffer[input_type, input_rank, input_origin, input_shape], filter: NDBuffer[filter_type, filter_rank, filter_origin, filter_shape], conv_shape: ConvShape[(input_rank + -2)])` ### `input_space_loop` `input_space_loop[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int)` ### `input_space_loop_2d` `input_space_loop_2d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)` ### `input_space_loop_3d` `input_space_loop_3d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)` ### `apply_epilogue` `apply_epilogue(self, n: Int, g: Int)` --- ## accumulate_wo_tile `accumulate_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](c_tile_size: Int, output: UnsafePointer[SIMD[output_dt, 1]], output_stride: Int, input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, partial_load_size: Int)` --- ## conv_transpose_naive `conv_transpose_naive[type: DType](output: NDBuffer[type, 5, MutableAnyOrigin], input: NDBuffer[type, 5, MutableAnyOrigin], filter: NDBuffer[type, 5, MutableAnyOrigin], stride: IndexList[3], dilation: IndexList[3], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2])` Implements the ConvTranspose operator from the MO spec. **Parameters:** * ​type (`DType`): Type of the input, output, and kernel tensors. **Args:** * ​output (`NDBuffer[type, 5, MutableAnyOrigin]`): Output data tensor that contains the result of the convolution. * ​input (`NDBuffer[type, 5, MutableAnyOrigin]`): Input data tensor from previous layer, with size of (N x H x W x C), where N is the batch size, C is the number of channels, and H and W are the height and width. * ​filter (`NDBuffer[type, 5, MutableAnyOrigin]`): The weight (kernel) tensor, with size of (kH x kW x M/groups x C), where C is the number of channels, kH and kW are the height and width of the kernel, and M is the number of feature maps. * ​stride (`IndexList[3]`): Stride along each spatial axis. * ​dilation (`IndexList[3]`): Dilation value along each spatial axis of the filter. * ​pad\_d (`IndexList[2]`): Padding in depth dimension. * ​pad\_h (`IndexList[2]`): Padding in height dimension. * ​pad\_w (`IndexList[2]`): Padding in width dimension. --- ## conv_transpose_shape `conv_transpose_shape[input_rank: Int, kernel_rank: Int, type: DType, strides_type: DType, dilations_type: DType, pads_type: DType, output_pads_type: DType, single_thread_blocking_override: Bool](input: NDBuffer[type, input_rank, origin], kernel: NDBuffer[type, kernel_rank, origin], strides: NDBuffer[strides_type, 1, origin], dilations: NDBuffer[dilations_type, 1, origin], pads: NDBuffer[pads_type, 1, origin], output_pads: NDBuffer[output_pads_type, 1, origin]) -> IndexList[input_rank]` Compute the output shape of a `conv-transpose` operation, and assert the inputs are compatible. **Parameters:** * ​input\_rank (`Int`): Rank of the input tensor. * ​kernel\_rank (`Int`): Rank of the kernel tensor. * ​type (`DType`): Element type of the input and kernel tensor. * ​strides\_type (`DType`): Element type of the strides tensor. * ​dilations\_type (`DType`): Element type of the dilations tensor. * ​pads\_type (`DType`): Element type of the pads tensor. * ​output\_pads\_type (`DType`): Element type of the output\_pads tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input (`NDBuffer[type, input_rank, origin]`): The input tensor. * ​kernel (`NDBuffer[type, kernel_rank, origin]`): The kernel tensor. * ​strides (`NDBuffer[strides_type, 1, origin]`): The strides tensor. * ​dilations (`NDBuffer[dilations_type, 1, origin]`): The dilations tensor. * ​pads (`NDBuffer[pads_type, 1, origin]`): The paddings tensor. * ​output\_pads (`NDBuffer[output_pads_type, 1, origin]`): The output paddings tensor. **Returns:** The output shape. --- ## conv_transposed_cpu `conv_transposed_cpu[input_rank: Int, filter_rank: Int, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, filter_packed: Bool, filter_is_cfrs: Bool, lambdas_have_fusion: Bool, elementwise_lambda: fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None](output: NDBuffer[output_type, input_rank, origin, output_shape], input: NDBuffer[input_type, input_rank, origin, input_shape], filter: NDBuffer[filter_type, filter_rank, origin, filter_shape], stride: IndexList[(input_rank + -2)], dilation: IndexList[(input_rank + -2)], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2])` --- ## conv_transposed_cudnn `conv_transposed_cudnn[input_type: DType, filter_type: DType, output_type: DType](input: NDBuffer[input_type, 4, MutableAnyOrigin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], filter: NDBuffer[filter_type, 4, MutableAnyOrigin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], output: NDBuffer[output_type, 4, MutableAnyOrigin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], stride: IndexList[2], dilation: IndexList[2], padding: IndexList[2], ctx: DeviceContext)` --- ## conv_transposed_gpu `conv_transposed_gpu[input_rank: Int, filter_rank: Int, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, elementwise_epilogue: OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]({:i1 0, 1})](output: NDBuffer[output_type, input_rank, origin, output_shape], input: NDBuffer[input_type, input_rank, origin, input_shape], filter: NDBuffer[filter_type, filter_rank, origin, filter_shape], stride: IndexList[(input_rank + -2)], dilation: IndexList[(input_rank + -2)], padding: IndexList[(input_rank + -2)], ctx: DeviceContext)` --- ## get_num_partitions `get_num_partitions[micro_kernel_height: Int, micro_kernel_f_size: Int](num_threads: Int, conv_shape: ConvShape[rank]) -> IndexList[4]` Partition the workload in (batch\&group, C, F, H) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions. --- ## get_partition `get_partition(task_id: Int, num_partitions: IndexList[4], conv_shape: ConvShape[rank], micro_kernel_height: Int, micro_kernel_f_size: Int) -> ConvPartition` --- ## conv_transpose ## Structs * [​`ConvTransposedPacked`](./ConvTransposedPacked): ## Functions * [​`accumulate_wo_tile`](./accumulate_wo_tile): * [​`conv_transpose_naive`](./conv_transpose_naive): Implements the ConvTranspose operator from the MO spec. * [​`conv_transpose_shape`](./conv_transpose_shape): Compute the output shape of a `conv-transpose` operation, and assert the inputs are compatible. * [​`conv_transposed_cpu`](./conv_transposed_cpu): * [​`conv_transposed_cudnn`](./conv_transposed_cudnn): * [​`conv_transposed_gpu`](./conv_transposed_gpu): * [​`get_num_partitions`](./get_num_partitions): Partition the workload in (batch\&group, C, F, H) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions. * [​`get_partition`](./get_partition): * [​`pack_filter`](./pack_filter): This packs the filter form RSFC to FRSCf. * [​`pack_filter_shape`](./pack_filter_shape): Compute the output shape of transposed convolution filter packing. * [​`update_w_tile_2d`](./update_w_tile_2d): * [​`update_w_tile_3d`](./update_w_tile_3d): --- ## pack_filter `pack_filter(filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], packed_filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int)` This packs the filter form RSFC to FRSCf. --- ## pack_filter_shape `pack_filter_shape(filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int) -> IndexList[(rank + 1)]` Compute the output shape of transposed convolution filter packing. **Args:** * ​filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The filter to be packed. * ​num\_groups (`Int`): The number of groups in the convolution. **Returns:** The output shape. --- ## update_w_tile_2d `update_w_tile_2d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], _init_output: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[2], n: Int, hw: IndexList[2])` --- ## update_w_tile_3d `update_w_tile_3d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], _init_output: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[3], n: Int, hw: IndexList[3])` --- ## ConvAlgorithm `@register_passable(trivial)` `struct ConvAlgorithm` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `Default` `alias Default = ConvAlgorithm(0)` ### `Direct` `alias Direct = ConvAlgorithm(2)` ### `Im2Col` `alias Im2Col = ConvAlgorithm(1)` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` ### `__ne__` `__ne__(self, rhs: Self) -> Bool` --- ## ConvInfoStatic `struct ConvInfoStatic[rank: Int]` ## Fields * ​pad (`DimList`): * ​stride (`DimList`): * ​dilation (`DimList`): * ​num\_groups (`Dim`): ## Implemented traits `AnyType`, `Defaultable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` `__init__(out self, pad: DimList, stride: DimList, dilation: DimList, num_groups: Dim)` `__init__(out self, pad: DimList, stride: DimList, dilation: DimList, input_c: Dim, filter_c: Dim)` ### `all_known` `all_known(self) -> Bool` ### `pad_left` `pad_left(self) -> Int` ### `pad_bottom` `pad_bottom(self) -> Int` ### `strides` `strides(self) -> IndexList[2]` ### `dilations` `dilations(self) -> IndexList[2]` --- ## ConvPartition `@register_passable(trivial)` `struct ConvPartition` Work range for a partition. ## Fields * ​ng\_offset (`Int`): * ​ng\_size (`Int`): * ​f\_offset (`Int`): * ​f\_size (`Int`): * ​ho\_or\_howo\_offset (`Int`): * ​ho\_or\_howo\_size (`Int`): * ​c\_offset (`Int`): * ​c\_size (`Int`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `empty` `empty(self) -> Bool` --- ## ConvShape `@register_passable(trivial)` `struct ConvShape[rank: Int]` A shape struct describing the convolution dimensions. ## Fields * ​n (`Int`): * ​input\_dims (`IndexList[rank]`): * ​output\_dims (`IndexList[rank]`): * ​filter\_dims (`IndexList[rank]`): * ​c (`Int`): * ​f (`Int`): * ​stride (`IndexList[rank]`): * ​dilation (`IndexList[rank]`): * ​pad\_d (`IndexList[2]`): * ​pad\_h (`IndexList[2]`): * ​pad\_w (`IndexList[2]`): * ​num\_groups (`Int`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `d` `d(self) -> Int` Input depth. ### `h` `h(self) -> Int` Input height. ### `w` `w(self) -> Int` Input width. ### `do` `do(self) -> Int` Output depth. ### `ho` `ho(self) -> Int` Output height. ### `wo` `wo(self) -> Int` Output width. ### `q` `q(self) -> Int` Filter window depth. ### `r` `r(self) -> Int` Filter window height. ### `s` `s(self) -> Int` Filter windown width. ### `filter_window_flat_size` `filter_window_flat_size(self) -> Int` ### `input_image_flat_size` `input_image_flat_size(self) -> Int` ### `output_image_flat_size` `output_image_flat_size(self) -> Int` ### `output_space_dims` `output_space_dims(self) -> IndexList[rank]` ### `output_flat_coord_to_input_offset` `output_flat_coord_to_input_offset(self, n: Int, output_flat_coord: Int) -> Int` ### `matmul_M` `matmul_M(self) -> Int` ### `matmul_N` `matmul_N(self) -> Int` ### `matmul_K` `matmul_K(self) -> Int` ### `padded` `padded(self) -> Bool` ### `c_per_group` `c_per_group(self) -> Int` Returns the number of channels per group. Channel count must be divisible by group size. ### `f_per_group` `f_per_group(self) -> Int` Returns the number of filters per group. Filter count must be divisible by group size. ### `f_to_group` `f_to_group(self, f_idx: Int) -> Int` Given a global filter idx, returns the group idx of the group the filter belongs to. ### `c_to_group` `c_to_group(self, c_idx: Int) -> Int` Given a global channel idx, returns the group idx of the group the channel belongs to. ### `f_in_group` `f_in_group(self, f_idx: Int) -> Int` Given a global filter idx, returns the offset of the filter in its group. ### `c_in_group` `c_in_group(self, c_idx: Int) -> Int` Given a global channel idx, returns the offset of the channel in its group. --- ## align_down_residual `align_down_residual(value: Int, alignment: Int) -> Int` Returns the remainder after aligning down value to alignment. **Args:** * ​value (`Int`): The value to align. * ​alignment (`Int`): Value to align to. **Returns:** The remainder after aligning down value to the closest multiple of alignment. In other words, value - align\_down(value, alignment). --- ## append_shape `append_shape[rank: Int](in_shape: IndexList[rank], last2nd: Int, last: Int) -> IndexList[(rank + 2)]` Append input shape by inserting `last2nd` and `last` at the end. --- ## extend_shape `extend_shape[rank: Int](in_shape: IndexList[rank], first: Int, last: Int) -> IndexList[(rank + 2)]` Extend input shape by inserting `first` and `last` at both ends. --- ## get_conv2d_shape `get_conv2d_shape[output_shape: DimList, input_shape: DimList, filter_shape: DimList, type: DType, data_layout: Image2DLayout, filter_layout: Image2DLayout](output: NDBuffer[type, 4, origin, output_shape], input: NDBuffer[type, 4, origin, input_shape], filter: NDBuffer[type, 4, origin, filter_shape], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[2], dilation: IndexList[2], num_groups: Int) -> ConvShape[2]` `get_conv2d_shape[filter_rank: Int, output_shape: DimList, input_shape: DimList, filter_shape: DimList, type: DType, data_layout: Image2DLayout, filter_layout: Image2DLayout](output: NDBuffer[type, 4, origin, output_shape], input: NDBuffer[type, 4, origin, input_shape], filter: NDBuffer[type, filter_rank, origin, filter_shape], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[2], dilation: IndexList[2], num_groups: Int) -> ConvShape[2]` --- ## get_conv_num_partitions `get_conv_num_partitions[micro_kernel_w: Int, micro_kernel_f: Int](num_threads: Int, conv_shape: ConvShape[rank]) -> IndexList[4]` Partition the workload in (batch, C, F, HOWO) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions. --- ## get_conv_num_tasks `get_conv_num_tasks(num_threads: Int, conv_shape: ConvShape[rank]) -> Int` --- ## get_conv_shape `get_conv_shape[rank: Int, filter_packed: Bool](output: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], input: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], stride: IndexList[rank], dilation: IndexList[rank], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], num_groups: Int) -> ConvShape[rank]` --- ## get_conv_tile_shape `get_conv_tile_shape[type: DType](c: Int, filter_window_size: Int, micro_kernel_width: Int) -> IndexList[2]` Compute the (c, f) tile shape in L2. Assume NHWC layout, the tile shape is (R, S, c\_tile, f\_tile). R and S are by default fully covered. The heuristic tried to block in C as much as possible. If C is small, it would start to block F. --- ## get_conv_tile_size `get_conv_tile_size[type: DType]() -> Int` --- ## get_direct_conv_micro_kernel_height `get_direct_conv_micro_kernel_height() -> Int` --- ## get_direct_conv_micro_kernel_width `get_direct_conv_micro_kernel_width() -> Int` --- ## get_micro_kernel_shape `get_micro_kernel_shape[rank: Int, WO: Dim, F: Dim, conv_attr: ConvInfoStatic[rank], simd_size: Int]() -> IndexList[2]` --- ## get_partition `get_partition(task_id: Int, num_partitions: IndexList[4], conv_shape: ConvShape[rank], micro_kernel_height: Int, micro_kernel_f_size: Int) -> ConvPartition` --- ## conv_utils ## Aliases ### `elementwise_epilogue_type` `alias elementwise_epilogue_type = fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None` ### `elementwise_simd_epilogue_type` `alias elementwise_simd_epilogue_type = fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None` ## Structs * [​`ConvAlgorithm`](./ConvAlgorithm): * [​`ConvInfoStatic`](./ConvInfoStatic): * [​`ConvPartition`](./ConvPartition): Work range for a partition. * [​`ConvShape`](./ConvShape): A shape struct describing the convolution dimensions. ## Functions * [​`align_down_residual`](./align_down_residual): Returns the remainder after aligning down value to alignment. * [​`append_shape`](./append_shape): Append input shape by inserting `last2nd` and `last` at the end. * [​`extend_shape`](./extend_shape): Extend input shape by inserting `first` and `last` at both ends. * [​`get_conv2d_shape`](./get_conv2d_shape): * [​`get_conv_num_partitions`](./get_conv_num_partitions): Partition the workload in (batch, C, F, HOWO) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions. * [​`get_conv_num_tasks`](./get_conv_num_tasks): * [​`get_conv_shape`](./get_conv_shape): * [​`get_conv_tile_shape`](./get_conv_tile_shape): Compute the (c, f) tile shape in L2. Assume NHWC layout, the tile shape is (R, S, c\_tile, f\_tile). R and S are by default fully covered. The heuristic tried to block in C as much as possible. If C is small, it would start to block F. * [​`get_conv_tile_size`](./get_conv_tile_size): * [​`get_direct_conv_micro_kernel_height`](./get_direct_conv_micro_kernel_height): * [​`get_direct_conv_micro_kernel_width`](./get_direct_conv_micro_kernel_width): * [​`get_micro_kernel_shape`](./get_micro_kernel_shape): * [​`get_partition`](./get_partition): * [​`reorder_padding`](./reorder_padding): --- ## reorder_padding `reorder_padding[rank: Int](pad: DimList) -> DimList` --- ## cumsum `cumsum[rank: Int, type: DType, exclusive: Bool, reverse: Bool](output: NDBuffer[type, rank, origin], input: NDBuffer[type, rank, origin], axis: Int)` Implements the CumSum operator from the ONNX spec: Computes cumulative sum of the input elements along the given axis. Cumulative sum can be inclusive or exclusive of the top element, and normal or reverse (direction along a given axis). **Parameters:** * ​rank (`Int`): Rank of the input and output tensors. * ​type (`DType`): Type of the input and output tensors. * ​exclusive (`Bool`): If set to True, return exclusive sum (top element not included). * ​reverse (`Bool`): If set to True, perform cumsum operation in reverse direction. **Args:** * ​output (`NDBuffer[type, rank, origin]`): The output tensor. * ​input (`NDBuffer[type, rank, origin]`): The input tensor. * ​axis (`Int`): The axis on which to perform the cumsum operation. --- ## cumsum ## Functions * [​`cumsum`](./cumsum): Implements the CumSum operator from the ONNX spec: Computes cumulative sum of the input elements along the given axis. Cumulative sum can be inclusive or exclusive of the top element, and normal or reverse (direction along a given axis). --- ## flash_attention `flash_attention[type: DType, rank: Int, mask_rank: Int, //, input_k_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_v_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_mask_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](q: NDBuffer[type, rank, origin, shape, strides], k_shape: IndexList[rank], v_shape: IndexList[rank], mask_shape: IndexList[mask_rank], output: NDBuffer[type, rank, origin, shape, strides], scale: SIMD[float32, 1])` --- ## flash_attention_kv_cache `flash_attention_kv_cache[type: DType, cache_t: KVCacheT, //](q: NDBuffer[type, 4, origin, shape, strides], k: cache_t, v: cache_t, mask: NDBuffer[type, rank, origin, shape, strides], scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides])` `flash_attention_kv_cache[type: DType, cache_t: KVCacheT, mask_t: MHAMask, //](q: NDBuffer[type, 4, origin, shape, strides], k: cache_t, v: cache_t, mask: mask_t, scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides])` `flash_attention_kv_cache[type: DType, cache_t: KVCacheT, mask_t: MHAMask, //](q: NDBuffer[type, 3, origin, shape, strides], q_input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], k: cache_t, v: cache_t, mask: mask_t, scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides])` Entrypoint for ragged tensors. --- ## flash_attention_split_kv `flash_attention_split_kv[type: DType, rank: Int, mask_rank: Int, //, input_k_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_v_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_k_cache_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_v_cache_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_mask_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](q: NDBuffer[type, rank, origin, shape, strides], k_shape: IndexList[rank], v_shape: IndexList[rank], k_cache_shape: IndexList[(rank + 1)], v_cache_shape: IndexList[(rank + 1)], mask_shape: IndexList[mask_rank], output: NDBuffer[type, rank, origin, shape, strides], scale: SIMD[float32, 1])` Variant of flash attention that takes the previous KV cache `input_{k,v}_cache_fn` and the current KV tensors `input_k_fn` and `input_v_fn` as separate arguments. This works around the fact that fusion can't currently look through concat. So this kernel does an in-place concat fusion by changing the input lambdas `input_{k,v}_cache_fn_wrapper` to take previous sequence KV elements from the KV cache, and current KV elements from tensors `k` and `v`. --- ## flash_attention ## Functions * [​`flash_attention`](./flash_attention): * [​`flash_attention_kv_cache`](./flash_attention_kv_cache): * [​`flash_attention_split_kv`](./flash_attention_split_kv): Variant of flash attention that takes the previous KV cache `input_{k,v}_cache_fn` and the current KV tensors `input_k_fn` and `input_v_fn` as separate arguments. --- ## fold `fold[dtype: DType, input_dim: DimList, output_dim: DimList, //, stride: Tuple[Int, Int], dilation: Tuple[Int, Int], padding: Tuple[Int, Int], target: StringSlice[StaticConstantOrigin]](input: NDBuffer[dtype, 3, MutableAnyOrigin, input_dim], output: NDBuffer[dtype, 4, MutableAnyOrigin, output_dim], output_size: IndexList[2], kernel_size: IndexList[2], ctx: DeviceContextPtr)` Folds array of sliding local blocks into a single output tensor. **Parameters:** * ​dtype (`DType`): The data type for the input and output. * ​input\_dim (`DimList`): The static shape of the input NDBuffer. * ​output\_dim (`DimList`): The static shape of the output NDBuffer. * ​stride (`Tuple[Int, Int]`): Stride of the sliding blocks. * ​dilation (`Tuple[Int, Int]`): Dilation of the sliding blocks. * ​padding (`Tuple[Int, Int]`): 0-paddings to be added on both sides of the inputs. * ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to compile for. **Args:** * ​input (`NDBuffer[dtype, 3, MutableAnyOrigin, input_dim]`): Input tensor to fold, shape \[N, C x kernel size, num\_blocks]. * ​output (`NDBuffer[dtype, 4, MutableAnyOrigin, output_dim]`): Output tensor to write to, shape \[N, C, H, W]. * ​output\_size (`IndexList[2]`): Spatial shape of the output tensor (H, W). * ​kernel\_size (`IndexList[2]`): Size of the sliding blocks. * ​ctx (`DeviceContextPtr`): DeviceContextPtr. --- ## fold_shape `fold_shape[dtype: DType, input_dim: DimList](input: NDBuffer[dtype, 3, MutableAnyOrigin, input_dim], output_size: IndexList[2], kernel_size: IndexList[2]) -> IndexList[4]` Returns the shape of the output tensor of the fold operation. --- ## fold Implements the fold operation. ## Functions * [​`fold`](./fold): Folds array of sliding local blocks into a single output tensor. * [​`fold_shape`](./fold_shape): Returns the shape of the output tensor of the fold operation. --- ## fused_qk_rope `fused_qk_rope[type: DType, collection_t: KVCollectionT, //, cache_t: KVCacheT, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 4, origin, shape, strides], kv_collection: collection_t, freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: Optional[DeviceContext])` --- ## fused_qk_rope_ragged `fused_qk_rope_ragged[type: DType, collection_t: KVCollectionT, //, cache_t: KVCacheT, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: Optional[DeviceContext])` Applies RoPE (Rotary Position Embedding) to query and key tensors. This function can applies RoPE only to the last `rope_dim` elements of each head, leaving the first `unroped_dim` elements unchanged. This is required for DeepSeek models where only part of each head undergoes rotary transformation. --- ## get_identity_rope_coeff `get_identity_rope_coeff[width: Int, type: DType]() -> SIMD[type, width]` --- ## get_safetensors_idx `get_safetensors_idx(head_dim_idx: Int, head_size: Int) -> Tuple[Int, Int]` --- ## fused_qk_rope ## Functions * [​`fused_qk_rope`](./fused_qk_rope): * [​`fused_qk_rope_ragged`](./fused_qk_rope_ragged): Applies RoPE (Rotary Position Embedding) to query and key tensors. * [​`get_identity_rope_coeff`](./get_identity_rope_coeff): * [​`get_safetensors_idx`](./get_safetensors_idx): * [​`rope_k_cache`](./rope_k_cache): * [​`rope_q_proj`](./rope_q_proj): --- ## rope_k_cache `rope_k_cache[type: DType, cache_t: KVCacheT, width: Int, //, *, interleaved: Bool](k_cache: cache_t, b_idx: Int, h_idx: Int, s_idx: Int, d_idx: Int, freq_val: SIMD[type, width], head_size: Int)` --- ## rope_q_proj `rope_q_proj[type: DType, rank: Int, width: Int, //, *, interleaved: Bool](q_proj: NDBuffer[type, rank, origin, shape, strides], output: NDBuffer[type, rank, origin, shape, strides], idx: IndexList[rank], freq_val: SIMD[type, width], head_size: Int)` --- ## Axis `@register_passable(trivial)` `struct Axis` ## Fields * ​axis (`Int`): ## Implemented traits `AnyType`, `Copyable`, `Indexer`, `Intable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(axis: Int) -> Self` `__init__(out self, axis: Int, rank: Int)` ### `__int__` `__int__(self) -> Int` ### `__index__` `__index__(self) -> index` Convert to index. **Returns:** The corresponding \_\_mlir\_type.index value. --- ## gather `gather[type: DType, indices_type: DType, //, *, axis: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](output: NDBuffer[type, rank, origin, shape, strides], input: NDBuffer[type, rank, origin, shape, strides], indices: NDBuffer[indices_type, rank, origin, shape, strides], *, context: DeviceContext)` Gather operation as defined in . Note that this is NOT the same as the default PyTorch gather (which is equivalent to ). `gather[type: DType, indices_type: DType, //, *, axis: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](output: NDBuffer[type, rank, origin, shape, strides], input: NDBuffer[type, rank, origin, shape, strides], indices: NDBuffer[indices_type, rank, origin, shape, strides], *, context: DeviceContextPtr = DeviceContextPtr())` Gather operation as defined in . Note that this is NOT the same as the default PyTorch gather (which is equivalent to ). `gather[*, type: DType, indices_type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], indices_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[indices_type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, prefetch_fn: OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None] = OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None]({:i1 0, 1}), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type], *, context: DeviceContext)` Gather operation as defined in . Note that this is NOT the same as the default PyTorch gather (which is equivalent to ). `gather[*, type: DType, indices_type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], indices_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[indices_type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, prefetch_fn: OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None] = OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None]({:i1 0, 1}), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type], *, context: DeviceContextPtr = DeviceContextPtr())` Gather operation as defined in . Note that this is NOT the same as the default PyTorch gather (which is equivalent to ). --- ## gather_elements `gather_elements[rank: Int, input_type: DType, indices_type: DType](input: NDBuffer[input_type, rank, origin], indices: NDBuffer[indices_type, rank, origin], _axis: Int, output: NDBuffer[input_type, rank, origin])` Implements ONNX GatherElements op which is equivalent to Pytorch gather. --- ## gather_elementwise_fn_wrapper `gather_elementwise_fn_wrapper[*, type: DType, indices_type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], indices_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[indices_type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, simd_width: Int, prefetch_fn: OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None] = OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None]({:i1 0, 1})](axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type], coords: IndexList[size, element_type=element_type])` --- ## gather_guards `gather_guards(axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type])` --- ## gather_nd `gather_nd[type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, output_rank: Int, batch_dims: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](data: NDBuffer[type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], output: NDBuffer[type, output_rank, origin], ctx: DeviceContextPtr)` GatherND operation as defined in . Based on reference implementation: . **Parameters:** * ​type (`DType`): Type of data tensor. * ​indices\_type (`DType`): Type of indices tensor. * ​data\_rank (`Int`): Rank of data tensor (data\_rank >= 1). * ​indices\_rank (`Int`): Rank of indices tensor (indices\_rank >= 1). * ​output\_rank (`Int`): Rank of output tensor. * ​batch\_dims (`Int`): Number of batch dimensions. The gather of indexing starts from dimension of data\[batch\_dims:]. * ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to execute on. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​data (`NDBuffer[type, data_rank, origin]`): Tensor of rank data\_rank >= 1. * ​indices (`NDBuffer[indices_type, indices_rank, origin]`): Tensor of rank indices\_rank >= 1. All index values are expected to be within bounds \[-s, s-1] along axis of size s. It is an error if any of the index values are out of bounds. * ​output (`NDBuffer[type, output_rank, origin]`): Tensor of rank data\_rank + indices\_rank - indices\_shape\[-1] - 1 - b. * ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler. --- ## gather_nd_shape `gather_nd_shape[input_rank: Int, indices_rank: Int, output_rank: Int, input_type: DType, indices_type: DType, batch_dims: Int, single_thread_blocking_override: Bool = True](input_buf: NDBuffer[input_type, input_rank, origin], indices_buf: NDBuffer[indices_type, indices_rank, origin]) -> IndexList[output_rank]` Compute the output shape of a `gather` operation, and assert the inputs are compatible. **Parameters:** * ​input\_rank (`Int`): Rank of the input tensor. * ​indices\_rank (`Int`): Rank of the indices tensor. * ​output\_rank (`Int`): Rank of the output tensor. * ​input\_type (`DType`): Type of the input tensor. * ​indices\_type (`DType`): Type of the indices tensor. * ​batch\_dims (`Int`): Batch dimensions. * ​single\_thread\_blocking\_override (`Bool`): If True, then reduction is run synchronously using a single thread. **Args:** * ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor. * ​indices\_buf (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor. **Returns:** The output shape. --- ## gather_reduce `gather_reduce[type: DType, gather_axis: Int, reduce_axis: Int, simd_width: Int, reduce_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1], output_rank: Int, output_shape: DimList, input_rank: Int, input_shape: DimList, indices_rank: Int, indices_shape: DimList](output: NDBuffer[type, output_rank, origin, output_shape], input: NDBuffer[type, input_rank, origin, input_shape], indices: NDBuffer[int32, indices_rank, origin, indices_shape], reduce_init: SIMD[type, 1])` Computes output\[i, j, k] = input\[indices\[i, j], k] and simultaneously reduces the output across axis 1 to produce output\[i, k]. The motivating use-case for this is multi-hot embeddings in recommender models. This provides similar functionality to Torch's EmbeddingBag layer. In that context, i is the batch dimension, j is the multi-hot dimension, and k is the embedding dimension. --- ## gather_shape `gather_shape[output_rank: Int, input_rank: Int, indices_rank: Int, input_type: DType, indices_type: DType, single_thread_blocking_override: Bool = False](input_buf: NDBuffer[input_type, input_rank, origin], indices_buf: NDBuffer[indices_type, indices_rank, origin], axis: Int) -> IndexList[output_rank]` Compute the output shape of a `gather` operation, and assert the inputs are compatible. **Parameters:** * ​output\_rank (`Int`): Rank of the output tensor. * ​input\_rank (`Int`): Rank of the input tensor. * ​indices\_rank (`Int`): Rank of the indices tensor. * ​input\_type (`DType`): Type of the input tensor. * ​indices\_type (`DType`): Type of the indices tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor. * ​indices\_buf (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor. * ​axis (`Int`): The axis. **Returns:** The output shape. --- ## gather_scatter ## Structs * [​`Axis`](./Axis): ## Functions * [​`gather`](./gather): Gather operation as defined in . * [​`gather_elements`](./gather_elements): Implements ONNX GatherElements op which is equivalent to Pytorch gather. * [​`gather_elementwise_fn_wrapper`](./gather_elementwise_fn_wrapper): * [​`gather_guards`](./gather_guards): * [​`gather_nd`](./gather_nd): GatherND operation as defined in . Based on reference implementation: . * [​`gather_nd_shape`](./gather_nd_shape): Compute the output shape of a `gather` operation, and assert the inputs are compatible. * [​`gather_reduce`](./gather_reduce): Computes output\[i, j, k] = input\[indices\[i, j], k] and simultaneously reduces the output across axis 1 to produce output\[i, k]. * [​`gather_shape`](./gather_shape): Compute the output shape of a `gather` operation, and assert the inputs are compatible. * [​`normalize_neg_index`](./normalize_neg_index): Indices passed to gather and scatter ops may be negative. This performs a normalization so that they can be used to index into a buffer. * [​`scatter_elements`](./scatter_elements): Implements ONNX ScatterElements op which is equivalent to Pytorch scatter. * [​`scatter_elements_shape`](./scatter_elements_shape): Compute the output shape of a `scatter_elements` operation, and assert the inputs are compatible. * [​`scatter_nd`](./scatter_nd): Scatter\_nd operation without any reduction. * [​`scatter_nd_generator`](./scatter_nd_generator): Implements ONNX ScatterND operation as defined in . * [​`scatter_nd_shape`](./scatter_nd_shape): Compute the output shape of a `scatter_nd` operation, and assert the inputs are compatible. * [​`scatter_set_constant`](./scatter_set_constant): Scatter the fill\_value into the data at the specified indices. --- ## normalize_neg_index `normalize_neg_index(idx: Int, dim_size: Int) -> Int` Indices passed to gather and scatter ops may be negative. This performs a normalization so that they can be used to index into a buffer. Returns val + dim if val `normalize_neg_index[type: DType, width: Int, out_type: DType = index](idx: SIMD[type, width], dim_size: Int) -> SIMD[out_type, width]` Indices passed to gather and scatter ops may be negative. This performs a normalization so that they can be used to index into a buffer. Returns val + dim if val --- ## scatter_elements `scatter_elements[reduce_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1], rank: Int, input_type: DType, indices_type: DType](input: ManagedTensorSlice[io_spec, static_spec=static_spec], indices: ManagedTensorSlice[io_spec, static_spec=static_spec], updates: ManagedTensorSlice[io_spec, static_spec=static_spec], _axis: Int, output: ManagedTensorSlice[io_spec, static_spec=static_spec])` Implements ONNX ScatterElements op which is equivalent to Pytorch scatter. --- ## scatter_elements_shape `scatter_elements_shape[rank: Int, input_type: DType, indices_type: DType, //, *, single_thread_blocking_override: Bool](input: NDBuffer[input_type, rank, origin], updates: NDBuffer[input_type, rank, origin], indices: NDBuffer[indices_type, rank, origin], axis: Int) -> IndexList[rank]` Compute the output shape of a `scatter_elements` operation, and assert the inputs are compatible. **Parameters:** * ​rank (`Int`): Rank of the input tensor. * ​input\_type (`DType`): Type of the input tensor. * ​indices\_type (`DType`): Type of the indices tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input (`NDBuffer[input_type, rank, origin]`): The input tensor. * ​updates (`NDBuffer[input_type, rank, origin]`): The input tensor. * ​indices (`NDBuffer[indices_type, rank, origin]`): The indices tensor. * ​axis (`Int`): The axis. **Returns:** The output shape. --- ## scatter_nd `scatter_nd[output_type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, updates_rank: Int, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](data: NDBuffer[output_type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], updates: NDBuffer[output_type, updates_rank, origin], output: NDBuffer[output_type, data_rank, origin], context: DeviceContextPtr = DeviceContextPtr())` Scatter\_nd operation without any reduction. --- ## scatter_nd_generator `scatter_nd_generator[output_type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, updates_rank: Int, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), /, reduce_fn: OptionalReg[fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), *, _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("scatter_nd")](data: NDBuffer[output_type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], updates: NDBuffer[output_type, updates_rank, origin], output: NDBuffer[output_type, data_rank, origin], context: DeviceContextPtr = DeviceContextPtr())` Implements ONNX ScatterND operation as defined in . **Parameters:** * ​output\_type (`DType`): Type of data, updates, and output tensors. * ​indices\_type (`DType`): Type of the indices tensor. * ​data\_rank (`Int`): Rank of input (data) tensor (data\_rank >= 1). * ​indices\_rank (`Int`): Rank of input (data) tensor (indices\_rank >= 1). * ​updates\_rank (`Int`): Rank of updates tensor (updates\_rank = data\_rank + indices\_rank - indices\_shape\[-1] - 1). * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​target (`StringSlice[StaticConstantOrigin]`): Target cpu or cuda. * ​reduce\_fn (`OptionalReg[fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]`): Reduction function to apply: none (default), add, mul, max, min. * ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): A description of the function, used for profiling and tracing. **Args:** * ​data (`NDBuffer[output_type, data_rank, origin]`): Tensor of rank data\_rank >= 1. * ​indices (`NDBuffer[indices_type, indices_rank, origin]`): Tensor of rank indices\_rank containing indices for the scatter operation. * ​updates (`NDBuffer[output_type, updates_rank, origin]`): Tensor containing values to update output tensor based on indices tensor. * ​output (`NDBuffer[output_type, data_rank, origin]`): Tensor of rank data\_rank, shaped the same as data tensor. * ​context (`DeviceContextPtr`): Pointer to DeviceContext. --- ## scatter_nd_shape `scatter_nd_shape[input_rank: Int, updates_rank: Int, indices_rank: Int, input_type: DType, indices_type: DType, single_thread_blocking_override: Bool](input: NDBuffer[input_type, input_rank, origin], updates: NDBuffer[input_type, updates_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin]) -> IndexList[input_rank]` Compute the output shape of a `scatter_nd` operation, and assert the inputs are compatible. **Parameters:** * ​input\_rank (`Int`): Rank of the input tensor. * ​updates\_rank (`Int`): Rank of the updates tensor. * ​indices\_rank (`Int`): Rank of the indices tensor. * ​input\_type (`DType`): Type of the input tensor. * ​indices\_type (`DType`): Type of the indices tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input (`NDBuffer[input_type, input_rank, origin]`): The input tensor. * ​updates (`NDBuffer[input_type, updates_rank, origin]`): The input tensor. * ​indices (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor. **Returns:** The output shape. --- ## scatter_set_constant `scatter_set_constant[data_type: DType, index_type: DType, //, target: StringSlice[StaticConstantOrigin], single_thread_blocking_override: Bool = False](data: LayoutTensor[data_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], indices: LayoutTensor[index_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], fill_value: SIMD[data_type, 1], ctx: DeviceContextPtr)` Scatter the fill\_value into the data at the specified indices. Example: Suppose we have a 3x3 matrix `data` initialized to zeros: data = [\[0, 0, 0], \[0, 0, 0], \[0, 0, 0]] And `indices` is a 2D tensor with shape \[2, 2]: indices = [\[0, 1], \[2, 0]] If `fill_value` is 5, after calling `scatter_set_constant`, `data` will be: data = [\[0, 5, 0], \[0, 0, 0], \[5, 0, 0]] Arguments: data: The data to scatter the updates into. indices: The indices to scatter the updates into. fill\_value: The value to fill the data with. ctx: The device context. --- ## Image2DLayout `@register_passable(trivial)` `struct Image2DLayout` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `FRSCf` `alias FRSCf = Image2DLayout(3)` ### `NCHW` `alias NCHW = Image2DLayout(1)` ### `NHWC` `alias NHWC = Image2DLayout(0)` ### `RSCF` `alias RSCF = Image2DLayout(2)` ### `UNKNOWN` `alias UNKNOWN = Image2DLayout(-1)` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` ### `__ne__` `__ne__(self, rhs: Self) -> Bool` --- ## ImageData `@register_passable(trivial)` `struct ImageData[shape: DimList, type: DType, static_layout: Image2DLayout, origin: MutableOrigin]` Utility class that generalizes conv2d data and filter tensor with a given data layout. ## Fields * ​data (`NDBuffer[type, 4, origin, shape]`): * ​dynamic\_layout (`Image2DLayout`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(data: NDBuffer[type, 4, origin, shape], layout: Image2DLayout) -> Self` Construct of an image data instance with dynamic layout param. **Args:** * ​data (`NDBuffer[type, 4, origin, shape]`): A 4d buffer containing the actual data. * ​layout (`Image2DLayout`): Data layout tag. `@implicit` `__init__(data: NDBuffer[type, 4, origin, shape]) -> Self` ### `__getitem__` `__getitem__(self, n: Int, c: Int, h: Int, w: Int) -> SIMD[type, 1]` Reads the underlying data buffer based on the tensor index and under- lying data layout. **Args:** * ​n (`Int`): Index on the batch dimension. * ​c (`Int`): Index on the channel dimension. * ​h (`Int`): Index on the height dimension. * ​w (`Int`): Index on the width dimension. **Returns:** The value stored at the given index position. ### `__setitem__` `__setitem__(self, n: Int, c: Int, h: Int, w: Int, value: SIMD[type, 1])` Writes the underlying data buffer based on the tensor index and under- lying data layout. **Args:** * ​n (`Int`): Index on the batch dimension. * ​c (`Int`): Index on the channel dimension. * ​h (`Int`): Index on the height dimension. * ​w (`Int`): Index on the width dimension. * ​value (`SIMD[type, 1]`): The value to store at the given index position. ### `to_static_layout` `to_static_layout[new_static_layout: Image2DLayout](self) -> ImageData[shape, type, new_static_layout, origin]` Conversion utility from a fully dynamic data structure, e.g. from c shim to one with compile-time known data layout. **Returns:** The image data with static data layout. ### `get_layout` `get_layout(self) -> Image2DLayout` The getter function of the underlying data layout, resolving from either statically or dynamically provided information. **Returns:** The resolved data layout tag for this image instance. ### `get_flat_index` `get_flat_index(self, n: Int, c: Int, h: Int, w: Int) -> Int` Converts the dimension index to the flat index of the underlying data based on the tensor layout. **Args:** * ​n (`Int`): Index on the batch dimension. * ​c (`Int`): Index on the channel dimension. * ​h (`Int`): Index on the height dimension. * ​w (`Int`): Index on the width dimension. **Returns:** An integer containing the index based on the underlying data layout. ### `get_tuple_index` `get_tuple_index(self, idx: Int) -> IndexList[4]` Converts the flat index to the dimension index of the underlying data based on the tensor layout. **Args:** * ​idx (`Int`): Flat index. **Returns:** A IndexList containing the index in NCHW order. ### `num_elements` `num_elements(self) -> Int` --- ## ImageShape `@register_passable(trivial)` `struct ImageShape` A data-layout agnostic representation of tensor shapes used in conv2d. ## Fields * ​N (`Int`): * ​C (`Int`): * ​H (`Int`): * ​W (`Int`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__[shape: DimList, type: DType, layout: Image2DLayout](image_data: ImageData[shape, type, layout, origin]) -> Self` Constructor of an ImageShape instance from an ImageData. **Args:** * ​image\_data (`ImageData[shape, type, layout, origin]`): The image data instance to extract shape info from. --- ## PadHandling `@register_passable(trivial)` `struct PadHandling` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `EXCLUDE_PAD` `alias EXCLUDE_PAD = PadHandling(0)` ### `INCLUDE_PAD` `alias INCLUDE_PAD = PadHandling(2)` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` ### `__ne__` `__ne__(self, rhs: Self) -> Bool` --- ## image ## Structs * [​`Image2DLayout`](./Image2DLayout): * [​`ImageData`](./ImageData): Utility class that generalizes conv2d data and filter tensor with a given data layout. * [​`ImageShape`](./ImageShape): A data-layout agnostic representation of tensor shapes used in conv2d. * [​`PadHandling`](./PadHandling): --- ## nn Provides neural network operators for deep learning models. ## Modules * [​`activations`](./activations/): The module contains implementations of activation functions. * [​`arange`](./arange/): * [​`arg_nonzero`](./arg_nonzero/): * [​`argmaxmin`](./argmaxmin/): * [​`argmaxmin_gpu`](./argmaxmin_gpu/): * [​`argsort`](./argsort/): * [​`bicubic`](./bicubic/): This module provides CPU and GPU implementations for bicubic interpolation. * [​`broadcast`](./broadcast/): * [​`concat`](./concat/): * [​`conv`](./conv/): * [​`conv_transpose`](./conv_transpose/): * [​`conv_utils`](./conv_utils/): * [​`cumsum`](./cumsum/): * [​`flash_attention`](./flash_attention/): * [​`fold`](./fold/): Implements the fold operation. * [​`fused_qk_rope`](./fused_qk_rope/): * [​`gather_scatter`](./gather_scatter/): * [​`image`](./image/): * [​`index_tensor`](./index_tensor/): * [​`irfft`](./irfft/): Inverse real FFT kernel using cuFFT. * [​`kv_cache`](./kv_cache/): * [​`kv_cache_ragged`](./kv_cache_ragged/): * [​`mha`](./mha/): * [​`mha_cross`](./mha_cross/): * [​`mha_mask`](./mha_mask/): * [​`mha_operand`](./mha_operand/): * [​`mha_score_mod`](./mha_score_mod/): * [​`mha_sm90`](./mha_sm90/): * [​`mha_tile_scheduler`](./mha_tile_scheduler/): * [​`mha_utils`](./mha_utils/): * [​`mla`](./mla/): * [​`moe`](./moe/): * [​`nms`](./nms/): * [​`normalization`](./normalization/): * [​`pad`](./pad/): * [​`pad_gpu`](./pad_gpu/): * [​`pool`](./pool/): * [​`rand_uniform`](./rand_uniform/): * [​`randn`](./randn/): * [​`repeat_interleave`](./repeat_interleave/): * [​`reshape`](./reshape/): * [​`resize`](./resize/): * [​`roi_align`](./roi_align/): * [​`sampling`](./sampling/): * [​`shapes`](./shapes/): * [​`slice`](./slice/): * [​`softmax`](./softmax/): * [​`split`](./split/): * [​`tile`](./tile/): * [​`topk`](./topk/): * [​`toppminp`](./toppminp/): * [​`toppminp_gpu`](./toppminp_gpu/): --- ## advanced_indexing_getitem `advanced_indexing_getitem[input_rank: Int, index_rank: Int, input_type: DType, index_type: DType, //, start_axis: Int, num_index_tensors: Int, target: StringSlice[StaticConstantOrigin], single_thread_blocking_override: Bool, trace_description: StringSlice[StaticConstantOrigin], input_tensor_fn: fn[Int](IndexList[input_rank]) capturing -> SIMD[input_type, $0], indices_fn: fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]](out_tensor: NDBuffer[input_type, ((num_index_tensors * -1) + index_rank + input_rank), origin], in_tensor_strides: IndexList[input_rank], ctx: DeviceContextPtr)` Implement basic numpy-style advanced indexing. This is designed to be fused with other view-producing operations to implement full numpy-indexing semantics. This assumes the dimensions in `input_tensor` not indexed by index tensors are ":", ie selecting all indices along the slice. For example in numpy: ``` # rank(indices1) == 3 # rank(indices2) == 3 out_tensor = input_tensor[:, :, :, indices1, indices2, :, :] ``` We calculate the following for all valid valued indexing variables: ``` out_tensor[a, b, c, i, j, k, d, e] = input_tensor[ a, b, c, indices1[i, j, k], indices2[i, j, k], d, e ] ``` In this example `start_axis = 3` and `num_index_tensors = 2`. TODO(GEX-1951): Support boolean tensor mask support TODO(GEX-1952): Support non-contiguous indexing tensor case TODO(GEX-1953): Support fusion (especially view-fusion) **Parameters:** * ​input\_rank (`Int`): The rank of the input tensor. * ​index\_rank (`Int`): The rank of the indexing tensors. * ​input\_type (`DType`): The dtype of the input tensor. * ​index\_type (`DType`): The dtype of the indexing tensors. * ​start\_axis (`Int`): The first dimension in input where the indexing tensors are applied. It is assumed the indexing tensors are applied in consecutive dimensions. * ​num\_index\_tensors (`Int`): The number of indexing tensors. * ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to operation on. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​trace\_description (`StringSlice[StaticConstantOrigin]`): For profiling, the trace name the operation will appear under. * ​input\_tensor\_fn (`fn[Int](IndexList[input_rank]) capturing -> SIMD[input_type, $0]`): Fusion lambda for the input tensor. * ​indices\_fn (`fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]`): Fusion lambda for the indices tensors. **Args:** * ​out\_tensor (`NDBuffer[input_type, ((num_index_tensors * -1) + index_rank + input_rank), origin]`): The output tensor to write to. * ​in\_tensor\_strides (`IndexList[input_rank]`): The strides of the input tensor. * ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler. --- ## advanced_indexing_getitem_shape `advanced_indexing_getitem_shape[input_rank: Int, index_rank: Int, //, start_axis: Int, num_index_tensors: Int](input_shape: IndexList[input_rank], index_shape: IndexList[index_rank]) -> IndexList[((num_index_tensors * -1) + index_rank + input_rank)]` Calculate the output shape from advanced indexing. **Parameters:** * ​input\_rank (`Int`): The rank of the input tensor. * ​index\_rank (`Int`): The rank of the indexing tensors. * ​start\_axis (`Int`): The first dimension in input where the indexing tensors are applied. It is assumed the indexing tensors are applied in consecutive dimensions. * ​num\_index\_tensors (`Int`): The number of indexing tensors. **Args:** * ​input\_shape (`IndexList[input_rank]`): The shape of the input tensor in the operation. * ​index\_shape (`IndexList[index_rank]`): The shape of the indexing tensors in the operation. --- ## advanced_indexing_setitem_inplace `advanced_indexing_setitem_inplace[input_rank: Int, index_rank: Int, updates_rank: Int, input_type: DType, index_type: DType, //, start_axis: Int, num_index_tensors: Int, target: StringSlice[StaticConstantOrigin], single_thread_blocking_override: Bool, trace_description: StringSlice[StaticConstantOrigin], updates_tensor_fn: fn[Int](IndexList[updates_rank]) capturing -> SIMD[input_type, $0], indices_fn: fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]](input_tensor: NDBuffer[input_type, input_rank, origin], index_tensor_shape: IndexList[index_rank, element_type=element_type], updates_tensor_strides: IndexList[updates_rank], ctx: DeviceContextPtr)` Implement basic numpy-style advanced indexing with assignment. This is designed to be fused with other view-producing operations to implement full numpy-indexing semantics. This assumes the dimensions in `input_tensor` not indexed by index tensors are ":", ie selecting all indices along the slice. For example in numpy: ``` # rank(indices1) == 2 # rank(indices2) == 2 # rank(updates) == 2 input_tensor[:, :, :, indices1, indices2, :, :] = updates ``` We calculate the following for all valid valued indexing variables: ``` input_tensor[ a, b, c, indices1[i, j], indices2[i, j], d, e ] = updates[i, j] ``` In this example `start_axis = 3` and `num_index_tensors = 2`. In terms of implementation details, our strategy is to iterate over all indices over a common iteration range. The idea is we can map indices in this range to the write location in `input_tensor` as well as the data location in `updates`. An update can illustrate how this is possible best: Imagine the `input_tensor` shape is \[A, B, C, D] and we have indexing tensors I1 and I2 with shape \[M, N, K]. Assume I1 and I2 are applied to dimensions 1 and 2. I claim an appropriate common iteration range is then (A, M, N, K, D). Note we expect `updates` to be the shape \[A, M, N, K, D]. We will show this by providing the mappings into `updates` and `input_tensor`: Consider an arbitrary set of indices in this range (a, m, n, k, d): \- The index into `updates` is (a, m, n, k, d). \- The index into `input_tensor` is (a, I1\[m, n, k], I2\[m, n, k], d). TODO(GEX-1951): Support boolean tensor mask support TODO(GEX-1952): Support non-contiguous indexing tensor case TODO(GEX-1953): Support fusion (especially view-fusion) TODO(GEX-1954): Unify getitem and setitem using generic views. (Requires non-strided view functions). **Parameters:** * ​input\_rank (`Int`): The rank of the input tensor. * ​index\_rank (`Int`): The rank of the indexing tensors. * ​updates\_rank (`Int`): The rank of the updates tensor. * ​input\_type (`DType`): The dtype of the input tensor. * ​index\_type (`DType`): The dtype of the indexing tensors. * ​start\_axis (`Int`): The first dimension in input where the indexing tensors are applied. It is assumed the indexing tensors are applied in consecutive dimensions. * ​num\_index\_tensors (`Int`): The number of indexing tensors. * ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to operation on. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​trace\_description (`StringSlice[StaticConstantOrigin]`): For profiling, the trace name the operation will appear under. * ​updates\_tensor\_fn (`fn[Int](IndexList[updates_rank]) capturing -> SIMD[input_type, $0]`): Fusion lambda for the update tensor. * ​indices\_fn (`fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]`): Fusion lambda for the indices tensors. **Args:** * ​input\_tensor (`NDBuffer[input_type, input_rank, origin]`): The input tensor being indexed into and modified in-place. * ​index\_tensor\_shape (`IndexList[index_rank, element_type=element_type]`): The shape of each index tensor. * ​updates\_tensor\_strides (`IndexList[updates_rank]`): The strides of the update tensor. * ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler. --- ## index_tensor ## Functions * [​`advanced_indexing_getitem`](./advanced_indexing_getitem): Implement basic numpy-style advanced indexing. * [​`advanced_indexing_getitem_shape`](./advanced_indexing_getitem_shape): Calculate the output shape from advanced indexing. * [​`advanced_indexing_setitem_inplace`](./advanced_indexing_setitem_inplace): Implement basic numpy-style advanced indexing with assignment. * [​`index_tensor`](./index_tensor): Index\_tensor operation; based on modified implementation of gather\_nd. * [​`index_tensor_shape`](./index_tensor_shape): Compute the output shape of a `index_tensor` operation, and assert the inputs are compatible. --- ## index_tensor `index_tensor[type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, output_rank: Int, batch_dims: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](data: NDBuffer[type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], output: NDBuffer[type, output_rank, origin], ctx: DeviceContextPtr)` Index\_tensor operation; based on modified implementation of gather\_nd. **Parameters:** * ​type (`DType`): Type of data tensor. * ​indices\_type (`DType`): Type of indices tensor. * ​data\_rank (`Int`): Rank of data tensor (data\_rank >= 1). * ​indices\_rank (`Int`): Rank of indices tensor (indices\_rank >= 1). * ​output\_rank (`Int`): Rank of output tensor. * ​batch\_dims (`Int`): Number of batch dimensions. The gather of indexing starts from dimension of data\[batch\_dims:]. * ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to execute on. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​data (`NDBuffer[type, data_rank, origin]`): Tensor of rank data\_rank >= 1. * ​indices (`NDBuffer[indices_type, indices_rank, origin]`): Tensor of rank indices\_rank >= 1. All index values are expected to be within bounds \[-s, s-1] along axis of size s. It is an error if any of the index values are out of bounds. * ​output (`NDBuffer[type, output_rank, origin]`): Tensor of rank data\_rank + indices\_rank - indices\_shape\[-1] - 1 - b. * ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler. --- ## index_tensor_shape `index_tensor_shape[input_rank: Int, indices_rank: Int, output_rank: Int, input_type: DType, indices_type: DType, batch_dims: Int, single_thread_blocking_override: Bool = True](input_buf: NDBuffer[input_type, input_rank, origin], indices_buf: NDBuffer[indices_type, indices_rank, origin]) -> IndexList[output_rank]` Compute the output shape of a `index_tensor` operation, and assert the inputs are compatible. **Parameters:** * ​input\_rank (`Int`): Rank of the input tensor. * ​indices\_rank (`Int`): Rank of the indices tensor. * ​output\_rank (`Int`): Rank of the output tensor. * ​input\_type (`DType`): Type of the input tensor. * ​indices\_type (`DType`): Type of the indices tensor. * ​batch\_dims (`Int`): Batch dimensions. * ​single\_thread\_blocking\_override (`Bool`): If True, then reduction is run synchronously using a single thread. **Args:** * ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor. * ​indices\_buf (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor. **Returns:** The output shape. --- ## global_cache_insert `global_cache_insert(key: String, value: UnsafePointer[NoneType])` --- ## global_cache_lookup `global_cache_lookup(key: String) -> UnsafePointer[NoneType]` --- ## irfft Inverse real FFT kernel using cuFFT. ## Functions * [​`global_cache_insert`](./global_cache_insert): * [​`global_cache_lookup`](./global_cache_lookup): * [​`irfft`](./irfft): Compute the inverse real FFT of the input tensor. --- ## irfft `irfft[input_rank: Int, input_type: DType, output_type: DType](input: NDBuffer[input_type, input_rank, origin], output: NDBuffer[output_type, input_rank, origin], n: Int, ctx: DeviceContext)` Compute the inverse real FFT of the input tensor. Currently, only applies it to the last dimension. **Args:** * ​input (`NDBuffer[input_type, input_rank, origin]`): Complex input tensor (NDBuffer). * ​output (`NDBuffer[output_type, input_rank, origin]`): Real output tensor (NDBuffer). * ​n (`Int`): Output signal size (if ctx (`DeviceContext`): Device context. --- ## generic_flash_attention_kv_cache_padded `generic_flash_attention_kv_cache_padded[collection_t: KVCollectionT, type: DType, //, *, target: StringSlice[StaticConstantOrigin], mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1, num_heads: Int = -1](q: NDBuffer[type, 4, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], valid_lengths: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_flash_attention_kv_cache_padded_materialized_mask `generic_flash_attention_kv_cache_padded_materialized_mask[collection_t: KVCollectionT, type: DType, //, *, target: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1, num_heads: Int = -1](q: NDBuffer[type, 4, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], mask: NDBuffer[type, rank, origin, shape, strides], valid_lengths: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_fused_qk_rope_bshd_continuous_batch `generic_fused_qk_rope_bshd_continuous_batch[type: DType, //, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 4, origin, shape, strides], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_], freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: DeviceContextPtr = DeviceContextPtr())` Performs a fused RoPE projection for Q and K projections. We have a manually fused QKV projection with mo.opaque types in our Llama model. Due to a limitation in custom op definitions, we can't declare both a tensor and opaque type as output from a custom kernel. This requires us to only note Q\_proj as an output from the QKV projection. If we immediately follow the QKV proj kernel with a RoPE kernel applied to K, we'll get a race condition because the graph compiler doesn't know about the dependency between these kernels in the graph definition. Here we fuse the RoPE kernel applied to Q\_proj with K\_proj, so K\_proj RoPE is only executed after QKV completes. --- ## generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch `generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch[type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](hidden_state: NDBuffer[type, 3, origin, shape], weight: NDBuffer[type, 2, origin, shape], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape], ctx: DeviceContextPtr)` Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. **Args:** * ​hidden\_state (`NDBuffer[type, 3, origin, shape]`): Tensor with shape (batch\_size, seq\_len, num\_heads \* head\_size). * ​weight (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​kv\_collection (`ContinuousBatchingKVCacheCollection[type_, kv_params_]`): The historical KVCache for keys and values. The KVCache for this layer is retrieved via layer\_idx. * ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache for the given layer from kv\_collection. * ​output (`NDBuffer[type, 3, origin, shape]`): The pre-allocated output buffer for Q projections. K and V projections are written in-place to k\_cache and v\_cache. * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## generic_get_continuous_cache `generic_get_continuous_cache[type: DType, kv_params: KVCacheStaticParams](blocks: NDBuffer[type, 6, origin], cache_lengths: NDBuffer[uint32, 1, origin], lookup_table: NDBuffer[uint32, 1, origin], max_lengths: NDBuffer[uint32, 2, origin]) -> ContinuousBatchingKVCacheCollection[type, kv_params]` --- ## generic_get_paged_cache `generic_get_paged_cache[type: DType, kv_params: KVCacheStaticParams, page_size: Int](blocks: NDBuffer[type, 6, origin], cache_lengths: NDBuffer[uint32, 1, origin], lookup_table: NDBuffer[uint32, 2, origin], max_lengths: NDBuffer[uint32, 2, origin], out result: PagedKVCacheCollection[type, kv_params, page_size])` --- ## kv_cache ## Aliases ### `embed_fn_type` `alias embed_fn_type = fn[DType, Int](IndexList[4], SIMD[$0, $1]) capturing -> SIMD[$0, $1]` ## Functions * [​`generic_flash_attention_kv_cache_padded`](./generic_flash_attention_kv_cache_padded): * [​`generic_flash_attention_kv_cache_padded_materialized_mask`](./generic_flash_attention_kv_cache_padded_materialized_mask): * [​`generic_fused_qk_rope_bshd_continuous_batch`](./generic_fused_qk_rope_bshd_continuous_batch): Performs a fused RoPE projection for Q and K projections. * [​`generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch`](./generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. * [​`generic_get_continuous_cache`](./generic_get_continuous_cache): * [​`generic_get_paged_cache`](./generic_get_paged_cache): * [​`managed_tensor_slice_to_ndbuffer`](./managed_tensor_slice_to_ndbuffer): * [​`print_kv_cache_cont_batch_generic_cpu`](./print_kv_cache_cont_batch_generic_cpu): * [​`print_kv_cache_cont_batch_generic_gpu`](./print_kv_cache_cont_batch_generic_gpu): * [​`print_kv_cache_paged_generic_cpu`](./print_kv_cache_paged_generic_cpu): * [​`print_kv_cache_paged_generic_gpu`](./print_kv_cache_paged_generic_gpu): * [​`rms_norm_kv_cache_ragged_continuous_batching`](./rms_norm_kv_cache_ragged_continuous_batching): Performs RMSNorm in place on new entries in the key cache. * [​`rms_norm_kv_cache_ragged_paged`](./rms_norm_kv_cache_ragged_paged): Performs RMSNorm in place on new entries in the key cache. --- ## managed_tensor_slice_to_ndbuffer `managed_tensor_slice_to_ndbuffer[: DType, : Int, spec: StaticTensorSpec[$0, $1], //](tensor: ManagedTensorSlice[io_spec, static_spec=spec]) -> NDBuffer[dtype, rank, MutableAnyOrigin, spec.shape, spec.strides, alignment=spec.alignment, address_space=spec.address_space, exclusive=spec.exclusive]` --- ## print_kv_cache_cont_batch_generic_cpu `print_kv_cache_cont_batch_generic_cpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: ContinuousBatchingKVCacheCollection[type, kv_params], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)` --- ## print_kv_cache_cont_batch_generic_gpu `print_kv_cache_cont_batch_generic_gpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: ContinuousBatchingKVCacheCollection[type, kv_params], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)` --- ## print_kv_cache_paged_generic_cpu `print_kv_cache_paged_generic_cpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams, page_size: Int](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: PagedKVCacheCollection[type, kv_params, page_size], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)` --- ## print_kv_cache_paged_generic_gpu `print_kv_cache_paged_generic_gpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams, page_size: Int](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: PagedKVCacheCollection[type, kv_params, page_size], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)` --- ## rms_norm_kv_cache_ragged_continuous_batching `rms_norm_kv_cache_ragged_continuous_batching[type: DType, num_heads: Int, head_dim: Int, //, target: StringSlice[StaticConstantOrigin], multiply_before_cast: Bool, per_head_norm: Bool](kv_collection: ContinuousBatchingKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim))], gamma: NDBuffer[type, 1, origin, shape, strides], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], layer_idx: SIMD[uint32, 1], total_seq_len: SIMD[uint32, 1], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], context: DeviceContextPtr)` Performs RMSNorm in place on new entries in the key cache. This is done by first creating the ragged tensor weight\_shape (total\_seq\_len, num\_heads, head\_dim) of the new token tensor. To do this we need to pass in `total_seq_len` on host. Then, using `input_row_offsets` we find the corresponding batch and token index, and use that together with the static head and channel indices to store to/load from the key cache. This uses the input/output lambdas on the RMSNorm kernel. This function could apply RMSNorm to a subset of dimensions in each head, determined by the size of the gamma tensor. In this case, it operates on a ragged tensor view of the key cache with shape (total\_seq\_len, num\_heads, rms\_norm\_cols), where rms\_norm\_cols is the length of gamma and must be --- ## rms_norm_kv_cache_ragged_paged `rms_norm_kv_cache_ragged_paged[type: DType, num_heads: Int, head_dim: Int, //, target: StringSlice[StaticConstantOrigin], multiply_before_cast: Bool, per_head_norm: Bool](kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], gamma: NDBuffer[type, 1, origin, shape, strides], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], layer_idx: SIMD[uint32, 1], total_seq_len: SIMD[uint32, 1], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], context: DeviceContextPtr)` Performs RMSNorm in place on new entries in the key cache. This is done by first creating the ragged tensor weight\_shape (total\_seq\_len, num\_heads, head\_dim) of the new token tensor. To do this we need to pass in `total_seq_len` on host. Then, using `input_row_offsets` we find the corresponding batch and token index, and use that together with the static head and channel indices to store to/load from the key cache. This uses the input/output lambdas on the RMSNorm kernel. This function could apply RMSNorm to a subset of dimensions in each head, determined by the size of the gamma tensor. In this case, it operates on a ragged tensor view of the key cache with shape (total\_seq\_len, num\_heads, rms\_norm\_cols), where rms\_norm\_cols is the length of gamma and must be --- ## generic_cross_attention_kv_cache `generic_cross_attention_kv_cache[collection_t: KVCollectionT, type: DType, //, target: StringSlice[StaticConstantOrigin], mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], q_input_row_offsets: ManagedTensorSlice[io_spec, static_spec=static_spec], q_max_seq_len: NDBuffer[uint32, 1, origin, shape, strides], kv_input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_flare_mla_decode_kv_cache_ragged `generic_flare_mla_decode_kv_cache_ragged[collection_t: KVCollectionT, type: DType, //, mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], target: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_flare_mla_decompress_k_cache_ragged_paged `generic_flare_mla_decompress_k_cache_ragged_paged[target: StringSlice[StaticConstantOrigin], type: DType](buffer_row_offsets_1d: NDBuffer[uint32, 1, origin, shape, strides], cache_offsets_1d: NDBuffer[uint32, 1, origin, shape, strides], buffer_length: SIMD[int32, 1], weight: NDBuffer[type, 2, origin, shape, strides], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size], layer_idx: SIMD[uint32, 1], k_latent_buffer: NDBuffer[type, 2, origin, shape, strides], k_buffer: NDBuffer[type, 2, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_flare_mla_prefill_kv_cache_ragged `generic_flare_mla_prefill_kv_cache_ragged[collection_t: KVCollectionT, type: DType, //, softmax_type: DType, write_softmax_info: Bool, use_cascade_attention: Bool, mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], target: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], k: NDBuffer[type, 3, origin, shape, strides], v: NDBuffer[type, 3, origin, shape, strides], buffer_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], cache_offsets: NDBuffer[uint32, 1, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], softmax_info: NDBuffer[softmax_type, 3, MutableAnyOrigin], context: DeviceContextPtr, prev_output: OptionalReg[NDBuffer[type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[type, 3, MutableAnyOrigin]]({:i1 0, 1}), prev_softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}))` --- ## generic_flare_mla_prefill_ragged_paged_plan `generic_flare_mla_prefill_ragged_paged_plan[target: StringSlice[StaticConstantOrigin]](input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size], layer_idx: SIMD[uint32, 1], buffer_token_size: SIMD[uint32, 1], buffer_row_offsets: NDBuffer[uint32, 2, origin, shape, strides], cache_offsets: NDBuffer[uint32, 2, origin, shape, strides], buffer_lengths: NDBuffer[int32, 1, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_flash_attention_kv_cache_ragged `generic_flash_attention_kv_cache_ragged[collection_t: KVCollectionT, type: DType, //, *, target: StringSlice[StaticConstantOrigin], mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: ManagedTensorSlice[io_spec, static_spec=static_spec], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_fused_qk_rope_bshd_continuous_batch_ragged `generic_fused_qk_rope_bshd_continuous_batch_ragged[type: DType, //, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_], freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_fused_qk_rope_bshd_paged_ragged `generic_fused_qk_rope_bshd_paged_ragged[type: DType, //, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size], freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr = DeviceContextPtr())` Performs a fused RoPE projection for Q and K projections. We have a manually fused QKV projection with mo.opaque types in our Llama model. Due to a limitation in custom op definitions, we can't declare both a tensor and opaque type as output from a custom kernel. This requires us to only note Q\_proj as an output from the QKV projection. If we immediately follow the QKV proj kernel with a RoPE kernel applied to K, we'll get a race condition because the graph compiler doesn't know about the dependency between these kernels in the graph definition. Here we fuse the RoPE kernel applied to Q\_proj with K\_proj, so K\_proj RoPE is only executed after QKV completes. --- ## generic_fused_qkv_matmul_kv_cache_cont_batch_ragged `generic_fused_qkv_matmul_kv_cache_cont_batch_ragged[type: DType, //, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[type, 2, origin, shape], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 2, origin, shape], ctx: DeviceContextPtr)` Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. **Args:** * ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size). * ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,). The value at each index is the start\_idx of the corresponding batch in hidden\_state. * ​weight (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​kv\_collection (`ContinuousBatchingKVCacheCollection[type_, kv_params_]`): The object storing the KVCache for this layer. * ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from kv\_collection. * ​output (`NDBuffer[type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V projections are written in-place to k\_cache and v\_cache. Shape: (sum(seq\_lens), num\_heads \* head\_size). * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## generic_fused_qkv_matmul_kv_cache_paged_ragged `generic_fused_qkv_matmul_kv_cache_paged_ragged[type: DType, weight_type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), group_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), has_zp: OptionalReg[Bool] = OptionalReg[Bool]({:i1 0, 1})](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[weight_type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 2, origin, shape], ctx: DeviceContextPtr)` Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. **Args:** * ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size). * ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,). The value at each index is the start\_idx of the corresponding batch in hidden\_state. * ​weight (`NDBuffer[weight_type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​kv\_collection (`PagedKVCacheCollection[type_, kv_params_, page_size]`): The object storing the KVCache for this layer. * ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from kv\_collection. * ​output (`NDBuffer[type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V projections are written in-place to k\_cache and v\_cache. Shape: (sum(seq\_lens), num\_heads \* head\_size). * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## generic_fused_qkv_matmul_kv_cache_paged_ragged_bias `generic_fused_qkv_matmul_kv_cache_paged_ragged_bias[type: DType, weight_type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), group_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), has_zp: OptionalReg[Bool] = OptionalReg[Bool]({:i1 0, 1})](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[weight_type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 2, origin, shape], bias: NDBuffer[type, 1, origin], ctx: DeviceContextPtr)` Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. **Args:** * ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size). * ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,). The value at each index is the start\_idx of the corresponding batch in hidden\_state. * ​weight (`NDBuffer[weight_type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​kv\_collection (`PagedKVCacheCollection[type_, kv_params_, page_size]`): The object storing the KVCache for this layer. * ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from kv\_collection. * ​output (`NDBuffer[type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V projections are written in-place to k\_cache and v\_cache. Shape: (sum(seq\_lens), num\_heads \* head\_size). * ​bias (`NDBuffer[type, 1, origin]`): Bias to be added to the QKV Tensor. Tensor is concatenated q + k + v. Rank 1. * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## generic_fused_qkv_matmul_kv_cache_paged_ragged_scale `generic_fused_qkv_matmul_kv_cache_paged_ragged_scale[type: DType, weight_type: DType, output_type: DType, scale_type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[weight_type, 2, origin, shape], input_scale: NDBuffer[scale_type, 2, origin, shape], weight_scale: NDBuffer[scale_type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size], layer_idx: SIMD[uint32, 1], output: NDBuffer[output_type, 2, origin, shape], ctx: DeviceContextPtr)` Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. **Args:** * ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size). * ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,). The value at each index is the start\_idx of the corresponding batch in hidden\_state. * ​weight (`NDBuffer[weight_type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​input\_scale (`NDBuffer[scale_type, 2, origin, shape]`): Scale to be multiplied to the input Tensor. * ​weight\_scale (`NDBuffer[scale_type, 2, origin, shape]`): Scale to be multiplied to the weight Tensor. * ​kv\_collection (`PagedKVCacheCollection[type_, kv_params_, page_size]`): The object storing the KVCache for this layer. * ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from kv\_collection. * ​output (`NDBuffer[output_type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V projections are written in-place to k\_cache and v\_cache. Shape: (sum(seq\_lens), num\_heads \* head\_size). * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## kv_cache_ragged ## Functions * [​`generic_cross_attention_kv_cache`](./generic_cross_attention_kv_cache): * [​`generic_flare_mla_decode_kv_cache_ragged`](./generic_flare_mla_decode_kv_cache_ragged): * [​`generic_flare_mla_decompress_k_cache_ragged_paged`](./generic_flare_mla_decompress_k_cache_ragged_paged): * [​`generic_flare_mla_prefill_kv_cache_ragged`](./generic_flare_mla_prefill_kv_cache_ragged): * [​`generic_flare_mla_prefill_ragged_paged_plan`](./generic_flare_mla_prefill_ragged_paged_plan): * [​`generic_flash_attention_kv_cache_ragged`](./generic_flash_attention_kv_cache_ragged): * [​`generic_fused_qk_rope_bshd_continuous_batch_ragged`](./generic_fused_qk_rope_bshd_continuous_batch_ragged): * [​`generic_fused_qk_rope_bshd_paged_ragged`](./generic_fused_qk_rope_bshd_paged_ragged): Performs a fused RoPE projection for Q and K projections. * [​`generic_fused_qkv_matmul_kv_cache_cont_batch_ragged`](./generic_fused_qkv_matmul_kv_cache_cont_batch_ragged): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. * [​`generic_fused_qkv_matmul_kv_cache_paged_ragged`](./generic_fused_qkv_matmul_kv_cache_paged_ragged): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. * [​`generic_fused_qkv_matmul_kv_cache_paged_ragged_bias`](./generic_fused_qkv_matmul_kv_cache_paged_ragged_bias): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. * [​`generic_fused_qkv_matmul_kv_cache_paged_ragged_scale`](./generic_fused_qkv_matmul_kv_cache_paged_ragged_scale): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. * [​`k_matmul_ragged_paged`](./k_matmul_ragged_paged): Performs a matmul, writing the output into a mutable PagedKVCacheCollection object. * [​`kv_matmul_ragged_paged`](./kv_matmul_ragged_paged): Performs a matmul, writing the output into a mutable ContinuousBatchingKVCacheCollection object. * [​`unfused_qkv_matmul_ragged_paged_gguf_quantized`](./unfused_qkv_matmul_ragged_paged_gguf_quantized): Performs a quantized matmul, writing the output into a mutable PagedKVCacheCollection object. * [​`valid_length_managed_tensor_slice_to_ndbuffer`](./valid_length_managed_tensor_slice_to_ndbuffer): --- ## k_matmul_ragged_paged `k_matmul_ragged_paged[type: DType, num_heads: Int, head_dim: Int, page_size: Int, //, target: StringSlice[StaticConstantOrigin]](hidden_state: NDBuffer[type, 2, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[type, 2, origin, shape, strides], kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], layer_idx: SIMD[uint32, 1], ctx: DeviceContextPtr)` Performs a matmul, writing the output into a mutable PagedKVCacheCollection object. **Args:** * ​hidden\_state (`NDBuffer[type, 2, origin, shape, strides]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size). * ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,) denoting the start of each sequence along the seq\_len dimension. * ​weight (`NDBuffer[type, 2, origin, shape, strides]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​kv\_collection (`PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size]`): The historical KVCache for keys and values. The KVCache for this layer is retrieved via layer\_idx. * ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache for the given layer from kv\_collection. * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## kv_matmul_ragged_paged `kv_matmul_ragged_paged[type: DType, num_heads: Int, head_dim: Int, page_size: Int, //, target: StringSlice[StaticConstantOrigin]](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], layer_idx: SIMD[uint32, 1], ctx: DeviceContextPtr)` Performs a matmul, writing the output into a mutable ContinuousBatchingKVCacheCollection object. **Args:** * ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size). * ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,) denoting the start of each sequence along the seq\_len dimension. * ​weight (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​kv\_collection (`PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size]`): The historical KVCache for keys and values. The KVCache for this layer is retrieved via layer\_idx. * ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache for the given layer from kv\_collection. * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## unfused_qkv_matmul_ragged_paged_gguf_quantized `unfused_qkv_matmul_ragged_paged_gguf_quantized[type: DType, num_heads: Int, head_dim: Int, page_size: Int, //, quantization_encoding_q: StringSlice[StaticConstantOrigin], quantization_encoding_k: StringSlice[StaticConstantOrigin], quantization_encoding_v: StringSlice[StaticConstantOrigin]](hidden_state: NDBuffer[float32, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], q_weight: NDBuffer[uint8, 2, origin, shape], k_weight: NDBuffer[uint8, 2, origin, shape], v_weight: NDBuffer[uint8, 2, origin, shape], kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], layer_idx: SIMD[uint32, 1], output: NDBuffer[float32, 2, origin, shape], ctx: DeviceContextPtr)` Performs a quantized matmul, writing the output into a mutable PagedKVCacheCollection object. Unlike the un-quantized version (kv\_matmul\_ragged\_continuous\_batching), this implementation does not concat the q, k, and v weights together. Instead, it performs three matmuls. This allows the q, k, and v weights to have different quantization encodings. This is only supported on CPU. **Args:** * ​hidden\_state (`NDBuffer[float32, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size). * ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,) denoting the start of each sequence along the seq\_len dimension. * ​q\_weight (`NDBuffer[uint8, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​k\_weight (`NDBuffer[uint8, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​v\_weight (`NDBuffer[uint8, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​kv\_collection (`PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size]`): The Collection object storing KVCache entries. * ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache for the given layer from kv\_collection. * ​output (`NDBuffer[float32, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_kv\_heads \* head\_size). This is the output buffer for the Q matmul. * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## valid_length_managed_tensor_slice_to_ndbuffer `valid_length_managed_tensor_slice_to_ndbuffer(tensor: ManagedTensorSlice[io_spec, static_spec=static_spec]) -> NDBuffer[uint32, 1, MutableAnyOrigin]` --- ## flash_attention `flash_attention[rank: Int, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), decoding_warp_split_k: Bool = False, naive_kernel: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, rank, origin, shape, strides], v: NDBuffer[type, rank, origin, shape, strides], mask: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], scale: SIMD[float32, 1], context: DeviceContextPtr = DeviceContextPtr(), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` `flash_attention[rank: Int, cache_t: KVCacheT, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, decoding_warp_split_k: Bool = False, naive_kernel: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: cache_t, v: cache_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` Flash attention 2 algorithm. Compute: (1) Transpose (Q) BSHD -> BHSD; (2) Transpose (K) BSHD -> BHSD; (3) Transpose (V) BSHD -> BHSD; (4) P = Bmm(Q, K), P is also called "score"; (5) P = P \* scale + mask; (6) P = softmax(P); (7) O = Bmm(P, V) (8) Output = Transpose(O). B, S, H, D denote batch size, sequence length, head count and depth, respectively. (1), (2), (3) happens while loading the data into shared memory. (8) happens when writing output to global memory. All inputs (query, key, and value) must have BSHD layout. The mask can be BSS or BHSS. This kernel also handles grouped attention optimization. In this case the shape of K and V are BShD where h = H / num\_groups. This kernels handles batches with different valid lengths (i.e., before the padding). Such lengths are passed in valid\_length argument. `flash_attention[rank: Int, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), decoding_warp_split_k: Bool = False, _use_valid_length: Bool = False, _padded_ndbuffer: Bool = False, naive_kernel: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, rank, origin, shape, strides], v: NDBuffer[type, rank, origin, shape, strides], mask_functor: mask_t, score_mod_functor: score_mod_t, scale: SIMD[float32, 1], ctx: DeviceContext, num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), valid_length: OptionalReg[ManagedTensorSlice[IOSpec(), static_spec=create_unknown()]] = OptionalReg[ManagedTensorSlice[IOSpec(), static_spec=create_unknown()]]({:i1 0, 1}))` --- ## flash_attention_dispatch `flash_attention_dispatch[rank: Int, k_t: MHAOperand, v_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, kv_num_heads: Int, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, _is_flash_attention_applicable: Bool = True, _is_cache_length_accurate: Bool = False, _use_valid_length: Bool = True, _padded_ndbuffer: Bool = False, decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: k_t, v: v_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], max_prompt_len: Int, max_cache_valid_length: Int, scale: SIMD[float32, 1], is_token_generation: Bool, ctx: DeviceContext, kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` --- ## flash_attention_hw_supported `flash_attention_hw_supported[qkv_type: DType]() -> Bool` --- ## get_mha_decoding_num_partitions `get_mha_decoding_num_partitions[num_heads: Int, group: Int](batch_size: Int, num_keys: Int, ctx: DeviceContext) -> Int` --- ## mha ## Functions * [​`flash_attention`](./flash_attention): * [​`flash_attention_dispatch`](./flash_attention_dispatch): * [​`flash_attention_hw_supported`](./flash_attention_hw_supported): * [​`get_mha_decoding_num_partitions`](./get_mha_decoding_num_partitions): * [​`managed_tensor_slice_to_ndbuffer`](./managed_tensor_slice_to_ndbuffer): * [​`mha`](./mha): * [​`mha_decoding`](./mha_decoding): * [​`mha_decoding_single_batch`](./mha_decoding_single_batch): Flash attention v2 algorithm. * [​`mha_decoding_single_batch_pipelined`](./mha_decoding_single_batch_pipelined): Flash attention v2 algorithm. * [​`mha_gpu_naive`](./mha_gpu_naive): * [​`mha_single_batch`](./mha_single_batch): MHA for token gen where seqlen = 1 and num\_keys >= 1. * [​`mha_single_batch_pipelined`](./mha_single_batch_pipelined): MHA for token gen where seqlen = 1 and num\_keys >= 1. * [​`mha_splitk_reduce`](./mha_splitk_reduce): * [​`scale_and_mask_helper`](./scale_and_mask_helper): --- ## managed_tensor_slice_to_ndbuffer `managed_tensor_slice_to_ndbuffer[: DType, : Int, spec: StaticTensorSpec[$0, $1], //](tensor: ManagedTensorSlice[io_spec, static_spec=spec]) -> NDBuffer[dtype, rank, MutableAnyOrigin, spec.shape, spec.strides, alignment=spec.alignment, address_space=spec.address_space, exclusive=spec.exclusive]` --- ## mha `mha[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, config: MHAConfig, group: Int = 1, use_score_mod: Bool = False, ragged: Bool = False, is_shared_kv: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False, _padded_ndbuffer: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], scale: SIMD[float32, 1], batch_size: Int, seq_len_arg: Int, num_keys_arg: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]], mask: mask_t, score_mod: score_mod_t)` --- ## mha_decoding `mha_decoding[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, ragged: Bool = False, is_shared_kv: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], batch_size: Int, num_partitions: Int, max_cache_valid_length: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], mask: mask_t, score_mod: score_mod_t)` --- ## mha_decoding_single_batch `mha_decoding_single_batch[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], num_keys: UInt, num_partitions: UInt, max_cache_valid_length: UInt, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)` Flash attention v2 algorithm. --- ## mha_decoding_single_batch_pipelined `mha_decoding_single_batch_pipelined[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], num_keys: UInt, num_partitions: UInt, max_cache_valid_length: UInt, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)` Flash attention v2 algorithm. --- ## mha_gpu_naive `mha_gpu_naive[output_type: DType, k_t: MHAOperand, v_t: MHAOperand, mask_t: MHAMask, rank: Int, //, ragged: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False](q: NDBuffer[type, rank, origin, shape, strides], k: k_t, v: v_t, mask_functor: mask_t, output: NDBuffer[output_type, rank, origin, shape, strides], valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], batch_size: Int, max_prompt_len: Int, max_cache_size: Int, num_heads: Int, depth: Int, group: Int, ctx: DeviceContext)` `mha_gpu_naive[q_type: DType, k_type: DType, v_type: DType, output_type: DType, rank: Int, mask_type: DType, mask_rank: Int, //](q: NDBuffer[q_type, rank, origin, shape, strides], k: NDBuffer[k_type, rank, origin, shape, strides], v: NDBuffer[v_type, rank, origin, shape, strides], mask: NDBuffer[mask_type, mask_rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], output: NDBuffer[output_type, rank, origin, shape, strides], scale: SIMD[float32, 1], batch_size: Int, seq_len: Int, num_keys: Int, num_heads: Int, depth: Int, group: Int, ctx: DeviceContext)` `mha_gpu_naive[q_type: DType, output_type: DType, cache_t: KVCacheT, mask_t: MHAMask, rank: Int, //, ragged: Bool = False](q: NDBuffer[q_type, rank, origin, shape, strides], k: cache_t, v: cache_t, mask_functor: mask_t, output: NDBuffer[output_type, rank, origin, shape, strides], valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], batch_size: Int, max_prompt_len: Int, max_cache_size: Int, num_heads: Int, depth: Int, group: Int, ctx: DeviceContext)` --- ## mha_single_batch `mha_single_batch[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, config: MHAConfig, group: Int = 1, use_score_mod: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], scale: SIMD[float32, 1], seq_len: Int, max_seq_len: Int, start_pos: SIMD[uint32, 1], num_keys: Int, mask_tensor_col: Int, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)` MHA for token gen where seqlen = 1 and num\_keys >= 1. The general data layout and steps conform to flash attention. Two exceptions: 1 Partition across B, H, and num\_keys (TODO). The last one is split-K and will need a separate reduction kernel at the end. 2 First bmm becomes gemv and second bmm becomes gevm. TODO: use more optimized kernels for them --- ## mha_single_batch_pipelined `mha_single_batch_pipelined[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, config: MHAConfig, group: Int = 1, use_score_mod: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], scale: SIMD[float32, 1], seq_len: Int, max_seq_len: Int, start_pos: SIMD[uint32, 1], num_keys: Int, mask_tensor_col: Int, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)` MHA for token gen where seqlen = 1 and num\_keys >= 1. The general data layout and steps conform to flash attention. Two exceptions: 1 Partition across B, H, and num\_keys (TODO). The last one is split-K and will need a separate reduction kernel at the end. 2 First bmm becomes gemv and second bmm becomes gevm. TODO: use more optimized kernels for them --- ## mha_splitk_reduce `mha_splitk_reduce[output_type: DType, depth: UInt, num_heads: UInt, num_threads: UInt, group: UInt = UInt(1), use_exp2: Bool = False](intermediate_ptr: UnsafePointer[SIMD[output_type, 1]], output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], batch_size: Int, num_partitions: Int)` --- ## scale_and_mask_helper `scale_and_mask_helper[p_type: DType, p_layout: Layout, mask_t: MHAMask, score_mod_t: ScoreModTrait, group: Int, num_n_mmas: Int, WN: Int, MMA_N: Int, simd_width: Int, use_score_mod: Bool = False](p_reg_tile: LayoutTensor[p_type, p_layout, origin, address_space=AddressSpace(5)], scale: SIMD[float32, 1], num_keys: UInt, bound: UInt, lane: UInt, warp: UInt, mask: mask_t, score_mod: score_mod_t, kv_tile_start_row: Int, mask_stride: UInt, max_seq_len: Int)` --- ## mha_cross ## Functions * [​`mha_cross_gpu_naive`](./mha_cross_gpu_naive): Naive cross attention on GPU. --- ## mha_cross_gpu_naive `mha_cross_gpu_naive[cache_t: KVCacheT, mask_t: MHAMask, type: DType, q_shape: DimList, //, rank: Int](output: NDBuffer[type, rank, MutableAnyOrigin, shape, strides], q: NDBuffer[type, rank, MutableAnyOrigin, q_shape, strides], q_input_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin, shape, strides], q_max_seq_len: Int, k: cache_t, v: cache_t, kv_input_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin, shape, strides], mask_functor: mask_t, scale: SIMD[float32, 1], ctx: DeviceContext)` Naive cross attention on GPU. Note that this assumes ragged tensor inputs and uses a mask functor. Computes: (1) Transpose (Q) BSHD -> BHSD; (2) Transpose (K) BSHD -> BHSD; (3) Transpose (V) BSHD -> BHSD; (4) P = Bmm(Q, K), P is also called "score"; (5) P = P \* scale + mask; (6) P = softmax(P); (7) O = Bmm(P, V) (8) Output = Transpose(O). B, S, H, D denote batch size, sequence length, head count and depth, respectively. (1), (2), (3) happens while loading the data into shared memory. (8) happens when writing output to global memory. All inputs (query, key, and value) must have BSHD layout. The mask can be BSS or BHSS. This kernel also handles grouped attention optimization. In this case the shape of K and V are BShD where h = H / num\_groups. --- ## AndMask `@register_passable(trivial)` `struct AndMask[T: MHAMask, S: MHAMask, //, lhs: T, rhs: S]` Mask that's the AND of two masks. ## Implemented traits `AnyType`, `Copyable`, `MHAMask`, `Movable`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask = get_vtable_entry(:trait T, "apply_log2e_after_mask") if get_vtable_entry(:trait T, "apply_log2e_after_mask") else get_vtable_entry(:trait S, "apply_log2e_after_mask")` ### `mask_out_of_bound` `alias mask_out_of_bound = get_vtable_entry(:trait T, "mask_out_of_bound") if get_vtable_entry(:trait T, "mask_out_of_bound") else get_vtable_entry(:trait S, "mask_out_of_bound")` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds = get_vtable_entry(:trait S, "mask_safe_out_of_bounds") if get_vtable_entry(:trait T, "mask_safe_out_of_bounds") else get_vtable_entry(:trait T, "mask_safe_out_of_bounds")` ## Methods ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` ### `status` `status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` --- ## CausalMask `@register_passable(trivial)` `struct CausalMask` MHA causal mask ensures a token is only affected by previous tokens. ## Implemented traits `AnyType`, `Copyable`, `MHAMask`, `Movable`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask = False` ### `mask_out_of_bound` `alias mask_out_of_bound = is_nvidia_gpu()` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds = True` ## Methods ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` ### `status` `status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` --- ## ChunkedCausalMask `ChunkedCausalMask[local_window_size: Int]() -> OrMask[CausalMask(), ChunkedMask()]` Mask implementing Chunked Causal attention for Llama4 models. This groups the mask into chunks of size `local_window_size` and performs causal attention within each local chunk. Considering the following case: * Q\_len = 7 * K\_len = 10 * start\_pos = 3 * local\_window\_size = 4 The mask will be applied as follows: K > 0 1 2 3 4 5 6 7 8 9 Q v x--------------------x 0 | 1 1 1 1 0 0 0 0 0 0 1 | 0 0 0 0 1 0 0 0 0 0 2 | 0 0 0 0 1 1 0 0 0 0 3 | 0 0 0 0 1 1 1 0 0 0 4 | 0 0 0 0 1 1 1 1 0 0 5 | 0 0 0 0 0 0 0 0 1 0 6 | 0 0 0 0 0 0 0 0 1 1 --- ## ChunkedMask `@register_passable(trivial)` `struct ChunkedMask[local_window_size: Int]` Mask implementing Chunked attention. This groups the mask into chunks of size `local_window_size`. Considering the following case: * Q\_len = 7 * K\_len = 10 * local\_window\_size = 4 The mask will be applied as follows: K > 0 1 2 3 4 5 6 7 8 9 Q v x--------------------x 0 | 1 1 1 1 0 0 0 0 0 0 1 | 0 0 0 0 1 1 1 1 0 0 2 | 0 0 0 0 1 1 1 1 0 0 3 | 0 0 0 0 1 1 1 1 0 0 4 | 0 0 0 0 1 1 1 1 0 0 5 | 0 0 0 0 0 0 0 0 1 1 6 | 0 0 0 0 0 0 0 0 1 1 ## Implemented traits `AnyType`, `Copyable`, `MHAMask`, `Movable`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask = False` ### `mask_out_of_bound` `alias mask_out_of_bound = True` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds = True` ## Methods ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` ### `status` `status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` --- ## MHAMask The MHAMask trait describes masks for MHA kernels, such as the causal mask. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask` Does the mask require `log2e` to be applied after the mask, or can it be fused with the scaling? ### `mask_out_of_bound` `alias mask_out_of_bound` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds` Is the mask safe to read out of bounds? ## Methods ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self: _Self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` Return mask vector at given coordinates. Arguments: coord is (seq\_id, head, q\_idx, k\_idx) score\_vec is at `coord` of the score matrix The functor could capture an mask tensor and add to the score e.g. Replit. ### `status` `status[*, element_type: DType = uint32](self: _Self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` Given a tile's index range, return its masking status. --- ## MaskName `struct MaskName` A tile's masking status. ## Fields * ​name (`String`): ## Implemented traits `AnyType`, `Stringable`, `UnknownDestructibility` ## Aliases ### `CAUSAL` `alias CAUSAL = MaskName(__init__[__mlir_type.!kgen.string]("causal"))` ### `CHUNKED` `alias CHUNKED = MaskName(__init__[__mlir_type.!kgen.string]("chunked"))` ### `CHUNKED_CAUSAL` `alias CHUNKED_CAUSAL = MaskName(__init__[__mlir_type.!kgen.string]("chunked_causal"))` ### `MATERIALIZED` `alias MATERIALIZED = MaskName(__init__[__mlir_type.!kgen.string]("materialized"))` ### `NULL` `alias NULL = MaskName(__init__[__mlir_type.!kgen.string]("null"))` ### `SLIDING_WINDOW_CAUSAL` `alias SLIDING_WINDOW_CAUSAL = MaskName(__init__[__mlir_type.!kgen.string]("sliding_window_causal"))` ## Methods ### `__init__` `__init__(out self, name: String)` ### `__eq__` `__eq__(self, rhs: Self) -> Bool` `__eq__(self, rhs: String) -> Bool` ### `__ne__` `__ne__(self, rhs: Self) -> Bool` ### `__str__` `__str__(self) -> String` --- ## MaterializedMask `@register_passable(trivial)` `struct MaterializedMask[type_: DType, rank_: Int, shape_: DimList]` Mask that's backed by a materialized tensor. ## Fields * ​mask\_tensor (`NDBuffer[type_, rank_, MutableAnyOrigin, shape_]`): * ​start\_pos (`OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]`): * ​is\_multiple\_of\_2 (`Bool`): ## Implemented traits `AnyType`, `Copyable`, `MHAMask`, `Movable`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask = True` ### `mask_out_of_bound` `alias mask_out_of_bound = True` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds = False` ### `MaskType` `alias MaskType = NDBuffer[type_, rank_, MutableAnyOrigin, shape_]` ## Methods ### `__init__` `__init__(mask_tensor: NDBuffer[type_, rank_, MutableAnyOrigin, shape_], start_pos: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1})) -> Self` ### `get_start_pos` `get_start_pos(self, batch_idx: Int) -> Int` ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` ### `status` `status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` --- ## NullMask `@register_passable(trivial)` `struct NullMask` Mask that's effectively a noop. ## Implemented traits `AnyType`, `Copyable`, `MHAMask`, `Movable`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask = False` ### `mask_out_of_bound` `alias mask_out_of_bound = True` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds = True` ## Methods ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` ### `status` `status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` --- ## OrMask `@register_passable(trivial)` `struct OrMask[T: MHAMask, S: MHAMask, //, lhs: T, rhs: S]` Mask that's the OR of two masks. ## Implemented traits `AnyType`, `Copyable`, `MHAMask`, `Movable`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask = get_vtable_entry(:trait T, "apply_log2e_after_mask") if get_vtable_entry(:trait T, "apply_log2e_after_mask") else get_vtable_entry(:trait S, "apply_log2e_after_mask")` ### `mask_out_of_bound` `alias mask_out_of_bound = get_vtable_entry(:trait S, "mask_out_of_bound") if get_vtable_entry(:trait T, "mask_out_of_bound") else get_vtable_entry(:trait T, "mask_out_of_bound")` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds = get_vtable_entry(:trait S, "mask_safe_out_of_bounds") if get_vtable_entry(:trait T, "mask_safe_out_of_bounds") else get_vtable_entry(:trait T, "mask_safe_out_of_bounds")` ## Methods ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` ### `status` `status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` --- ## SlidingWindowCausalMask `@register_passable(trivial)` `struct SlidingWindowCausalMask[window_size: Int]` Mask implementing Sliding Window attention. Considering the following case: * Q\_len = 7 * K\_len = 7 * window\_size = 3 The mask will be applied as follows: K > 0 1 2 3 4 5 6 Q v x------------x 0 | 1 0 0 0 0 0 0 1 | 1 1 0 0 0 0 0 2 | 1 1 1 0 0 0 0 3 | 0 1 1 1 0 0 0 4 | 0 0 1 1 1 0 0 5 | 0 0 0 1 1 1 0 6 | 0 0 0 0 1 1 1 ## Implemented traits `AnyType`, `Copyable`, `MHAMask`, `Movable`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask = False` ### `mask_out_of_bound` `alias mask_out_of_bound = True` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds = True` ## Methods ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` ### `status` `status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` --- ## TileMaskStatus `@register_passable(trivial)` `struct TileMaskStatus` A tile's masking status. ## Fields * ​status (`SIMD[uint8, 1]`): ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `FULL_MASK` `alias FULL_MASK = TileMaskStatus(__init__[__mlir_type.!pop.int_literal](3))` ### `NO_MASK` `alias NO_MASK = TileMaskStatus(__init__[__mlir_type.!pop.int_literal](0))` ### `PARTIAL_MASK` `alias PARTIAL_MASK = TileMaskStatus(__init__[__mlir_type.!pop.int_literal](1))` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` ### `__ne__` `__ne__(self, rhs: Self) -> Bool` ### `__is__` `__is__(self, rhs: Self) -> Bool` ### `__and__` `__and__(self, rhs: Self) -> Self` ### `__or__` `__or__(self, rhs: Self) -> Self` ### `__is_not__` `__is_not__(self, rhs: Self) -> Bool` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` --- ## mha_mask ## Aliases ### `MASK_VALUE` `alias MASK_VALUE = -10000` ## Structs * [​`AndMask`](./AndMask): Mask that's the AND of two masks. * [​`CausalMask`](./CausalMask): MHA causal mask ensures a token is only affected by previous tokens. * [​`ChunkedMask`](./ChunkedMask): Mask implementing Chunked attention. * [​`MaskName`](./MaskName): A tile's masking status. * [​`MaterializedMask`](./MaterializedMask): Mask that's backed by a materialized tensor. * [​`NullMask`](./NullMask): Mask that's effectively a noop. * [​`OrMask`](./OrMask): Mask that's the OR of two masks. * [​`SlidingWindowCausalMask`](./SlidingWindowCausalMask): Mask implementing Sliding Window attention. * [​`TileMaskStatus`](./TileMaskStatus): A tile's masking status. ## Traits * [​`MHAMask`](./MHAMask): The MHAMask trait describes masks for MHA kernels, such as the causal mask. ## Functions * [​`ChunkedCausalMask`](./ChunkedCausalMask): Mask implementing Chunked Causal attention for Llama4 models. --- ## KVCacheMHAOperand `@register_passable(trivial)` `struct KVCacheMHAOperand[cache_t: KVCacheT]` An implementation for `mo.opaque` KVCacheT arguments to MHA kernels. We can eventually remove this trait and just add it as a sub-trait in the KVCacheT type, but we need to solve some cyclic dependencies first. ## Fields * ​cache (`cache_t`): ## Implemented traits `AnyType`, `Copyable`, `MHAOperand`, `Movable`, `UnknownDestructibility` ## Aliases ### `type` `alias type = get_vtable_entry(:trait cache_t, "type")` ## Methods ### `__init__` `__init__(cache: cache_t) -> Self` ### `block_paged_ptr` `block_paged_ptr[tile_size: Int](self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[get_vtable_entry(:trait cache_t, "type"), 1]]` ### `cache_length` `cache_length(self, batch_idx: Int) -> Int` ### `max_context_length` `max_context_length(self) -> SIMD[uint32, 1]` --- ## MHAOperand This serves as the trait to support arguments to our MHA kernel. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `type` `alias type` ## Methods ### `block_paged_ptr` `block_paged_ptr[tile_size: Int](self: _Self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[get_vtable_entry(:trait _Self, "type"), 1]]` ### `cache_length` `cache_length(self: _Self, batch_idx: Int) -> Int` Returns the length of the cache for a given batch index. ### `max_context_length` `max_context_length(self: _Self) -> SIMD[uint32, 1]` Returns the maximum cache length in a given batch index. --- ## NDBufferMHAOperand `@register_passable(trivial)` `struct NDBufferMHAOperand[type_: DType, rank: Int, shape: DimList, stride: DimList]` An implementation for NDBuffer arguments to MHA kernels. ## Fields * ​buffer (`NDBuffer[type_, rank, MutableAnyOrigin, shape, stride]`): ## Implemented traits `AnyType`, `Copyable`, `MHAOperand`, `Movable`, `UnknownDestructibility` ## Aliases ### `type` `alias type = type_` ## Methods ### `__init__` `__init__(buffer: NDBuffer[type_, rank, MutableAnyOrigin, shape, stride]) -> Self` ### `block_paged_ptr` `block_paged_ptr[tile_size: Int](self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[type_, 1]]` ### `cache_length` `cache_length(self, batch_idx: Int) -> Int` ### `max_context_length` `max_context_length(self) -> SIMD[uint32, 1]` --- ## RaggedMHAOperand `@register_passable(trivial)` `struct RaggedMHAOperand[type_: DType, shape: DimList, stride: DimList]` An implementation for ragged NDBuffer arguments to MHA kernels. ## Fields * ​buffer (`NDBuffer[type_, 3, MutableAnyOrigin, shape, stride]`): * ​cache\_row\_offsets (`NDBuffer[uint32, 1, MutableAnyOrigin]`): ## Implemented traits `AnyType`, `Copyable`, `MHAOperand`, `Movable`, `UnknownDestructibility` ## Aliases ### `type` `alias type = type_` ## Methods ### `__init__` `__init__(buffer: NDBuffer[type_, 3, MutableAnyOrigin, shape, stride], cache_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin, shape, strides]) -> Self` ### `block_paged_ptr` `block_paged_ptr[tile_size: Int](self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[type_, 1]]` ### `cache_length` `cache_length(self, batch_idx: Int) -> Int` ### `max_context_length` `max_context_length(self) -> SIMD[uint32, 1]` --- ## mha_operand ## Structs * [​`KVCacheMHAOperand`](./KVCacheMHAOperand): An implementation for `mo.opaque` KVCacheT arguments to MHA kernels. * [​`NDBufferMHAOperand`](./NDBufferMHAOperand): An implementation for NDBuffer arguments to MHA kernels. * [​`RaggedMHAOperand`](./RaggedMHAOperand): An implementation for ragged NDBuffer arguments to MHA kernels. ## Traits * [​`MHAOperand`](./MHAOperand): This serves as the trait to support arguments to our MHA kernel. --- ## AlibiScoreMod `@register_passable(trivial)` `struct AlibiScoreMod[num_heads: Int]` AlibiScoreMod adds the appropriate ALiBi constant bias to attention score. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `ScoreModTrait`, `UnknownDestructibility` ## Aliases ### `name_str` `alias name_str = __init__[__mlir_type.!kgen.string]("alibi")` ## Methods ### `score_mod` `score_mod[type: DType, width: Int, //, *, element_type: DType = int32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width], max_prompt_len: Int) -> SIMD[type, width]` --- ## IdentityScoreMod `@register_passable(trivial)` `struct IdentityScoreMod` IdentityScoreMod simply returns attention score. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `ScoreModTrait`, `UnknownDestructibility` ## Aliases ### `name_str` `alias name_str = __init__[__mlir_type.!kgen.string]("no_pos")` ## Methods ### `score_mod` `score_mod[type: DType, width: Int, //, *, element_type: DType = int32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width], max_prompt_len: Int = 0) -> SIMD[type, width]` --- ## ScoreModTrait The ScoreMod trait desctribes score\_mod for mha kernel like alibi bias. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `name_str` `alias name_str` ## Methods ### `score_mod` `score_mod[type: DType, width: Int, //, *, element_type: DType = int32](self: _Self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width], max_prompt_len: Int = 0) -> SIMD[type, width]` Return score vector at given coordinates given a score\_mod. Arguments: coord is (seq\_id, head, q\_idx, k\_idx) score\_vec is at `coord` of the score matrix Score\_mod calculates a tensor given the functor and adds to score\_vec. --- ## mha_score_mod ## Structs * [​`AlibiScoreMod`](./AlibiScoreMod): AlibiScoreMod adds the appropriate ALiBi constant bias to attention score. * [​`IdentityScoreMod`](./IdentityScoreMod): IdentityScoreMod simply returns attention score. ## Traits * [​`ScoreModTrait`](./ScoreModTrait): The ScoreMod trait desctribes score\_mod for mha kernel like alibi bias. --- ## MHAPosition `@register_passable(trivial)` `struct MHAPosition[BM: Int, BN: Int, depth: Int, num_heads: Int, group: Int, decoding: Bool]` Position of the MHA-kernel. When `decoding=False`, `q_head_stride == num_heads`. When `decoding=True`, `q_head_stride == 1`. ## Fields * ​q\_out\_offset (`Int`): * ​num\_keys (`SIMD[uint32, 1]`): * ​start\_pos (`SIMD[uint32, 1]`): * ​seq\_len (`SIMD[uint32, 1]`): * ​head\_idx (`SIMD[uint32, 1]`): * ​prompt\_offset (`SIMD[uint32, 1]`): * ​prompt\_idx (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `q_output_gmem_layout` `alias q_output_gmem_layout = __init__[::Origin[::Bool(IntTuple(BM, depth), IntTuple(depth if decoding else (depth * num_heads), 1))` ### `q_stride` `alias q_stride = depth if decoding else (depth * num_heads)` ## Methods ### `__init__` `__init__(q_out_offset: Int, num_keys: SIMD[uint32, 1], start_pos: SIMD[uint32, 1], seq_info: SeqInfo) -> Self` ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` ### `q_head_idx` `q_head_idx(self) -> SIMD[uint32, 1]` ### `kv_head_idx` `kv_head_idx(self) -> SIMD[uint32, 1]` ### `write_to` `write_to[W: Writer](self, mut writer: W)` ### `q_tile_num_rows` `q_tile_num_rows(self) -> SIMD[uint32, 1]` ### `q_out_gmem_tensor` `q_out_gmem_tensor[dtype: DType](self, ptr: UnsafePointer[SIMD[dtype, 1]]) -> LayoutTensor[dtype, __init__[::Origin[::Bool(IntTuple(BM, depth), IntTuple(depth if decoding else (depth * num_heads), 1)), MutableAnyOrigin, layout_int_type=int32, linear_idx_type=int32, masked=True]` ### `mask_status` `mask_status[mask_t: MHAMask](self, mask: mask_t, kv_tile_start_row: SIMD[uint32, 1]) -> TileMaskStatus` ### `exp_sum_qk_max_ptr` `exp_sum_qk_max_ptr[partition_t: MHAPartitionScheme](self, partition: partition_t, batch_size: SIMD[uint32, 1]) -> Tuple[UnsafePointer[SIMD[get_vtable_entry(:trait partition_t, "accum_dtype"), 1]], UnsafePointer[SIMD[get_vtable_entry(:trait partition_t, "accum_dtype"), 1]]]` ### `get_start_and_end_for_partitions` `get_start_and_end_for_partitions[partition_t: MHAPartitionScheme, //, BN: Int](self, partition: partition_t) -> Tuple[SIMD[uint32, 1], SIMD[uint32, 1]]` --- ## mha_sm90 ## Structs * [​`MHAPosition`](./MHAPosition): Position of the MHA-kernel. When `decoding=False`, `q_head_stride == num_heads`. When `decoding=True`, `q_head_stride == 1`. ## Functions * [​`mha_sm90_dispatch`](./mha_sm90_dispatch): * [​`valid_length_managed_tensor_slice_to_ndbuffer`](./valid_length_managed_tensor_slice_to_ndbuffer): --- ## mha_sm90_dispatch `mha_sm90_dispatch[k_t: MHAOperand, v_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, output_type: DType, max_prompt_len_t: OptionallyStaticInt, partition_t: MHAPartitionScheme, //, config: MHAConfig, group: Int, use_score_mod: Bool, ragged: Bool, _is_cache_length_accurate: Bool](output: UnsafePointer[SIMD[output_type, 1]], q: UnsafePointer[SIMD[type, 1]], k: k_t, v: v_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], max_prompt_len_arg: max_prompt_len_t, max_cache_valid_length_arg: Int, scale: SIMD[float32, 1], kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]], batch_size_arg: Int, partition: partition_t, ctx: DeviceContext)` --- ## valid_length_managed_tensor_slice_to_ndbuffer `valid_length_managed_tensor_slice_to_ndbuffer(tensor: ManagedTensorSlice[io_spec, static_spec=static_spec]) -> NDBuffer[uint32, 1, MutableAnyOrigin]` --- ## MHASchedule `@register_passable(trivial)` `struct MHASchedule` ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `DEFAULT` `alias DEFAULT = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))` ### `PROMPT_ROTATE` `alias PROMPT_ROTATE = MHASchedule(__init__[__mlir_type.!pop.int_literal](1))` ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` --- ## MHASchedulerSynchronization `@register_passable(trivial)` `struct MHASchedulerSynchronization` ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ALL` `alias ALL = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](2))` ### `DEFAULT` `alias DEFAULT = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))` ### `NONE` `alias NONE = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](0))` ### `PRODUCER` `alias PRODUCER = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))` ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` --- ## MHATileScheduler ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `may_advance` `alias may_advance` ### `mha_schedule` `alias mha_schedule` The MHATileScheduler trait describes a schedule for the persistent kernel. ## Methods ### `get_current_work_info` `get_current_work_info(self: _Self, ts: MHATileSummary, state: MHATileState) -> WorkInfo` Returns the current `WorkInfo`. ### `advance` `advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self: _Self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]` Advance state to the next work item. `func` must return a `Bool` indicating whether there is more work. Returns `True` if there is more work. ### `grid_dim` `static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]` Return the grid\_dim required for the kernel. ### `initial_state` `initial_state(self: _Self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState` Create the initial state object. ### `unsafe_seq_info` `unsafe_seq_info[ragged: Bool](self: _Self, ts: MHATileSummary, state: MHATileState) -> SeqInfo` --- ## MHATileState `@register_passable(trivial)` `struct MHATileState` ## Fields * ​idx (`SIMD[uint32, 1]`): * ​sidx\_ptr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)]`): * ​max\_idx (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(idx: SIMD[uint32, 1], sidx_ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], max_idx: SIMD[uint32, 1]) -> Self` ### `is_valid` `is_valid(self, idx: SIMD[uint32, 1]) -> Bool` `is_valid(self) -> Bool` --- ## MHATileSummary `@register_passable(trivial)` `struct MHATileSummary` ## Fields * ​batch\_size (`SIMD[uint32, 1]`): * ​max\_num\_prompt\_tiles (`SIMD[uint32, 1]`): * ​valid\_length (`NDBuffer[uint32, 1, MutableAnyOrigin]`): * ​max\_seq\_len (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1], valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_len: SIMD[uint32, 1]) -> Self` ### `get_current_work_info` `get_current_work_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], schedule: MHASchedule](self, idx: SIMD[uint32, 1]) -> WorkInfo` `get_current_work_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], schedule: MHASchedule](self, idx: MHATileState) -> WorkInfo` ### `unsafe_get_current_work_info` `unsafe_get_current_work_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], schedule: MHASchedule](self, idx: SIMD[uint32, 1]) -> WorkInfo` ### `max_idx` `max_idx(self, num_heads: SIMD[uint32, 1]) -> SIMD[uint32, 1]` ### `grid_dim` `static grid_dim[num_heads: SIMD[uint32, 1]](max_num_prompt_tiles: SIMD[uint32, 1], batch_size: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]` ### `seq_info` `seq_info[ragged: Bool](self, work: WorkInfo) -> SeqInfo` ### `unsafe_seq_info` `unsafe_seq_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], ragged: Bool, schedule: MHASchedule](self, idx: SIMD[uint32, 1]) -> SeqInfo` `unsafe_seq_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], ragged: Bool, schedule: MHASchedule](self, state: MHATileState) -> SeqInfo` --- ## QueuedTileScheduler `@register_passable(trivial)` `struct QueuedTileScheduler[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], /, decoding: Bool, num_ctas: SIMD[uint32, 1] = SIMD(Info(__init__[__mlir_type.!kgen.string]("H100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("hopper"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](9), __init__[__mlir_type.!kgen.string]("sm_90a"), 132, 32, 2048, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)), schedule: MHASchedule = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))]` If `decoding == False`, then `num_heads` is `q_num_heads`. If `decoding == True`, then `num_heads` is `kv_num_heads`. ## Fields * ​gidx\_ptr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(1)]`): ## Implemented traits `AnyType`, `Copyable`, `MHATileScheduler`, `Movable`, `UnknownDestructibility` ## Aliases ### `may_advance` `alias may_advance = True` ### `mha_schedule` `alias mha_schedule = schedule` ## Methods ### `__init__` `__init__(gidx_ptr: UnsafePointer[SIMD[uint32, 1]]) -> Self` ### `get_current_work_info` `get_current_work_info(self, ts: MHATileSummary, state: MHATileState) -> WorkInfo` ### `advance` `advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]` The parameter `func` must return a `Bool` indicating whether the `WorkInfo` arg is valid. This function returns whether the current idx corresponds to a valid `WorkInfo`. Note that if `MHASchedulerSynchronization` is `NONE`, then we assume it is only called by `thread_idx.x==0`. ### `grid_dim` `static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]` ### `initial_state` `initial_state(self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState` ### `unsafe_seq_info` `unsafe_seq_info[ragged: Bool](self, ts: MHATileSummary, state: MHATileState) -> SeqInfo` --- ## SeqInfo `@register_passable(trivial)` `struct SeqInfo` ## Fields * ​seq\_len (`SIMD[uint32, 1]`): * ​start\_of\_seq (`SIMD[uint32, 1]`): * ​prompt\_offset (`SIMD[uint32, 1]`): * ​head\_idx (`SIMD[uint32, 1]`): * ​prompt\_idx (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(seq_len: SIMD[uint32, 1], start_of_seq: SIMD[uint32, 1], work: WorkInfo) -> Self` ### `is_valid` `is_valid(self) -> Bool` ### `create` `static create[ragged: Bool](work: WorkInfo, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_len: SIMD[uint32, 1]) -> Self` --- ## TileScheduler `@register_passable(trivial)` `struct TileScheduler[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], /, num_ctas: SIMD[uint32, 1] = SIMD(Info(__init__[__mlir_type.!kgen.string]("H100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("hopper"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](9), __init__[__mlir_type.!kgen.string]("sm_90a"), 132, 32, 2048, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)), schedule: MHASchedule = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))]` ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `MHATileScheduler`, `Movable`, `UnknownDestructibility` ## Aliases ### `may_advance` `alias may_advance = True` ### `mha_schedule` `alias mha_schedule = schedule` ## Methods ### `__init__` `__init__() -> Self` ### `get_current_work_info` `get_current_work_info(self, ts: MHATileSummary, state: MHATileState) -> WorkInfo` ### `fetch_next_work` `fetch_next_work(self, ts: MHATileSummary, mut state: MHATileState) -> WorkInfo` ### `advance` `advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]` ### `grid_dim` `static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]` ### `initial_state` `initial_state(self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState` ### `unsafe_seq_info` `unsafe_seq_info[ragged: Bool](self, ts: MHATileSummary, state: MHATileState) -> SeqInfo` --- ## TransientScheduler `@register_passable(trivial)` `struct TransientScheduler[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1]]` ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `MHATileScheduler`, `Movable`, `UnknownDestructibility` ## Aliases ### `may_advance` `alias may_advance = False` ### `mha_schedule` `alias mha_schedule = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))` ## Methods ### `__init__` `__init__() -> Self` ### `get_current_work_info` `get_current_work_info(self) -> WorkInfo` `get_current_work_info(self, ts: MHATileSummary, state: MHATileState) -> WorkInfo` ### `advance` `advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]` ### `grid_dim` `static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]` ### `initial_state` `initial_state(self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState` ### `unsafe_seq_info` `unsafe_seq_info[ragged: Bool](self, ts: MHATileSummary, state: MHATileState) -> SeqInfo` --- ## WorkInfo `@register_passable(trivial)` `struct WorkInfo` ## Fields * ​prompt\_offset (`SIMD[uint32, 1]`): * ​head\_idx (`SIMD[uint32, 1]`): * ​prompt\_idx (`SIMD[uint32, 1]`): * ​is\_valid\_tile (`Bool`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `is_valid` `is_valid(self) -> Bool` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` --- ## mha_tile_scheduler ## Structs * [​`MHASchedule`](./MHASchedule): * [​`MHASchedulerSynchronization`](./MHASchedulerSynchronization): * [​`MHATileState`](./MHATileState): * [​`MHATileSummary`](./MHATileSummary): * [​`QueuedTileScheduler`](./QueuedTileScheduler): If `decoding == False`, then `num_heads` is `q_num_heads`. If `decoding == True`, then `num_heads` is `kv_num_heads`. * [​`SeqInfo`](./SeqInfo): * [​`TileScheduler`](./TileScheduler): * [​`TransientScheduler`](./TransientScheduler): * [​`WorkInfo`](./WorkInfo): ## Traits * [​`MHATileScheduler`](./MHATileScheduler): --- ## DynamicInt `@register_passable(trivial)` `struct DynamicInt` ## Fields * ​value (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `Intable`, `Movable`, `OptionallyStaticInt`, `UnknownDestructibility` ## Aliases ### `static_value` `alias static_value = OptionalReg[Int]({:i1 0, 1})` ## Methods ### `__init__` `__init__(value: Int) -> Self` ### `__int__` `__int__(self) -> Int` ### `as_uint32` `as_uint32(self) -> SIMD[uint32, 1]` --- ## FlashAttentionAlgorithm `@register_passable(trivial)` `struct FlashAttentionAlgorithm` ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `FLASH_ATTENTION_1` `alias FLASH_ATTENTION_1 = FlashAttentionAlgorithm(1)` ### `FLASH_ATTENTION_2` `alias FLASH_ATTENTION_2 = FlashAttentionAlgorithm(2)` ### `FLASH_ATTENTION_3` `alias FLASH_ATTENTION_3 = FlashAttentionAlgorithm(3)` ### `NAIVE` `alias NAIVE = FlashAttentionAlgorithm(0)` ## Methods ### `__init__` `__init__() -> Self` `@implicit` `__init__(value: Int) -> Self` ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` --- ## MHAConfig `@register_passable(trivial)` `struct MHAConfig` ## Fields * ​type (`DType`): * ​num\_heads (`UInt`): * ​depth (`UInt`): * ​num\_queries\_per\_block (`UInt`): * ​num\_keys\_per\_block (`UInt`): * ​BK (`UInt`): * ​WM (`UInt`): * ​WN (`UInt`): * ​num\_pipeline\_stages (`UInt`): * ​k\_group\_size (`UInt`): * ​algorithm (`FlashAttentionAlgorithm`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(type: DType, num_heads: UInt, depth: UInt, num_queries_per_block: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), num_keys_per_block: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), BK: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), WM: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), WN: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), num_pipeline_stages: UInt = UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), k_group_size: UInt = UInt(1), algorithm: FlashAttentionAlgorithm = FlashAttentionAlgorithm()) -> Self` ### `block_m` `block_m(self) -> UInt` ### `block_n` `block_n(self) -> UInt` ### `block_k` `block_k(self) -> UInt` ### `warp_m` `warp_m(self) -> UInt` ### `warp_n` `warp_n(self) -> UInt` ### `num_warps_m` `num_warps_m(self) -> UInt` ### `num_warps_n` `num_warps_n(self) -> UInt` ### `num_consumer_threads` `num_consumer_threads(self) -> UInt` ### `num_producer_threads` `num_producer_threads[producer_consumer_kernel: Bool = False](self) -> UInt` ### `num_threads` `num_threads[producer_consumer_kernel: Bool = False](self) -> UInt` ### `q_smem_size` `q_smem_size(self, fa3: Bool = False) -> UInt` ### `kv_smem_size` `kv_smem_size(self, fa3: Bool = False) -> UInt` ### `k_smem_size` `k_smem_size(self, sm_90: Bool = False) -> UInt` ### `v_smem_size` `v_smem_size(self, sm_90: Bool = False) -> UInt` ### `p_smem_size` `p_smem_size(self) -> UInt` ### `warp_scratch_smem_size` `warp_scratch_smem_size(self) -> UInt` ### `shared_mem_bytes` `shared_mem_bytes[shared_kv: Bool = False, sm_90: Bool = False](self) -> UInt` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` --- ## MHAPartitionScheme ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `accum_dtype` `alias accum_dtype` ### `do_partition` `alias do_partition` ## Methods ### `num_partitions` `num_partitions(self: _Self) -> SIMD[uint32, 1]` ### `get_exp_sum_qk_max_pointer` `get_exp_sum_qk_max_pointer(self: _Self) -> UnsafePointer[SIMD[get_vtable_entry(:trait _Self, "accum_dtype"), 1]]` --- ## NoPartition `@register_passable(trivial)` `struct NoPartition[dtype: DType]` ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `MHAPartitionScheme`, `Movable`, `UnknownDestructibility` ## Aliases ### `accum_dtype` `alias accum_dtype = dtype` ### `do_partition` `alias do_partition = False` ## Methods ### `__init__` `__init__() -> Self` ### `num_partitions` `num_partitions(self) -> SIMD[uint32, 1]` ### `get_exp_sum_qk_max_pointer` `get_exp_sum_qk_max_pointer(self) -> UnsafePointer[SIMD[dtype, 1]]` --- ## OptionallyStaticInt ## Implemented traits `AnyType`, `Copyable`, `Intable`, `Movable`, `UnknownDestructibility` ## Aliases ### `static_value` `alias static_value` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `as_uint32` `as_uint32(self: _Self) -> SIMD[uint32, 1]` ### `__int__` `__int__(self: _Self) -> Int` Get the integral representation of the value. **Returns:** The integral representation of the value. --- ## SplitKPartition `@register_passable(trivial)` `struct SplitKPartition[dtype: DType]` ## Fields * ​ptr (`UnsafePointer[SIMD[dtype, 1]]`): * ​num\_partitions\_value (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `MHAPartitionScheme`, `Movable`, `UnknownDestructibility` ## Aliases ### `accum_dtype` `alias accum_dtype = dtype` ### `do_partition` `alias do_partition = True` ## Methods ### `__init__` `__init__(ptr: UnsafePointer[SIMD[dtype, 1]], num_partitions_value: SIMD[uint32, 1]) -> Self` ### `num_partitions` `num_partitions(self) -> SIMD[uint32, 1]` ### `get_exp_sum_qk_max_pointer` `get_exp_sum_qk_max_pointer(self) -> UnsafePointer[SIMD[dtype, 1]]` --- ## StaticInt `@register_passable(trivial)` `struct StaticInt[value: Int]` ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Intable`, `Movable`, `OptionallyStaticInt`, `UnknownDestructibility` ## Aliases ### `static_value` `alias static_value = OptionalReg[Int]({:@stdlib::@builtin::@int::@Int value, 0})` ## Methods ### `__init__` `__init__() -> Self` ### `__int__` `__int__(self) -> Int` ### `as_uint32` `as_uint32(self) -> SIMD[uint32, 1]` --- ## dispatch_mask_and_score_mod `dispatch_mask_and_score_mod[mask_type: String, score_mod_type: String, callback_fn: fn[MHAMask, ScoreModTrait](mask: $0, score_mod: $1) raises capturing -> None, local_window_size: Int = -1, num_heads: Int = -1]()` --- ## dispatch_materialized_mask_and_score_mod `dispatch_materialized_mask_and_score_mod[score_mod_type: String, callback_fn: fn[MHAMask, ScoreModTrait](mask: $0, score_mod: $1) raises capturing -> None, num_heads: Int = -1](mask_nd: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], start_pos_nd: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}))` --- ## get_start_and_end_for_partitions `get_start_and_end_for_partitions[tile_size: Int](num_keys: Int, num_partitions: Int, partition_idx: Int) -> Tuple[Int, Int]` Calculate start and end indices for a partition. **Args:** * ​num\_keys (`Int`): Total number of keys (sequence length). * ​num\_partitions (`Int`): Number of partitions to split keys into. * ​partition\_idx (`Int`): Index of current partition (0 to num\_partitions-1). **Returns:** Tuple of (start\_idx, end\_idx) for the partition, aligned to tile\_size. --- ## mha_utils ## Aliases ### `callback_fn_type` `alias callback_fn_type = fn[MHAMask, ScoreModTrait](mask: $0, score_mod: $1) raises capturing -> None` ### `is_sm100` `alias is_sm100 = _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100"))` ### `is_sm90` `alias is_sm90 = _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90"))` ### `is_sm90or100` `alias is_sm90or100 = _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100"))` ## Structs * [​`DynamicInt`](./DynamicInt): * [​`FlashAttentionAlgorithm`](./FlashAttentionAlgorithm): * [​`MHAConfig`](./MHAConfig): * [​`NoPartition`](./NoPartition): * [​`SplitKPartition`](./SplitKPartition): * [​`StaticInt`](./StaticInt): ## Traits * [​`MHAPartitionScheme`](./MHAPartitionScheme): * [​`OptionallyStaticInt`](./OptionallyStaticInt): ## Functions * [​`dispatch_mask_and_score_mod`](./dispatch_mask_and_score_mod): * [​`dispatch_materialized_mask_and_score_mod`](./dispatch_materialized_mask_and_score_mod): * [​`get_start_and_end_for_partitions`](./get_start_and_end_for_partitions): Calculate start and end indices for a partition. --- ## flare_mla_decoding `flare_mla_decoding[rank: Int, cache_t: KVCacheT, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: cache_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` MLA decoding kernel that would only be called in the optimized compute graph. The Q input has a shape of \[seq\_len, num\_heads, depth]. The K input has a shape of \[seq\_len, 1, depth]. The V tensor is derived by reusing K, where V = K\[:, :, :depth\_v]. Specifically, for DeepSeek V2/3, depth = 576 and depth\_v = 512. This kernel computes attention without needing to load V twice. This kernel only handles decoding requests. In this case q\_max\_seq\_len = 1. This kernel handles batches with different valid lengths (i.e., before the padding). Such lengths are passed in valid\_length argument. `flare_mla_decoding[rank: Int, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, rank, origin, shape, strides], mask_functor: mask_t, score_mod_functor: score_mod_t, scale: SIMD[float32, 1], ctx: DeviceContext, num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` --- ## flare_mla_decoding_dispatch `flare_mla_decoding_dispatch[rank: Int, k_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, kv_num_heads: Int, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, _is_cache_length_accurate: Bool = False, _use_valid_length: Bool = True, decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: k_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], max_prompt_len: Int, max_cache_valid_length: Int, scale: SIMD[float32, 1], ctx: DeviceContext, kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` --- ## flare_mla_prefill `flare_mla_prefill[rank: Int, cache_t: KVCacheT, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, output_type: DType, softmax_type: DType, q_shape: DimList, //, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False](output: NDBuffer[output_type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, 3, origin, shape, strides], v: NDBuffer[type, 3, origin, shape, strides], k_rope: cache_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], cache_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}), cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), prev_output: OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]] = OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]]({:i1 0, 1}), prev_softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}))` MLA prefill kernel that would only be called in the optimized compute graph. Only supports ragged Q/K/V inputs. The Q input has a shape of \[seq\_len, num\_heads, q\_depth]. The K and V input has a shape of \[cache\_len, num\_heads, depth]. The K\_rope input is retrieved from the KV cache, with a shape of \[cache\_len, 1, q\_depth - depth]. Specifically, for DeepSeek V2/3, depth = 128 and q\_depth = 192. When computing attention scores (Q @ K), each head of K is smaller than Q head. The missing 64 elements of K are retrieved from the K cache, and broadcasted to all the heads. This kernel also handles that output has reduced dimension compared to input Q. This kernel handles batches with different valid lengths (i.e., before the padding). Such lengths are passed in valid\_length argument. `flare_mla_prefill[rank: Int, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, softmax_type: DType, q_shape: DimList, //, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, 3, origin, shape, strides], v: NDBuffer[type, 3, origin, shape, strides], k_rope: NDBuffer[type, 4, origin, shape, strides], mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], cache_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}), cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}))` --- ## flare_mla_prefill_dispatch `flare_mla_prefill_dispatch[rank: Int, k_t: MHAOperand, v_t: MHAOperand, k_rope_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, output_type: DType, softmax_type: DType, q_shape: DimList, //, kv_num_heads: Int, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False, q_depth: Int = 192, cache_depth: Int = 576, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":100")) else 4), UInt(1), FlashAttentionAlgorithm()), _ndbuffer_mha_operand: Bool = False](output: NDBuffer[output_type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: k_t, v: v_t, k_rope: k_rope_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], max_prompt_len: Int, scale: SIMD[float32, 1], ctx: DeviceContext, softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}), cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), prev_output: OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]] = OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]]({:i1 0, 1}), prev_softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}))` --- ## mla ## Functions * [​`flare_mla_decoding`](./flare_mla_decoding): MLA decoding kernel that would only be called in the optimized compute graph. * [​`flare_mla_decoding_dispatch`](./flare_mla_decoding_dispatch): * [​`flare_mla_prefill`](./flare_mla_prefill): MLA prefill kernel that would only be called in the optimized compute graph. Only supports ragged Q/K/V inputs. * [​`flare_mla_prefill_dispatch`](./flare_mla_prefill_dispatch): * [​`mla_decoding`](./mla_decoding): * [​`mla_decoding_single_batch`](./mla_decoding_single_batch): Flash attention v2 algorithm. * [​`mla_prefill`](./mla_prefill): * [​`mla_prefill_plan`](./mla_prefill_plan): This calls a GPU kernel that plans how to process a batch of sequences with varying lengths using a fixed-size buffer. * [​`mla_prefill_plan_kernel`](./mla_prefill_plan_kernel): * [​`mla_prefill_single_batch`](./mla_prefill_single_batch): MLA for encoding where seqlen > 1. --- ## mla_decoding `mla_decoding[q_type: DType, k_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, ragged: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], batch_size: Int, num_partitions: Int, max_cache_valid_length: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], mask: mask_t, score_mod: score_mod_t)` --- ## mla_decoding_single_batch `mla_decoding_single_batch[q_type: DType, k_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, depth_v: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], num_keys: UInt, num_partitions: UInt, max_cache_valid_length: UInt, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)` Flash attention v2 algorithm. --- ## mla_prefill `mla_prefill[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, k_rope_t: MHAOperand, output_type: DType, softmax_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, config: MHAConfig, group: Int = 128, q_depth: Int = 192, cache_depth: Int = 576, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False, _ndbuffer_mha_operand: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, k_rope: k_rope_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], softmax_info_ptr: UnsafePointer[SIMD[softmax_type, 1]], prev_output_ptr: UnsafePointer[SIMD[output_type, 1]], prev_softmax_info_ptr: UnsafePointer[SIMD[softmax_type, 1]], scale: SIMD[float32, 1], batch_size: Int, seq_len_arg: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]], mask: mask_t, score_mod: score_mod_t)` --- ## mla_prefill_plan `mla_prefill_plan[cache_t: KVCacheT](buffer_row_offsets: NDBuffer[uint32, 2, origin, shape, strides], cache_offsets: NDBuffer[uint32, 2, origin, shape, strides], buffer_lengths: NDBuffer[int32, 1, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], k_cache: cache_t, buffer_token_size: SIMD[uint32, 1], ctx: DeviceContext)` This calls a GPU kernel that plans how to process a batch of sequences with varying lengths using a fixed-size buffer. Each sequence in the batch has some existing cached tokens and new input tokens. The kernel divides the total tokens into chunks of buffer\_token\_size. For each chunk (iteration), it calculates: 1\. Buffer offsets for each sequence in each chunk 2\. Cache offsets for each sequence in each chunk 3\. Total buffer lengths for each processing iteration --- ## mla_prefill_plan_kernel `mla_prefill_plan_kernel[buffer_lengths_shape: DimList, cache_t: KVCacheT](buffer_row_offsets: NDBuffer[uint32, 2, MutableAnyOrigin], cache_offsets: NDBuffer[uint32, 2, MutableAnyOrigin], buffer_lengths: NDBuffer[int32, 1, MutableAnyOrigin, buffer_lengths_shape], input_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], k_cache: cache_t, buffer_token_size: SIMD[uint32, 1])` --- ## mla_prefill_single_batch `mla_prefill_single_batch[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, k_rope_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, config: MHAConfig, group: Int = 1, q_depth: Int = 192, cache_depth: Int = 576, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, k_rope: k_rope_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], softmax_info_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], prev_output_ptr: UnsafePointer[SIMD[output_type, 1]], prev_softmax_info_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], seq_len: Int, max_seq_len: Int, start_pos: SIMD[uint32, 1], cache_start_pos: SIMD[uint32, 1], num_keys: Int, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)` MLA for encoding where seqlen > 1. --- ## moe ## Functions * [​`moe_create_indices`](./moe_create_indices): * [​`moe_create_indices_kernel`](./moe_create_indices_kernel): --- ## moe_create_indices `moe_create_indices[input_type: DType, //, target: StringSlice[StaticConstantOrigin]](token_expert_order: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], expert_start_indices: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], restore_token_order: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], expert_ids: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], expert_usage_stats: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], topk_ids: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], context: DeviceContextPtr)` --- ## moe_create_indices_kernel `moe_create_indices_kernel[input_type: DType, num_threads: Int, token_expert_order_layout: Layout, expert_start_indices_layout: Layout, restore_token_order_layout: Layout, expert_ids_layout: Layout, expert_usage_stats_layout: Layout, indices_padded_layout: Layout, padded_input_layout: Layout, topk_ids_layout: Layout](token_expert_order: LayoutTensor[uint32, token_expert_order_layout, MutableAnyOrigin], expert_start_indices: LayoutTensor[uint32, expert_start_indices_layout, MutableAnyOrigin], restore_token_order: LayoutTensor[uint32, restore_token_order_layout, MutableAnyOrigin], expert_ids: LayoutTensor[uint32, expert_ids_layout, MutableAnyOrigin], expert_usage_stats: LayoutTensor[uint32, expert_usage_stats_layout, MutableAnyOrigin], indices_padded: LayoutTensor[uint32, indices_padded_layout, MutableAnyOrigin], topk_ids_padded: LayoutTensor[input_type, padded_input_layout, MutableAnyOrigin], topk_ids: LayoutTensor[input_type, topk_ids_layout, MutableAnyOrigin])` --- ## BoundingBox `struct BoundingBox[type: DType]` ## Fields * ​nw (`SIMD[type, 2]`): * ​se (`SIMD[type, 2]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, y1: SIMD[type, 1], x1: SIMD[type, 1], y2: SIMD[type, 1], x2: SIMD[type, 1])` ### `iou` `iou(self, other: Self) -> SIMD[type, 1]` ### `intersection_area` `intersection_area(self, other: Self) -> SIMD[type, 1]` ### `area` `area(self) -> SIMD[type, 1]` --- ## nms ## Structs * [​`BoundingBox`](./BoundingBox): ## Functions * [​`non_max_suppression`](./non_max_suppression): Buffer semantic overload. * [​`non_max_suppression_shape_func`](./non_max_suppression_shape_func): Overload to compute the output shape. Can be removed once the graph compiler supports value semantic kernels that allocate their own output. --- ## non_max_suppression `non_max_suppression[type: DType](boxes: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scores: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[int64, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], max_output_boxes_per_class: Int, iou_threshold: SIMD[float32, 1], score_threshold: SIMD[float32, 1])` Buffer semantic overload. `non_max_suppression[: origin.set, //, type: DType, func: fn(SIMD[int64, 1], SIMD[int64, 1], SIMD[int64, 1]) capturing -> None](boxes: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scores: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], max_output_boxes_per_class: Int, iou_threshold: SIMD[float32, 1], score_threshold: SIMD[float32, 1])` Implements the NonMaxSuppression operator from the ONNX spec . --- ## non_max_suppression_shape_func `non_max_suppression_shape_func[type: DType](boxes: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scores: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], max_output_boxes_per_class: Int, iou_threshold: SIMD[float32, 1], score_threshold: SIMD[float32, 1]) -> IndexList[2]` Overload to compute the output shape. Can be removed once the graph compiler supports value semantic kernels that allocate their own output. --- ## block_reduce `block_reduce[type: DType, max_warps_per_block: Int](val: SIMD[type, 1]) -> SIMD[type, 1]` --- ## group_norm `group_norm[type: DType, rank: Int, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], gamma_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0], beta_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0], /, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("gpu")](shape: IndexList[rank], epsilon: SIMD[type, 1], groups: SIMD[int32, 1], output: NDBuffer[type, rank, origin, shape, strides], ctx: DeviceContextPtr)` --- ## group_norm_gpu `group_norm_gpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], gamma_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0], beta_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0]](shape: IndexList[rank, element_type=element_type], epsilon: SIMD[type, 1], output: NDBuffer[type, rank, origin, shape, strides], num_groups: Int, ctx: DeviceContext)` --- ## group_norm_gpu_block `group_norm_gpu_block[type: DType, simd_width: UInt, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0], beta_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0]](output: NDBuffer[type, 2, MutableAnyOrigin], epsilon: SIMD[type, 1], num_groups: Int, channels_per_group: Int, spatial: Int)` --- ## group_norm_gpu_warp_tiling `group_norm_gpu_warp_tiling[type: DType, simd_width: UInt, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0], beta_fn: fn[Int](IndexList[1]) capturing -> SIMD[type, $0]](output: NDBuffer[type, 2, MutableAnyOrigin], epsilon: SIMD[type, 1], num_groups: Int, channels_per_group: Int, spatial: Int)` --- ## group_norm_reshape `group_norm_reshape[type: DType, rank: Int](shape: IndexList[rank, element_type=element_type], buf: NDBuffer[type, rank, origin, shape, strides], channels_per_group: Int, spatial: Int) -> NDBuffer[type, 2, origin]` Reshapes an input buffer for group normalization by flattening all dimensions except the group dimension. Returns a 2D buffer of shape (num\_groups \* N, group\_size), where group\_size is the product of channels\_per\_group and spatial. --- ## group_norm_shape `group_norm_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], gamma: NDBuffer[type, 1, origin], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], num_groups: SIMD[int32, 1]) -> IndexList[rank]` --- ## normalization ## Functions * [​`block_reduce`](./block_reduce): * [​`group_norm`](./group_norm): * [​`group_norm_gpu`](./group_norm_gpu): * [​`group_norm_gpu_block`](./group_norm_gpu_block): * [​`group_norm_gpu_warp_tiling`](./group_norm_gpu_warp_tiling): * [​`group_norm_reshape`](./group_norm_reshape): Reshapes an input buffer for group normalization by flattening all dimensions except the group dimension. Returns a 2D buffer of shape (num\_groups \* N, group\_size), where group\_size is the product of channels\_per\_group and spatial. * [​`group_norm_shape`](./group_norm_shape): * [​`layer_norm`](./layer_norm): * [​`layer_norm_cpu`](./layer_norm_cpu): Computes layernorm(elementwise\_fn(x)) across the last dimension of x, where layernorm is defined as $(x-mean(x))/(sqrt(var(x)+eps)*gamma_fn + beta$. * [​`layer_norm_gpu`](./layer_norm_gpu): * [​`layer_norm_gpu_block`](./layer_norm_gpu_block): * [​`layer_norm_gpu_warp_tiling`](./layer_norm_gpu_warp_tiling): * [​`layer_norm_reshape`](./layer_norm_reshape): * [​`layer_norm_shape`](./layer_norm_shape): Compute the output shape of a `layer_norm` operation. * [​`rms_norm`](./rms_norm): * [​`rms_norm_cpu`](./rms_norm_cpu): * [​`rms_norm_gpu`](./rms_norm_gpu): * [​`rms_norm_gpu_block`](./rms_norm_gpu_block): * [​`rms_norm_gpu_warp_tiling`](./rms_norm_gpu_warp_tiling): * [​`rms_norm_shape`](./rms_norm_shape): * [​`welford_block_all_reduce`](./welford_block_all_reduce): * [​`welford_combine`](./welford_combine): * [​`welford_update`](./welford_update): * [​`welford_warp_all_reduce`](./welford_warp_all_reduce): * [​`welford_warp_reduce`](./welford_warp_reduce): --- ## layer_norm `layer_norm[type: DType, rank: Int, input_0_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_1_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], /, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](shape: IndexList[rank], gamma_shape: IndexList[1], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], output: NDBuffer[type, rank, origin, shape, strides], ctx: DeviceContextPtr)` --- ## layer_norm_cpu `layer_norm_cpu[type: DType, //, input_fn: fn[Int](Int, Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](out_buf: NDBuffer[type, 2, origin, shape], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1])` Computes layernorm(elementwise\_fn(x)) across the last dimension of x, where layernorm is defined as $(x-mean(x))/(sqrt(var(x)+eps)*gamma_fn + beta$. Currently performs 3 passes over the input data. This can be reduced to 2 by fusing the add, mean, and variance loops using Welford's algorithm. **Parameters:** * ​type (`DType`): The x and out buffers' elements dtype. * ​input\_fn (`fn[Int](Int, Int) capturing -> SIMD[type, $0]`): Function called to generate an input value. * ​gamma\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): Function called to generate a gamma value. **Args:** * ​out\_buf (`NDBuffer[type, 2, origin, shape]`): The output buffer. * ​beta (`NDBuffer[type, 1, origin]`): The beta value to use in the layernorm calculation. * ​epsilon (`SIMD[type, 1]`): The eps value to use in the layernorm calculation. `layer_norm_cpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](shape: IndexList[rank, element_type=element_type], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], output: NDBuffer[type, rank, origin, shape, strides])` --- ## layer_norm_gpu `layer_norm_gpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](shape: IndexList[rank, element_type=element_type], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], output: NDBuffer[type, rank, origin, shape, strides], *, ctx: DeviceContext)` --- ## layer_norm_gpu_block `layer_norm_gpu_block[type: DType, //, simd_width: UInt, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](output: NDBuffer[type, 2, MutableAnyOrigin], beta: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1])` --- ## layer_norm_gpu_warp_tiling `layer_norm_gpu_warp_tiling[type: DType, //, simd_width: UInt, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](output: NDBuffer[type, 2, MutableAnyOrigin], beta: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1])` --- ## layer_norm_reshape `layer_norm_reshape[type: DType, rank: Int, //, output_rank: Int](shape: IndexList[rank, element_type=element_type], buf: NDBuffer[type, rank, origin, shape, strides]) -> NDBuffer[type, output_rank, origin]` --- ## layer_norm_shape `layer_norm_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], gamma: NDBuffer[type, 1, origin, __init__[::Intable](1)], beta: NDBuffer[type, 1, origin, __init__[::Intable](1)], epsilon: SIMD[type, 1]) -> IndexList[rank]` Compute the output shape of a `layer_norm` operation. **Parameters:** * ​type (`DType`): Type of the input tensors. * ​rank (`Int`): Rank of the input tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input (`NDBuffer[type, rank, origin]`): The input tensor. * ​gamma (`NDBuffer[type, 1, origin, __init__[::Intable](1)]`): The tensor for gamma coefficient. * ​beta (`NDBuffer[type, 1, origin, __init__[::Intable](1)]`): The tensor for beta coefficient. * ​epsilon (`SIMD[type, 1]`): The tensor for epsilon coefficient. **Returns:** The output shape. --- ## rms_norm `rms_norm[type: DType, rank: Int, input_0_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], /, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), multiply_before_cast: Bool = True](shape: IndexList[rank], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], output: NDBuffer[type, rank, origin], ctx: DeviceContextPtr)` --- ## rms_norm_cpu `rms_norm_cpu[type: DType, //, input_fn: fn[Int](Int, Int) capturing -> SIMD[type, $0], output_fn: fn[Int](Int, Int, SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], out_shape: IndexList[2])` `rms_norm_cpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int](IndexList[rank], SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](shape: IndexList[rank], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1])` --- ## rms_norm_gpu `rms_norm_gpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int](IndexList[rank], SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](shape: IndexList[rank, element_type=element_type], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], ctx: DeviceContext)` --- ## rms_norm_gpu_block `rms_norm_gpu_block[type: DType, //, simd_width: Int, max_warps_per_block: Int, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], output_fn: fn[Int](row: Int, col: Int, val: SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](gamma: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], num_cols: Int)` --- ## rms_norm_gpu_warp_tiling `rms_norm_gpu_warp_tiling[type: DType, //, simd_width: Int, max_warps_per_block: Int, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], output_fn: fn[Int](row: Int, col: Int, val: SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](gamma: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], num_cols: Int)` --- ## rms_norm_shape `rms_norm_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1]) -> IndexList[rank]` --- ## welford_block_all_reduce `welford_block_all_reduce[type: DType, //](thread_mean: SIMD[type, 1], thread_m2: SIMD[type, 1], thread_count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])` --- ## welford_combine `welford_combine[type: DType, //](mean: SIMD[type, 1], m2: SIMD[type, 1], count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])` --- ## welford_update `welford_update[type: DType, //](val: SIMD[type, 1], mut mean: SIMD[type, 1], mut m2: SIMD[type, 1], mut count: SIMD[type, 1])` --- ## welford_warp_all_reduce `welford_warp_all_reduce[type: DType, //](thread_mean: SIMD[type, 1], thread_m2: SIMD[type, 1], thread_count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])` --- ## welford_warp_reduce `welford_warp_reduce[type: DType, //](thread_mean: SIMD[type, 1], thread_m2: SIMD[type, 1], thread_count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])` --- ## pad ## Functions * [​`pad_constant`](./pad_constant): Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`. * [​`pad_reflect`](./pad_reflect): Fill `output` with values from `input`, and edges padded with reflected values from the unpadded region. * [​`pad_repeat`](./pad_repeat): Fill `output` with values from `input`, and edges padded boundary values from the unpadded region. * [​`pad_shape`](./pad_shape): Compute the output shape of a `pad` operation, and assert the inputs are compatible. --- ## pad_constant `pad_constant[output_layout: Layout, input_layout: Layout, type: DType, paddings_type: DType, constant_type: DType](output: LayoutTensor[type, output_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, input_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings: UnsafePointer[SIMD[paddings_type, 1]], constant: SIMD[constant_type, 1])` Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`. Example: var input\_shape = (X, Y, Z) var paddings = [x0, x1, y0, y1, z0, z1] out\[x, y, z] = input\[x - x0, y - y0, z - z0] if x ∈ \[x0, x0 + X] && y ∈ \[y0, y0 + Y] && z ∈ \[z0, z0 + Z] else constant **Args:** * ​output (`LayoutTensor[type, output_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer. * ​input (`LayoutTensor[type, input_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input buffer. * ​paddings (`UnsafePointer[SIMD[paddings_type, 1]]`): Ordered (before, after) padding sizes for each axis. * ​constant (`SIMD[constant_type, 1]`): The constant to pad output with. --- ## pad_reflect `pad_reflect[output_layout: Layout, input_layout: Layout, type: DType, paddings_type: DType](output: LayoutTensor[type, output_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, input_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings: UnsafePointer[SIMD[paddings_type, 1]])` Fill `output` with values from `input`, and edges padded with reflected values from the unpadded region. Example: var input = [\[1, 2], \[3, 4]] var paddings = [2, 2, 1, 0] Yields: output = [\[2, 1, 2], \[4, 3, 4], \[2, 1, 2], \[4, 3, 4], \[2, 1, 2], \[4, 3, 4]] **Args:** * ​output (`LayoutTensor[type, output_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer. * ​input (`LayoutTensor[type, input_layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input buffer. * ​paddings (`UnsafePointer[SIMD[paddings_type, 1]]`): Ordered (before, after) padding sizes for each axis. --- ## pad_repeat `pad_repeat[output_layout: Layout, input_layout: Layout, type: DType, paddings_type: DType](output: LayoutTensor[type, output_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, input_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings: UnsafePointer[SIMD[paddings_type, 1]])` Fill `output` with values from `input`, and edges padded boundary values from the unpadded region. Example: var input = [\[1, 2], \[3, 4]] var paddings = [2, 2, 1, 0] Yields: output = [\[1, 1, 2], \[1, 1, 2], \[1, 1, 2], \[3, 3, 4], \[3, 3, 4], \[3, 3, 4]] **Parameters:** * ​output\_layout (`Layout`): Layout of the output buffer. * ​input\_layout (`Layout`): Layout of the input buffer. * ​type (`DType`): DType of the input/output buffer. * ​paddings\_type (`DType`): DType of the input, output, and padding buffers. **Args:** * ​output (`LayoutTensor[type, output_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer. * ​input (`LayoutTensor[type, input_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input buffer. * ​paddings (`UnsafePointer[SIMD[paddings_type, 1]]`): Ordered (before, after) padding sizes for each axis. --- ## pad_shape `pad_shape[input_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings_buf: LayoutTensor[paddings_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> IndexList[layout.rank()]` Compute the output shape of a `pad` operation, and assert the inputs are compatible. **Parameters:** * ​input\_type (`DType`): Type of the input tensor. * ​paddings\_type (`DType`): Type of the padding tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input\_buf (`LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to pad. * ​paddings\_buf (`LayoutTensor[paddings_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The paddings tensor, of shape (input\_rank, 2). **Returns:** The output shape. --- ## get_padding_output_shape `get_padding_output_shape[rank: Int](input_shape: IndexList[rank], paddings: LayoutTensor[index, __init__[::Origin[::Bool(IntTuple((rank * 2))), origin]) -> IndexList[rank]` --- ## pad_gpu ## Functions * [​`get_padding_output_shape`](./get_padding_output_shape): * [​`pad_constant`](./pad_constant): Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`. --- ## pad_constant `pad_constant[rank: Int, type: DType, padding_type: DType](output: UnsafePointer[SIMD[type, 1]], output_shape: IndexList[rank], input: UnsafePointer[SIMD[type, 1]], input_shape: IndexList[rank], paddings: UnsafePointer[SIMD[padding_type, 1]], constant: SIMD[type, 1], ctx: DeviceContext)` Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`. Example: ```mojo var input_shape = (X, Y, Z) var paddings = [x0, x1, y0, y1, z0, z1] out[x, y, z] = input[x - x0, y - y0, z - z0] if x ∈ [x0, x0 + X] && y ∈ [y0, y0 + Y] && z ∈ [z0, z0 + Z] else constant ``` **Args:** * ​output (`UnsafePointer[SIMD[type, 1]]`): The output buffer. * ​output\_shape (`IndexList[rank]`): The output shape. * ​input (`UnsafePointer[SIMD[type, 1]]`): The input buffer. * ​input\_shape (`IndexList[rank]`): The input shape. * ​paddings (`UnsafePointer[SIMD[padding_type, 1]]`): Ordered (before, after) padding sizes for each axis. * ​constant (`SIMD[type, 1]`): The constant to pad output with. * ​ctx (`DeviceContext`): Device context for participating GPU. --- ## PoolMethod `@register_passable(trivial)` `struct PoolMethod` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `AVG` `alias AVG = PoolMethod(1)` ### `MAX` `alias MAX = PoolMethod(0)` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` ### `__ne__` `__ne__(self, rhs: Self) -> Bool` --- ## avg_pool `avg_pool[type: DType, int_type: DType, rank: Int = 4, count_boundary: Bool = False](input: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], filter: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], strides: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dilations: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ceil_mode: Bool = False)` Computes the average pool. Params: count\_boundary: Whether to count the boundary in the average computation. **Args:** * ​input (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Batched image input to the pool2d operator. * ​filter (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Filter size on height and width dimensions with assumed tuple def (filter\_h, filter\_w). * ​strides (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Strides on height and width dimensions with assumed tuple def (stride\_h, stride\_w). * ​dilations (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Dilations on height and width dimensions with assumed tuple def (dilation\_h, dilation\_w). * ​paddings (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Paddings on height and width dimensions with assumed tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)). * ​output (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Pre-allocated output tensor space. * ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding. --- ## avg_pool_gpu `avg_pool_gpu[type: DType, int_type: DType, count_boundary: Bool = False](ctx: DeviceContext, input: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], filter: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], strides: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dilations: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ceil_mode: Bool = False)` Computes the average pool on GPU. Params: count\_boundary: Whether to count the boundary in the average computation. **Args:** * ​ctx (`DeviceContext`): The DeviceContext to use for GPU execution. * ​input (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On device) Batched image input to the pool2d operator. * ​filter (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Filter size on height and width dimensions with assumed tuple def (filter\_h, filter\_w). * ​strides (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Strides on height and width dimensions with assumed tuple def (stride\_h, stride\_w). * ​dilations (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Dilations on height and width dimensions with assumed tuple def (dilation\_h, dilation\_w). * ​paddings (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Paddings on height and width dimensions with assumed tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)). * ​output (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On device) Pre-allocated output tensor space. * ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding. --- ## pool ## Structs * [​`PoolMethod`](./PoolMethod): ## Functions * [​`avg_pool`](./avg_pool): Computes the average pool. * [​`avg_pool_gpu`](./avg_pool_gpu): Computes the average pool on GPU. * [​`max_pool`](./max_pool): Computes fp32 pooling. * [​`max_pool_gpu`](./max_pool_gpu): Computes max pooling on GPU. * [​`pool_shape`](./pool_shape): * [​`pool_shape_ceil`](./pool_shape_ceil): * [​`pool_shape_impl`](./pool_shape_impl): Compute the output shape of a pooling operation, and assert the inputs are compatible. Works for 2D pool operations only in the NHWC format. --- ## max_pool `max_pool[type: DType, int_type: DType](input: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], filter: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], strides: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dilations: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ceil_mode: Bool = False)` Computes fp32 pooling. **Args:** * ​input (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Batched image input to the pool2d operator. * ​filter (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Filter size on height and width dimensions with assumed tuple def (filter\_h, filter\_w). * ​strides (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Strides on height and width dimensions with assumed tuple def (stride\_h, stride\_w). * ​dilations (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Dilations on height and width dimensions with assumed tuple def (dilation\_h, dilation\_w). * ​paddings (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Paddings on height and width dimensions with assumed tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)). * ​output (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Pre-allocated output tensor space. * ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding. --- ## max_pool_gpu `max_pool_gpu[type: DType, int_type: DType](ctx: DeviceContext, input: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], filter: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], strides: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dilations: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings: LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ceil_mode: Bool = False)` Computes max pooling on GPU. **Args:** * ​ctx (`DeviceContext`): The DeviceContext to use for GPU execution. * ​input (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On device) Batched image input to the pool2d operator. * ​filter (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Filter size on height and width dimensions with assumed tuple def (filter\_h, filter\_w). * ​strides (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Strides on height and width dimensions with assumed tuple def (stride\_h, stride\_w). * ​dilations (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Dilations on height and width dimensions with assumed tuple def (dilation\_h, dilation\_w). * ​paddings (`LayoutTensor[int_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On host) Paddings on height and width dimensions with assumed tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)). * ​output (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): (On device) Pre-allocated output tensor space. * ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding. --- ## pool_shape `pool_shape[input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], filter_buf: LayoutTensor[filter_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], strides_buf: LayoutTensor[strides_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dilations_buf: LayoutTensor[dilations_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings_buf: LayoutTensor[paddings_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> IndexList[layout.rank()]` --- ## pool_shape_ceil `pool_shape_ceil[input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], filter_buf: LayoutTensor[filter_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], strides_buf: LayoutTensor[strides_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dilations_buf: LayoutTensor[dilations_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings_buf: LayoutTensor[paddings_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> IndexList[layout.rank()]` --- ## pool_shape_impl `pool_shape_impl[input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool, ceil_mode: Bool](input_buf: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], filter_buf: LayoutTensor[filter_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], strides_buf: LayoutTensor[strides_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dilations_buf: LayoutTensor[dilations_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], paddings_buf: LayoutTensor[paddings_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> IndexList[layout.rank()]` Compute the output shape of a pooling operation, and assert the inputs are compatible. Works for 2D pool operations only in the NHWC format. **Parameters:** * ​input\_type (`DType`): Type of the input tensor. * ​filter\_type (`DType`): Type of the filter tensor. * ​strides\_type (`DType`): Type of the strides tensor. * ​dilations\_type (`DType`): Type of the dilations tensor. * ​paddings\_type (`DType`): Type of the paddings tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​ceil\_mode (`Bool`): Define rounding mode for shape calculation. **Args:** * ​input\_buf (`LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor. * ​filter\_buf (`LayoutTensor[filter_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The filter size buffer. * ​strides\_buf (`LayoutTensor[strides_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The strides size buffer. * ​dilations\_buf (`LayoutTensor[dilations_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The dilations size buffer. * ​paddings\_buf (`LayoutTensor[paddings_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The paddings size buffer. **Returns:** The output shape. --- ## rand_uniform ## Functions * [​`random_uniform`](./random_uniform): Call `output_fn` with values generated from a uniform distribution on \[lower\_bound, upper\_bound] for floating-point types or \[lower\_bound, upper\_bound) for integer types. --- ## random_uniform `random_uniform[: origin.set, dtype: DType, rank: Int, //, output_fn: fn[Int, Int](idx: IndexList[$1], val: SIMD[dtype, $0]) capturing -> None, target: StringSlice[StaticConstantOrigin]](shape: IndexList[rank], lower_bound: SIMD[dtype, 1], upper_bound: SIMD[dtype, 1], seed_value: SIMD[uint64, 1], ctx: DeviceContextPtr)` Call `output_fn` with values generated from a uniform distribution on \[lower\_bound, upper\_bound] for floating-point types or \[lower\_bound, upper\_bound) for integer types. **Parameters:** * ​dtype (`DType`): The data type to generate. * ​rank (`Int`): The rank of the underlying buffer. * ​output\_fn (`fn[Int, Int](idx: IndexList[$1], val: SIMD[dtype, $0]) capturing -> None`): The function which stores the generated values. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. **Args:** * ​shape (`IndexList[rank]`): The shape of the output being stored into by output\_fn. * ​lower\_bound (`SIMD[dtype, 1]`): The lower bound on the uniform range. * ​upper\_bound (`SIMD[dtype, 1]`): The upper bound on the uniform range. * ​seed\_value (`SIMD[uint64, 1]`): Seed value used to initialize the random number generator. * ​ctx (`DeviceContextPtr`): The device context. --- ## randn ## Functions * [​`random_normal`](./random_normal): Fill `output` with values generated from Normal(mean, variance) distribution. --- ## random_normal `random_normal[type: DType, mean: SIMD[float64, 1], variance: SIMD[float64, 1]](output: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Fill `output` with values generated from Normal(mean, variance) distribution. **Args:** * ​output (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer. --- ## repeat_interleave ## Functions * [​`repeat_interleave`](./repeat_interleave): Fill `output` by repeating values from `input` along `axis` based on the values in `repeats` buffer. * [​`repeat_interleave_shape`](./repeat_interleave_shape): --- ## repeat_interleave `repeat_interleave[type: DType, rank: Int, type_repeats: DType](input: NDBuffer[type, rank, origin], repeats: NDBuffer[type_repeats, 1, origin], axis: Int, output: NDBuffer[type, rank, origin])` Fill `output` by repeating values from `input` along `axis` based on the values in `repeats` buffer. This is intended to implement the same functionality as torch.repeat: **Args:** * ​input (`NDBuffer[type, rank, origin]`): The input buffer. * ​repeats (`NDBuffer[type_repeats, 1, origin]`): The number of repetitions each element in input. * ​axis (`Int`): The axis along which to repeat values. * ​output (`NDBuffer[type, rank, origin]`): The output buffer. --- ## repeat_interleave_shape `repeat_interleave_shape[type_repeats: DType](input: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], repeats: NDBuffer[type_repeats, 1, origin], axis: Int) -> IndexList[rank]` --- ## reshape ## Functions * [​`ndbuffer_reshape`](./ndbuffer_reshape): * [​`reshape`](./reshape): * [​`reshape_shape`](./reshape_shape): --- ## ndbuffer_reshape `ndbuffer_reshape[rank: Int, output_rank: Int, type: DType, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], new_shape: IndexList[output_rank]) -> NDBuffer[type, output_rank, origin]` --- ## reshape `reshape[rank: Int, type: DType, //, output_rank: Int, single_thread_blocking_override: Bool = True](input: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], new_shape: IndexList[output_rank]) -> NDBuffer[type, output_rank, origin]` --- ## reshape_shape `reshape_shape[input_rank: Int, output_rank: Int, input_type: DType, target_shape_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], target_shape_buf: NDBuffer[target_shape_type, 1, origin]) -> IndexList[output_rank]` --- ## CoordinateTransformationMode `struct CoordinateTransformationMode` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `AlignCorners` `alias AlignCorners = CoordinateTransformationMode(1)` ### `Asymmetric` `alias Asymmetric = CoordinateTransformationMode(2)` ### `HalfPixel` `alias HalfPixel = CoordinateTransformationMode(0)` ### `HalfPixel1D` `alias HalfPixel1D = CoordinateTransformationMode(3)` ## Methods ### `__init__` `@implicit` `__init__(out self, value: Int)` ### `__eq__` `__eq__(self, other: Self) -> Bool` --- ## InterpolationMode `struct InterpolationMode` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `Linear` `alias Linear = InterpolationMode(0)` ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` --- ## Interpolator `@register_passable(trivial)` `struct Interpolator[mode: InterpolationMode]` ## Fields * ​cubic\_coeff (`SIMD[float32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(cubic_coeff: SIMD[float32, 1]) -> Self` `__init__() -> Self` ### `filter_length` `static filter_length() -> Int` ### `filter` `filter(self, x: SIMD[float32, 1]) -> SIMD[float32, 1]` --- ## RoundMode `struct RoundMode` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `Ceil` `alias Ceil = RoundMode(3)` ### `Floor` `alias Floor = RoundMode(2)` ### `HalfDown` `alias HalfDown = RoundMode(0)` ### `HalfUp` `alias HalfUp = RoundMode(1)` ## Methods ### `__init__` `@implicit` `__init__(out self, value: Int)` ### `__eq__` `__eq__(self, other: Self) -> Bool` --- ## coord_transform `coord_transform[mode: CoordinateTransformationMode](out_coord: Int, in_dim: Int, out_dim: Int, scale: SIMD[float32, 1]) -> SIMD[float32, 1]` --- ## resize ## Structs * [​`CoordinateTransformationMode`](./CoordinateTransformationMode): * [​`InterpolationMode`](./InterpolationMode): * [​`Interpolator`](./Interpolator): * [​`RoundMode`](./RoundMode): ## Functions * [​`coord_transform`](./coord_transform): * [​`interpolate_point_1d`](./interpolate_point_1d): * [​`linear_filter`](./linear_filter): This is a tent filter. * [​`resize_linear`](./resize_linear): Resizes input to output shape using linear interpolation. * [​`resize_nearest_neighbor`](./resize_nearest_neighbor): --- ## interpolate_point_1d `interpolate_point_1d[coordinate_transformation_mode: CoordinateTransformationMode, antialias: Bool, rank: Int, type: DType, interpolation_mode: InterpolationMode](interpolator: Interpolator[interpolation_mode], dim: Int, out_coords: IndexList[rank], scale: SIMD[float32, 1], input: NDBuffer[type, rank, origin], output: NDBuffer[type, rank, origin])` --- ## linear_filter `linear_filter(x: SIMD[float32, 1]) -> SIMD[float32, 1]` This is a tent filter. f(x) = 1 + x, x = 1 --- ## resize_linear `resize_linear[coordinate_transformation_mode: CoordinateTransformationMode, antialias: Bool, rank: Int, type: DType](input: NDBuffer[type, rank, origin], output: NDBuffer[type, rank, origin])` Resizes input to output shape using linear interpolation. Does not use anti-aliasing filter for downsampling (coming soon). **Parameters:** * ​coordinate\_transformation\_mode (`CoordinateTransformationMode`): How to map a coordinate in output to a coordinate in input. * ​antialias (`Bool`): Whether or not to use an antialiasing linear/cubic filter, which when downsampling, uses more points to avoid aliasing artifacts. Effectively stretches the filter by a factor of 1 / scale. * ​rank (`Int`): Rank of the input and output. * ​type (`DType`): Type of input and output. **Args:** * ​input (`NDBuffer[type, rank, origin]`): The input to be resized. * ​output (`NDBuffer[type, rank, origin]`): The output containing the resized input. --- ## resize_nearest_neighbor `resize_nearest_neighbor[coordinate_transformation_mode: CoordinateTransformationMode, round_mode: RoundMode, rank: Int, type: DType](input: NDBuffer[type, rank, origin], output: NDBuffer[type, rank, origin])` --- ## Weighted2DPoint `@register_passable(trivial)` `struct Weighted2DPoint[type: DType]` Utility class to wrap 2-d point coordinates and floating point weight for bilinear interpolation. ## Fields * ​y (`Int`): * ​x (`Int`): * ​w (`SIMD[type, 1]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(y: Int, x: Int, weight: SIMD[type, 1]) -> Self` --- ## roi_align ## Structs * [​`Weighted2DPoint`](./Weighted2DPoint): Utility class to wrap 2-d point coordinates and floating point weight for bilinear interpolation. ## Functions * [​`roi_align_nhwc`](./roi_align_nhwc): Compute ROIAlign a batch of rois of shape \[M, 5] where the first dim is the batch index, followed by region box coordinates (y0, x0) (y1, x1). For inputs of NHWC format. The output shape is \[M, output\_height, output\_width, C]. --- ## roi_align_nhwc `roi_align_nhwc[type: DType, output_layout: Layout, input_layout: Layout, roi_layout: Layout, //, aligned: Bool, mode: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("AVG")](output: LayoutTensor[type, output_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, input_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], rois: LayoutTensor[type, roi_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output_height: Int, output_width: Int, in_spatial_scale: SIMD[dtype, 1], in_sampling_ratio: SIMD[dtype, 1])` Compute ROIAlign a batch of rois of shape \[M, 5] where the first dim is the batch index, followed by region box coordinates (y0, x0) (y1, x1). For inputs of NHWC format. The output shape is \[M, output\_height, output\_width, C]. **Parameters:** * ​type (`DType`): Type of the input tensor. * ​output\_layout (`Layout`): The output layout. * ​input\_layout (`Layout`): The input layout. * ​roi\_layout (`Layout`): The layout of the regions of interests (ROI). * ​aligned (`Bool`): If not true offset the ROIs by 0.5. * ​mode (`StringSlice[StaticConstantOrigin]`): The pooling mode "AVG" for average and "MAX" for max pooling. --- ## apply_penalties_to_logits `apply_penalties_to_logits[logit_type: DType, penalty_type: DType, //, target: StringSlice[StaticConstantOrigin]](logits: LayoutTensor[logit_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], compressed_frequency_data: LayoutTensor[int32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], frequency_offsets: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], frequency_penalty: LayoutTensor[penalty_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], presence_penalty: LayoutTensor[penalty_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], repetition_penalty: LayoutTensor[penalty_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ctx: DeviceContextPtr)` Apply penalties to the logits based on the frequency of the tokens in the batch. The frequency data is stored in a CSR format, where the frequency\_offsets is the starting index of each sequence in the frequency\_data array. The frequency\_data array is a 2D array, where: * frequency\_data\[i, 0] is the token id * frequency\_data\[i, 1] is the frequency of the token in the sequence --- ## sampling ## Functions * [​`apply_penalties_to_logits`](./apply_penalties_to_logits): Apply penalties to the logits based on the frequency of the tokens in the batch. * [​`update_frequency_data`](./update_frequency_data): Update the frequency data for the given new tokens. --- ## update_frequency_data `update_frequency_data[token_type: DType, //, target: StringSlice[StaticConstantOrigin]](compressed_frequency_data: LayoutTensor[int32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], frequency_offsets: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], new_tokens: LayoutTensor[token_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ctx: DeviceContextPtr)` Update the frequency data for the given new tokens. The frequency data is stored in a CSR format. This kernel expects there will be enough padding for each sequence to store the new tokens. --- ## get_sliding_window_out_dim `get_sliding_window_out_dim[ceil_mode: Bool = False](in_dim: Int, ft_dim: Int, dilation: Int, stride: Int, pad: Int) -> Int` Return output dimension for a sliding window operation along some dimension. **Parameters:** * ​ceil\_mode (`Bool`): Define rounding mode for shape calculation. **Args:** * ​in\_dim (`Int`): The size of the input dimension. * ​ft\_dim (`Int`): The size of the corresponding filter dimension. * ​dilation (`Int`): The dilation for the sliding window operation. * ​stride (`Int`): The stride for the sliding window operation. * ​pad (`Int`): The total padding for the sliding window operation. **Returns:** The size of the output dimension. --- ## shapes ## Functions * [​`get_sliding_window_out_dim`](./get_sliding_window_out_dim): Return output dimension for a sliding window operation along some dimension. --- ## copy_to_slice `copy_to_slice[type: DType, start_type: DType, end_type: DType, step_type: DType, in_rank: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](buffer: NDBuffer[type, in_rank, origin], in_slice: NDBuffer[type, in_rank, origin], start: NDBuffer[start_type, 1, origin], end: NDBuffer[end_type, 1, origin], step: NDBuffer[step_type, 1, origin], context: DeviceContextPtr = DeviceContextPtr())` --- ## slice ## Functions * [​`copy_to_slice`](./copy_to_slice): * [​`slice_as_copy`](./slice_as_copy): * [​`slice_as_view`](./slice_as_view): * [​`slice_dim_as_view`](./slice_dim_as_view): * [​`slice_shape`](./slice_shape): --- ## slice_as_copy `slice_as_copy[type: DType, index_type: DType, in_rank: Int](output: NDBuffer[type, in_rank, origin], tensor: NDBuffer[type, in_rank, origin], start: NDBuffer[index_type, 1, origin], end: NDBuffer[index_type, 1, origin], step: NDBuffer[index_type, 1, origin])` --- ## slice_as_view `slice_as_view[type: DType, start_type: DType, end_type: DType, step_type: DType, rank: Int](tensor: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], starts: NDBuffer[start_type, 1, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], ends: NDBuffer[end_type, 1, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], steps: NDBuffer[step_type, 1, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> NDBuffer[type, rank, origin]` --- ## slice_dim_as_view `slice_dim_as_view[type: DType, rank: Int, dim: Int](tensor: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], start: Int, end: Int, step: Int) -> NDBuffer[type, rank, origin]` --- ## slice_shape `slice_shape[input_rank: Int, input_type: DType, start_type: DType, stop_type: DType, step_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], start_buf: NDBuffer[start_type, 1, origin], stop_buf: NDBuffer[stop_type, 1, origin], step_buf: NDBuffer[step_type, 1, origin]) -> IndexList[input_rank]` --- ## identity `identity(x: SIMD[dtype, size]) -> SIMD[dtype, size]` --- ## softmax ## Functions * [​`identity`](./identity): * [​`logsoftmax`](./logsoftmax): Performs an unbatched logsoftmax on an input tensor using the three-pass algorithm. * [​`mul`](./mul): * [​`reciprocal`](./reciprocal): * [​`reduce_add_simd`](./reduce_add_simd): This functions adds val to either the scalar value or the vector value depending on the step\_simd\_width. This is useful when the simd\_width varies between iterations as in vectorize. * [​`softmax`](./softmax): * [​`softmax_2_pass`](./softmax_2_pass): Performs an unbatched softmax on an input tensor using the two-pass online algorithm. * [​`softmax_3_pass`](./softmax_3_pass): Performs an unbatched softmax on an input tensor using the three-pass algorithm. * [​`softmax_kernel`](./softmax_kernel): * [​`sub`](./sub): --- ## logsoftmax `logsoftmax[simd_width: Int, buffer_size: Dim, type: DType, origins: origin.set, input_fn_1d: fn[Int](Int) capturing -> SIMD[type, $0]](output: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)])` Performs an unbatched logsoftmax on an input tensor using the three-pass algorithm. The unbatched three-pass softmax is defined as: procedure SoftmaxUnbatched(InputInput) maxVal = -∞ denom = 0 STEP 1: find the max value in each batch for i = 0 to N do maxVal = max(maxVal, Input\[b, i]) end for STEP 2: compute the sum of exponential of each batch for i = 0 to N do Output\[b, i] = Input\[b, i] - maxVal accum += exp(Output\[b, i]) end for STEP 3: normalize each batch for i = 0 to N do Output\[b, i] -= log(accum) end for **Parameters:** * ​simd\_width (`Int`): The simd\_width to use in vectorization. * ​buffer\_size (`Dim`): The size of the input and output buffers. * ​type (`DType`): The type of the input and output buffers. * ​origins (`origin.set`): The OriginSet of captured arguments by the input\_fn\_1d. * ​input\_fn\_1d (`fn[Int](Int) capturing -> SIMD[type, $0]`): The elementwise input lambda. **Args:** * ​output (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The output buffer in which to store the softmax values. `logsoftmax[: origin.set, //, type: DType, simd_width: Int, rank: Int, static_shape: DimList, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](shape: IndexList[rank], output: NDBuffer[type, rank, origin, static_shape], axis: Int)` `logsoftmax[type: DType, simd_width: Int, rank: Int, static_shape: DimList](input: NDBuffer[type, rank, origin, static_shape], output: NDBuffer[type, rank, origin, static_shape], axis: Int)` --- ## mul `mul(x: SIMD[dtype, size], y: SIMD[dtype, size]) -> SIMD[dtype, size]` --- ## reciprocal `reciprocal(x: SIMD[dtype, size]) -> SIMD[dtype, size]` --- ## reduce_add_simd `reduce_add_simd[simd_width: Int, step_simd_width: Int, type: DType](mut scalar: SIMD[type, 1], mut vector: SIMD[type, simd_width], val: SIMD[type, step_simd_width])` This functions adds val to either the scalar value or the vector value depending on the step\_simd\_width. This is useful when the simd\_width varies between iterations as in vectorize. --- ## softmax `softmax[type: DType, simd_width: Int, rank: Int, static_shape: DimList](input: NDBuffer[type, rank, origin, static_shape], output: NDBuffer[type, rank, origin, static_shape], axis: Int)` `softmax[: origin.set, //, type: DType, simd_width: Int, rank: Int, static_shape: DimList, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](shape: IndexList[rank], output: NDBuffer[type, rank, origin, static_shape], axis: Int, context: DeviceContextPtr = DeviceContextPtr())` --- ## softmax_2_pass `softmax_2_pass[simd_width: Int, buffer_size: Dim, type: DType](output: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)], input: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)])` Performs an unbatched softmax on an input tensor using the two-pass online algorithm. The unbatched two-pass online softmax is described in "Online normalizer calculation for softmax" () and "A full-stack search technique for domain optimized deep learning accelerators" () and is defined as: procedure SoftmaxUnbatched(InputInput) runningMax = -∞ runningSum = 0 STAGE 1: for i = 0 to N do newMax = max(runningMax, Input\[i]) runningSum = runningSum\*exp(runningMax-newMax) + exp(Input\[i]-newMax) runningMax = newMax end for for i = 0 to N do Output\[i] = exp(Input\[i] - runningMax) / runningSum end for **Parameters:** * ​simd\_width (`Int`): The simd\_width to use in vectorization. * ​buffer\_size (`Dim`): The size of the input and output buffers. * ​type (`DType`): The type of the input and output buffers. **Args:** * ​output (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The output buffer in which to store the softmax values. * ​input (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The input buffer used to compute the softmax. --- ## softmax_3_pass `softmax_3_pass[simd_width: Int, buffer_size: Dim, type: DType, origins: origin.set, input_fn_1d: fn[Int](Int) capturing -> SIMD[type, $0]](output: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)])` Performs an unbatched softmax on an input tensor using the three-pass algorithm. The unbatched three-pass softmax is defined as: procedure SoftmaxUnbatched(InputInput) maxVal = -∞ denom = 0 STEP 1: find the max value in each batch for i = 0 to N do maxVal = max(maxVal, Input\[b, i]) end for STEP 2: compute the exponential for each batch for i = 0 to N do Output\[b, i] = exp(Input\[b, i] - maxVal) denom += Output\[b, i] end for STEP 3: normalize each batch for i = 0 to N do Output\[b, i] /= denom end for **Parameters:** * ​simd\_width (`Int`): The simd\_width to use in vectorization. * ​buffer\_size (`Dim`): The size of the input and output buffers. * ​type (`DType`): The type of the input and output buffers. * ​origins (`origin.set`): The OriginSet of captured arguments by the input\_fn\_1d. * ​input\_fn\_1d (`fn[Int](Int) capturing -> SIMD[type, $0]`): The elementwise input lambda. **Args:** * ​output (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The output buffer in which to store the softmax values. --- ## softmax_kernel `softmax_kernel[: origin.set, //, BLOCK_SIZE: Int, input_fn: fn[DType, Int, Int](IndexList[$2]) capturing -> SIMD[$0, $1], type: DType, rank: Int, accum_type: DType = get_accum_type[::DType,::DType]()](shape: IndexList[rank], output: NDBuffer[type, rank, MutableAnyOrigin])` --- ## sub `sub(x: SIMD[dtype, size], y: SIMD[dtype, size]) -> SIMD[dtype, size]` --- ## split ## Functions * [​`split`](./split): --- ## split `split[type: DType, num_outputs: Int, target: StringSlice[StaticConstantOrigin], trace_description: StringSlice[StaticConstantOrigin], outputs_origin: MutableOrigin, outputs_layout: Layout](input: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis: Int, outputs: StaticTuple[LayoutTensor[type, outputs_layout, outputs_origin], num_outputs], ctx: DeviceContext)` --- ## tile ## Functions * [​`tile`](./tile): Implements the `Tile` operator from the ONNX spec. This behaves like Numpy tile, but without broadcast. * [​`tile_shape`](./tile_shape): Compute the output shape of a `tile` operation, and assert the inputs are compatible. --- ## tile `tile[type: DType, type_repeats: DType](input: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], repeats: LayoutTensor[type_repeats, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Implements the `Tile` operator from the ONNX spec. This behaves like Numpy tile, but without broadcast. **Parameters:** * ​type (`DType`): Type of the input and output tensors. * ​type\_repeats (`DType`): Type of the repeats tensor. **Args:** * ​input (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor. Currently repeats (`LayoutTensor[type_repeats, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): One-dimensional tensor that specifies the number of repeated copies along each of the input's dimensions. Length equals input tensor rank. * ​output (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor. Has the same dimensions and type as input. --- ## tile_shape `tile_shape[input_type: DType, repeats_type: DType, single_thread_blocking_override: Bool](input_buf: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], repeats_buf: LayoutTensor[repeats_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> IndexList[layout.rank()]` Compute the output shape of a `tile` operation, and assert the inputs are compatible. **Parameters:** * ​input\_type (`DType`): Type of the input tensor. * ​repeats\_type (`DType`): Type of the repeats tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input\_buf (`LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor. * ​repeats\_buf (`LayoutTensor[repeats_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The repeats tensor. **Returns:** The output shape. --- ## TopK_2 `@register_passable(trivial)` `struct TopK_2[T: DType, largest: Bool = True]` ## Fields * ​p (`Int`): * ​u (`SIMD[T, 1]`): ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` ### `insert` `insert(mut self, elem: SIMD[T, 1], elem_id: Int)` --- ## bottom_k_shape `bottom_k_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], max_k: Int, axis: Int) -> IndexList[rank]` --- ## fused_token_sampling_cpu `fused_token_sampling_cpu[type: DType, rank: Int, out_idx_type: DType](max_k: Int, input: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], k: OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]({:i1 0, 1}), temperature: OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]({:i1 0, 1}), top_p: OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]({:i1 0, 1}), seed: OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]]({:i1 0, 1}))` Generalized implementation of the Top K algorithm with sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume. **Parameters:** * ​type (`DType`): Data type of the input buffer. * ​rank (`Int`): Rank of the input. * ​out\_idx\_type (`DType`): Data type of the output indices. **Args:** * ​max\_k (`Int`): Largest number of top elements. * ​input (`NDBuffer[type, rank, origin]`): NDBuffer\[type, rank] (Any shape)- The input tensor. * ​out\_idxs (`NDBuffer[out_idx_type, rank, origin]`): NDBuffer\[out\_idx\_type, rank] (shape of \[input\_shape\[:-1]] + \[1]) - The output indices. * ​k (`OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]`): Optional device buffer of top elements to keep for each batch element. * ​temperature (`OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]`): The temperature based scaling. * ​top\_p (`OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]`): Only use the tokens whose cumulative probability exceeds this threshold. * ​seed (`OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]]`): The seed to use for the random number generator. --- ## fused_token_sampling_gpu `fused_token_sampling_gpu[type: DType, rank: Int, out_idx_type: DType, //](ctx: DeviceContext, max_k: Int, input: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], block_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), num_blocks_per_input: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), k: OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]({:i1 0, 1}), temperature: OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]({:i1 0, 1}), top_p: OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]({:i1 0, 1}), seed: OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]]({:i1 0, 1}))` Top K algorithm with fused sampling. Returns the sampled indices from the Top-K of the innermost dimension of the input tensor for each row/subvolume. --- ## topk ## Structs * [​`TopK_2`](./TopK_2): ## Functions * [​`bottom_k_shape`](./bottom_k_shape): * [​`fused_token_sampling_cpu`](./fused_token_sampling_cpu): Generalized implementation of the Top K algorithm with sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume. * [​`fused_token_sampling_gpu`](./fused_token_sampling_gpu): Top K algorithm with fused sampling. Returns the sampled indices from the Top-K of the innermost dimension of the input tensor for each row/subvolume. * [​`top_k`](./top_k): Implementation of the Top K algorithm. Returns the top or bottom K elements and their index along a specified axis. * [​`top_k_shape`](./top_k_shape): * [​`top_k_shape_impl`](./top_k_shape_impl): Compute the output shape of a top/bottom k operation. * [​`topk_gpu`](./topk_gpu): Generalized implementation of the Top K algorithm with/without sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume or the top K values and indices across the tensor. --- ## top_k `top_k[rank: Int, type: DType, out_idx_type: DType, //, largest: Bool = True, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input: NDBuffer[type, rank, origin], max_k: Int, axis: Int, out_vals: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], sorted: Bool, ctx: DeviceContextPtr, k: OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]({:i1 0, 1}))` Implementation of the Top K algorithm. Returns the top or bottom K elements and their index along a specified axis. **Parameters:** * ​rank (`Int`): Rank of the input. * ​type (`DType`): Data type of the input buffer. * ​out\_idx\_type (`DType`): The data type of the output indices (default is DType.int64). * ​largest (`Bool`): Whether to find the maximum (top k) or minimum value (bottom k). * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. **Args:** * ​input (`NDBuffer[type, rank, origin]`): The input tensor. * ​max\_k (`Int`): The largest number of top elements. * ​axis (`Int`): The axis along which to operate. * ​out\_vals (`NDBuffer[type, rank, origin]`): Output values. * ​out\_idxs (`NDBuffer[out_idx_type, rank, origin]`): Output indices. * ​sorted (`Bool`): Indicates if the top/bottom K elements are in (stable) sorted order. * ​ctx (`DeviceContextPtr`): The device call context. * ​k (`OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]`): Per batch element k value. --- ## top_k_shape `top_k_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], max_k: Int, axis: Int) -> IndexList[rank]` --- ## top_k_shape_impl `top_k_shape_impl[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], max_k: Int, axis: Int) -> IndexList[rank]` Compute the output shape of a top/bottom k operation. **Parameters:** * ​type (`DType`): Data type of the input buffer. * ​rank (`Int`): Rank of the input. * ​single\_thread\_blocking\_override (`Bool`): If this function can block. **Args:** * ​input (`NDBuffer[type, rank, origin]`): The input tensor. * ​max\_k (`Int`): The maximum K value. * ​axis (`Int`): The axis value in a tensor. **Returns:** The output shape. --- ## topk_gpu `topk_gpu[type: DType, rank: Int, out_idx_type: DType, //, sampling: Bool = True, largest: Bool = True](ctx: DeviceContext, max_k: Int, input: NDBuffer[type, rank, origin], out_vals: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], block_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), num_blocks_per_input: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), k: OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]({:i1 0, 1}), temperature: OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]({:i1 0, 1}), top_p: OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]({:i1 0, 1}), seed: OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]]({:i1 0, 1}))` Generalized implementation of the Top K algorithm with/without sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume or the top K values and indices across the tensor. **Parameters:** * ​type (`DType`): DType - The data type of the input tensor. * ​rank (`Int`): Int - The rank of the input tensor. * ​out\_idx\_type (`DType`): DType - The data type of the output indices (default is DType.index). * ​sampling (`Bool`): Bool - Whether to return token samples from topK dist (default is True). * ​largest (`Bool`): Bool - Whether to find the maximum or minimum value. **Args:** * ​ctx (`DeviceContext`): DeviceContext The context for GPU execution. * ​max\_k (`Int`): Int Largest number of top elements to keep for each batch element. * ​input (`NDBuffer[type, rank, origin]`): NDBuffer\[type, rank] Input tensor as a device NDBuffer. * ​out\_vals (`NDBuffer[type, rank, origin]`): NDBuffer\[type, rank] Output buffer on device for the K largest values. * ​out\_idxs (`NDBuffer[out_idx_type, rank, origin]`): NDBuffer\[DType.index, rank] Output buffer on device for the indices of the K largest values, or sampled token indices. Last dimension is 1 if sampling is True, otherwise K. * ​block\_size (`OptionalReg[Int]`): Int The number of threads per block (default is 256 from TRT and empirical testing). * ​num\_blocks\_per\_input (`OptionalReg[Int]`): OptionalReg\[Int] Number of blocks per input (default computed from input size and block size). This is the equivalent of "BLOCKS\_PER\_BEAM" in TRT-LLM kernel allowing for much larger batch sizes through packing several elements per thread in the first stage. * ​k (`OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]`): Optional NDBuffer\[DType.int64, 1, MutableAnyOrigin] Device buffer of top elements to keep for each batch element. * ​temperature (`OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]`): The temperature based scaling. * ​top\_p (`OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]`): Only use the tokens whose cumulative probability exceeds this threshold. * ​seed (`OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]]`): The seed to use for the random number generator. --- ## toppminp ## Functions * [​`merge`](./merge): Merge two sorted subarrays into one sorted array. * [​`merge_sort_recursive`](./merge_sort_recursive): Recursive merge sort implementation. * [​`min_p_sampling`](./min_p_sampling): Naive CPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the calculated probability threshold (Min-P). * [​`sort_buf_descending`](./sort_buf_descending): Sort each batch separately in descending order using parallel merge sort. * [​`top_p_sampling`](./top_p_sampling): Naive CPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the cumulative probability mass (Top-P). --- ## merge `merge[type: DType, out_idx_type: DType](mut buf_keys: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mut buf_ids: LayoutTensor[out_idx_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], start: Int, mid: Int, end: Int)` Merge two sorted subarrays into one sorted array. --- ## merge_sort_recursive `merge_sort_recursive[type: DType, out_idx_type: DType](mut buf_keys: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mut buf_ids: LayoutTensor[out_idx_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], start: Int, end: Int)` Recursive merge sort implementation. --- ## min_p_sampling `min_p_sampling[type: DType, out_idx_type: DType, //, _test_sort: Bool = False](min_ps: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input_logits: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], out_token_ids: LayoutTensor[out_idx_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))` Naive CPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the calculated probability threshold (Min-P). --- ## sort_buf_descending `sort_buf_descending[type: DType, out_idx_type: DType](mut buf_keys: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mut buf_ids: LayoutTensor[out_idx_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], vocab_size: Int)` Sort each batch separately in descending order using parallel merge sort. --- ## top_p_sampling `top_p_sampling[type: DType, out_idx_type: DType, //, _test_sort: Bool = False](top_ps: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input_logits: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], out_token_ids: LayoutTensor[out_idx_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))` Naive CPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the cumulative probability mass (Top-P). --- ## toppminp_gpu ## Aliases ### `DEBUG_FILE` `alias DEBUG_FILE = False` ### `SEED` `alias SEED = 42` ## Functions * [​`min_p_sampling_gpu`](./min_p_sampling_gpu): GPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the calculated probability threshold (Min-P). * [​`normalize`](./normalize): * [​`normalize_u32`](./normalize_u32): * [​`radix_sort_pairs_kernel`](./radix_sort_pairs_kernel): Radix pair sort kernel for (default) descending order. * [​`run_radix_sort_pairs_gpu`](./run_radix_sort_pairs_gpu): * [​`top_p_sampling_gpu`](./top_p_sampling_gpu): GPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the cumulative probability mass (Top-P). * [​`topk_wrapper`](./topk_wrapper): Copy of `Kernels/mojo/nn/topk.mojo:_topk_stage1` with the addition of max\_vals and p\_threshold arguments to determine if sorting is needed for top-p/min-p sampling. * [​`topp_minp_sampling_kernel`](./topp_minp_sampling_kernel): Top P-Min P sampling kernel. --- ## min_p_sampling_gpu `min_p_sampling_gpu[type: DType, rank: Int, out_idx_type: DType, //, _test_sort: Bool = False](ctx: DeviceContext, min_ps: NDBuffer[type, 1, origin], input_logits: NDBuffer[type, rank, origin], out_token_ids: NDBuffer[out_idx_type, rank, origin], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))` GPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the calculated probability threshold (Min-P). --- ## normalize `normalize(value: SIMD[bfloat16, 1]) -> SIMD[uint16, 1]` `normalize(value: SIMD[int32, 1]) -> SIMD[uint32, 1]` `normalize(value: SIMD[uint16, 1]) -> SIMD[uint16, 1]` `normalize(value: SIMD[float32, 1]) -> SIMD[uint32, 1]` `normalize(value: SIMD[dtype, 1]) -> SIMD[_uint_type_of_width[::Int](), 1]` Normalize the value to the appropriate unsigned integer type. This is needed for radix sort to work correctly. --- ## normalize_u32 `normalize_u32(value: SIMD[uint32, 1]) -> SIMD[uint32, 1]` --- ## radix_sort_pairs_kernel `radix_sort_pairs_kernel[type: DType, out_idx_type: DType, current_bit: Int, ascending: Bool = False, BLOCK_SIZE: Int = 256, NUM_BITS_PER_PASS: Int = 4](input_keys_: UnsafePointer[SIMD[type, 1]], output_keys_: UnsafePointer[SIMD[type, 1]], input_key_ids_: UnsafePointer[SIMD[out_idx_type, 1]], output_key_ids_: UnsafePointer[SIMD[out_idx_type, 1]], num_keys: Int, skip_sort: UnsafePointer[SIMD[bool, 1]])` Radix pair sort kernel for (default) descending order. Implementation based on: AMD. Introduction to GPU Radix Sort. GPUOpen, 2017. Available at: . **Parameters:** * ​type (`DType`): DType - Data type. * ​out\_idx\_type (`DType`): DType - Output index type. * ​current\_bit (`Int`): Int - Current bit to start sorting NUM\_BITS\_PER\_PASS bits at. * ​ascending (`Bool`): Bool - Whether to sort in ascending order. * ​BLOCK\_SIZE (`Int`): Int - Block size. * ​NUM\_BITS\_PER\_PASS (`Int`): Int - Number of bits per pass. **Args:** * ​input\_keys\_ (`UnsafePointer[SIMD[type, 1]]`): Input tensor values to sort. * ​output\_keys\_ (`UnsafePointer[SIMD[type, 1]]`): Output tensor values sorted in (default) descending order. * ​input\_key\_ids\_ (`UnsafePointer[SIMD[out_idx_type, 1]]`): Input tensor indices. * ​output\_key\_ids\_ (`UnsafePointer[SIMD[out_idx_type, 1]]`): Output tensor indices sorted in (default) descending order. * ​num\_keys (`Int`): Number of keys to sort per batch. * ​skip\_sort (`UnsafePointer[SIMD[bool, 1]]`): Whether sorting is skipped for this batch. --- ## run_radix_sort_pairs_gpu `run_radix_sort_pairs_gpu[type: DType, out_idx_type: DType, rank: Int, ascending: Bool = False, BLOCK_SIZE: Int = 256, NUM_BITS_PER_PASS: Int = 4](ctx: DeviceContext, mut input_keys: NDBuffer[type, rank, MutableAnyOrigin], mut output_keys: NDBuffer[type, rank, MutableAnyOrigin], mut input_key_ids: NDBuffer[out_idx_type, rank, MutableAnyOrigin], mut output_key_ids: NDBuffer[out_idx_type, rank, MutableAnyOrigin], skip_sort: NDBuffer[bool, rank, origin])` --- ## top_p_sampling_gpu `top_p_sampling_gpu[type: DType, rank: Int, out_idx_type: DType, //, _test_sort: Bool = False](ctx: DeviceContext, top_ps: NDBuffer[type, 1, origin], input_logits: NDBuffer[type, rank, origin], out_token_ids: NDBuffer[out_idx_type, rank, origin], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))` GPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the cumulative probability mass (Top-P). --- ## topk_wrapper `topk_wrapper[T: DType, out_idx_type: DType, is_top_p: Bool, largest: Bool = True, _test_sort: Bool = False](K: Int, num_elements: Int, num_blocks_per_input: Int, in_buffer: UnsafePointer[SIMD[T, 1]], local_topk_vals: UnsafePointer[SIMD[T, 1]], local_topk_idxs: UnsafePointer[SIMD[out_idx_type, 1]], p_threshold: UnsafePointer[SIMD[T, 1]], skip_sort: UnsafePointer[SIMD[bool, 1]])` Copy of `Kernels/mojo/nn/topk.mojo:_topk_stage1` with the addition of max\_vals and p\_threshold arguments to determine if sorting is needed for top-p/min-p sampling. Arguments: K: Int - Number of top elements to select per block num\_elements: Int - Size of last dimension of input buffer (vocab size) num\_blocks\_per\_input: Int - Number of blocks used to process the input data in\_buffer: UnsafePointer\[Scalar\[T]] - Input buffer containing the elements to process local\_topk\_vals: UnsafePointer\[Scalar\[T]] - Output buffer to store the local top-K values local\_topk\_idxs: UnsafePointer\[Scalar\[out\_idx\_type]] - Output buffer to store the indices of local top-K elements p\_threshold: UnsafePointer\[Scalar\[T]] - Threshold for top-p sampling if is\_top\_p is True else min-p coefficient skip\_sort: UnsafePointer\[Scalar\[DType.bool]] - Output buffer to store whether sorting is needed **Parameters:** * ​T (`DType`): DType - The data type of the elements. * ​out\_idx\_type (`DType`): DType - The data type of the output indices. * ​is\_top\_p (`Bool`): Bool - Whether this if for top-p sampling or min-p sampling. * ​largest (`Bool`): Bool - Whether to find the maximum or minimum value. * ​\_test\_sort (`Bool`): Bool - An internal test flag to not skip sort if testing. --- ## topp_minp_sampling_kernel `topp_minp_sampling_kernel[type: DType, out_idx_type: DType, is_top_p: Bool](p_thresholds_: UnsafePointer[SIMD[type, 1]], sorted_probs_: UnsafePointer[SIMD[type, 1]], sorted_ids_: UnsafePointer[SIMD[out_idx_type, 1]], out_token_ids: UnsafePointer[SIMD[out_idx_type, 1]], skip_sort: UnsafePointer[SIMD[bool, 1]], vocab_size: Int)` Top P-Min P sampling kernel. **Parameters:** * ​type (`DType`): DType - scalar values dtype. * ​out\_idx\_type (`DType`): DType - output index type. * ​is\_top\_p (`Bool`): Bool - Whether to use Top-P (True) or Min-P (False) sampling. --- ## nvml Implements wrappers around the NVIDIA Management Library (nvml). ## Modules * [​`nvml`](./nvml/): Implements wrappers around the NVIDIA Management Library (nvml). --- ## ClockType `@register_passable(trivial)` `struct ClockType` ## Fields * ​code (`SIMD[int32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Movable`, `UnknownDestructibility` ## Aliases ### `GRAPHICS` `alias GRAPHICS = ClockType(__init__[__mlir_type.!pop.int_literal](0))` Graphics clock domain ### `MEM` `alias MEM = ClockType(__init__[__mlir_type.!pop.int_literal](2))` Memory clock domain ### `SM` `alias SM = ClockType(__init__[__mlir_type.!pop.int_literal](1))` SM clock domain ### `VIDEO` `alias VIDEO = ClockType(__init__[__mlir_type.!pop.int_literal](2))` Video clock domain ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` --- ## Device `struct Device` ## Fields * ​idx (`Int`): * ​device (`_DeviceImpl`): ## Implemented traits `AnyType`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(out self, idx: Int = 0)` ### `__copyinit__` `__copyinit__(out self, existing: Self)` ### `get_driver_version` `get_driver_version(self) -> DriverVersion` Returns NVIDIA driver version. ### `max_mem_clock` `max_mem_clock(self) -> Int` ### `max_graphics_clock` `max_graphics_clock(self) -> Int` ### `mem_clocks` `mem_clocks(self) -> List[Int, True]` ### `graphics_clocks` `graphics_clocks(self, memory_clock_mhz: Int) -> List[Int, True]` ### `set_clock` `set_clock(self, mem_clock: Int, graphics_clock: Int)` ### `gpu_turbo_enabled` `gpu_turbo_enabled(self) -> Bool` Returns True if the gpu turbo is enabled. ### `set_gpu_turbo` `set_gpu_turbo(self, enabled: Bool = True)` Sets the GPU turbo state. ### `get_persistence_mode` `get_persistence_mode(self) -> Bool` Returns True if the gpu persistence mode is enabled. ### `set_persistence_mode` `set_persistence_mode(self, enabled: Bool = True)` Sets the persistence mode. ### `set_max_gpu_clocks` `set_max_gpu_clocks(device)` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` ### `__repr__` `__repr__(self) -> String` --- ## DriverVersion `struct DriverVersion` ## Implemented traits `AnyType`, `Copyable`, `Movable`, `StringableRaising`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(out self, value: List[String])` ### `major` `major(self) -> Int` ### `minor` `minor(self) -> Int` ### `patch` `patch(self) -> Int` ### `__str__` `__str__(self) -> String` --- ## EnableState `@register_passable(trivial)` `struct EnableState` ## Fields * ​code (`SIMD[int32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Movable`, `UnknownDestructibility` ## Aliases ### `DISABLED` `alias DISABLED = EnableState(__init__[__mlir_type.!pop.int_literal](0))` Feature disabled ### `ENABLED` `alias ENABLED = EnableState(__init__[__mlir_type.!pop.int_literal](1))` Feature enabled ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` --- ## Result `@register_passable(trivial)` `struct Result` ## Fields * ​code (`SIMD[int32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Movable`, `Stringable`, `UnknownDestructibility` ## Aliases ### `ALREADY_INITIALIZED` `alias ALREADY_INITIALIZED = Result(__init__[__mlir_type.!pop.int_literal](5))` Deprecated: Multiple initializations are now allowed through ref counting ### `ARGUMENT_VERSION_MISMATCH` `alias ARGUMENT_VERSION_MISMATCH = Result(__init__[__mlir_type.!pop.int_literal](25))` The provided version is invalid/unsupported ### `CORRUPTED_INFOROM` `alias CORRUPTED_INFOROM = Result(__init__[__mlir_type.!pop.int_literal](14))` infoROM is corrupted ### `DEPRECATED` `alias DEPRECATED = Result(__init__[__mlir_type.!pop.int_literal](26))` The requested functionality has been deprecated ### `DRIVER_NOT_LOADED` `alias DRIVER_NOT_LOADED = Result(__init__[__mlir_type.!pop.int_literal](9))` NVIDIA driver is not loaded ### `FREQ_NOT_SUPPORTED` `alias FREQ_NOT_SUPPORTED = Result(__init__[__mlir_type.!pop.int_literal](24))` Ran out of critical resources, other than memory ### `FUNCTION_NOT_FOUND` `alias FUNCTION_NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](13))` Local version of NVML doesn't implement this function ### `GPU_IS_LOST` `alias GPU_IS_LOST = Result(__init__[__mlir_type.!pop.int_literal](15))` The GPU has fallen off the bus or has otherwise become inaccessible ### `GPU_NOT_FOUND` `alias GPU_NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](28))` No GPUs were found ### `IN_USE` `alias IN_USE = Result(__init__[__mlir_type.!pop.int_literal](19))` An operation cannot be performed because the GPU is currently in use ### `INSUFFICIENT_POWER` `alias INSUFFICIENT_POWER = Result(__init__[__mlir_type.!pop.int_literal](8))` A device's external power cables are not properly attached ### `INSUFFICIENT_RESOURCES` `alias INSUFFICIENT_RESOURCES = Result(__init__[__mlir_type.!pop.int_literal](23))` Ran out of critical resources, other than memory ### `INSUFFICIENT_SIZE` `alias INSUFFICIENT_SIZE = Result(__init__[__mlir_type.!pop.int_literal](7))` An input argument is not large enough ### `INVALID_ARGUMENT` `alias INVALID_ARGUMENT = Result(__init__[__mlir_type.!pop.int_literal](2))` A supplied argument is invalid ### `IRQ_ISSUE` `alias IRQ_ISSUE = Result(__init__[__mlir_type.!pop.int_literal](11))` NVIDIA Kernel detected an interrupt issue with a GPU ### `LIB_RM_VERSION_MISMATCH` `alias LIB_RM_VERSION_MISMATCH = Result(__init__[__mlir_type.!pop.int_literal](18))` RM detects a driver/library version mismatch ### `LIBRARY_NOT_FOUND` `alias LIBRARY_NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](12))` NVML Shared Library couldn't be found or loaded ### `MEMORY` `alias MEMORY = Result(__init__[__mlir_type.!pop.int_literal](20))` Insufficient memory ### `NO_DATA` `alias NO_DATA = Result(__init__[__mlir_type.!pop.int_literal](21))` No data ### `NO_PERMISSION` `alias NO_PERMISSION = Result(__init__[__mlir_type.!pop.int_literal](4))` The current user does not have permission for operation ### `NOT_FOUND` `alias NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](6))` A query to find an object was unsuccessful ### `NOT_READY` `alias NOT_READY = Result(__init__[__mlir_type.!pop.int_literal](27))` The system is not ready for the request ### `NOT_SUPPORTED` `alias NOT_SUPPORTED = Result(__init__[__mlir_type.!pop.int_literal](3))` The requested operation is not available on target device ### `OPERATING_SYSTEM` `alias OPERATING_SYSTEM = Result(__init__[__mlir_type.!pop.int_literal](17))` The GPU control device has been blocked by the operating system/cgroups ### `RESET_REQUIRED` `alias RESET_REQUIRED = Result(__init__[__mlir_type.!pop.int_literal](16))` The GPU requires a reset before it can be used again ### `SUCCESS` `alias SUCCESS = Result(__init__[__mlir_type.!pop.int_literal](0))` The operation was successful ### `TIMEOUT` `alias TIMEOUT = Result(__init__[__mlir_type.!pop.int_literal](10))` User provided timeout passed ### `UNINITIALIZED` `alias UNINITIALIZED = Result(__init__[__mlir_type.!pop.int_literal](1))` NVML was not first initialized with nvmlInit() ### `UNKNOWN` `alias UNKNOWN = Result(__init__[__mlir_type.!pop.int_literal](999))` An internal driver error occurred ### `VGPU_ECC_NOT_SUPPORTED` `alias VGPU_ECC_NOT_SUPPORTED = Result(__init__[__mlir_type.!pop.int_literal](22))` The requested vgpu operation is not available on target device, because ECC is enabled ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` ### `__str__` `__str__(self) -> String` --- ## nvml Implements wrappers around the NVIDIA Management Library (nvml). ## Aliases ### `CUDA_NVML_LIBRARY` `alias CUDA_NVML_LIBRARY = _Global[__init__[__mlir_type.!kgen.string]("CUDA_NVML_LIBRARY"), _OwnedDLHandle, _init_dylib]` ### `CUDA_NVML_LIBRARY_BASE_NAME` `alias CUDA_NVML_LIBRARY_BASE_NAME = "libnvidia-ml"` ### `CUDA_NVML_LIBRARY_DIR` `alias CUDA_NVML_LIBRARY_DIR = __init__[__mlir_type.!kgen.string]("/usr/lib/x86_64-linux-gnu")` ### `CUDA_NVML_LIBRARY_EXT` `alias CUDA_NVML_LIBRARY_EXT = ".so"` ## Structs * [​`ClockType`](./ClockType): * [​`Device`](./Device): * [​`DriverVersion`](./DriverVersion): * [​`EnableState`](./EnableState): * [​`Result`](./Result): --- ## quantization This package contains a set of APIs for quantizing tensor data. Quantization is a technique used to reduce the precision of floating-point numbers, which are used in most neural networks. Quantization is a type of lossy compression, which means that some precision is lost, but the resulting tensors take less memory and computations are faster. ## Modules * [​`per_channel_grouped_4bit`](./per_channel_grouped_4bit/): * [​`qmatmul`](./qmatmul/): * [​`qmatmul_gpu`](./qmatmul_gpu/): * [​`qmatmul_k`](./qmatmul_k/): --- ## Q4sym `struct Q4sym[group_size: Int, float_dtype: DType = float32]` Q4sym: compresses values of type `float_dtype` to 4bit unsigned integers which have been dynamically symmetrically quantized with the given scale factor. `group_size` determines the number of elements which share quantization parameters. We store things in a strided fashion: Example: Assume `group_size = 8` and we want to process uint4 numbers: A, B, C, D, E, F, G, H which have associated bits aaaa, bbbb, cccc, .... eeeeaaaa|ffffbbbb|ggggcccc|hhhhdddd To uncompress to floating point, take the decoded uint4 value, subtract the implicit zero-point of 2^4=8, and multiply by the scale factor. ## Parameters * ​group\_size (`Int`): The number of encoded numbers stored in this struct. * ​float\_dtype (`DType`): The floating point dtype this struct works with. ## Fields * ​scale (`StaticTuple[SIMD[uint8, 1], 2]`): The FP16 scale of the group, stored as individual bytes. * ​bits (`StaticTuple[SIMD[uint8, 1], (div_s(#lit.struct.extract, 2) + -1) if ((group_size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]`): The bits of the encoded uint4 numbers. ## Implemented traits `AnyType`, `Defaultable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Construct a default initialized Q4sym. `@implicit` `__init__(out self, data: SIMD[float_dtype, group_size])` Construct an encoded Q4sym from data. **Args:** * ​data (`SIMD[float_dtype, group_size]`): The floating point data to encode and store. ### `decode_scale` `decode_scale(mut self) -> SIMD[float16, 1]` Obtain the scale factor. **Returns:** The decoded scale factor. ### `decode_unsigned` `decode_unsigned(mut self) -> SIMD[uint8, group_size]` Decode the stored uint4 numbers to uint8. **Returns:** The decoded stored numbers as uint8 numbers. These have an implicit zero-point of 8. ### `decode_signed` `decode_signed(mut self) -> SIMD[int8, group_size]` Decode the stored uint4 numbers to requantized int4 numbers. This is done by simply subtracting an implicit zp of 8 from the unsigned decoding. **Returns:** The decoded stored numbers as int8 numbers. These have a zero-point of 0\. ### `decode_fully` `decode_fully(mut self) -> SIMD[float_dtype, group_size]` Decode the stored numbers into floating point representation. **Returns:** The decoded numbers. ### `quantize_and_write_to_tensor` `static quantize_and_write_to_tensor[rank: Int](input_tensor: NDBuffer[float_dtype, rank, origin], output_tensor: NDBuffer[uint8, rank, origin], input_shape: IndexList[rank])` Encodes the floating point numbers in `input_tensor` along the inner-most dimension and writes the result to output\_tensor. **Parameters:** * ​rank (`Int`): The rank of the input and output tensors. **Args:** * ​input\_tensor (`NDBuffer[float_dtype, rank, origin]`): The input tensor we are encoding. * ​output\_tensor (`NDBuffer[uint8, rank, origin]`): The output tensor containing the encoded input. The shape of the output should be the same as the input except along the inner dimension where if the original inner dimension was `d`, the corresponding output dimension should be: ceil(`d` / group\_size) \* sizeof(self). * ​input\_shape (`IndexList[rank]`): The shape of the input tensor. ### `dequantize_and_write_to_tensor` `static dequantize_and_write_to_tensor[rank: Int, //](input_tensor: NDBuffer[uint8, rank, origin], output_tensor: NDBuffer[float_dtype, rank, origin], output_shape: IndexList[rank])` Encodes the floating point numbers in `input_tensor` along the inner-most dimension and writes the result to output\_tensor. **Parameters:** * ​rank (`Int`): The rank of the input and output tensors. **Args:** * ​input\_tensor (`NDBuffer[uint8, rank, origin]`): The input tensor we are decoding. * ​output\_tensor (`NDBuffer[float_dtype, rank, origin]`): The output tensor containing the decoded input. * ​output\_shape (`IndexList[rank]`): The shape of the output tensor. --- ## block_Q4_K `struct block_Q4_K` ## Fields * ​base\_scale (`SIMD[float16, 1]`): * ​base\_min (`SIMD[float16, 1]`): * ​q\_scales\_and\_mins (`InlineArray[SIMD[uint8, 1], 12]`): * ​q\_bits (`InlineArray[SIMD[uint8, 1], 128]`): ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `group_count` `alias group_count = 8` ### `group_size` `alias group_size = 32` --- ## block_Q6_K `struct block_Q6_K` ## Fields * ​q\_bits\_lo (`InlineArray[SIMD[uint8, 1], 128]`): * ​q\_bits\_hi (`InlineArray[SIMD[uint8, 1], 64]`): * ​q\_scales (`InlineArray[SIMD[int8, 1], 16]`): * ​base\_scale (`SIMD[float16, 1]`): ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `group_count` `alias group_count = 16` ### `group_size` `alias group_size = 16` --- ## block_QK_K `struct block_QK_K` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `quantized_k` `alias quantized_k = 256` --- ## calculate_symmetric_vector `calculate_symmetric_vector[input_dtype: DType, simd_width: Int, output_bits: Int](data: SIMD[input_dtype, simd_width]) -> Tuple[SIMD[uint8, simd_width], SIMD[input_dtype, 1]]` Symmetrically quantizes the given SIMD vector `data` with input type `input_dtype` and `simd_width` elements, assuming we want the results to fit in an unsigned integer of size `output_bits`. **Parameters:** * ​input\_dtype (`DType`): The dtype of the input tensor. * ​simd\_width (`Int`): The width of the SIMD input. * ​output\_bits (`Int`): The bits we want to fit the unsigned integral result in. **Args:** * ​data (`SIMD[input_dtype, simd_width]`): The input SIMD we want to quantize. **Returns:** A vector of the quantized values. The associated scale factor. --- ## per_channel_grouped_4bit ## Structs * [​`block_Q4_K`](./block_Q4_K): * [​`block_Q6_K`](./block_Q6_K): * [​`block_QK_K`](./block_QK_K): * [​`Q4sym`](./Q4sym): Q4sym: compresses values of type `float_dtype` to 4bit unsigned integers which have been dynamically symmetrically quantized with the given scale factor. ## Functions * [​`calculate_symmetric_vector`](./calculate_symmetric_vector): Symmetrically quantizes the given SIMD vector `data` with input type `input_dtype` and `simd_width` elements, assuming we want the results to fit in an unsigned integer of size `output_bits`. * [​`q4_k_dequantize_impl`](./q4_k_dequantize_impl): * [​`q6_k_dequantize_impl`](./q6_k_dequantize_impl): * [​`scale_min_k4`](./scale_min_k4): --- ## q4_k_dequantize_impl `q4_k_dequantize_impl(input_tensor: NDBuffer[uint8, 2, origin], output_tensor: NDBuffer[float32, 2, origin])` --- ## q6_k_dequantize_impl `q6_k_dequantize_impl(input_tensor: NDBuffer[uint8, 2, origin], output_tensor: NDBuffer[float32, 2, origin], output_shape: IndexList[2])` --- ## scale_min_k4 `scale_min_k4(src_ptr: UnsafePointer[block_Q4_K], g: Int) -> Tuple[SIMD[float32, 1], SIMD[float32, 1]]` --- ## qmatmul ## Aliases ### `K_BATCH_SIZE` `alias K_BATCH_SIZE = 512` Defines the batch size of K used to pack A and unpack B weights. ## Functions * [​`matmul_qint4`](./matmul_qint4): * [​`matmul_qint4_pack_b`](./matmul_qint4_pack_b): --- ## matmul_qint4 `matmul_qint4[group_size: Int, b_static_shape: DimList = create_unknown[::Int](), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a: NDBuffer[float32, 2, origin], b: NDBuffer[uint8, 2, origin, b_static_shape], c: NDBuffer[float32, 2, origin])` --- ## matmul_qint4_pack_b `matmul_qint4_pack_b[group_size: Int](b: NDBuffer[uint8, 2, origin], b_rot: NDBuffer[uint8, 2, origin])` --- ## args_to_tuple `args_to_tuple[swap: Bool](arg_0: Int, arg_1: Int) -> Tuple[Int, Int]` --- ## gpu_qint4_repack_GPTQ `gpu_qint4_repack_GPTQ[b_shape: DimList, b_packed_shape: DimList, //, group_size: Int, target: StringSlice[StaticConstantOrigin]](b: NDBuffer[uint8, 2, origin, b_shape], b_packed: NDBuffer[uint8, 2, origin, b_packed_shape], perm_idx: OptionalReg[NDBuffer[int32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[int32, 1, MutableAnyOrigin]]({:i1 0, 1}), ctx: DeviceContextPtr = DeviceContextPtr())` --- ## gpu_qint4_repack_Q4_0 `gpu_qint4_repack_Q4_0[b_shape: DimList, //, target: StringSlice[StaticConstantOrigin]](b: NDBuffer[uint8, 2, origin, b_shape], b_packed: NDBuffer[uint8, 2, origin, b_shape], ctx: DeviceContextPtr = DeviceContextPtr())` --- ## qmatmul_gpu ## Functions * [​`args_to_tuple`](./args_to_tuple): * [​`gpu_qint4_repack_GPTQ`](./gpu_qint4_repack_GPTQ): * [​`gpu_qint4_repack_Q4_0`](./gpu_qint4_repack_Q4_0): * [​`matmul_gpu_qint4`](./matmul_gpu_qint4): * [​`matmul_gpu_qint4_impl`](./matmul_gpu_qint4_impl): * [​`multistage_gemm_q`](./multistage_gemm_q): * [​`multistage_mma_q`](./multistage_mma_q): * [​`multistage_qgemm_kernel`](./multistage_qgemm_kernel): * [​`pack_Q_tile`](./pack_Q_tile): * [​`q_smem_usage`](./q_smem_usage): * [​`repack_GPTQ_for_sm8x`](./repack_GPTQ_for_sm8x): * [​`repack_Q4_0_for_sm8x`](./repack_Q4_0_for_sm8x): * [​`unpack_4bit_int`](./unpack_4bit_int): --- ## matmul_gpu_qint4 `matmul_gpu_qint4[c_type: DType, a_type: DType, //, group_size: Int, target: StringSlice[StaticConstantOrigin], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[uint8, 2, origin, shape], ctx: DeviceContextPtr = DeviceContextPtr())` --- ## matmul_gpu_qint4_impl `matmul_gpu_qint4_impl[c_type: DType, a_type: DType, //, group_size: Int, target: StringSlice[StaticConstantOrigin], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[uint8, 2, origin, shape], ctx: Optional[DeviceContext])` --- ## multistage_gemm_q `multistage_gemm_q[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, group_size: Int, pack_factor: Int, config: MatmulConfig[a_type, b_type, c_type, True], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, c_shape], a: NDBuffer[a_type, 2, origin, a_shape], b: NDBuffer[b_type, 2, origin, b_shape], runtime_config: MatmulConfig[a_type, b_type, c_type, True], ctx: DeviceContext)` --- ## multistage_mma_q `multistage_mma_q[BM: Int, BN: Int, BK: Int, WM: Int, WN: Int, num_threads: Int, num_pipeline_stages: Int, transpose_b: Bool, group_size: Int, pack_factor: Int, c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, a_smem_layout: Layout, b_type: DType, b_layout: Layout, b_smem_layout: Layout, scales_type: DType, scales_layout: Layout, scales_smem_layout: Layout, /, *, swizzle_a: Bool = True, static_num_iters: Dim = Dim(-31337), prefetch_init: Bool = True, continue_prefetch_b: Bool = False, transpose_b_next: Bool = False, b_next_gmem_layout: Layout = Layout(), b_next_smem_layout: Layout = Layout(), next_op_b_iter_alignment: Int = alignof[::DType,__mlir_type.!kgen.target]()](c: LayoutTensor[c_type, c_layout, origin, address_space=AddressSpace(5)], a_iter_arg: LayoutTensorIter[type, a_layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b_iter_arg: LayoutTensorIter[b_type, b_layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], a_smem_iter_arg: LayoutTensorIter[a_type, a_smem_layout, origin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut b_smem_iter: LayoutTensorIter[b_type, b_smem_layout, origin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], scales_smem_iter_arg: LayoutTensorIter[scales_type, scales_smem_layout, origin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], scales_iter_arg: LayoutTensorIter[scales_type, scales_layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], num_iters: Int, /, *, num_b_rows: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` --- ## multistage_qgemm_kernel `multistage_qgemm_kernel[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, b_packed_type: DType, b_layout: Layout, group_size: Int, pack_factor: Int, transpose_b: Bool, config: MatmulConfig[a_type, b_packed_type, c_type, transpose_b], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: LayoutTensor[c_type, c_layout, MutableAnyOrigin], a: LayoutTensor[a_type, a_layout, MutableAnyOrigin], b_packed: LayoutTensor[b_packed_type, b_layout, MutableAnyOrigin])` --- ## pack_Q_tile `pack_Q_tile(input: SIMD[uint8, 16]) -> SIMD[uint32, 4]` --- ## q_smem_usage `q_smem_usage[: DType, : DType, : DType, : Bool, : IndexList[3], //, config: MatmulConfig[$0, $1, $2, $3, $4], group_size: Int]() -> Int` --- ## repack_GPTQ_for_sm8x `repack_GPTQ_for_sm8x[in_layout: Layout, out_layout: Layout, scales_type: DType, group_size: Int, has_perm: Bool, *, perm_layout: Layout = Layout()](in_tensor: LayoutTensor[uint8, in_layout, MutableAnyOrigin], out_tensor: LayoutTensor[uint8, out_layout, MutableAnyOrigin], perm_idx: LayoutTensor[int32, perm_layout, MutableAnyOrigin])` --- ## repack_Q4_0_for_sm8x `repack_Q4_0_for_sm8x[q_layout: Layout, repack_layout: Layout, scales_type: DType](q_weight: LayoutTensor[uint8, q_layout, MutableAnyOrigin], q_packed_weight: LayoutTensor[uint8, repack_layout, MutableAnyOrigin])` --- ## unpack_4bit_int `unpack_4bit_int(val: SIMD[uint32, size], idx: Int) -> SIMD[uint8, 1]` --- ## qmatmul_k ## Functions * [​`matmul_Q4_K`](./matmul_Q4_K): * [​`matmul_Q4_K_pack_b`](./matmul_Q4_K_pack_b): * [​`matmul_Q6_K`](./matmul_Q6_K): * [​`matmul_Q6_K_pack_b`](./matmul_Q6_K_pack_b): --- ## matmul_Q4_K `matmul_Q4_K[elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a: NDBuffer[float32, 2, origin], b: NDBuffer[uint8, 2, origin], c: NDBuffer[float32, 2, origin])` --- ## matmul_Q4_K_pack_b `matmul_Q4_K_pack_b[b_origin: MutableOrigin, b_packed_origin: MutableOrigin](b: NDBuffer[uint8, 2, b_origin], b_packed: NDBuffer[uint8, 2, b_packed_origin])` --- ## matmul_Q6_K `matmul_Q6_K[elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a: NDBuffer[float32, 2, origin], b: NDBuffer[uint8, 2, origin], c: NDBuffer[float32, 2, origin])` --- ## matmul_Q6_K_pack_b `matmul_Q6_K_pack_b[b_origin: MutableOrigin, b_packed_origin: MutableOrigin](b: NDBuffer[uint8, 2, b_origin], b_packed: NDBuffer[uint8, 2, b_packed_origin])` --- ## elementwise `elementwise[: origin.set, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: Int)` Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed. **Parameters:** * ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function. * ​simd\_width (`Int`): The SIMD vector width to use. * ​use\_blocking\_impl (`Bool`): Do not invoke the function using asynchronous calls. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. * ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace. **Args:** * ​shape (`Int`): The shape of the buffer. `elementwise[: origin.set, rank: Int, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: IndexList[rank, element_type=element_type])` Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed. **Parameters:** * ​rank (`Int`): The rank of the buffer. * ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function. * ​simd\_width (`Int`): The SIMD vector width to use. * ​use\_blocking\_impl (`Bool`): Do not invoke the function using asynchronous calls. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. * ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace. **Args:** * ​shape (`IndexList[rank, element_type=element_type]`): The shape of the buffer. `elementwise[: origin.set, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: Int, context: DeviceContext)` Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed. **Parameters:** * ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function. * ​simd\_width (`Int`): The SIMD vector width to use. * ​use\_blocking\_impl (`Bool`): Do not invoke the function using asynchronous calls. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. * ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace. **Args:** * ​shape (`Int`): The shape of the buffer. * ​context (`DeviceContext`): The device context to use. `elementwise[: origin.set, rank: Int, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: IndexList[rank, element_type=element_type], context: DeviceContext)` Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed. **Parameters:** * ​rank (`Int`): The rank of the buffer. * ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function. * ​simd\_width (`Int`): The SIMD vector width to use. * ​use\_blocking\_impl (`Bool`): Do not invoke the function using asynchronous calls. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. * ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace. **Args:** * ​shape (`IndexList[rank, element_type=element_type]`): The shape of the buffer. * ​context (`DeviceContext`): The device context to use. `elementwise[: origin.set, rank: Int, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: IndexList[rank, element_type=element_type], context: DeviceContextPtr)` Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed. **Parameters:** * ​rank (`Int`): The rank of the buffer. * ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function. * ​simd\_width (`Int`): The SIMD vector width to use. * ​use\_blocking\_impl (`Bool`): Do not invoke the function using asynchronous calls. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. * ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace. **Args:** * ​shape (`IndexList[rank, element_type=element_type]`): The shape of the buffer. * ​context (`DeviceContextPtr`): The device context to use. --- ## functional Implements higher-order functions. You can import these APIs from the `algorithm` package. For example: ```mojo from algorithm import map ``` ## Aliases ### `BinaryTile1DTileUnitFunc` `alias BinaryTile1DTileUnitFunc = fn[Int](Int, Int) capturing -> None` Signature of a tiled function that performs some work with a dynamic tile size and a secondary static tile size. ### `Dynamic1DTileUnitFunc` `alias Dynamic1DTileUnitFunc = fn(Int, Int) capturing -> None` Signature of a 1d tiled function that performs some work with a dynamic tile size and an offset. i.e. func(offset: Int, tile\_size: Int) ### `Dynamic1DTileUnswitchUnitFunc` `alias Dynamic1DTileUnswitchUnitFunc = fn[Bool](Int, Int, Int) capturing -> None` ### `Static1DTileUnitFunc` `alias Static1DTileUnitFunc = fn[Int](Int) capturing -> None` Signature of a 1d tiled function that performs some work with a static tile size and an offset. i.e. func\ (offset: Int) ### `Static1DTileUnitFuncWithFlag` `alias Static1DTileUnitFuncWithFlag = fn[Int, Bool](Int) capturing -> None` ### `Static1DTileUnitFuncWithFlags` `alias Static1DTileUnitFuncWithFlags = fn[Int, Bool, Bool](Int) capturing -> None` ### `Static1DTileUnswitchUnitFunc` `alias Static1DTileUnswitchUnitFunc = fn[Int, Bool](Int, Int) capturing -> None` Signature of a tiled function that performs some work with a static tile size and an offset. i.e. func\ (offset: Int) ### `Static2DTileUnitFunc` `alias Static2DTileUnitFunc = fn[Int, Int](Int, Int) capturing -> None` Signature of a 2d tiled function that performs some work with a static tile size and an offset. i.e. func\ (offset\_x: Int, offset\_y: Int) ### `stencil` `alias stencil = _stencil_impl_cpu[__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,::DType,::DType,::DType,::Int,::Int,::IndexList[$10, $6],::Int,::DType,fn[::DType]` ### `stencil_gpu` `alias stencil_gpu = _stencil_impl_gpu[__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,::DType,::DType,::DType,::Int,::Int,::IndexList[$10, $6],::Int,::DType,fn[::DType]` ### `SwitchedFunction` `alias SwitchedFunction = fn[Bool]() raises capturing -> None` ### `SwitchedFunction2` `alias SwitchedFunction2 = fn[Bool, Bool]() capturing -> None` ## Functions * [​`elementwise`](/mojo/stdlib/algorithm/functional/elementwise): Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed. * [​`map`](/mojo/stdlib/algorithm/functional/map): Maps a function over a range from 0 to size. * [​`parallelize`](/mojo/stdlib/algorithm/functional/parallelize): Executes func(0) ... func(num\_work\_items-1) as sub-tasks in parallel, and returns when all are complete. * [​`parallelize_over_rows`](/mojo/stdlib/algorithm/functional/parallelize_over_rows): Parallelize func over non-axis dims of shape. * [​`sync_parallelize`](/mojo/stdlib/algorithm/functional/sync_parallelize): Executes func(0) ... func(num\_work\_items-1) as parallel sub-tasks, and returns when all are complete. * [​`tile`](/mojo/stdlib/algorithm/functional/tile): A generator that launches work groups in specified list of tile sizes. * [​`tile_and_unswitch`](/mojo/stdlib/algorithm/functional/tile_and_unswitch): Performs time and unswitch functional transformation. * [​`tile_middle_unswitch_boundaries`](/mojo/stdlib/algorithm/functional/tile_middle_unswitch_boundaries): Divides 1d iteration space into three parts and tiles them with different steps. * [​`unswitch`](/mojo/stdlib/algorithm/functional/unswitch): Performs a functional unswitch transformation. * [​`vectorize`](/mojo/stdlib/algorithm/functional/vectorize): Simplifies SIMD optimized loops by mapping a function across a range from 0 to `size`, incrementing by `simd_width` at each step. The remainder of `size % simd_width` will run in separate iterations. --- ## map `map[origins: origin.set, //, func: fn(Int) capturing -> None](size: Int)` Maps a function over a range from 0 to size. **Parameters:** * ​origins (`origin.set`): The capture origins. * ​func (`fn(Int) capturing -> None`): Function to map. **Args:** * ​size (`Int`): The number of elements. --- ## parallelize `parallelize[origins: origin.set, //, func: fn(Int) capturing -> None](num_work_items: Int)` Executes func(0) ... func(num\_work\_items-1) as sub-tasks in parallel, and returns when all are complete. **Parameters:** * ​origins (`origin.set`): The capture origins. * ​func (`fn(Int) capturing -> None`): The function to invoke. **Args:** * ​num\_work\_items (`Int`): Number of parallel tasks. `parallelize[origins: origin.set, //, func: fn(Int) capturing -> None](num_work_items: Int, num_workers: Int)` Executes func(0) ... func(num\_work\_items-1) as sub-tasks in parallel, and returns when all are complete. **Parameters:** * ​origins (`origin.set`): The capture origins. * ​func (`fn(Int) capturing -> None`): The function to invoke. **Args:** * ​num\_work\_items (`Int`): Number of parallel tasks. * ​num\_workers (`Int`): The number of workers to use for execution. --- ## parallelize_over_rows `parallelize_over_rows[: origin.set, //, func: fn(Int, Int) capturing -> None](shape: IndexList[size, element_type=element_type], axis: Int, grain_size: Int)` Parallelize func over non-axis dims of shape. **Parameters:** * ​func (`fn(Int, Int) capturing -> None`): Function to call on range of rows. **Args:** * ​shape (`IndexList[size, element_type=element_type]`): Shape to parallelize over. * ​axis (`Int`): Rows are slices along the axis dimension of shape. * ​grain\_size (`Int`): The minimum number of elements to warrant using an additional thread. --- ## sync_parallelize `sync_parallelize[origins: origin.set, //, func: fn(Int) capturing -> None](num_work_items: Int)` Executes func(0) ... func(num\_work\_items-1) as parallel sub-tasks, and returns when all are complete. **Parameters:** * ​origins (`origin.set`): The capture origins. * ​func (`fn(Int) capturing -> None`): The function to invoke. **Args:** * ​num\_work\_items (`Int`): Number of parallel tasks. `sync_parallelize[origins: origin.set, //, func: fn(Int) raises capturing -> None](num_work_items: Int)` Executes func(0) ... func(num\_work\_items-1) as parallel sub-tasks, and returns when all are complete. TODO: Currently exceptions raised by func will cause a trap rather than be propagated back to the caller. **Parameters:** * ​origins (`origin.set`): The capture origins. * ​func (`fn(Int) raises capturing -> None`): The function to invoke. **Args:** * ​num\_work\_items (`Int`): Number of parallel tasks. --- ## tile `tile[: origin.set, //, workgroup_function: fn[Int](Int) capturing -> None, tile_size_list: VariadicList[Int]](offset: Int, upperbound: Int)` A generator that launches work groups in specified list of tile sizes. A workgroup function is a function that can process a configurable consecutive "tile" of workload. E.g. `work_on[3](5)` should launch computation on item 5,6,7, and should be semantically equivalent to `work_on[1](5)`, `work_on[1](6)`, `work_on[1](7)`. This generator will try to proceed with the given list of tile sizes on the listed order. E.g. `tile[func, (3,2,1)](offset, upperbound)` will try to call `func[3]` starting from offset until remaining work is less than 3 from upperbound and then try `func[2]`, and then `func[1]`, etc. **Parameters:** * ​workgroup\_function (`fn[Int](Int) capturing -> None`): Workgroup function that processes one tile of workload. * ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work. **Args:** * ​offset (`Int`): The initial index to start the work from. * ​upperbound (`Int`): The runtime upperbound that the work function should not exceed. `tile[: origin.set, //, workgroup_function: fn(Int, Int) capturing -> None](offset: Int, upperbound: Int, tile_size_list: VariadicList[Int])` A generator that launches work groups in specified list of tile sizes. This is the version of tile generator for the case where work\_group function can take the tile size as a runtime value. **Parameters:** * ​workgroup\_function (`fn(Int, Int) capturing -> None`): Workgroup function that processes one tile of workload. **Args:** * ​offset (`Int`): The initial index to start the work from. * ​upperbound (`Int`): The runtime upperbound that the work function should not exceed. * ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work. `tile[: origin.set, //, secondary_tile_size_list: VariadicList[Int], secondary_cleanup_tile: Int, workgroup_function: fn[Int](Int, Int) capturing -> None](offset: Int, upperbound: Int, primary_tile_size_list: VariadicList[Int], primary_cleanup_tile: Int)` A generator that launches work groups in specified list of tile sizes until the sum of primary\_tile\_sizes has exceeded the upperbound. **Parameters:** * ​secondary\_tile\_size\_list (`VariadicList[Int]`): List of static tile sizes to launch work. * ​secondary\_cleanup\_tile (`Int`): Last static tile to use when primary tile sizes don't fit exactly within the upperbound. * ​workgroup\_function (`fn[Int](Int, Int) capturing -> None`): Workgroup function that processes one tile of workload. **Args:** * ​offset (`Int`): The initial index to start the work from. * ​upperbound (`Int`): The runtime upperbound that the work function should not exceed. * ​primary\_tile\_size\_list (`VariadicList[Int]`): List of dynamic tile sizes to launch work. * ​primary\_cleanup\_tile (`Int`): Last dynamic tile to use when primary tile sizes don't fit exactly within the upperbound. `tile[: origin.set, //, workgroup_function: fn[Int, Int](Int, Int) capturing -> None, tile_sizes_x: VariadicList[Int], tile_sizes_y: VariadicList[Int]](offset_x: Int, offset_y: Int, upperbound_x: Int, upperbound_y: Int)` Launches workgroup\_function using the largest tile sizes possible in each dimension, starting from the x and y offset, until the x and y upperbounds are reached. **Parameters:** * ​workgroup\_function (`fn[Int, Int](Int, Int) capturing -> None`): Function that is invoked for each tile and offset. * ​tile\_sizes\_x (`VariadicList[Int]`): List of tile sizes to use for the first parameter of workgroup\_function. * ​tile\_sizes\_y (`VariadicList[Int]`): List of tile sizes to use for the second parameter of workgroup\_function. **Args:** * ​offset\_x (`Int`): Initial x offset passed to workgroup\_function. * ​offset\_y (`Int`): Initial y offset passed to workgroup\_function. * ​upperbound\_x (`Int`): Max offset in x dimension passed to workgroup function. * ​upperbound\_y (`Int`): Max offset in y dimension passed to workgroup function. --- ## tile_and_unswitch `tile_and_unswitch[: origin.set, //, workgroup_function: fn[Int, Bool](Int, Int) capturing -> None, tile_size_list: VariadicList[Int]](offset: Int, upperbound: Int)` Performs time and unswitch functional transformation. A variant of static tile given a workgroup function that can be unswitched. This generator is a fused version of tile and unswitch, where the static unswitch is true throughout the "inner" portion of the workload and is false only on the residue tile. **Parameters:** * ​workgroup\_function (`fn[Int, Bool](Int, Int) capturing -> None`): Workgroup function that processes one tile of workload. * ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work. **Args:** * ​offset (`Int`): The initial index to start the work from. * ​upperbound (`Int`): The runtime upperbound that the work function should not exceed. `tile_and_unswitch[: origin.set, //, workgroup_function: fn[Bool](Int, Int, Int) capturing -> None](offset: Int, upperbound: Int, tile_size_list: VariadicList[Int])` Performs time and unswitch functional transformation. A variant of dynamic tile given a workgroup function that can be unswitched. This generator is a fused version of tile and unswitch, where the static unswitch is true throughout the "inner" portion of the workload and is false only on the residue tile. **Parameters:** * ​workgroup\_function (`fn[Bool](Int, Int, Int) capturing -> None`): Workgroup function that processes one tile of workload. **Args:** * ​offset (`Int`): The initial index to start the work from. * ​upperbound (`Int`): The runtime upperbound that the work function should not exceed. * ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work. --- ## tile_middle_unswitch_boundaries `tile_middle_unswitch_boundaries[: origin.set, //, work_fn: fn[Int, Bool](Int) capturing -> None, middle_tile_sizes: VariadicList[Int], left_tile_size: Int = 1, right_tile_size: Int = 1](left_boundary_start: Int, left_boundary_end: Int, right_boundary_start: Int, right_boundary_end: Int)` Divides 1d iteration space into three parts and tiles them with different steps. The 1d iteration space is divided into: 1\. \[left\_boundary\_start, left\_boundary\_end), effected by left boundary. 2\. \[left\_boundary\_end, right\_boundary\_start), not effected by any boundary. 3\. \[right\_boundary\_start, right\_boundary\_end), effected by right boundary. work\_fn's switch is true for the left and right boundaries, implying boundary conditions like padding in convolution. The middle part is tiled with static tile sizes with the switch as false. `middle_tile_sizes` should be in descending order for optimal performance. (Larger tile size appeared later in the list fails the while-loop.) **Parameters:** * ​work\_fn (`fn[Int, Bool](Int) capturing -> None`): Work function that processes one tile of workload. * ​middle\_tile\_sizes (`VariadicList[Int]`): List of tile sizes for the middle part. * ​left\_tile\_size (`Int`): Tile size for the left boundary region. * ​right\_tile\_size (`Int`): Tile size for the right boundary region. **Args:** * ​left\_boundary\_start (`Int`): Start index of the left boundary. * ​left\_boundary\_end (`Int`): End index of the left boundary. * ​right\_boundary\_start (`Int`): Start index of the right boundary. * ​right\_boundary\_end (`Int`): End index of the right boundary. `tile_middle_unswitch_boundaries[: origin.set, //, work_fn: fn[Int, Bool, Bool](Int) capturing -> None, tile_size: Int, size: Int]()` Tile 1d iteration space with boundary conditions at both ends. This generator is primarily for convolution with static shapes. `work_fn`'s flags hints the function to handle padding at the boundary. The size is the static output row size, i.e., WO dimension. **Parameters:** * ​work\_fn (`fn[Int, Bool, Bool](Int) capturing -> None`): Work function that updates one tile. It has two flags for left and right boundaries, respectively. * ​tile\_size (`Int`): 1D Tile size. * ​size (`Int`): Iteration range is \[0, size). --- ## unswitch `unswitch[: origin.set, //, switched_func: fn[Bool]() raises capturing -> None](dynamic_switch: Bool)` Performs a functional unswitch transformation. Unswitch is a simple pattern that is similar idea to loop unswitching pass but extended to functional patterns. The pattern facilitates the following code transformation that reduces the number of branches in the generated code Before: ``` for i in range(...) if i switched\_func (`fn[Bool]() raises capturing -> None`): The function containing the inner loop logic that can be unswitched. **Args:** * ​dynamic\_switch (`Bool`): The dynamic condition that enables the unswitched code path. `unswitch[: origin.set, //, switched_func: fn[Bool]() capturing -> None](dynamic_switch: Bool)` Performs a functional unswitch transformation. Unswitch is a simple pattern that is similar idea to loop unswitching pass but extended to functional patterns. The pattern facilitates the following code transformation that reduces the number of branches in the generated code Before: ``` for i in range(...) if i switched\_func (`fn[Bool]() capturing -> None`): The function containing the inner loop logic that can be unswitched. **Args:** * ​dynamic\_switch (`Bool`): The dynamic condition that enables the unswitched code path. `unswitch[: origin.set, //, switched_func: fn[Bool, Bool]() capturing -> None](dynamic_switch_a: Bool, dynamic_switch_b: Bool)` Performs a functional 2-predicates unswitch transformation. **Parameters:** * ​switched\_func (`fn[Bool, Bool]() capturing -> None`): The function containing the inner loop logic that has 2 predicates which can be unswitched. **Args:** * ​dynamic\_switch\_a (`Bool`): The first dynamic condition that enables the outer unswitched code path. * ​dynamic\_switch\_b (`Bool`): The second dynamic condition that enables the inner unswitched code path. --- ## vectorize `vectorize[origins: origin.set, //, func: fn[Int](Int) capturing -> None, simd_width: Int, /, *, unroll_factor: Int = 1](size: Int)` Simplifies SIMD optimized loops by mapping a function across a range from 0 to `size`, incrementing by `simd_width` at each step. The remainder of `size % simd_width` will run in separate iterations. The below example demonstrates how you could improve the performance of a loop, by setting multiple values at the same time using SIMD registers on the machine: ```mojo from algorithm.functional import vectorize from memory import UnsafePointer # The amount of elements to loop through alias size = 10 # How many Dtype.int32 elements fit into the SIMD register (4 on 128bit) alias simd_width = simdwidthof[DType.int32]() # assumed to be 4 in this example fn main(): var p = UnsafePointer[Int32].alloc(size) # @parameter allows the closure to capture the `p` pointer @parameter fn closure[width: Int](i: Int): print("storing", width, "els at pos", i) p.store[width=width](i, i) vectorize[closure, simd_width](size) print(p.load[width=simd_width]()) print(p.load[width=simd_width](simd_width)) ``` On a machine with a SIMD register size of 128, this will set 4xInt32 values on each iteration. The remainder of 10 % 4 is 2, so those last two elements will be set in two separate iterations: ```plaintext storing 4 els at pos 0 storing 4 els at pos 4 storing 1 els at pos 8 storing 1 els at pos 9 [0, 0, 0, 0, 4, 4, 4, 4, 8, 9] ``` You can also unroll the loop to potentially improve performance at the cost of binary size: ``` vectorize[closure, width, unroll_factor=2](size) ``` In the generated assembly the function calls will be repeated, resulting in fewer arithmetic, comparison, and conditional jump operations. The assembly would look like this in pseudocode: ``` closure[4](0) closure[4](4) # Remainder loop won't unroll unless `size` is passed as a parameter for i in range(8, 10): closure[1](i) closure[1](i) ``` You can pass `size` as a parameter if it's compile time known to reduce the iterations for the remainder. This only occurs if the remainder is an exponent of 2 (2, 4, 8, 16, ...). The remainder loop will still unroll for performance improvements if not an exponent of 2. **Parameters:** * ​origins (`origin.set`): The capture origins. * ​func (`fn[Int](Int) capturing -> None`): The function that will be called in the loop body. * ​simd\_width (`Int`): The SIMD vector width. * ​unroll\_factor (`Int`): The unroll factor for the main loop (Default 1). **Args:** * ​size (`Int`): The upper limit for the loop. `vectorize[origins: origin.set, //, func: fn[Int](Int) capturing -> None, simd_width: Int, /, *, size: Int, unroll_factor: Int = size if is_nvidia_gpu() else 1]()` Simplifies SIMD optimized loops by mapping a function across a range from 0 to `size`, incrementing by `simd_width` at each step. The remainder of `size % simd_width` will run in a single iteration if it's an exponent of 2. The below example demonstrates how you could improve the performance of a loop, by setting multiple values at the same time using SIMD registers on the machine: ```mojo from algorithm.functional import vectorize from memory import UnsafePointer # The amount of elements to loop through alias size = 10 # How many Dtype.int32 elements fit into the SIMD register (4 on 128bit) alias simd_width = simdwidthof[DType.int32]() # assumed to be 4 in this example fn main(): var p = UnsafePointer[Int32].alloc(size) # @parameter allows the closure to capture the `p` pointer @parameter fn closure[width: Int](i: Int): print("storing", width, "els at pos", i) p.store[width=width](i, i) vectorize[closure, simd_width](size) print(p.load[width=simd_width]()) print(p.load[width=simd_width](simd_width)) ``` On a machine with a SIMD register size of 128, this will set 4xInt32 values on each iteration. The remainder of 10 % 4 is 2, so those last two elements will be set in a single iteration: ```plaintext storing 4 els at pos 0 storing 4 els at pos 4 storing 2 els at pos 8 [0, 0, 0, 0, 4, 4, 4, 4, 8, 8] ``` If the remainder is not an exponent of 2 (2, 4, 8, 16 ...) there will be a separate iteration for each element. However passing `size` as a parameter also allows the loop for the remaining elements to be unrolled. You can also unroll the main loop to potentially improve performance at the cost of binary size: ``` vectorize[closure, width, size=size, unroll_factor=2]() ``` In the generated assembly the function calls will be repeated, resulting in fewer arithmetic, comparison, and conditional jump operations. The assembly would look like this in pseudocode: ``` closure[4](0) closure[4](4) closure[2](8) ``` **Parameters:** * ​origins (`origin.set`): The capture origins. * ​func (`fn[Int](Int) capturing -> None`): The function that will be called in the loop body. * ​simd\_width (`Int`): The SIMD vector width. * ​size (`Int`): The upper limit for the loop. * ​unroll\_factor (`Int`): The unroll factor for the main loop (Default 1). --- ## algorithm Implements the algorithm package. ## Modules * [​`functional`](/mojo/stdlib/algorithm/functional/): Implements higher-order functions. * [​`memory`](/mojo/stdlib/algorithm/memory/): Implements `parallel_memcpy`. * [​`reduction`](/mojo/stdlib/algorithm/reduction/): Implements SIMD reductions. --- ## memory Implements `parallel_memcpy`. You can import these APIs from the `algorithm` package. For example: ```mojo from algorithm import parallel_memcpy ``` ## Functions * [​`parallel_memcpy`](/mojo/stdlib/algorithm/memory/parallel_memcpy): Copies `count` elements from a memory buffer `src` to `dest` in parallel by spawning `num_tasks` tasks each copying `count_per_task` elements. --- ## parallel_memcpy `parallel_memcpy[type: DType](dest: UnsafePointer[SIMD[type, 1]], src: UnsafePointer[SIMD[type, 1]], count: Int, count_per_task: Int, num_tasks: Int)` Copies `count` elements from a memory buffer `src` to `dest` in parallel by spawning `num_tasks` tasks each copying `count_per_task` elements. **Parameters:** * ​type (`DType`): The element dtype. **Args:** * ​dest (`UnsafePointer[SIMD[type, 1]]`): The destination buffer. * ​src (`UnsafePointer[SIMD[type, 1]]`): The source buffer. * ​count (`Int`): Number of elements in the buffer. * ​count\_per\_task (`Int`): Task size. * ​num\_tasks (`Int`): Number of tasks to run in parallel. `parallel_memcpy[type: DType](dest: UnsafePointer[SIMD[type, 1]], src: UnsafePointer[SIMD[type, 1]], count: Int)` Copies `count` elements from a memory buffer `src` to `dest` in parallel. **Parameters:** * ​type (`DType`): The element type. **Args:** * ​dest (`UnsafePointer[SIMD[type, 1]]`): The destination pointer. * ​src (`UnsafePointer[SIMD[type, 1]]`): The source pointer. * ​count (`Int`): The number of elements to copy. --- ## all_true `all_true(src: NDBuffer[type, 1, origin]) -> Bool` Returns True if all the elements in a buffer are True and False otherwise. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. **Returns:** True if all of the elements of the buffer are True and False otherwise. --- ## any_true `any_true(src: NDBuffer[type, 1, origin]) -> Bool` Returns True if any the elements in a buffer are True and False otherwise. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. **Returns:** True if any of the elements of the buffer are True and False otherwise. --- ## cumsum `cumsum(dst: NDBuffer[type, 1, origin], src: NDBuffer[type, 1, origin, shape, strides])` Computes the cumulative sum of all elements in a buffer. dst\[i] = src\[i] + src\[i-1] + ... + src\[0]. **Args:** * ​dst (`NDBuffer[type, 1, origin]`): The buffer that stores the result of cumulative sum operation. * ​src (`NDBuffer[type, 1, origin, shape, strides]`): The buffer of elements for which the cumulative sum is computed. --- ## reduction Implements SIMD reductions. You can import these APIs from the `algorithm` package. For example: ```mojo from algorithm import map_reduce ``` ## Functions * [​`all_true`](/mojo/stdlib/algorithm/reduction/all_true): Returns True if all the elements in a buffer are True and False otherwise. * [​`any_true`](/mojo/stdlib/algorithm/reduction/any_true): Returns True if any the elements in a buffer are True and False otherwise. * [​`cumsum`](/mojo/stdlib/algorithm/reduction/cumsum): Computes the cumulative sum of all elements in a buffer. dst\[i] = src\[i] + src\[i-1] + ... + src\[0]. * [​`map_reduce`](/mojo/stdlib/algorithm/reduction/map_reduce): Stores the result of calling input\_gen\_fn in dst and simultaneously reduce the result using a custom reduction function. * [​`max`](/mojo/stdlib/algorithm/reduction/max): Computes the max element in a buffer. * [​`mean`](/mojo/stdlib/algorithm/reduction/mean): Computes the mean value of the elements in a buffer. * [​`min`](/mojo/stdlib/algorithm/reduction/min): Computes the min element in a buffer. * [​`none_true`](/mojo/stdlib/algorithm/reduction/none_true): Returns True if none of the elements in a buffer are True and False otherwise. * [​`product`](/mojo/stdlib/algorithm/reduction/product): Computes the product of the buffer elements. * [​`reduce`](/mojo/stdlib/algorithm/reduction/reduce): Computes a custom reduction of buffer elements. * [​`reduce_boolean`](/mojo/stdlib/algorithm/reduction/reduce_boolean): Computes a bool reduction of buffer elements. The reduction will early exit if the `continue_fn` returns False. * [​`sum`](/mojo/stdlib/algorithm/reduction/sum): Computes the sum of buffer elements. * [​`variance`](/mojo/stdlib/algorithm/reduction/variance): Given a mean, computes the variance of elements in a buffer. --- ## map_reduce `map_reduce[simd_width: Int, size: Dim, type: DType, acc_type: DType, origins_gen: origin.set, input_gen_fn: fn[DType, Int](Int) capturing -> SIMD[$0, $1], origins_vec: origin.set, reduce_vec_to_vec_fn: fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2], reduce_vec_to_scalar_fn: fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1]](dst: NDBuffer[type, 1, origin, __init__[::Intable](size)], init: SIMD[acc_type, 1]) -> SIMD[acc_type, 1]` Stores the result of calling input\_gen\_fn in dst and simultaneously reduce the result using a custom reduction function. **Parameters:** * ​simd\_width (`Int`): The vector width for the computation. * ​size (`Dim`): The buffer size. * ​type (`DType`): The buffer elements dtype. * ​acc\_type (`DType`): The dtype of the reduction accumulator. * ​origins\_gen (`origin.set`): The OriginSet of captured arguments by the input\_gen\_fn. * ​input\_gen\_fn (`fn[DType, Int](Int) capturing -> SIMD[$0, $1]`): A function that generates inputs to reduce. * ​origins\_vec (`origin.set`): The OriginSet of captured arguments by the reduce\_vec\_to\_vec\_fn. * ​reduce\_vec\_to\_vec\_fn (`fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]`): A mapping function. This function is used to combine (accumulate) two chunks of input data: e.g. we load two `8xfloat32` vectors of elements and need to reduce them into a single `8xfloat32` vector. * ​reduce\_vec\_to\_scalar\_fn (`fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1]`): A reduction function. This function is used to reduce a vector to a scalar. E.g. when we got `8xfloat32` vector and want to reduce it to an `float32` scalar. **Args:** * ​dst (`NDBuffer[type, 1, origin, __init__[::Intable](size)]`): The output buffer. * ​init (`SIMD[acc_type, 1]`): The initial value to use in accumulator. **Returns:** The computed reduction value. --- ## max `max(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]` Computes the max element in a buffer. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. **Returns:** The maximum of the buffer elements. `max[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])` Computes the max across reduce\_axis of an NDBuffer. **Parameters:** * ​reduce\_axis (`Int`): The axis to reduce across. **Args:** * ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer. * ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer. `max[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())` Computes the max across the input and output shape. This performs the max computation on the domain specified by `input_shape`, loading the inputs using the `input_fn`. The results are stored using the `output_fn`. **Parameters:** * ​type (`DType`): The type of the input and output. * ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input. * ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. **Args:** * ​input\_shape (`IndexList[size]`): The input shape. * ​reduce\_dim (`Int`): The axis to perform the max on. * ​context (`DeviceContextPtr`): The pointer to DeviceContext. --- ## mean `mean(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]` Computes the mean value of the elements in a buffer. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer of elements for which the mean is computed. **Returns:** The mean value of the elements in the given buffer. `mean[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])` Computes the mean across reduce\_axis of an NDBuffer. **Parameters:** * ​reduce\_axis (`Int`): The axis to reduce across. **Args:** * ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer. * ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer. `mean[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, output_shape: IndexList[size], context: DeviceContextPtr = DeviceContextPtr())` Computes the mean across the input and output shape. This performs the mean computation on the domain specified by `input_shape`, loading the inputs using the `input_fn`. The results' domain is `output_shape` which are stored using the `output_fn`. **Parameters:** * ​type (`DType`): The type of the input and output. * ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input. * ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. **Args:** * ​input\_shape (`IndexList[size]`): The input shape. * ​reduce\_dim (`Int`): The axis to perform the mean on. * ​output\_shape (`IndexList[size]`): The output shape. * ​context (`DeviceContextPtr`): The pointer to DeviceContext. --- ## min `min(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]` Computes the min element in a buffer. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. **Returns:** The minimum of the buffer elements. `min[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])` Computes the min across reduce\_axis of an NDBuffer. **Parameters:** * ​reduce\_axis (`Int`): The axis to reduce across. **Args:** * ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer. * ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer. `min[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())` Computes the min across the input and output shape. This performs the min computation on the domain specified by `input_shape`, loading the inputs using the `input_fn`. The results are stored using the `output_fn`. **Parameters:** * ​type (`DType`): The type of the input and output. * ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input. * ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. **Args:** * ​input\_shape (`IndexList[size]`): The input shape. * ​reduce\_dim (`Int`): The axis to perform the min on. * ​context (`DeviceContextPtr`): The pointer to DeviceContext. --- ## none_true `none_true(src: NDBuffer[type, 1, origin]) -> Bool` Returns True if none of the elements in a buffer are True and False otherwise. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. **Returns:** True if none of the elements of the buffer are True and False otherwise. --- ## product `product(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]` Computes the product of the buffer elements. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. **Returns:** The product of the buffer elements. `product[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])` Computes the product across reduce\_axis of an NDBuffer. **Parameters:** * ​reduce\_axis (`Int`): The axis to reduce across. **Args:** * ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer. * ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer. `product[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())` Computes the product across the input and output shape. This performs the product computation on the domain specified by `input_shape`, loading the inputs using the `input_fn`. The results are stored using the `output_fn`. **Parameters:** * ​type (`DType`): The type of the input and output. * ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input. * ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. **Args:** * ​input\_shape (`IndexList[size]`): The input shape. * ​reduce\_dim (`Int`): The axis to perform the product on. * ​context (`DeviceContextPtr`): The pointer to DeviceContext. --- ## reduce `reduce[: origin.set, //, reduce_fn: fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]](src: NDBuffer[type, 1, origin], init: SIMD[dtype, 1]) -> SIMD[dtype, 1]` Computes a custom reduction of buffer elements. **Parameters:** * ​reduce\_fn (`fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]`): The lambda implementing the reduction. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The input buffer. * ​init (`SIMD[dtype, 1]`): The initial value to use in accumulator. **Returns:** The computed reduction value. `reduce[: origin.set, //, map_fn: fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2], reduce_fn: fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1], reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], init: SIMD[type, 1])` Performs a reduction across reduce\_axis of an NDBuffer (src) and stores the result in an NDBuffer (dst). First src is reshaped into a 3D tensor. Without loss of generality, the three axes will be referred to as \[H,W,C], where the axis to reduce across is W, the axes before the reduce axis are packed into H, and the axes after the reduce axis are packed into C. i.e. a tensor with dims \[D1, D2, ..., Di, ..., Dn] reducing across axis i gets packed into a 3D tensor with dims \[H, W, C], where H=prod(D1,...,Di-1), W = Di, and C = prod(Di+1,...,Dn). **Parameters:** * ​map\_fn (`fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]`): A mapping function. This function is used when to combine (accumulate) two chunks of input data: e.g. we load two 8xfloat32 vectors of elements and need to reduce them to a single 8xfloat32 vector. * ​reduce\_fn (`fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1]`): A reduction function. This function is used to reduce a vector to a scalar. E.g. when we got 8xfloat32 vector and want to reduce it to 1xfloat32. * ​reduce\_axis (`Int`): The axis to reduce across. **Args:** * ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer. * ​dst (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The output buffer. * ​init (`SIMD[type, 1]`): The initial value to use in accumulator. --- ## reduce_boolean `reduce_boolean[: origin.set, : origin.set, //, reduce_fn: fn[DType, Int](SIMD[$0, $1]) capturing -> Bool, continue_fn: fn(Bool) capturing -> Bool](src: NDBuffer[type, 1, origin], init: Bool) -> Bool` Computes a bool reduction of buffer elements. The reduction will early exit if the `continue_fn` returns False. **Parameters:** * ​reduce\_fn (`fn[DType, Int](SIMD[$0, $1]) capturing -> Bool`): A boolean reduction function. This function is used to reduce a vector to a scalar. E.g. when we got `8xfloat32` vector and want to reduce it to a `bool`. * ​continue\_fn (`fn(Bool) capturing -> Bool`): A function to indicate whether we want to continue processing the rest of the iterations. This takes the result of the reduce\_fn and returns True to continue processing and False to early exit. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The input buffer. * ​init (`Bool`): The initial value to use. **Returns:** The computed reduction value. --- ## sum `sum(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]` Computes the sum of buffer elements. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. **Returns:** The sum of the buffer elements. `sum[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])` Computes the sum across reduce\_axis of an NDBuffer. **Parameters:** * ​reduce\_axis (`Int`): The axis to reduce across. **Args:** * ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer. * ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer. `sum[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())` Computes the sum across the input and output shape. This performs the sum computation on the domain specified by `input_shape`, loading the inputs using the `input_fn`. The results are stored using the `output_fn`. **Parameters:** * ​type (`DType`): The type of the input and output. * ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input. * ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. **Args:** * ​input\_shape (`IndexList[size]`): The input shape. * ​reduce\_dim (`Int`): The axis to perform the sum on. * ​context (`DeviceContextPtr`): The pointer to DeviceContext. --- ## variance `variance(src: NDBuffer[type, 1, origin], mean_value: SIMD[type, 1], correction: Int = 1) -> SIMD[type, 1]` Given a mean, computes the variance of elements in a buffer. The mean value is used to avoid a second pass over the data: ``` variance(x) = sum((x - E(x))^2) / (size - correction) ``` **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. * ​mean\_value (`SIMD[type, 1]`): The mean value of the buffer. * ​correction (`Int`): Normalize variance by size - correction. **Returns:** The variance value of the elements in a buffer. `variance(src: NDBuffer[type, 1, origin], correction: Int = 1) -> SIMD[type, 1]` Computes the variance value of the elements in a buffer. ``` variance(x) = sum((x - E(x))^2) / (size - correction) ``` **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. * ​correction (`Int`): Normalize variance by size - correction (Default=1). **Returns:** The variance value of the elements in a buffer. --- ## b16decode `b16decode(str: StringSlice[origin]) -> String` Performs base16 decoding on the input string. **Args:** * ​str (`StringSlice[origin]`): A base16 encoded string. **Returns:** The decoded string. --- ## b16encode `b16encode(str: StringSlice[origin]) -> String` Performs base16 encoding on the input string slice. **Args:** * ​str (`StringSlice[origin]`): The input string slice. **Returns:** Base16 encoding of the input string. --- ## b64decode `b64decode[*, validate: Bool = False](str: StringSlice[origin]) -> String` Performs base64 decoding on the input string. **Parameters:** * ​validate (`Bool`): If true, the function will validate the input string. **Args:** * ​str (`StringSlice[origin]`): A base64 encoded string. **Returns:** The decoded string. --- ## b64encode `b64encode(input_bytes: Span[SIMD[uint8, 1], origin], mut result: String)` Performs base64 encoding on the input string. Notes: This method reserves the necessary capacity. `result` can be a 0 capacity string. **Args:** * ​input\_bytes (`Span[SIMD[uint8, 1], origin]`): The input string buffer. * ​result (`String`): The string in which to store the values. `b64encode(input_string: StringSlice[origin]) -> String` Performs base64 encoding on the input string. **Args:** * ​input\_string (`StringSlice[origin]`): The input string buffer. **Returns:** The ASCII base64 encoded string. `b64encode(input_bytes: Span[SIMD[uint8, 1], origin]) -> String` Performs base64 encoding on the input string. **Args:** * ​input\_bytes (`Span[SIMD[uint8, 1], origin]`): The input string buffer. **Returns:** The ASCII base64 encoded string. --- ## base64 Provides functions for base64 encoding strings. You can import these APIs from the `base64` package. For example: ```mojo from base64 import b64encode ``` ## Functions * [​`b16decode`](/mojo/stdlib/base64/base64/b16decode): Performs base16 decoding on the input string. * [​`b16encode`](/mojo/stdlib/base64/base64/b16encode): Performs base16 encoding on the input string slice. * [​`b64decode`](/mojo/stdlib/base64/base64/b64decode): Performs base64 decoding on the input string. * [​`b64encode`](/mojo/stdlib/base64/base64/b64encode): Performs base64 encoding on the input string. --- ## base64 Implements the base64 package. ## Modules * [​`base64`](/mojo/stdlib/base64/base64/): Provides functions for base64 encoding strings. --- ## Bench `struct Bench` Constructs a Benchmark object, used for running multiple benchmarks and comparing the results. Example: ```mojo from benchmark import ( Bench, BenchConfig, Bencher, BenchId, ThroughputMeasure, BenchMetric, Format, ) from utils import IndexList from gpu.host import DeviceContext from pathlib import Path fn example_kernel(): print("example_kernel") var shape = IndexList[2](1024, 1024) var bench = Bench(BenchConfig(max_iters=100)) @parameter @always_inline fn example(mut b: Bencher, shape: IndexList[2]) capturing raises: @parameter @always_inline fn kernel_launch(ctx: DeviceContext) raises: ctx.enqueue_function[example_kernel]( grid_dim=shape[0], block_dim=shape[1] ) var bench_ctx = DeviceContext() b.iter_custom[kernel_launch](bench_ctx) bench.bench_with_input[IndexList[2], example]( BenchId("top_k_custom", "gpu"), shape, ThroughputMeasure( BenchMetric.elements, shape.flattened_length() ), ThroughputMeasure( BenchMetric.flops, shape.flattened_length() * 3 # number of ops ), ) # Add more benchmarks like above to compare results # Pretty print in table format print(bench) # Dump report to csv file bench.config.out_file = Path("out.csv") bench.dump_report() # Print in tabular csv format bench.config.format = Format.tabular print(bench) ``` You can pass arguments when running a program that makes use of `Bench`: ```sh mojo benchmark.mojo -o out.csv -r 10 ``` This will repeat the benchmarks 10 times and write the output to `out.csv` in csv format. ## Fields * ​config (`BenchConfig`): Constructs a Benchmark object based on specific configuration and mode. * ​mode (`Mode`): Benchmark mode object representing benchmark or test mode. * ​info\_vec (`List[BenchmarkInfo]`): A list containing the benchmark info. ## Implemented traits `AnyType`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(out self, config: Optional[BenchConfig] = Optional(None), mode: Mode = Mode(0))` Constructs a Benchmark object based on specific configuration and mode. **Args:** * ​config (`Optional[BenchConfig]`): Benchmark configuration object to control length and frequency of benchmarks. * ​mode (`Mode`): Benchmark mode object representing benchmark or test mode. ### `bench_with_input` `bench_with_input[: origin.set, //, T: AnyType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, measures: List[ThroughputMeasure] = List())` Benchmarks an input function with input args of type AnyType. **Parameters:** * ​T (`AnyType`): Benchmark function input type. * ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​input (`T`): Represents the target function's input arguments. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `bench_with_input[: origin.set, //, T: AnyType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, *measures: ThroughputMeasure)` Benchmarks an input function with input args of type AnyType. **Parameters:** * ​T (`AnyType`): Benchmark function input type. * ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​input (`T`): Represents the target function's input arguments. * ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's. `bench_with_input[: origin.set, //, T: AnyTrivialRegType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, measures: List[ThroughputMeasure] = List())` Benchmarks an input function with input args of type AnyTrivialRegType. **Parameters:** * ​T (`AnyTrivialRegType`): Benchmark function input type. * ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​input (`T`): Represents the target function's input arguments. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `bench_with_input[: origin.set, //, T: AnyTrivialRegType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, *measures: ThroughputMeasure)` Benchmarks an input function with input args of type AnyTrivialRegType. **Parameters:** * ​T (`AnyTrivialRegType`): Benchmark function input type. * ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​input (`T`): Represents the target function's input arguments. * ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's. ### `bench_function` `bench_function[: origin.set, //, bench_fn: fn() raises capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmarks or Tests an input function. **Parameters:** * ​bench\_fn (`fn() raises capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `bench_function[: origin.set, //, bench_fn: fn() capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmarks or Tests an input function. **Parameters:** * ​bench\_fn (`fn() capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `bench_function[: origin.set, //, bench_fn: fn(mut Bencher) capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmarks or Tests an input function. **Parameters:** * ​bench\_fn (`fn(mut Bencher) capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `bench_function[: origin.set, //, bench_fn: fn(mut Bencher) capturing -> None](mut self, bench_id: BenchId, *measures: ThroughputMeasure)` Benchmarks or Tests an input function. **Parameters:** * ​bench\_fn (`fn(mut Bencher) capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's. `bench_function[: origin.set, //, bench_fn: fn(mut Bencher) raises capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmarks or Tests an input function. **Parameters:** * ​bench\_fn (`fn(mut Bencher) raises capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `bench_function[: origin.set, //, bench_fn: fn(mut Bencher) raises capturing -> None](mut self, bench_id: BenchId, *measures: ThroughputMeasure)` Benchmarks or Tests an input function. **Parameters:** * ​bench\_fn (`fn(mut Bencher) raises capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's. ### `dump_report` `dump_report(mut self)` Prints out the report from a Benchmark execution. If `Bench.config.out_file` is set, it will also write the output in the format set in `out_file_format` to the file defined in `out_file`. ### `pad` `pad[pad_str: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")](self, width: Int, string: String) -> String` Pads a string to a given width. **Parameters:** * ​pad\_str (`StringSlice[StaticConstantOrigin]`): The length 1 string to use for the padding. **Args:** * ​width (`Int`): The width to pad the string to. * ​string (`String`): The string to pad. **Returns:** A string padded to the given width. ### `__str__` `__str__(self) -> String` Returns a string representation of the benchmark results. **Returns:** A string representing the benchmark results. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the benchmark results to a writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writer trait. **Args:** * ​writer (`W`): The writer to write to. --- ## BenchConfig `struct BenchConfig` Defines a benchmark configuration struct to control execution times and frequency. ## Fields * ​out\_file (`Optional[Path]`): Output file to write results to. * ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs. * ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs. * ​min\_warmuptime\_secs (`SIMD[float64, 1]`): Lower bound on warmup time in secs. * ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement. * ​max\_iters (`Int`): Max number of iterations to run. * ​num\_repetitions (`Int`): Number of times the benchmark has to be repeated. * ​flush\_denormals (`Bool`): Whether or not the denormal values are flushed. * ​show\_progress (`Bool`): If True, print progress of each benchmark. * ​format (`Format`): The format to print results. (default: "table"). * ​out\_file\_format (`Format`): The format to write out the file with `dump_file` (default: "csv"). * ​verbose\_timing (`Bool`): Whether to print verbose timing results. * ​verbose\_metric\_names (`Bool`): If True print the metric name and unit, else print the unit only. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `VERBOSE_TIMING_LABELS` `alias VERBOSE_TIMING_LABELS = List(__init__[__mlir_type.!kgen.string]("min (ms)"), __init__[__mlir_type.!kgen.string]("mean (ms)"), __init__[__mlir_type.!kgen.string]("max (ms)"), __init__[__mlir_type.!kgen.string]("duration (ms)"), Tuple())` Labels to print verbose timing results. ## Methods ### `__init__` `__init__(out self, out_file: Optional[Path] = Optional(None), min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](2), min_warmuptime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1), max_batch_size: Int = 0, max_iters: Int = 1000000000, num_repetitions: Int = 1, flush_denormals: Bool = True)` Constructs and initializes Benchmark config object with default and inputted values. **Args:** * ​out\_file (`Optional[Path]`): Output file to write results to. * ​min\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `0.1`). * ​max\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `1`). * ​min\_warmuptime\_secs (`SIMD[float64, 1]`): Lower bound on warmup time in secs (default `1.0`). * ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement. * ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`). * ​num\_repetitions (`Int`): Number of times the benchmark has to be repeated. * ​flush\_denormals (`Bool`): Whether or not the denormal values are flushed. `__init__(out self, *, other: Self)` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. --- ## BenchId `struct BenchId` Defines a benchmark Id struct to identify and represent a particular benchmark execution. ## Fields * ​func\_name (`String`): The target function name. * ​input\_id (`Optional[String]`): The target function input id phrase. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, func_name: String, input_id: String)` Constructs a Benchmark Id object from input function name and Id phrase. **Args:** * ​func\_name (`String`): The target function name. * ​input\_id (`String`): The target function input id phrase. `@implicit` `__init__(out self, func_name: String)` Constructs a Benchmark Id object from input function name. **Args:** * ​func\_name (`String`): The target function name. `@implicit` `__init__(out self, func_name: StringLiteral[value])` Constructs a Benchmark Id object from input function name. **Args:** * ​func\_name (`StringLiteral[value]`): The target function name. --- ## BenchMetric `struct BenchMetric` Defines a benchmark throughput metric. ## Fields * ​code (`Int`): Op-code of the Metric. * ​name (`String`): Metric's name. * ​unit (`String`): Metric's throughput rate unit (count/second). ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `bytes` `alias bytes = BenchMetric(1, __init__[__mlir_type.!kgen.string]("DataMovement"), __init__[__mlir_type.!kgen.string]("GB/s"))` ### `DEFAULTS` `alias DEFAULTS = List(BenchMetric(0, __init__[__mlir_type.!kgen.string]("throughput"), __init__[__mlir_type.!kgen.string]("GElems/s")), BenchMetric(1, __init__[__mlir_type.!kgen.string]("DataMovement"), __init__[__mlir_type.!kgen.string]("GB/s")), BenchMetric(2, __init__[__mlir_type.!kgen.string]("Arithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s")), Tuple())` Default set of benchmark metrics. ### `elements` `alias elements = BenchMetric(0, __init__[__mlir_type.!kgen.string]("throughput"), __init__[__mlir_type.!kgen.string]("GElems/s"))` ### `flops` `alias flops = BenchMetric(2, __init__[__mlir_type.!kgen.string]("Arithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s"))` ### `theoretical_flops` `alias theoretical_flops = BenchMetric(3, __init__[__mlir_type.!kgen.string]("TheoreticalArithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s"))` ## Methods ### `__init__` `__init__(out self, *, other: Self)` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. ### `__eq__` `__eq__(self, other: Self) -> Bool` Compares two metrics for equality. **Args:** * ​other (`Self`): The metric to compare. **Returns:** True if the two metrics are equal. ### `__ne__` `__ne__(self, other: Self) -> Bool` Compares two metrics for inequality. **Args:** * ​other (`Self`): The metric to compare. **Returns:** True if the two metrics are NOT equal. ### `__str__` `__str__(self) -> String` Gets a string representation of this metric. **Returns:** The string representation. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this BenchMetric to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `check_name` `check_name(self, alt_name: String) -> Bool` Checks whether a string contains the metric's name. **Args:** * ​alt\_name (`String`): Alternative name of a metric. **Returns:** True if 'alt\_name' is valid alternative of the metric's name. ### `get_metric_from_list` `static get_metric_from_list(name: String, metric_list: List[BenchMetric]) -> Self` Gets a metric from a given list using only the metric's name. **Args:** * ​name (`String`): Metric's name. * ​metric\_list (`List[BenchMetric]`): List of metrics to search. **Returns:** The selected metric. --- ## Bencher `@register_passable` `struct Bencher` Defines a Bencher struct which facilitates the timing of a target function. ## Fields * ​num\_iters (`Int`): Number of iterations to run the target function. * ​elapsed (`Int`): The total time elapsed when running the target function. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(num_iters: Int) -> Self` Constructs a Bencher object to run and time a function. **Args:** * ​num\_iters (`Int`): Number of times to run the target function. ### `iter` `iter[: origin.set, //, iter_fn: fn() capturing -> None](mut self)` Returns the total elapsed time by running a target function a particular number of times. **Parameters:** * ​iter\_fn (`fn() capturing -> None`): The target function to benchmark. `iter[iter_fn: fn() raises capturing -> None](mut self)` Returns the total elapsed time by running a target function a particular number of times. **Parameters:** * ​iter\_fn (`fn() raises capturing -> None`): The target function to benchmark. ### `iter_preproc` `iter_preproc[: origin.set, : origin.set, //, iter_fn: fn() capturing -> None, preproc_fn: fn() capturing -> None](mut self)` Returns the total elapsed time by running a target function a particular number of times. **Parameters:** * ​iter\_fn (`fn() capturing -> None`): The target function to benchmark. * ​preproc\_fn (`fn() capturing -> None`): The function to preprocess the target function. ### `iter_custom` `iter_custom[: origin.set, //, iter_fn: fn(Int) capturing -> Int](mut self)` Times a target function with custom number of iterations. **Parameters:** * ​iter\_fn (`fn(Int) capturing -> Int`): The target function to benchmark. `iter_custom[: origin.set, //, kernel_launch_fn: fn(DeviceContext) raises capturing -> None](mut self, ctx: DeviceContext)` Times a target GPU function with custom number of iterations via DeviceContext ctx. **Parameters:** * ​kernel\_launch\_fn (`fn(DeviceContext) raises capturing -> None`): The target GPU kernel launch function to benchmark. **Args:** * ​ctx (`DeviceContext`): The GPU DeviceContext for launching kernel. `iter_custom[: origin.set, //, kernel_launch_fn: fn(DeviceContext, Int) raises capturing -> None](mut self, ctx: DeviceContext)` Times a target GPU function with custom number of iterations via DeviceContext ctx. **Parameters:** * ​kernel\_launch\_fn (`fn(DeviceContext, Int) raises capturing -> None`): The target GPU kernel launch function to benchmark. **Args:** * ​ctx (`DeviceContext`): The GPU DeviceContext for launching kernel. `iter_custom[iter_fn: fn(Int) raises capturing -> Int](mut self)` Times a target function with custom number of iterations. **Parameters:** * ​iter\_fn (`fn(Int) raises capturing -> Int`): The target function to benchmark. ### `iter_custom_multicontext` `iter_custom_multicontext[: origin.set, //, kernel_launch_fn: fn() raises capturing -> None](mut self, ctxs: List[DeviceContext])` Times a target GPU function with custom number of iterations via DeviceContext ctx. **Parameters:** * ​kernel\_launch\_fn (`fn() raises capturing -> None`): The target GPU kernel launch function to benchmark. **Args:** * ​ctxs (`List[DeviceContext]`): The list of GPU DeviceContext's for launching kernel. --- ## BenchmarkInfo `struct BenchmarkInfo` Defines a Benchmark Info struct to record execution Statistics. ## Fields * ​name (`String`): The name of the benchmark. * ​result (`Report`): The output report after executing a benchmark. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. * ​verbose\_timing (`Bool`): Whether to print verbose timing results. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, name: String, result: Report, measures: List[ThroughputMeasure] = List(), verbose_timing: Bool = False)` Constructs a `BenchmarkInfo` object to return benchmark report and statistics. **Args:** * ​name (`String`): The name of the benchmark. * ​result (`Report`): The output report after executing a benchmark. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. * ​verbose\_timing (`Bool`): Whether to print verbose timing results. `__init__(out self, *, other: Self)` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. --- ## Format `struct Format` Defines a format for the benchmark output when printing or writing to a file. ## Fields * ​value (`StringSlice[StaticConstantOrigin]`): The format to print results. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `csv` `alias csv = __init__[__mlir_type.!kgen.string]("csv")` Comma separated values with no alignment. ### `table` `alias table = __init__[__mlir_type.!kgen.string]("table")` Table format with dynamically aligned columns. ### `tabular` `alias tabular = __init__[__mlir_type.!kgen.string]("tabular")` Comma separated values with dynamically aligned columns. ## Methods ### `__init__` `@implicit` `__init__(out self, value: StringSlice[origin])` Constructs a Format object from a string. **Args:** * ​value (`StringSlice[origin]`): The format to print results. ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two Format objects are equal. **Args:** * ​other (`Self`): The `Format` to compare with. **Returns:** True if the two `Format` objects are equal, false otherwise. ### `__str__` `__str__(self) -> String` Returns the string representation of the format. **Returns:** The string representation of the format. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the format to a writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The writer to write the `Format` to. --- ## Mode `struct Mode` Defines a Benchmark Mode to distinguish between test runs and actual benchmarks. ## Fields * ​value (`Int`): Represents the mode type. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `Benchmark` `alias Benchmark = Mode(0)` ### `Test` `alias Test = Mode(1)` ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Check if its Benchmark mode or test mode. **Args:** * ​other (`Self`): The mode to be compared against. **Returns:** If its a test mode or benchmark mode. --- ## ThroughputMeasure `struct ThroughputMeasure` Records a throughput metric of metric BenchMetric and value. ## Fields * ​metric (`BenchMetric`): Type of throughput metric. * ​value (`Int`): Measured count of throughput metric. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, name: String, value: Int, reference: List[BenchMetric] = List(BenchMetric(0, __init__[__mlir_type.!kgen.string]("throughput"), __init__[__mlir_type.!kgen.string]("GElems/s")), BenchMetric(1, __init__[__mlir_type.!kgen.string]("DataMovement"), __init__[__mlir_type.!kgen.string]("GB/s")), BenchMetric(2, __init__[__mlir_type.!kgen.string]("Arithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s")), Tuple()))` Creates a `ThroughputMeasure` based on metric's name. Example: For the default bench metrics `BenchMetric.DEFAULTS` the following are equivalent: \- `ThroughputMeasure(BenchMetric.fmas, 1024)` \- `ThroughputMeasure("fmas", 1024)` \- `ThroughputMeasure("fmas", 1024, BenchMetric.DEFAULTS)` **Args:** * ​name (`String`): The name of BenchMetric in its corresponding reference. * ​value (`Int`): The measured value to assign to this metric. * ​reference (`List[BenchMetric]`): List of BenchMetrics that contains this metric. `__init__(out self, *, other: Self)` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. ### `__str__` `__str__(self) -> String` Gets a string representation of this `ThroughputMeasure`. **Returns:** The string representation. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this ThroughputMeasure to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `compute` `compute(self, elapsed_sec: SIMD[float64, 1]) -> SIMD[float64, 1]` Computes throughput rate for this metric per unit of time (second). **Args:** * ​elapsed\_sec (`SIMD[float64, 1]`): Elapsed time measured in seconds. **Returns:** The throughput values as a floating point 64. --- ## bencher ## Structs * [​`Bench`](/mojo/stdlib/benchmark/bencher/Bench): Constructs a Benchmark object, used for running multiple benchmarks and comparing the results. * [​`BenchConfig`](/mojo/stdlib/benchmark/bencher/BenchConfig): Defines a benchmark configuration struct to control execution times and frequency. * [​`Bencher`](/mojo/stdlib/benchmark/bencher/Bencher): Defines a Bencher struct which facilitates the timing of a target function. * [​`BenchId`](/mojo/stdlib/benchmark/bencher/BenchId): Defines a benchmark Id struct to identify and represent a particular benchmark execution. * [​`BenchmarkInfo`](/mojo/stdlib/benchmark/bencher/BenchmarkInfo): Defines a Benchmark Info struct to record execution Statistics. * [​`BenchMetric`](/mojo/stdlib/benchmark/bencher/BenchMetric): Defines a benchmark throughput metric. * [​`Format`](/mojo/stdlib/benchmark/bencher/Format): Defines a format for the benchmark output when printing or writing to a file. * [​`Mode`](/mojo/stdlib/benchmark/bencher/Mode): Defines a Benchmark Mode to distinguish between test runs and actual benchmarks. * [​`ThroughputMeasure`](/mojo/stdlib/benchmark/bencher/ThroughputMeasure): Records a throughput metric of metric BenchMetric and value. --- ## Batch `@register_passable(trivial)` `struct Batch` A batch of benchmarks, the benchmark.run() function works out how many iterations to run in each batch based the how long the previous iterations took. ## Fields * ​duration (`Int`): Total duration of batch stored as nanoseconds. * ​iterations (`Int`): Total iterations in the batch. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(*, other: Self) -> Self` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. ### `mean` `mean(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]` Returns the average duration of the batch. **Args:** * ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`). **Returns:** The average duration of the batch. --- ## Report `struct Report` Contains the average execution time, iterations, min and max of each batch. ## Fields * ​warmup\_duration (`Int`): The total duration it took to warmup. * ​runs (`List[Batch]`): A `List` of benchmark runs. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Default initializer for the Report. Sets all values to 0 `__init__(out self, *, other: Self)` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. ### `__copyinit__` `__copyinit__(out self, existing: Self)` Creates a shallow copy (it doesn't copy the data). **Args:** * ​existing (`Self`): The `Report` to copy. ### `iters` `iters(self) -> Int` The total benchmark iterations. **Returns:** The total benchmark iterations. ### `duration` `duration(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]` The total duration it took to run all benchmarks. **Args:** * ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`). **Returns:** The total duration it took to run all benchmarks. ### `mean` `mean(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]` The average duration of all benchmark runs. **Args:** * ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`). **Returns:** The average duration of all benchmark runs. ### `min` `min(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]` The batch of benchmarks that was the fastest to run. **Args:** * ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`). **Returns:** The fastest duration out of all batches. ### `max` `max(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]` The batch of benchmarks that was the slowest to run. **Args:** * ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`). **Returns:** The slowest duration out of all batches. ### `print` `print(self, unit: String = __init__[__mlir_type.!kgen.string]("s"))` Prints out the shortened version of the report. **Args:** * ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`). ### `print_full` `print_full(self, unit: String = __init__[__mlir_type.!kgen.string]("s"))` Prints out the full version of the report with each batch of benchmark runs. **Args:** * ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`). --- ## Unit `struct Unit` Time Unit used by Benchmark Report. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `ms` `alias ms = "ms"` Milliseconds ### `ns` `alias ns = "ns"` Nanoseconds ### `s` `alias s = "s"` Seconds --- ## benchmark Implements the benchmark module for runtime benchmarking. You can import these APIs from the `benchmark` package. For example: ```mojo import benchmark from time import sleep ``` You can pass any `fn` as a parameter into `benchmark.run[...]()`, it will return a `Report` where you can get the mean, duration, max, and more: ```mojo fn sleeper(): sleep(.01) var report = benchmark.run[sleeper]() print(report.mean()) ``` ```output 0.012256487394957985 ``` You can print a full report: ```mojo report.print() ``` ```output --------------------- Benchmark Report (s) --------------------- Mean: 0.012265747899159664 Total: 1.459624 Iters: 119 Warmup Total: 0.025020000000000001 Fastest Mean: 0.0121578 Slowest Mean: 0.012321428571428572 ``` Or all the batch runs: ```mojo report.print_full() ``` ```output --------------------- Benchmark Report (s) --------------------- Mean: 0.012368649122807017 Total: 1.410026 Iters: 114 Warmup Total: 0.023341000000000001 Fastest Mean: 0.012295586956521738 Slowest Mean: 0.012508099999999999 Batch: 1 Iterations: 20 Mean: 0.012508099999999999 Duration: 0.250162 Batch: 2 Iterations: 46 Mean: 0.012295586956521738 Duration: 0.56559700000000002 Batch: 3 Iterations: 48 Mean: 0.012380562499999999 Duration: 0.59426699999999999 ``` If you want to use a different time unit you can bring in the Unit and pass it in as an argument: ```mojo from benchmark import Unit report.print(Unit.ms) ``` ```output --------------------- Benchmark Report (ms) --------------------- Mean: 0.012312411764705882 Total: 1.465177 Iters: 119 Warmup Total: 0.025010999999999999 Fastest Mean: 0.012015649999999999 Slowest Mean: 0.012421204081632654 ``` The unit's are just aliases for string constants, so you can for example: ```mojo print(report.mean("ms")) ``` ```output 12.199145299145298 ``` Benchmark.run takes four arguments to change the behaviour, to set warmup iterations to 5: ```mojo r = benchmark.run[sleeper](5) ``` ```output 0.012004808080808081 ``` To set 1 warmup iteration, 2 max iterations, a min total time of 3 sec, and a max total time of 4 s: ```mojo r = benchmark.run[sleeper](1, 2, 3, 4) ``` Note that the min total time will take precedence over max iterations ## Structs * [​`Batch`](/mojo/stdlib/benchmark/benchmark/Batch): A batch of benchmarks, the benchmark.run() function works out how many iterations to run in each batch based the how long the previous iterations took. * [​`Report`](/mojo/stdlib/benchmark/benchmark/Report): Contains the average execution time, iterations, min and max of each batch. * [​`Unit`](/mojo/stdlib/benchmark/benchmark/Unit): Time Unit used by Benchmark Report. ## Functions * [​`run`](/mojo/stdlib/benchmark/benchmark/run): Benchmarks the function passed in as a parameter. --- ## run `run[func: fn() raises -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report` Benchmarks the function passed in as a parameter. Benchmarking continues until 'min\_time\_ns' has elapsed and either `max_time_ns` OR `max_iters` is achieved. **Parameters:** * ​func (`fn() raises -> None`): The function to benchmark. **Args:** * ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`). * ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`). * ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`). * ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement. **Returns:** Average execution time of func in ns. `run[func: fn() -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report` Benchmarks the function passed in as a parameter. Benchmarking continues until 'min\_time\_ns' has elapsed and either `max_time_ns` OR `max_iters` is achieved. **Parameters:** * ​func (`fn() -> None`): The function to benchmark. **Args:** * ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`). * ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`). * ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`). * ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement. **Returns:** Average execution time of func in ns. `run[: origin.set, //, func: fn() raises capturing -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report` Benchmarks the function passed in as a parameter. Benchmarking continues until 'min\_time\_ns' has elapsed and either `max_time_ns` OR `max_iters` is achieved. **Parameters:** * ​func (`fn() raises capturing -> None`): The function to benchmark. **Args:** * ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`). * ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`). * ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`). * ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement. **Returns:** Average execution time of func in ns. `run[: origin.set, //, func: fn() capturing -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report` Benchmarks the function passed in as a parameter. Benchmarking continues until 'min\_time\_ns' has elapsed and either `max_time_ns` OR `max_iters` is achieved. **Parameters:** * ​func (`fn() capturing -> None`): The function to benchmark. **Args:** * ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`). * ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`). * ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`). * ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement. **Returns:** Average execution time of func in ns. --- ## compiler ## Functions * [​`keep`](/mojo/stdlib/benchmark/compiler/keep): Provides a hint to the compiler to not optimize the variable use away. --- ## keep `keep(val: Bool)` Provides a hint to the compiler to not optimize the variable use away. This is useful in benchmarking to avoid the compiler not deleting the code to be benchmarked because the variable is not used in a side-effecting manner. **Args:** * ​val (`Bool`): The value to not optimize away. `keep(val: Int)` Provides a hint to the compiler to not optimize the variable use away. This is useful in benchmarking to avoid the compiler not deleting the code to be benchmarked because the variable is not used in a side-effecting manner. **Args:** * ​val (`Int`): The value to not optimize away. `keep[type: DType, simd_width: Int](val: SIMD[type, simd_width])` Provides a hint to the compiler to not optimize the variable use away. This is useful in benchmarking to avoid the compiler not deleting the code to be benchmarked because the variable is not used in a side-effecting manner. **Parameters:** * ​type (`DType`): The `dtype` of the input and output SIMD vector. * ​simd\_width (`Int`): The width of the input and output SIMD vector. **Args:** * ​val (`SIMD[type, simd_width]`): The value to not optimize away. `keep[type: AnyType](val: UnsafePointer[type])` Provides a hint to the compiler to not optimize the variable use away. This is useful in benchmarking to avoid the compiler not deleting the code to be benchmarked because the variable is not used in a side-effecting manner. **Parameters:** * ​type (`AnyType`): The type of the input. **Args:** * ​val (`UnsafePointer[type]`): The value to not optimize away. `keep[type: AnyTrivialRegType](mut val: type)` Provides a hint to the compiler to not optimize the variable use away. This is useful in benchmarking to avoid the compiler not deleting the code to be benchmarked because the variable is not used in a side-effecting manner. **Parameters:** * ​type (`AnyTrivialRegType`): The type of the input. **Args:** * ​val (`type`): The value to not optimize away. --- ## benchmark Implements the benchmark package for runtime benchmarking. You can import these APIs from the `benchmark` package. For example: ```mojo import benchmark from time import sleep ``` You can pass any `fn` as a parameter into `benchmark.run[...]()`, it will return a `Report` where you can get the mean, duration, max, and more: ```mojo fn sleeper(): sleep(.01) var report = benchmark.run[sleeper]() print(report.mean()) ``` ```output 0.012256487394957985 ``` You can print a full report: ```mojo report.print() ``` ```output --------------------- Benchmark Report (s) --------------------- Mean: 0.012265747899159664 Total: 1.459624 Iters: 119 Warmup Mean: 0.01251 Warmup Total: 0.025020000000000001 Warmup Iters: 2 Fastest Mean: 0.0121578 Slowest Mean: 0.012321428571428572 ``` Or all the batch runs: ```mojo report.print_full() ``` ```output --------------------- Benchmark Report (s) --------------------- Mean: 0.012368649122807017 Total: 1.410026 Iters: 114 Warmup Mean: 0.0116705 Warmup Total: 0.023341000000000001 Warmup Iters: 2 Fastest Mean: 0.012295586956521738 Slowest Mean: 0.012508099999999999 Batch: 1 Iterations: 20 Mean: 0.012508099999999999 Duration: 0.250162 Batch: 2 Iterations: 46 Mean: 0.012295586956521738 Duration: 0.56559700000000002 Batch: 3 Iterations: 48 Mean: 0.012380562499999999 Duration: 0.59426699999999999 ``` If you want to use a different time unit you can bring in the Unit and pass it in as an argument: ```mojo from benchmark import Unit report.print(Unit.ms) ``` ```output --------------------- Benchmark Report (ms) --------------------- Mean: 0.012312411764705882 Total: 1.465177 Iters: 119 Warmup Mean: 0.012505499999999999 Warmup Total: 0.025010999999999999 Warmup Iters: 2 Fastest Mean: 0.012015649999999999 Slowest Mean: 0.012421204081632654 ``` The unit's are just aliases for string constants, so you can for example: ```mojo print(report.mean("ms")) ``` ```output 12.199145299145298 ``` Benchmark.run takes four arguments to change the behaviour, to set warmup iterations to 5: ```mojo r = benchmark.run[sleeper](5) ``` ```output 0.012004808080808081 ``` To set 1 warmup iteration, 2 max iterations, a min total time of 3 sec, and a max total time of 4 s: ```mojo r = benchmark.run[sleeper](1, 2, 3, 4) ``` Note that the min total time will take precedence over max iterations ## Modules * [​`bencher`](/mojo/stdlib/benchmark/bencher/): * [​`benchmark`](/mojo/stdlib/benchmark/benchmark/): Implements the benchmark module for runtime benchmarking. * [​`compiler`](/mojo/stdlib/benchmark/compiler/): * [​`memory`](/mojo/stdlib/benchmark/memory/): * [​`quick_bench`](/mojo/stdlib/benchmark/quick_bench/): --- ## clobber_memory `clobber_memory()` Forces all pending memory writes to be flushed to memory. This ensures that the compiler does not optimize away memory writes if it deems them to be not necessary. In effect, this operation acts as a barrier to memory reads and writes. --- ## memory ## Functions * [​`clobber_memory`](/mojo/stdlib/benchmark/memory/clobber_memory): Forces all pending memory writes to be flushed to memory. --- ## QuickBench `struct QuickBench` Defines a struct to facilitate benchmarking and avoiding `Bencher` boilerplate. ## Fields * ​m (`Bench`): Bench object to collect the results. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Just initialize the Bench object. ### `dump_report` `dump_report(mut self)` Prints out the report from a Benchmark execution collected in Bench object. ### `run` `run[T_out: AnyTrivialRegType](mut self, func: fn() -> T_out, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with no input arguments and return type `T_out`. **Parameters:** * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn() -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0) -> T_out, x0: T0, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 1 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1) -> T_out, x0: T0, x1: T1, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 2 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2) -> T_out, x0: T0, x1: T1, x2: T2, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 3 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 4 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2, T3) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​x3 (`T3`): The 4th argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 5 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func. * ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2, T3, T4) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​x3 (`T3`): The 4th argument of func. * ​x4 (`T4`): The 5th argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 6 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func. * ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func. * ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2, T3, T4, T5) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​x3 (`T3`): The 4th argument of func. * ​x4 (`T4`): The 5th argument of func. * ​x5 (`T5`): The 6th argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 7 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func. * ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func. * ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func. * ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2, T3, T4, T5, T6) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​x3 (`T3`): The 4th argument of func. * ​x4 (`T4`): The 5th argument of func. * ​x5 (`T5`): The 6th argument of func. * ​x6 (`T6`): The 7th argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, T7: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6, T7) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, x7: T7, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 8 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func. * ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func. * ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func. * ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func. * ​T7 (`AnyTrivialRegType`): Type of the 8th argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2, T3, T4, T5, T6, T7) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​x3 (`T3`): The 4th argument of func. * ​x4 (`T4`): The 5th argument of func. * ​x5 (`T5`): The 6th argument of func. * ​x6 (`T6`): The 7th argument of func. * ​x7 (`T7`): The 8th argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, T7: AnyTrivialRegType, T8: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6, T7, T8) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, x7: T7, x8: T8, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 9 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func. * ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func. * ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func. * ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func. * ​T7 (`AnyTrivialRegType`): Type of the 8th argument of func. * ​T8 (`AnyTrivialRegType`): Type of the 9th argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2, T3, T4, T5, T6, T7, T8) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​x3 (`T3`): The 4th argument of func. * ​x4 (`T4`): The 5th argument of func. * ​x5 (`T5`): The 6th argument of func. * ​x6 (`T6`): The 7th argument of func. * ​x7 (`T7`): The 8th argument of func. * ​x8 (`T8`): The 9th argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, T7: AnyTrivialRegType, T8: AnyTrivialRegType, T9: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6, T7, T8, T9) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, x7: T7, x8: T8, x9: T9, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 10 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func. * ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func. * ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func. * ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func. * ​T7 (`AnyTrivialRegType`): Type of the 8th argument of func. * ​T8 (`AnyTrivialRegType`): Type of the 9th argument of func. * ​T9 (`AnyTrivialRegType`): Type of the 10th argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2, T3, T4, T5, T6, T7, T8, T9) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​x3 (`T3`): The 4th argument of func. * ​x4 (`T4`): The 5th argument of func. * ​x5 (`T5`): The 6th argument of func. * ​x6 (`T6`): The 7th argument of func. * ​x7 (`T7`): The 8th argument of func. * ​x8 (`T8`): The 9th argument of func. * ​x9 (`T9`): The 10th argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. --- ## quick_bench ## Structs * [​`QuickBench`](/mojo/stdlib/benchmark/quick_bench/QuickBench): Defines a struct to facilitate benchmarking and avoiding `Bencher` boilerplate. --- ## bit_not `bit_not[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs a bitwise NOT operation on an SIMD vector of integer values. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` is computed as a bitwise NOT of the integer value at position `i` of the input value. --- ## bit_reverse `bit_reverse(val: Int) -> Int` Reverses the bitpattern of an integer value. **Args:** * ​val (`Int`): The input value. **Returns:** The input value with its bitpattern reversed. `bit_reverse[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Element-wise reverses the bitpattern of a SIMD vector of integer values. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` has a reversed bitpattern of an integer value of the element at position `i` of the input value. --- ## bit_width `bit_width(val: Int) -> Int` Computes the minimum number of bits required to represent the integer. **Args:** * ​val (`Int`): The input value. **Returns:** The number of bits required to represent the integer. `bit_width[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the minimum number of bits required to represent each element of a SIMD vector of integer values. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` equals the number of bits required to represent the integer at position `i` of the input. --- ## byte_swap `byte_swap(val: Int) -> Int` Byte-swaps an integer value with an even number of bytes. Byte swap an integer value (8 bytes) with an even number of bytes (positive multiple of 16 bits). This returns an integer value (8 bytes) that has its bytes swapped. For example, if the input bytes are numbered 0, 1, 2, 3, 4, 5, 6, 7 then the returned integer will have its bytes in 7, 6, 5, 4, 3, 2, 1, 0 order. **Args:** * ​val (`Int`): The input value. **Returns:** The input value with its bytes swapped. `byte_swap[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Byte-swaps a SIMD vector of integer values with an even number of bytes. Byte swap an integer value or vector of integer values with an even number of bytes (positive multiple of 16 bits). For example, The Int16 returns an Int16 value that has the high and low byte of the input Int16 swapped. Similarly, Int32 returns an Int32 value that has the four bytes of the input Int32 swapped, so that if the input bytes are numbered 0, 1, 2, 3 then the returned Int32 will have its bytes in 3, 2, 1, 0 order. Int64 and other integer type extend this concept to additional even-byte lengths (6 bytes, 8 bytes and more, respectively). **Constraints:** The element type of the input vector must be an integral type. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` is the value of the element at position `i` of the input value with its bytes swapped. --- ## count_leading_zeros `count_leading_zeros(val: Int) -> Int` Counts the number of leading zeros of an integer. **Args:** * ​val (`Int`): The input value. **Returns:** The number of leading zeros of the input. `count_leading_zeros[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Counts the per-element number of leading zeros in a SIMD vector. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `DType` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` contains the number of leading zeros at position `i` of the input value. --- ## count_trailing_zeros `count_trailing_zeros(val: Int) -> Int` Counts the number of trailing zeros for an integer. **Args:** * ​val (`Int`): The input value. **Returns:** The number of trailing zeros of the input. `count_trailing_zeros[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Counts the per-element number of trailing zeros in a SIMD vector. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` contains the number of trailing zeros at position `i` of the input value. --- ## bit Provides functions for bit manipulation. You can import these APIs from the `bit` package. For example: ```mojo from bit import count_leading_zeros ``` ## Functions * [​`bit_not`](/mojo/stdlib/bit/bit/bit_not): Performs a bitwise NOT operation on an SIMD vector of integer values. * [​`bit_reverse`](/mojo/stdlib/bit/bit/bit_reverse): Reverses the bitpattern of an integer value. * [​`bit_width`](/mojo/stdlib/bit/bit/bit_width): Computes the minimum number of bits required to represent the integer. * [​`byte_swap`](/mojo/stdlib/bit/bit/byte_swap): Byte-swaps an integer value with an even number of bytes. * [​`count_leading_zeros`](/mojo/stdlib/bit/bit/count_leading_zeros): Counts the number of leading zeros of an integer. * [​`count_trailing_zeros`](/mojo/stdlib/bit/bit/count_trailing_zeros): Counts the number of trailing zeros for an integer. * [​`log2_floor`](/mojo/stdlib/bit/bit/log2_floor): Returns the floor of the base-2 logarithm of an integer value. * [​`next_power_of_two`](/mojo/stdlib/bit/bit/next_power_of_two): Computes the smallest power of 2 that is greater than or equal to the input value. Any integral value less than or equal to 1 will be ceiled to 1. * [​`pop_count`](/mojo/stdlib/bit/bit/pop_count): Counts the number of bits set in an integer value. * [​`prev_power_of_two`](/mojo/stdlib/bit/bit/prev_power_of_two): Computes the largest power of 2 that is less than or equal to the input value. Any integral value less than or equal to 0 will be floored to 0. * [​`rotate_bits_left`](/mojo/stdlib/bit/bit/rotate_bits_left): Shifts the bits of an input to the left by `shift` bits (with wrap-around). * [​`rotate_bits_right`](/mojo/stdlib/bit/bit/rotate_bits_right): Shifts the bits of an input to the right by `shift` bits (with wrap-around). --- ## log2_floor `log2_floor(val: Int) -> Int` Returns the floor of the base-2 logarithm of an integer value. **Args:** * ​val (`Int`): The input value. **Returns:** The floor of the base-2 logarithm of the input value, which is equal to the position of the highest set bit. Returns -1 if val is 0. --- ## next_power_of_two `next_power_of_two(val: Int) -> Int` Computes the smallest power of 2 that is greater than or equal to the input value. Any integral value less than or equal to 1 will be ceiled to 1. Notes: This operation is called `bit_ceil()` in C++. **Args:** * ​val (`Int`): The input value. **Returns:** The smallest power of 2 that is greater than or equal to the input value. `next_power_of_two(val: UInt) -> UInt` Computes the smallest power of 2 that is greater than or equal to the input value. Any integral value less than or equal to 1 will be ceiled to 1. Notes: This operation is called `bit_ceil()` in C++. **Args:** * ​val (`UInt`): The input value. **Returns:** The smallest power of 2 that is greater than or equal to the input value. `next_power_of_two[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the smallest power of 2 that is greater than or equal to the input value for each element of a SIMD vector. Any integral value less than or equal to 1 will be ceiled to 1. This operation is called `bit_ceil()` in C++. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` is the smallest power of 2 that is greater than or equal to the integer at position `i` of the input value. --- ## pop_count `pop_count(val: Int) -> Int` Counts the number of bits set in an integer value. **Args:** * ​val (`Int`): The input value. **Returns:** The number of bits set in the input value. `pop_count[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Counts the number of bits set in a SIMD vector of integer values. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` contains the number of bits set in the element at position `i` of the input value. --- ## prev_power_of_two `prev_power_of_two(val: Int) -> Int` Computes the largest power of 2 that is less than or equal to the input value. Any integral value less than or equal to 0 will be floored to 0. This operation is called `bit_floor()` in C++. **Args:** * ​val (`Int`): The input value. **Returns:** The largest power of 2 that is less than or equal to the input value. `prev_power_of_two[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the largest power of 2 that is less than or equal to the input value for each element of a SIMD vector. Any integral value less than or equal to 0 will be floored to 0. This operation is called `bit_floor()` in C++. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` is the largest power of 2 that is less than or equal to the integer at position `i` of the input value. --- ## rotate_bits_left `rotate_bits_left[shift: Int](x: Int) -> Int` Shifts the bits of an input to the left by `shift` bits (with wrap-around). **Constraints:** `-size shift (`Int`): The number of bit positions by which to rotate the bits of the integer to the left (with wrap-around). **Args:** * ​x (`Int`): The input value. **Returns:** The input rotated to the left by `shift` elements (with wrap-around). `rotate_bits_left[dtype: DType, width: Int, //, shift: Int](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Shifts bits to the left by `shift` positions (with wrap-around) for each element of a SIMD vector. **Constraints:** `0 dtype (`DType`): The `dtype` of the input and output SIMD vector. Must be integral and unsigned. * ​width (`Int`): The width of the SIMD vector. * ​shift (`Int`): The number of positions to rotate left. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector input. **Returns:** SIMD vector with each element rotated left by `shift` bits. --- ## rotate_bits_right `rotate_bits_right[shift: Int](x: Int) -> Int` Shifts the bits of an input to the right by `shift` bits (with wrap-around). **Constraints:** `-size shift (`Int`): The number of bit positions by which to rotate the bits of the integer to the right (with wrap-around). **Args:** * ​x (`Int`): The input value. **Returns:** The input rotated to the right by `shift` elements (with wrap-around). `rotate_bits_right[dtype: DType, width: Int, //, shift: Int](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Shifts bits to the right by `shift` positions (with wrap-around) for each element of a SIMD vector. **Constraints:** `0 dtype (`DType`): The `dtype` of the input and output SIMD vector. Must be integral and unsigned. * ​width (`Int`): The width of the SIMD vector. * ​shift (`Int`): The number of positions to rotate right. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector input. **Returns:** SIMD vector with each element rotated right by `shift` bits. --- ## bit Implements the bit package. ## Modules * [​`bit`](/mojo/stdlib/bit/bit/): Provides functions for bit manipulation. --- ## NDBuffer `@register_passable(trivial)` `struct NDBuffer[mut: Bool, //, type: DType, rank: Int, origin: Origin[mut], shape: DimList = create_unknown[::Int](), strides: DimList = create_unknown[::Int](), *, alignment: Int = 1, address_space: AddressSpace = AddressSpace(0), exclusive: Bool = True]` An N-dimensional buffer. NDBuffer can be parametrized on rank, static dimensions and Dtype. It does not own its underlying pointer. ## Parameters * ​mut (`Bool`): The inferred mutability. * ​type (`DType`): The element type of the buffer. * ​rank (`Int`): The rank of the buffer. * ​origin (`Origin[mut]`): The origin of the memory being addressed. * ​shape (`DimList`): The static size (if known) of the buffer. * ​strides (`DimList`): The strides (if known) of the buffer. * ​alignment (`Int`): The preferred address alignment of the buffer. * ​address\_space (`AddressSpace`): The address space of the buffer. * ​exclusive (`Bool`): The underlying memory allocation of the tensor is known only to be accessible through this pointer. ## Fields * ​data (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): The underlying data for the buffer. The pointer is not owned by the NDBuffer. * ​dynamic\_shape (`IndexList[rank, element_type=uint64]`): The dynamic value of the shape. * ​dynamic\_stride (`IndexList[rank, element_type=uint64]`): The dynamic stride of the buffer. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Default initializer for NDBuffer. By default the fields are all initialized to 0. `@implicit` `__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self` Constructs an NDBuffer with statically known rank, shapes and type. **Constraints:** The rank, shapes, and type are known. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the data. `@implicit` `__init__(span: Span[SIMD[type, 1], origin, address_space=address_space, alignment=alignment]) -> Self` Constructs an NDBuffer with statically known rank, shapes and type. **Constraints:** The rank, shapes, and type are known. **Args:** * ​span (`Span[SIMD[type, 1], origin, address_space=address_space, alignment=alignment]`): Span of the data. `@implicit` `__init__(other: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> Self` Converts NDBuffers between different variants which do not effect the underlying memory representation. E.g. this allows implicit conversion between `NDBuffer[type, rank, DimList(1, 2, 3), DimList(6, 6, 1), alignment=16]` to `NDBuffer[type, rank, DimList(1, 2, 3), DimList.create_unknown[rank](), alignment=4]` **Args:** * ​other (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The other NDBuffer type. `__init__(ptr: UnsafePointer[scalar>, address_space=address_space, mut=mut, origin=origin], dynamic_shape: IndexList[rank, element_type=element_type]) -> Self` Constructs an NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​ptr (`UnsafePointer[scalar>, address_space=address_space, mut=mut, origin=origin]`): Pointer to the data. * ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes. `__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: IndexList[rank, element_type=element_type]) -> Self` Constructs an NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data. * ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes. `__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: IndexList[rank, element_type=element_type]) -> Self` Constructs an NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Span of the data. * ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes. `__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: DimList) -> Self` Constructs an NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data. * ​dynamic\_shape (`DimList`): A static tuple of size 'rank' representing shapes. `__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: DimList) -> Self` Constructs an NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Span of the data. * ​dynamic\_shape (`DimList`): A static tuple of size 'rank' representing shapes. `__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: IndexList[rank, element_type=element_type], dynamic_stride: IndexList[rank, element_type=element_type]) -> Self` Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data. * ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes. * ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides. `__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: IndexList[rank, element_type=element_type], dynamic_stride: IndexList[rank, element_type=element_type]) -> Self` Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Span over the data. * ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes. * ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides. `__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: DimList, dynamic_stride: IndexList[rank, element_type=element_type]) -> Self` Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data. * ​dynamic\_shape (`DimList`): A DimList of size 'rank' representing shapes. * ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides. `__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: DimList, dynamic_stride: IndexList[rank, element_type=element_type]) -> Self` Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Pointer to the data. * ​dynamic\_shape (`DimList`): A DimList of size 'rank' representing shapes. * ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides. ### `__getitem__` `__getitem__(self, *idx: Int) -> SIMD[type, 1]` Gets an element from the buffer from the specified index. **Args:** * ​\*idx (`Int`): Index of the element to retrieve. **Returns:** The value of the element. `__getitem__(self, idx: IndexList[rank, element_type=element_type]) -> SIMD[type, 1]` Gets an element from the buffer from the specified index. **Args:** * ​idx (`IndexList[rank, element_type=element_type]`): Index of the element to retrieve. **Returns:** The value of the element. ### `__setitem__` `__setitem__(self: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], idx: IndexList[rank, element_type=element_type], val: SIMD[type, 1])` Stores a single value into the buffer at the specified index. **Args:** * ​idx (`IndexList[rank, element_type=element_type]`): The index into the buffer. * ​val (`SIMD[type, 1]`): The value to store. `__setitem__(self: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], *idx: Int, *, val: SIMD[type, 1])` Stores a single value into the buffer at the specified index. **Args:** * ​\*idx (`Int`): Index of the element to retrieve. * ​val (`SIMD[type, 1]`): The value to store. ### `origin_cast` `origin_cast[mut: Bool = mut, origin: Origin[mut] = origin](self) -> NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]` Changes the origin or mutability of a pointer. **Parameters:** * ​mut (`Bool`): Whether the origin is mutable. * ​origin (`Origin[mut]`): Origin of the destination pointer. **Returns:** A new `NDBuffer` object with the same type and the same address, as the original `NDBuffer` and the new specified mutability and origin. ### `get_rank` `get_rank(self) -> Int` Returns the rank of the buffer. **Returns:** The rank of NDBuffer. ### `get_shape` `get_shape(self) -> IndexList[rank]` Returns the shapes of the buffer. **Returns:** A static tuple of size 'rank' representing shapes of the NDBuffer. ### `get_strides` `get_strides(self) -> IndexList[rank]` Returns the strides of the buffer. **Returns:** A static tuple of size 'rank' representing strides of the NDBuffer. ### `get_nd_index` `get_nd_index(self, idx: Int) -> IndexList[rank]` Computes the NDBuffer's ND-index based on the flat index. **Args:** * ​idx (`Int`): The flat index. **Returns:** The index positions. ### `__len__` `__len__(self) -> Int` Computes the NDBuffer's number of elements. **Returns:** The total number of elements in the NDBuffer. ### `num_elements` `num_elements(self) -> Int` Computes the NDBuffer's number of elements. **Returns:** The total number of elements in the NDBuffer. ### `size` `size(self) -> Int` Computes the NDBuffer's number of elements. **Returns:** The total number of elements in the NDBuffer. ### `__str__` `__str__(self) -> String` Gets the buffer as a string. **Returns:** A compact string of the buffer. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this buffer to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__repr__` `__repr__(self) -> String` Gets the buffer as a string. **Returns:** A compact string representation of the buffer. ### `tile` `tile[*tile_sizes: Dim](self, tile_coords: IndexList[rank, element_type=element_type]) -> NDBuffer[type, rank, origin, DimList(VariadicList(tile_sizes)), address_space=address_space]` Returns an n-d tile "slice" of the buffer of size tile\_sizes at coords. **Parameters:** * ​\*tile\_sizes (`Dim`): The size of the tiles. **Args:** * ​tile\_coords (`IndexList[rank, element_type=element_type]`): The tile index. **Returns:** The tiled buffer at tile\_coords. ### `load` `load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, *idx: Int) -> SIMD[type, width]` Loads a simd value from the buffer at the specified index. **Constraints:** The buffer must be contiguous or width must be 1. **Parameters:** * ​width (`Int`): The simd\_width of the load. * ​alignment (`Int`): The alignment value. **Args:** * ​\*idx (`Int`): The index into the NDBuffer. **Returns:** The simd value starting at the `idx` position and ending at `idx+width`. `load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, idx: VariadicList[Int]) -> SIMD[type, width]` Loads a simd value from the buffer at the specified index. **Constraints:** The buffer must be contiguous or width must be 1. **Parameters:** * ​width (`Int`): The simd\_width of the load. * ​alignment (`Int`): The alignment value. **Args:** * ​idx (`VariadicList[Int]`): The index into the NDBuffer. **Returns:** The simd value starting at the `idx` position and ending at `idx+width`. `load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, idx: IndexList[size, element_type=element_type]) -> SIMD[type, width]` Loads a simd value from the buffer at the specified index. **Constraints:** The buffer must be contiguous or width must be 1. **Parameters:** * ​width (`Int`): The simd\_width of the load. * ​alignment (`Int`): The alignment value. **Args:** * ​idx (`IndexList[size, element_type=element_type]`): The index into the NDBuffer. **Returns:** The simd value starting at the `idx` position and ending at `idx+width`. `load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, idx: StaticTuple[Int, rank]) -> SIMD[type, width]` Loads a simd value from the buffer at the specified index. **Constraints:** The buffer must be contiguous or width must be 1. **Parameters:** * ​width (`Int`): The simd\_width of the load. * ​alignment (`Int`): The alignment value. **Args:** * ​idx (`StaticTuple[Int, rank]`): The index into the NDBuffer. **Returns:** The simd value starting at the `idx` position and ending at `idx+width`. ### `store` `store[_alignment: Int, //, *, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self: NDBuffer[type, rank, origin, shape, strides, alignment=_alignment, address_space=address_space, exclusive=exclusive], idx: IndexList[rank, element_type=element_type], val: SIMD[type, width])` Stores a simd value into the buffer at the specified index. **Constraints:** The buffer must be contiguous or width must be 1. **Parameters:** * ​\_alignment (`Int`): The inferred alignment of self. * ​width (`Int`): The width of the simd vector. * ​alignment (`Int`): The alignment value. **Args:** * ​idx (`IndexList[rank, element_type=element_type]`): The index into the buffer. * ​val (`SIMD[type, width]`): The value to store. `store[_alignment: Int, //, *, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self: NDBuffer[type, rank, origin, shape, strides, alignment=_alignment, address_space=address_space, exclusive=exclusive], idx: StaticTuple[Int, rank], val: SIMD[type, width])` Stores a simd value into the buffer at the specified index. **Constraints:** The buffer must be contiguous or width must be 1. **Parameters:** * ​\_alignment (`Int`): The inferred alignment of self. * ​width (`Int`): The width of the simd vector. * ​alignment (`Int`): The alignment value. **Args:** * ​idx (`StaticTuple[Int, rank]`): The index into the buffer. * ​val (`SIMD[type, width]`): The value to store. ### `dim` `dim[index: Int](self) -> Int` Gets the buffer dimension at the given index. **Parameters:** * ​index (`Int`): The number of dimension to get. **Returns:** The buffer size at the given dimension. `dim(self, index: Int) -> Int` Gets the buffer dimension at the given index. **Args:** * ​index (`Int`): The number of dimension to get. **Returns:** The buffer size at the given dimension. ### `stride` `stride[index: Int](self) -> Int` Gets the buffer stride at the given index. **Parameters:** * ​index (`Int`): The number of dimension to get the stride for. **Returns:** The stride at the given dimension. `stride(self, index: Int) -> Int` Gets the buffer stride at the given index. **Args:** * ​index (`Int`): The number of dimension to get the stride for. **Returns:** The stride at the given dimension. ### `is_contiguous` `is_contiguous(self) -> Bool` Checks if the buffer is contiguous in memory. **Returns:** True if the buffer is contiguous in memory and False otherwise. ### `flatten` `flatten(self) -> NDBuffer[type, 1, origin, __init__[::Intable](shape.product()), address_space=address_space]` Constructs a flattened buffer counterpart for this NDBuffer. **Constraints:** The buffer must be contiguous. **Returns:** Constructed buffer object. ### `make_dims_unknown` `make_dims_unknown(self) -> NDBuffer[type, rank, origin, address_space=address_space]` Rebinds the NDBuffer to one with unknown shape. **Returns:** The rebound NDBuffer with unknown shape. ### `bytecount` `bytecount(self) -> Int` Returns the size of the NDBuffer in bytes. **Returns:** The size of the NDBuffer in bytes. ### `zero` `zero(self: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])` Sets all bytes of the NDBuffer to 0. **Constraints:** The buffer must be contiguous. ### `tofile` `tofile(self, path: Path)` Write values to a file. **Args:** * ​path (`Path`): Path to the output file. ### `fill` `fill(self: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], val: SIMD[type, 1])` Assigns val to all elements in the buffer. The fill is performed in chunks of size N, where N is the native SIMD width of type on the system. **Args:** * ​val (`SIMD[type, 1]`): The value to store. ### `stack_allocation` `static stack_allocation[*, alignment: Int = alignof[::DType,__mlir_type.!kgen.target]()]() -> Self` Constructs an NDBuffer instance backed by stack allocated memory space. **Parameters:** * ​alignment (`Int`): Address alignment requirement for the allocation. **Returns:** Constructed NDBuffer with the allocated space. ### `prefetch` `prefetch[params: PrefetchOptions](self, *idx: Int)` Prefetches the data at the given index. **Parameters:** * ​params (`PrefetchOptions`): The prefetch configuration. **Args:** * ​\*idx (`Int`): The N-D index of the prefetched location. `prefetch[params: PrefetchOptions](self, indices: IndexList[rank])` Prefetches the data at the given index. **Parameters:** * ​params (`PrefetchOptions`): The prefetch configuration. **Args:** * ​indices (`IndexList[rank]`): The N-D index of the prefetched location. --- ## buffer Implements the NDBuffer struct. You can import these APIs from the `memory` package. For example: ```mojo from buffer import NDBuffer ``` ## Structs * [​`NDBuffer`](/mojo/stdlib/buffer/buffer/NDBuffer): An N-dimensional buffer. ## Functions * [​`partial_simd_load`](/mojo/stdlib/buffer/buffer/partial_simd_load): Loads a vector with dynamic bound. * [​`partial_simd_store`](/mojo/stdlib/buffer/buffer/partial_simd_store): Stores a vector with dynamic bound. * [​`prod_dims`](/mojo/stdlib/buffer/buffer/prod_dims): Computes the product of a slice of the given buffer's dimensions. --- ## partial_simd_load `partial_simd_load[type: DType, //, width: Int](storage: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], lbound: Int, rbound: Int, pad_value: SIMD[type, 1]) -> SIMD[type, width]` Loads a vector with dynamic bound. Out of bound data will be filled with pad value. Data is valid if lbound type (`DType`): The DType of storage. * ​width (`Int`): The system simd vector size. **Args:** * ​storage (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the address to perform load. * ​lbound (`Int`): Lower bound of valid index within simd (inclusive). * ​rbound (`Int`): Upper bound of valid index within simd (non-inclusive). * ​pad\_value (`SIMD[type, 1]`): Value to fill for out of bound indices. **Returns:** The SIMD vector loaded and zero-filled. --- ## partial_simd_store `partial_simd_store[type: DType, //, width: Int](storage: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], lbound: Int, rbound: Int, data: SIMD[type, width])` Stores a vector with dynamic bound. Out of bound data will ignored. Data is valid if lbound type (`DType`): The DType of storage. * ​width (`Int`): The system simd vector size. **Args:** * ​storage (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the address to perform load. * ​lbound (`Int`): Lower bound of valid index within simd (inclusive). * ​rbound (`Int`): Upper bound of valid index within simd (non-inclusive). * ​data (`SIMD[type, width]`): The vector value to store. --- ## prod_dims `prod_dims[start_dim: Int, end_dim: Int](x: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> Int` Computes the product of a slice of the given buffer's dimensions. **Parameters:** * ​start\_dim (`Int`): The index at which to begin computing the product. * ​end\_dim (`Int`): The index at which to stop computing the product. **Args:** * ​x (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The NDBuffer whose dimensions will be multiplied. **Returns:** The product of the specified slice of the buffer's dimensions. --- ## Dim `@register_passable(trivial)` `struct Dim` A static or dynamic dimension modeled with an optional integer. This class is meant to represent an optional static dimension. When a value is present, the dimension has that static value. When a value is not present, the dimension is dynamic. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Defaultable`, `EqualityComparable`, `ImplicitlyBoolable`, `Indexer`, `Intable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `@implicit` `__init__[I: Intable](value: I) -> Self` Creates a statically-known dimension. **Parameters:** * ​I (`Intable`): The Intable type. **Args:** * ​value (`I`): The static dimension value. `@implicit` `__init__[I: Indexer](value: I) -> Self` Creates a statically-known dimension. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​value (`I`): The static dimension value. `@implicit` `__init__(value: index) -> Self` Creates a statically-known dimension. **Args:** * ​value (`index`): The static dimension value. `@implicit` `__init__(value: Int) -> Self` Creates a statically-known dimension. **Args:** * ​value (`Int`): The static dimension value. `__init__() -> Self` Creates a dynamic dimension with no static value. ### `__bool__` `__bool__(self) -> Bool` Returns True if the dimension has a static value. **Returns:** Whether the dimension has a static value. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compares two dimensions for equality. **Args:** * ​rhs (`Self`): The other dimension. **Returns:** True if the dimensions are the same. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Compare two dimensions for inequality. **Args:** * ​rhs (`Self`): The dimension to compare. **Returns:** True if they are not equal. ### `__mul__` `__mul__(self, rhs: Self) -> Self` Multiplies two dimensions. If either are unknown, the result is unknown as well. **Args:** * ​rhs (`Self`): The other dimension. **Returns:** The product of the two dimensions. ### `__floordiv__` `__floordiv__(self, rhs: Self) -> Self` Divide by the given dimension and round towards negative infinity. If either are unknown, the result is unknown as well. **Args:** * ​rhs (`Self`): The divisor dimension. **Returns:** The floor division of the two dimensions. ### `__rfloordiv__` `__rfloordiv__(self, rhs: Self) -> Self` Divide the given argument by self and round towards negative infinity. If either are unknown, the result is unknown as well. **Args:** * ​rhs (`Self`): The dimension to divide by this Dim. **Returns:** The floor of the argument divided by self. ### `__imul__` `__imul__(mut self, rhs: Self)` Inplace multiplies two dimensions. If either are unknown, the result is unknown as well. **Args:** * ​rhs (`Self`): The other dimension. ### `__as_bool__` `__as_bool__(self) -> Bool` Returns True if the dimension has a static value. **Returns:** Whether the dimension has a static value. ### `has_value` `has_value(self) -> Bool` Returns True if the dimension has a static value. **Returns:** Whether the dimension has a static value. ### `is_dynamic` `is_dynamic(self) -> Bool` Returns True if the dimension has a dynamic value. **Returns:** Whether the dimension is dynamic. ### `get` `get(self) -> Int` Gets the static dimension value. **Returns:** The static dimension value. ### `is_multiple` `is_multiple[alignment: Int](self) -> Bool` Checks if the dimension is aligned. **Parameters:** * ​alignment (`Int`): The alignment requirement. **Returns:** Whether the dimension is aligned. ### `__index__` `__index__(self) -> index` Convert to index. **Returns:** The corresponding \_\_mlir\_type.index value. ### `__int__` `__int__(self) -> Int` Gets the static dimension value. **Returns:** The static dimension value. ### `__str__` `__str__(self) -> String` Converts the Dim to a String. If the value is unknown, then the string "?" is returned. **Returns:** The string representation of the type. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this DimList to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `or_else` `or_else(self, default: Int) -> Int` Return the underlying value contained in the Optional or a default value if the Optional's underlying value is not present. **Args:** * ​default (`Int`): The new value to use if no value was present. **Returns:** The underlying value contained in the Optional or a default value. --- ## DimList `@register_passable(trivial)` `struct DimList` This type represents a list of dimensions. Each dimension may have a static value or not have a value, which represents a dynamic dimension. ## Fields * ​value (`VariadicList[Dim]`): The underlying storage for the list of dimensions. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Representable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `@implicit` `__init__[Intable: Intable](value: Intable) -> Self` Creates a dimension list from the given list of values. **Parameters:** * ​Intable (`Intable`): A type able to be converted to an `Int`. **Args:** * ​value (`Intable`): The initial dim values list. `@implicit` `__init__[I: Indexer](values: Tuple[I]) -> Self` Creates a dimension list from the given list of values. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​values (`Tuple[I]`): The initial dim values list. `@implicit` `__init__[I0: Indexer, I1: Indexer](values: Tuple[I0, I1]) -> Self` Creates a dimension list from the given list of values. **Parameters:** * ​I0 (`Indexer`): A type that can be used as an Index. * ​I1 (`Indexer`): A type that can be used as an Index. **Args:** * ​values (`Tuple[I0, I1]`): The initial dim values list. `@implicit` `__init__[I0: Indexer, I1: Indexer, I2: Indexer](values: Tuple[I0, I1, I2]) -> Self` Creates a dimension list from the given list of values. **Parameters:** * ​I0 (`Indexer`): A type that can be used as an Index. * ​I1 (`Indexer`): A type that can be used as an Index. * ​I2 (`Indexer`): A type that can be used as an Index. **Args:** * ​values (`Tuple[I0, I1, I2]`): The initial dim values list. `__init__[I0: Indexer, I1: Indexer](val0: I0, val1: I1) -> Self` Creates a dimension list from the given list of values. **Parameters:** * ​I0 (`Indexer`): A type that can be used as an Index. * ​I1 (`Indexer`): A type that can be used as an Index. **Args:** * ​val0 (`I0`): The initial dim value. * ​val1 (`I1`): The initial dim value. `__init__[I0: Indexer, I1: Indexer, I2: Indexer](val0: I0, val1: I1, val2: I2) -> Self` Creates a dimension list from the given list of values. **Parameters:** * ​I0 (`Indexer`): A type that can be used as an Index. * ​I1 (`Indexer`): A type that can be used as an Index. * ​I2 (`Indexer`): A type that can be used as an Index. **Args:** * ​val0 (`I0`): The initial dim value. * ​val1 (`I1`): The initial dim value. * ​val2 (`I2`): The initial dim value. `__init__[I0: Indexer, I1: Indexer, I2: Indexer, I3: Indexer](val0: I0, val1: I1, val2: I2, val3: I3) -> Self` Creates a statically-known dimension. **Parameters:** * ​I0 (`Indexer`): A type that can be used as an Index. * ​I1 (`Indexer`): A type that can be used as an Index. * ​I2 (`Indexer`): A type that can be used as an Index. * ​I3 (`Indexer`): A type that can be used as an Index. **Args:** * ​val0 (`I0`): The initial dim value. * ​val1 (`I1`): The initial dim value. * ​val2 (`I2`): The initial dim value. * ​val3 (`I3`): The initial dim value. `@implicit` `__init__(values: VariadicList[Dim]) -> Self` Creates a dimension list from the given list of values. **Args:** * ​values (`VariadicList[Dim]`): The initial dim values list. `@implicit` `__init__(*values: Dim) -> Self` Creates a dimension list from the given Dim values. **Args:** * ​\*values (`Dim`): The initial dim values. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compares two DimLists for equality. DimLists are considered equal if all non-dynamic Dims have similar values and all dynamic Dims in self are also dynamic in rhs. **Args:** * ​rhs (`Self`): The other DimList. **Returns:** True if the DimLists are the same. ### `__len__` `__len__(self) -> Int` Gets the size of the DimList. **Returns:** The number of elements in the DimList. ### `get` `get[i: Int](self) -> Int` Gets the static dimension value at a specified index. **Parameters:** * ​i (`Int`): The dimension index. **Returns:** The static dimension value at the specified index. ### `at` `at[i: Int](self) -> Dim` Gets the dimension at a specified index. **Parameters:** * ​i (`Int`): The dimension index. **Returns:** The dimension at the specified index. ### `has_value` `has_value[i: Int](self) -> Bool` Returns True if the dimension at the given index has a static value. **Parameters:** * ​i (`Int`): The dimension index. **Returns:** Whether the specified dimension has a static value. ### `product` `product[length: Int](self) -> Dim` Computes the product of the first `length` dimensions in the list. If any are dynamic, the result is a dynamic dimension value. **Parameters:** * ​length (`Int`): The number of elements in the list. **Returns:** The product of the first `length` dimensions. `product[start: Int, end: Int](self) -> Dim` Computes the product of a range of the dimensions in the list. If any in the range are dynamic, the result is a dynamic dimension value. **Parameters:** * ​start (`Int`): The starting index. * ​end (`Int`): The end index. **Returns:** The product of all the dimensions. `product(self) -> Dim` Computes the product of all the dimensions in the list. If any are dynamic, the result is a dynamic dimension value. **Returns:** The product of all the dimensions. ### `contains` `contains[length: Int](self, value: Dim) -> Bool` Determines whether the dimension list contains a specified dimension value. **Parameters:** * ​length (`Int`): The number of elements in the list. **Args:** * ​value (`Dim`): The value to find. **Returns:** True if the list contains a dimension of the specified value. ### `all_known` `all_known[length: Int](self) -> Bool` Determines whether all dimensions are statically known. **Parameters:** * ​length (`Int`): The number of elements in the list. **Returns:** True if all dimensions have a static value. `all_known[start: Int, end: Int](self) -> Bool` Determines whether all dimensions within \[start, end) are statically known. **Parameters:** * ​start (`Int`): The first queried dimension. * ​end (`Int`): The last queried dimension. **Returns:** True if all queried dimensions have a static value. ### `into_index_list` `into_index_list[rank: Int](self) -> IndexList[rank]` Copy the DimList values into an `IndexList`, providing the rank. ```mojo from buffer import DimList var dim_list = DimList(2, 4) var index_list = dim_list.into_index_list[rank=2]() ``` . **Parameters:** * ​rank (`Int`): The rank of the output IndexList. **Returns:** An IndexList with the same dimensions as the DimList. ### `create_unknown` `static create_unknown[length: Int]() -> Self` Creates a dimension list of all dynamic dimension values. **Parameters:** * ​length (`Int`): The number of elements in the list. **Returns:** A list of all dynamic dimension values. ### `__str__` `__str__(self) -> String` Converts the DimList to a String. The String is a comma separated list of the string representation of Dim. **Returns:** The string representation of the type. ### `__repr__` `__repr__(self) -> String` Converts the DimList to a readable String representation. **Returns:** The string representation of the type. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this DimList to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. --- ## dimlist Provides utilities for working with static and variadic lists. You can import these APIs from the `buffer` package. For example: ```mojo from buffer import Dim ``` ## Structs * [​`Dim`](/mojo/stdlib/buffer/dimlist/Dim): A static or dynamic dimension modeled with an optional integer. * [​`DimList`](/mojo/stdlib/buffer/dimlist/DimList): This type represents a list of dimensions. Each dimension may have a static value or not have a value, which represents a dynamic dimension. --- ## buffer Implements the buffer package. ## Modules * [​`buffer`](/mojo/stdlib/buffer/buffer/): Implements the NDBuffer struct. * [​`dimlist`](/mojo/stdlib/buffer/dimlist/): Provides utilities for working with static and variadic lists. --- ## AnyType A trait for types that require lifetime management through destructors. The `AnyType` trait is fundamental to Mojo's memory management system. It indicates that a type has a destructor that needs to be called when instances go out of scope. This is essential for types that own resources like memory, file handles, or other system resources that need proper cleanup. Key aspects: * Any type with a destructor must implement this trait * The destructor (`__del__`) is called automatically when an instance's lifetime ends * Composition of types with destructors automatically gets a destructor * All Mojo structs and traits inherit from `AnyType` by default unless they specify `@explicit_destroy` Example: ```mojo struct ResourceOwner(AnyType): var ptr: UnsafePointer[Int] fn __init__(out self, size: Int): self.ptr = UnsafePointer[Int].alloc(size) fn __del__(owned self): # Clean up owned resources self.ptr.free() ``` Best practices: * Implement this trait when your type owns resources that need cleanup * Ensure the destructor properly frees all owned resources * Consider using `@explicit_destroy` for types that should never have destructors * Use composition to automatically handle nested resource cleanup ## Implemented traits `UnknownDestructibility` ## Methods ### `__del__` `__del__(owned self: _Self, /)` Destroys the instance and cleans up any owned resources. This method is called automatically when an instance's lifetime ends. It receives an owned value and should perform all necessary cleanup operations like: * Freeing allocated memory * Closing file handles * Releasing system resources * Cleaning up any other owned resources The instance is considered dead after this method completes, regardless of whether any explicit cleanup was performed. --- ## UnknownDestructibility The most basic trait that all Mojo types extend by default. This trait indicates that a type has no destructor and therefore no lifetime management. It is the default for all types unless they explicitly implement `AnyType` or `ImplicitlyDestructible`. Types with this trait: * Have no `__del__` method * Do not perform any cleanup when they go out of scope * Are suitable for simple value types that don't own resources For types that need cleanup when they are destroyed, use `ImplicitlyDestructible` or `AnyType` instead. --- ## anytype Defines the core traits for object lifetime management in Mojo. This module provides the foundational traits that define how objects are created, managed and destroyed in Mojo: * `UnknownDestructibility`: The most basic trait that all types extend by default. Types with this trait have no destructor and no lifetime management. * `AnyType`: The base trait for types that require lifetime management through destructors. Any type that needs cleanup when it goes out of scope should implement this trait. * `ImplicitlyDestructible`: An alias for `AnyType` to help with the transition to linear types. Use this when you want to be explicit about a type having a destructor. These traits are built into Mojo and do not need to be imported. ## Aliases ### `ImplicitlyDestructible` `alias ImplicitlyDestructible = AnyType` ## Traits * [​`AnyType`](/mojo/stdlib/builtin/anytype/AnyType): A trait for types that require lifetime management through destructors. * [​`UnknownDestructibility`](/mojo/stdlib/builtin/anytype/UnknownDestructibility): The most basic trait that all Mojo types extend by default. --- ## Bool `@register_passable(trivial)` `struct Bool` The primitive Bool scalar value used in Mojo. ## Fields * ​value (`i1`): The underlying storage of the boolean value. ## Implemented traits `AnyType`, `Boolable`, `ConvertibleFromPython`, `Copyable`, `Defaultable`, `EqualityComparable`, `ExplicitlyCopyable`, `Floatable`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `ImplicitlyBoolable`, `ImplicitlyIntable`, `Indexer`, `Intable`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `PythonConvertible`, `Representable`, `Stringable`, `UnknownDestructibility`, `Writable`, `_HashableWithHasher` ## Aliases ### `MAX` `alias MAX = __init__[::Boolable](True)` The maximum value of a Bool. ### `MIN` `alias MIN = __init__[::Boolable](False)` The minimum value of a Bool. ## Methods ### `__init__` `__init__() -> Self` Construct a default, `False` Bool. `@implicit` `__init__[T: ImplicitlyBoolable, //](value: T) -> Self` Convert an ImplicitlyBoolable value to a Bool. **Parameters:** * ​T (`ImplicitlyBoolable`): The ImplicitlyBoolable type. **Args:** * ​value (`T`): The boolable value. `__init__[T: Boolable, //](value: T) -> Self` Set the bool representation of the object. **Parameters:** * ​T (`Boolable`): The type of the object. **Args:** * ​value (`T`): The object to get the bool representation of. `__init__(value: None) -> Self` Set the bool representation of the `None` type to `False`. **Args:** * ​value (`None`): The object to get the bool representation of. `@implicit` `__init__(value: SIMD[bool, 1]) -> Self` Convert a scalar SIMD value to a Bool. **Args:** * ​value (`SIMD[bool, 1]`): The scalar value. ### `__bool__` `__bool__(self) -> Self` Convert to Bool. **Returns:** This value. ### `__neg__` `__neg__(self) -> Int` Defines the unary `-` operation. **Returns:** 0 for False and -1 for True. ### `__invert__` `__invert__(self) -> Self` Inverts the Bool value. **Returns:** True if the object is false and False otherwise. ### `__lt__` `__lt__(self, rhs: Self) -> Self` Compare this Bool to RHS using less-than comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** True if self is False and rhs is True. ### `__le__` `__le__(self, rhs: Self) -> Self` Compare this Bool to RHS using less-than-or-equal comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** True if self is False and rhs is True or False. ### `__eq__` `__eq__(self, rhs: Self) -> Self` Compare this Bool to RHS. Performs an equality comparison between the Bool value and the argument. This method gets invoked when a user uses the `==` infix operator. **Args:** * ​rhs (`Self`): The rhs value of the equality statement. **Returns:** True if the two values match and False otherwise. ### `__ne__` `__ne__(self, rhs: Self) -> Self` Compare this Bool to RHS. Performs a non-equality comparison between the Bool value and the argument. This method gets invoked when a user uses the `!=` infix operator. **Args:** * ​rhs (`Self`): The rhs value of the non-equality statement. **Returns:** False if the two values do match and True otherwise. ### `__gt__` `__gt__(self, rhs: Self) -> Self` Compare this Bool to RHS using greater-than comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** True if self is True and rhs is False. ### `__ge__` `__ge__(self, rhs: Self) -> Self` Compare this Bool to RHS using greater-than-or-equal comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** True if self is True and rhs is True or False. ### `__and__` `__and__(self, rhs: Self) -> Self` Returns `self & rhs`. Bitwise and's the Bool value with the argument. This method gets invoked when a user uses the `and` infix operator. **Args:** * ​rhs (`Self`): The right hand side of the `and` statement. **Returns:** `self & rhs`. ### `__or__` `__or__(self, rhs: Self) -> Self` Returns `self | rhs`. Bitwise or's the Bool value with the argument. This method gets invoked when a user uses the `or` infix operator. **Args:** * ​rhs (`Self`): The right hand side of the `or` statement. **Returns:** `self | rhs`. ### `__xor__` `__xor__(self, rhs: Self) -> Self` Returns `self ^ rhs`. Bitwise Xor's the Bool value with the argument. This method gets invoked when a user uses the `^` infix operator. **Args:** * ​rhs (`Self`): The right hand side of the `xor` statement. **Returns:** `self ^ rhs`. ### `__rand__` `__rand__(self, lhs: Self) -> Self` Returns `lhs & self`. **Args:** * ​lhs (`Self`): The left hand side of the `and` statement. **Returns:** `lhs & self`. ### `__ror__` `__ror__(self, lhs: Self) -> Self` Returns `lhs | self`. **Args:** * ​lhs (`Self`): The left hand side of the `or` statement. **Returns:** `lhs | self`. ### `__rxor__` `__rxor__(self, lhs: Self) -> Self` Returns `lhs ^ self`. **Args:** * ​lhs (`Self`): The left hand side of the `xor` statement. **Returns:** `lhs ^ self`. ### `__iand__` `__iand__(mut self, rhs: Self)` Computes `self & rhs` and store the result in `self`. **Args:** * ​rhs (`Self`): The right hand side of the `and` statement. ### `__ixor__` `__ixor__(mut self, rhs: Self)` Computes `self ^ rhs` and stores the result in `self`. **Args:** * ​rhs (`Self`): The right hand side of the `xor` statement. ### `__ior__` `__ior__(mut self, rhs: Self)` Computes `self | rhs` and store the result in `self`. **Args:** * ​rhs (`Self`): The right hand side of the `or` statement. ### `copy` `copy(self) -> Self` Explicitly construct a deep copy of the provided value. **Returns:** A copy of the value. ### `__as_bool__` `__as_bool__(self) -> Self` Convert to Bool. **Returns:** This value. ### `__str__` `__str__(self) -> String` Get the bool as a string. Returns `"True"` or `"False"`. **Returns:** A string representation. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this boolean to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__repr__` `__repr__(self) -> String` Get the bool as a string. Returns `"True"` or `"False"`. **Returns:** A string representation. ### `__int__` `__int__(self) -> Int` Convert this Bool to an integer. **Returns:** 1 if the Bool is True, 0 otherwise. ### `__as_int__` `__as_int__(self) -> Int` Implicitly convert to an integral representation of the value, wherever an `Int` is expected. **Returns:** The integral representation of the value. ### `__index__` `__index__(self) -> index` Convert to index. **Returns:** 1 if the Bool is True, 0 otherwise. ### `__float__` `__float__(self) -> SIMD[float64, 1]` Convert this Bool to a float. **Returns:** 1.0 if True else 0.0 otherwise. ### `__hash__` `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with the underlying bytes. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. ### `to_python_object` `to_python_object(owned self) -> PythonObject` Convert this value to a PythonObject. **Returns:** A PythonObject representing the value. --- ## Boolable The `Boolable` trait describes a type that can be explicitly converted to a `Bool` or evaluated as a boolean expression in `if` or `while` conditions. This trait requires the type to implement the `__bool__()` method. For example: ```mojo struct Foo(Boolable): var val: Bool fn __bool__(self) -> Bool: return self.val ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__bool__` `__bool__(self: _Self) -> Bool` Get the boolean representation of the value. **Returns:** The boolean representation of the value. --- ## ImplicitlyBoolable The `ImplicitlyBoolable` trait describes a type that can be implicitly converted to a `Bool`. Types conforming to this trait can be passed to a function that expects a `Bool` without explicitly converting to it. Accordingly, most types should conform to `Boolable` instead, since implicit conversions to `Bool` can have unintuitive consequences. This trait requires the type to implement the `__as_bool__()` method. For example: ```mojo struct Foo(ImplicitlyBoolable): var val: Bool fn __as_bool__(self) -> Bool: return self.val fn __bool__(self) -> Bool: return self.__as_bool__() ``` ## Implemented traits `AnyType`, `Boolable`, `UnknownDestructibility` ## Methods ### `__bool__` `__bool__(self: _Self) -> Bool` Get the boolean representation of the value. **Returns:** The boolean representation of the value. ### `__as_bool__` `__as_bool__(self: _Self) -> Bool` Get the boolean representation of the value. **Returns:** The boolean representation of the value. --- ## all `all[T: Boolable & Copyable & Movable, //](list: List[T, hint_trivial_type]) -> Bool` Checks if **all** elements in the list are truthy. **Parameters:** * ​T (`Boolable & Copyable & Movable`): The type of elements to check. **Args:** * ​list (`List[T, hint_trivial_type]`): The list to check. **Returns:** `True` if **all** elements in the list are truthy, `False` otherwise. `all[T: Boolable & Copyable & Movable & Hashable & EqualityComparable, //](set: Set[T]) -> Bool` Checks if **all** elements in the set are truthy. **Parameters:** * ​T (`Boolable & Copyable & Movable & Hashable & EqualityComparable`): The type of elements to check. **Args:** * ​set (`Set[T]`): The set to check. **Returns:** `True` if **all** elements in the set are truthy, `False` otherwise. `all(value: SIMD[dtype, size]) -> Bool` Checks if **all** elements in the simd vector are truthy. **Args:** * ​value (`SIMD[dtype, size]`): The simd vector to check. **Returns:** `True` if **all** elements in the simd vector are truthy, `False` otherwise. --- ## any `any[T: Boolable & Copyable & Movable, //](list: List[T, hint_trivial_type]) -> Bool` Checks if **any** element in the list is truthy. **Parameters:** * ​T (`Boolable & Copyable & Movable`): The type of elements to check. **Args:** * ​list (`List[T, hint_trivial_type]`): The list to check. **Returns:** `True` if **any** element in the list is truthy, `False` otherwise. `any[T: Boolable & Copyable & Movable & Hashable & EqualityComparable, //](set: Set[T]) -> Bool` Checks if **any** element in the set is truthy. **Parameters:** * ​T (`Boolable & Copyable & Movable & Hashable & EqualityComparable`): The type of elements to check. **Args:** * ​set (`Set[T]`): The set to check. **Returns:** `True` if **any** element in the set is truthy, `False` otherwise. `any(value: SIMD[dtype, size]) -> Bool` Checks if **any** element in the simd vector is truthy. **Args:** * ​value (`SIMD[dtype, size]`): The simd vector to check. **Returns:** `True` if **any** element in the simd vector is truthy, `False` otherwise. --- ## bool Implements the Bool class. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`Bool`](/mojo/stdlib/builtin/bool/Bool): The primitive Bool scalar value used in Mojo. ## Traits * [​`Boolable`](/mojo/stdlib/builtin/bool/Boolable): The `Boolable` trait describes a type that can be explicitly converted to a `Bool` or evaluated as a boolean expression in `if` or `while` conditions. * [​`ImplicitlyBoolable`](/mojo/stdlib/builtin/bool/ImplicitlyBoolable): The `ImplicitlyBoolable` trait describes a type that can be implicitly converted to a `Bool`. ## Functions * [​`all`](/mojo/stdlib/builtin/bool/all): Checks if **all** elements in the list are truthy. * [​`any`](/mojo/stdlib/builtin/bool/any): Checks if **any** element in the list is truthy. --- ## breakpoint `breakpoint()` Cause an execution trap with the intention of requesting the attention of a debugger. --- ## breakpoint This module includes the builtin breakpoint function. ## Functions * [​`breakpoint`](/mojo/stdlib/builtin/breakpoint/breakpoint): Cause an execution trap with the intention of requesting the attention of a debugger. --- ## Slice `struct Slice` Represents a slice expression. Objects of this type are generated when slice syntax is used within square brackets, e.g.: ```mojo var msg: String = "Hello Mojo" # Both are equivalent and print "Mojo". print(msg[6:]) print(msg.__getitem__(Slice(6, len(msg)))) ``` ## Fields * ​start (`Optional[Int]`): The starting index of the slice. * ​end (`Optional[Int]`): The end index of the slice. * ​step (`Optional[Int]`): The step increment value of the slice. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `Movable`, `Representable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(out self, start: Int, end: Int)` Construct slice given the start and end values. **Args:** * ​start (`Int`): The start value. * ​end (`Int`): The end value. `__init__(out self, start: Optional[Int], end: Optional[Int], step: Optional[Int])` Construct slice given the start, end and step values. **Args:** * ​start (`Optional[Int]`): The start value. * ​end (`Optional[Int]`): The end value. * ​step (`Optional[Int]`): The step value. ### `__eq__` `__eq__(self, other: Self) -> Bool` Compare this slice to the other. **Args:** * ​other (`Self`): The slice to compare to. **Returns:** True if start, end, and step values of this slice match the corresponding values of the other slice and False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Compare this slice to the other. **Args:** * ​other (`Self`): The slice to compare to. **Returns:** False if start, end, and step values of this slice match the corresponding values of the other slice and True otherwise. ### `copy` `copy(self) -> Self` Creates a deep copy of the Slice. **Returns:** A copy of the value. ### `__str__` `__str__(self) -> String` Gets the string representation of the span. **Returns:** The string representation of the span. ### `__repr__` `__repr__(self) -> String` Gets the string representation of the span. **Returns:** The string representation of the span. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Write Slice string representation to a `Writer`. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `indices` `indices(self, length: Int) -> Tuple[Int, Int, Int]` Returns a tuple of 3 integers representing the start, end, and step of the slice if applied to a container of the given length. Uses the target container length to normalize negative, out of bounds, or None indices. Negative indices are wrapped using the length of the container. ```mojo s = slice(0, -1, 1) i = s.indices(5) # returns (0, 4, 1) ``` None indices are defaulted to the start or the end of the container based on whether `step` is positive or negative. ```mojo s = slice(None, None, 1) i = s.indices(5) # returns (0, 5, 1) ``` Out of bounds indices are clamped using the size of the container. ```mojo s = slice(20) i = s.indices(5) # returns (0, 5, 1) ``` **Args:** * ​length (`Int`): The length of the target container. **Returns:** A tuple containing three integers for start, end, and step. --- ## builtin_slice Implements slice. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`Slice`](/mojo/stdlib/builtin/builtin_slice/Slice): Represents a slice expression. ## Functions * [​`slice`](/mojo/stdlib/builtin/builtin_slice/slice-function): Construct slice given the end value. --- ## slice `slice(end: Int) -> Slice` Construct slice given the end value. **Args:** * ​end (`Int`): The end value. **Returns:** The constructed slice. `slice(start: Int, end: Int) -> Slice` Construct slice given the start and end values. **Args:** * ​start (`Int`): The start value. * ​end (`Int`): The end value. **Returns:** The constructed slice. `slice(start: Optional[Int], end: Optional[Int], step: Optional[Int]) -> Slice` Construct a Slice given the start, end and step values. **Args:** * ​start (`Optional[Int]`): The start value. * ​end (`Optional[Int]`): The end value. * ​step (`Optional[Int]`): The step value. **Returns:** The constructed slice. --- ## GreaterThanComparable A type which can be greater than compared with other instances of itself. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__gt__` `__gt__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is greater than `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is greater than `rhs`. --- ## GreaterThanOrEqualComparable A type which can be greater than or equal to compared with other instances of itself. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__ge__` `__ge__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is greater than or equal to `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is greater than or equal to `rhs`. --- ## LessThanComparable A type which can be less than compared with other instances of itself. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__lt__` `__lt__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is less than `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is less than `rhs`. --- ## LessThanOrEqualComparable A type which can be less than or equal to compared with other instances of itself. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__le__` `__le__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is less than or equal to `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is less than or equal to `rhs`. --- ## comparable ## Aliases ### `Comparable` `alias Comparable = EqualityComparable & LessThanComparable & GreaterThanComparable & LessThanOrEqualComparable & GreaterThanOrEqualComparable` A type which can be compared with other instances of itself. ## Traits * [​`GreaterThanComparable`](/mojo/stdlib/builtin/comparable/GreaterThanComparable): A type which can be greater than compared with other instances of itself. * [​`GreaterThanOrEqualComparable`](/mojo/stdlib/builtin/comparable/GreaterThanOrEqualComparable): A type which can be greater than or equal to compared with other instances of itself. * [​`LessThanComparable`](/mojo/stdlib/builtin/comparable/LessThanComparable): A type which can be less than compared with other instances of itself. * [​`LessThanOrEqualComparable`](/mojo/stdlib/builtin/comparable/LessThanOrEqualComparable): A type which can be less than or equal to compared with other instances of itself. --- ## constrained `constrained[cond: Bool, msg: StringSlice[StaticConstantOrigin], *extra: StringSlice[StaticConstantOrigin]]()` Asserts that the condition must be true at compile time. The `constrained()` function introduces a compile-time constraint on the enclosing function. If the condition is true at compile time, the constraint has no effect. If the condition is false, compilation fails and the message is displayed. This is similar to `static_assert` in C++. It differs from [`debug_assert()`](/mojo/stdlib/builtin/debug_assert/debug_assert), which is a run-time assertion. Example: ```mojo fn half[dtype: DType](a: Scalar[dtype]) -> Scalar[dtype]: constrained[ dtype.is_numeric(), "dtype must be numeric." ]() return a / 2 def main(): print(half(UInt8(5))) # prints 2 print(half(Scalar[DType.bool](True))) # constraint failed: # dtype must be numeric. ``` **Parameters:** * ​cond (`Bool`): The bool value to assert. * ​msg (`StringSlice[StaticConstantOrigin]`): The message to display on failure. * ​\*extra (`StringSlice[StaticConstantOrigin]`): Additional messages to concatenate to msg. `constrained[cond: Bool]()` Asserts that the condition must be true at compile time. The `constrained()` function introduces a compile-time constraint on the enclosing function. If the condition is true at compile time, the constraint has no effect. If the condition is false, compilation fails and a generic message is displayed. This is similar to `static_assert` in C++. It differs from [`debug_assert()`](/mojo/stdlib/builtin/debug_assert/debug_assert), which is a run-time assertion. For an example, see the [first overload](/mojo/stdlib/builtin/constrained/constrained). **Parameters:** * ​cond (`Bool`): The bool value to assert. --- ## constrained Implements compile-time constraints. These are Mojo built-ins, so you don't need to import them. ## Functions * [​`constrained`](/mojo/stdlib/builtin/constrained/constrained): Asserts that the condition must be true at compile time. --- ## Coroutine `@register_passable` `struct Coroutine[type: AnyType, origins: origin.set]` Represents a coroutine. Coroutines can pause execution saving the state of the program (including values of local variables and the location of the next instruction to be executed). When the coroutine is resumed, execution continues from where it left off, with the saved state restored. ## Parameters * ​type (`AnyType`): Type of value returned upon completion of the coroutine. * ​origins (`origin.set`): The origin of the coroutine's captures. ## Implemented traits `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(handle: !co.routine) -> Self` Construct a coroutine object from a handle. **Args:** * ​handle (`!co.routine`): The init handle. ### `__await__` `__await__(owned self, out result: type)` Suspends the current coroutine until the coroutine is complete. **Returns:** The coroutine promise. ### `force_destroy` `force_destroy(owned self)` Destroy the coroutine object. --- ## RaisingCoroutine `@register_passable` `struct RaisingCoroutine[type: AnyType, origins: origin.set]` Represents a coroutine that can raise. Coroutines can pause execution saving the state of the program (including values of local variables and the location of the next instruction to be executed). When the coroutine is resumed, execution continues from where it left off, with the saved state restored. ## Parameters * ​type (`AnyType`): Type of value returned upon completion of the coroutine. * ​origins (`origin.set`): The origin set of the coroutine's captures. ## Implemented traits `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(handle: !co.routine) -> Self` Construct a coroutine object from a handle. **Args:** * ​handle (`!co.routine`): The init handle. ### `__await__` `__await__(owned self, out result: type)` Suspends the current coroutine until the coroutine is complete. **Returns:** The coroutine promise. ### `force_destroy` `force_destroy(owned self)` Destroy the coroutine object. --- ## coroutine Implements classes and methods for coroutines. These are Mojo built-ins, so you don't need to import them. ## Aliases ### `AnyCoroutine` `alias AnyCoroutine = !co.routine` ## Structs * [​`Coroutine`](/mojo/stdlib/builtin/coroutine/Coroutine): Represents a coroutine. * [​`RaisingCoroutine`](/mojo/stdlib/builtin/coroutine/RaisingCoroutine): Represents a coroutine that can raise. --- ## debug_assert `debug_assert[: origin.set, //, cond: fn() capturing -> Bool, assert_mode: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("none"), *Ts: Writable = *?, *, cpu_only: Bool = False](*messages: *Ts)` Asserts that the condition is true at run time. If the condition is false, the assertion displays the given message and causes the program to exit. You can pass in multiple arguments to generate a formatted message. No string allocation occurs unless the assertion is triggered. ```mojo x = 0 debug_assert(x > 0, "expected x to be more than 0 but got: ", x) ``` Normal assertions are off by default—they only run when the program is compiled with all assertions enabled. You can set the `assert_mode` to `safe` to create an assertion that's on by default: ```mojo debug_assert[assert_mode="safe"]( x > 0, "expected x to be more than 0 but got: ", x ) ``` Use the `ASSERT` variable to turn assertions on or off when building or running a Mojo program: ```sh mojo -D ASSERT=all main.mojo ``` The `ASSERT` variable takes the following values: * all: Turn on all assertions. * safe: Turn on "safe" assertions only. This is the default. * none: Turn off all assertions, for performance at the cost of safety. * warn: Turn on all assertions, but print any errors instead of exiting. To ensure that you have no run-time penalty from your assertions even when they're disabled, make sure there are no side effects in your message and condition expressions. For example: ```mojo person = "name: john, age: 50" name = "john" debug_assert(String("name: ") + name == person, "unexpected name") ``` This will have a run-time penalty due to allocating a `String` in the condition expression, even when assertions are disabled. To avoid this, put the condition inside a closure so it runs only when the assertion is turned on: ```mojo fn check_name() capturing -> Bool: return String("name: ") + name == person debug_assert[check_name]("unexpected name") ``` If you need to allocate, and so don't want the assert to ever run on GPU, you can set it to CPU only: ```mojo debug_assert[check_name, cpu_only=True]("unexpected name") ``` For compile-time assertions, see [`constrained()`](/mojo/stdlib/builtin/constrained/constrained). **Parameters:** * ​cond (`fn() capturing -> Bool`): The function to invoke to check if the assertion holds. * ​assert\_mode (`StringSlice[StaticConstantOrigin]`): Determines when the assert is turned on. * default ("none"): Turned on when compiled with `-D ASSERT=all`. * "safe": Turned on by default. * ​\*Ts (`Writable`): The element types for the message arguments. * ​cpu\_only (`Bool`): If true, only run the assert on CPU. **Args:** * ​\*messages (`*Ts`): A set of [`Writable`](/mojo/stdlib/utils/write/Writable/) arguments to convert to a `String` message. `debug_assert[assert_mode: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("none"), *Ts: Writable = *?, *, cpu_only: Bool = False](cond: Bool, *messages: *Ts)` Asserts that the condition is true at run time. If the condition is false, the assertion displays the given message and causes the program to exit. You can pass in multiple arguments to generate a formatted message. No string allocation occurs unless the assertion is triggered. ```mojo x = 0 debug_assert(x > 0, "expected x to be more than 0 but got: ", x) ``` Normal assertions are off by default—they only run when the program is compiled with all assertions enabled. You can set the `assert_mode` to `safe` to create an assertion that's on by default: ```mojo debug_assert[assert_mode="safe"]( x > 0, "expected x to be more than 0 but got: ", x ) ``` Use the `ASSERT` variable to turn assertions on or off when building or running a Mojo program: ```sh mojo -D ASSERT=all main.mojo ``` The `ASSERT` variable takes the following values: * all: Turn on all assertions. * safe: Turn on "safe" assertions only. This is the default. * none: Turn off all assertions, for performance at the cost of safety. * warn: Turn on all assertions, but print any errors instead of exiting. To ensure that you have no run-time penalty from your assertions even when they're disabled, make sure there are no side effects in your message and condition expressions. For example: ```mojo person = "name: john, age: 50" name = "john" debug_assert(String("name: ") + name == person, "unexpected name") ``` This will have a run-time penalty due to allocating a `String` in the condition expression, even when assertions are disabled. To avoid this, put the condition inside a closure so it runs only when the assertion is turned on: ```mojo fn check_name() capturing -> Bool: return String("name: ") + name == person debug_assert[check_name]("unexpected name") ``` If you need to allocate, and so don't want the assert to ever run on GPU, you can set it to CPU only: ```mojo debug_assert[check_name, cpu_only=True]("unexpected name") ``` For compile-time assertions, see [`constrained()`](/mojo/stdlib/builtin/constrained/constrained). **Parameters:** * ​assert\_mode (`StringSlice[StaticConstantOrigin]`): Determines when the assert is turned on. * default ("none"): Turned on when compiled with `-D ASSERT=all`. * "safe": Turned on by default. * ​\*Ts (`Writable`): The element types for the message arguments. * ​cpu\_only (`Bool`): If true, only run the assert on CPU. **Args:** * ​cond (`Bool`): The bool value to assert. * ​\*messages (`*Ts`): A set of [`Writable`](/mojo/stdlib/utils/write/Writable/) arguments to convert to a `String` message. `debug_assert[assert_mode: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("none"), cpu_only: Bool = False](cond: Bool, message: StringLiteral[value])` Asserts that the condition is true at run time. If the condition is false, the assertion displays the given message and causes the program to exit. You can pass in multiple arguments to generate a formatted message. No string allocation occurs unless the assertion is triggered. ```mojo x = 0 debug_assert(x > 0, "expected x to be more than 0 but got: ", x) ``` Normal assertions are off by default—they only run when the program is compiled with all assertions enabled. You can set the `assert_mode` to `safe` to create an assertion that's on by default: ```mojo debug_assert[assert_mode="safe"]( x > 0, "expected x to be more than 0 but got: ", x ) ``` Use the `ASSERT` variable to turn assertions on or off when building or running a Mojo program: ```sh mojo -D ASSERT=all main.mojo ``` The `ASSERT` variable takes the following values: * all: Turn on all assertions. * safe: Turn on "safe" assertions only. This is the default. * none: Turn off all assertions, for performance at the cost of safety. * warn: Turn on all assertions, but print any errors instead of exiting. To ensure that you have no run-time penalty from your assertions even when they're disabled, make sure there are no side effects in your message and condition expressions. For example: ```mojo person = "name: john, age: 50" name = "john" debug_assert(String("name: ") + name == person, "unexpected name") ``` This will have a run-time penalty due to allocating a `String` in the condition expression, even when assertions are disabled. To avoid this, put the condition inside a closure so it runs only when the assertion is turned on: ```mojo fn check_name() capturing -> Bool: return String("name: ") + name == person debug_assert[check_name]("unexpected name") ``` If you need to allocate, and so don't want the assert to ever run on GPU, you can set it to CPU only: ```mojo debug_assert[check_name, cpu_only=True]("unexpected name") ``` For compile-time assertions, see [`constrained()`](/mojo/stdlib/builtin/constrained/constrained). **Parameters:** * ​assert\_mode (`StringSlice[StaticConstantOrigin]`): Determines when the assert is turned on. * default ("none"): Turned on when compiled with `-D ASSERT=all`. * "safe": Turned on by default. * ​cpu\_only (`Bool`): If true, only run the assert on CPU. **Args:** * ​cond (`Bool`): The bool value to assert. * ​message (`StringLiteral[value]`): A static string message. --- ## debug_assert Implements run-time assertions. These are Mojo built-ins, so you don't need to import them. ## Aliases ### `ASSERT_MODE` `alias ASSERT_MODE = env_get_string[::StringSlice[::Bool()` ## Functions * [​`debug_assert`](/mojo/stdlib/builtin/debug_assert/debug_assert): Asserts that the condition is true at run time. --- ## DevicePassable This trait marks types as passable to accelerator devices. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `device_type` `alias device_type` Indicate the type being used on accelerator devices. ## Methods ### `get_type_name` `static get_type_name() -> String` Gets the name of the host type (the one implementing this trait). For example, Int would return "Int", DeviceBuffer\[DType.float32] would return "DeviceBuffer\[DType.float32]". This is used for error messages when passing types to the device. TODO: This method will be retired soon when better kernel call error messages arrive. **Returns:** The host type's name. ### `get_device_type_name` `static get_device_type_name() -> String` Gets device\_type's name. For example, because DeviceBuffer's device\_type is UnsafePointer, DeviceBuffer\[DType.float32]'s get\_device\_type\_name() should return something like "UnsafePointer\[Scalar\[DType.float32]]". This is used for error messages when passing types to the device. TODO: This method will be retired soon when better kernel call error messages arrive. **Returns:** The device type's name. --- ## device_passable ## Traits * [​`DevicePassable`](/mojo/stdlib/builtin/device_passable/DevicePassable): This trait marks types as passable to accelerator devices. --- ## DType `@register_passable(trivial)` `struct DType` Represents DType and provides methods for working with it. ## Fields * ​value (`dtype`): The underlying storage for the DType value. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `Hashable`, `Movable`, `Representable`, `Stringable`, `UnknownDestructibility`, `Writable`, `_HashableWithHasher` ## Aliases ### `bfloat16` `alias bfloat16` Represents a brain floating point value whose bitwidth is 16. ### `bool` `alias bool` Represents a boolean data type. ### `float16` `alias float16` Represents an IEEE754-2008 `binary16` floating point value. ### `float32` `alias float32` Represents an IEEE754-2008 `binary32` floating point value. ### `float64` `alias float64` Represents an IEEE754-2008 `binary64` floating point value. ### `float8_e3m4` `alias float8_e3m4` Represents an 8-bit e3m4 floating point format, encoded as `seeemmmm`: - (s)ign: 1 bit - (e)xponent: 3 bits - (m)antissa: 4 bits - exponent bias: 3 - nan: 00111111, 11111111 - -0: 10000000 - fn: finite (no inf or -inf encodings) ### `float8_e4m3fn` `alias float8_e4m3fn` Represents the E4M3 floating point format defined in the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1). This type is named differently across libraries and vendors, for example: * Mojo, PyTorch, JAX, and LLVM refer to it as `e4m3fn`. * OCP, NVIDIA CUDA, and AMD ROCm refer to it as `e4m3`. In these contexts, they are all referring to the same finite type specified in the OFP8 standard above, encoded as `seeeemmm`: * (s)ign: 1 bit * (e)xponent: 4 bits * (m)antissa: 3 bits * exponent bias: 7 * nan: 01111111, 11111111 * -0: 10000000 * fn: finite (no inf or -inf encodings) ### `float8_e4m3fnuz` `alias float8_e4m3fnuz` Represents an 8-bit e4m3fnuz floating point format, encoded as `seeeemmm`: - (s)ign: 1 bit - (e)xponent: 4 bits - (m)antissa: 3 bits - exponent bias: 8 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding) ### `float8_e5m2` `alias float8_e5m2` Represents the 8-bit E5M2 floating point format from the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1), encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 15 - nan: {0,1}11111{01,10,11} - inf: 01111100 - -inf: 11111100 - -0: 10000000 ### `float8_e5m2fnuz` `alias float8_e5m2fnuz` Represents an 8-bit floating point format, encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 16 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding) ### `index` `alias index` Represents an integral type whose bitwidth is the maximum integral value on the system. ### `int128` `alias int128 = si128` Represents a signed integer type whose bitwidth is 128. ### `int16` `alias int16` Represents a signed integer type whose bitwidth is 16. ### `int256` `alias int256 = si256` Represents a signed integer type whose bitwidth is 256. ### `int32` `alias int32` Represents a signed integer type whose bitwidth is 32. ### `int64` `alias int64` Represents a signed integer type whose bitwidth is 64. ### `int8` `alias int8` Represents a signed integer type whose bitwidth is 8. ### `invalid` `alias invalid` Represents an invalid or unknown data type. ### `tensor_float32` `alias tensor_float32` Represents a special floating point format supported by NVIDIA Tensor Cores, with the same range as float32 and reduced precision (>=10 bits). Note that this dtype is only available on NVIDIA GPUs. ### `type` `alias type = dtype` ### `uint128` `alias uint128 = ui128` Represents an unsigned integer type whose bitwidth is 128. ### `uint16` `alias uint16` Represents an unsigned integer type whose bitwidth is 16. ### `uint256` `alias uint256 = ui256` Represents an unsigned integer type whose bitwidth is 256. ### `uint32` `alias uint32` Represents an unsigned integer type whose bitwidth is 32. ### `uint64` `alias uint64` Represents an unsigned integer type whose bitwidth is 64. ### `uint8` `alias uint8` Represents an unsigned integer type whose bitwidth is 8. ## Methods ### `__init__` `@implicit` `__init__(value: dtype) -> Self` Construct a DType from MLIR dtype. **Args:** * ​value (`dtype`): The MLIR dtype. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compares one DType to another for equality. **Args:** * ​rhs (`Self`): The DType to compare against. **Returns:** True if the DTypes are the same and False otherwise. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Compares one DType to another for inequality. **Args:** * ​rhs (`Self`): The DType to compare against. **Returns:** False if the DTypes are the same and True otherwise. ### `__is__` `__is__(self, rhs: Self) -> Bool` Compares one DType to another for equality. **Args:** * ​rhs (`Self`): The DType to compare against. **Returns:** True if the DTypes are the same and False otherwise. ### `__isnot__` `__isnot__(self, rhs: Self) -> Bool` Compares one DType to another for inequality. **Args:** * ​rhs (`Self`): The DType to compare against. **Returns:** True if the DTypes are the same and False otherwise. ### `copy` `copy(self) -> Self` Copy this DType. **Returns:** A copy of the value. ### `__str__` `__str__(self) -> String` Gets the name of the DType. **Returns:** The name of the dtype. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this dtype to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__repr__` `__repr__(self) -> String` Gets the representation of the DType e.g. `"DType.float32"`. **Returns:** The representation of the dtype. ### `get_value` `get_value(self) -> dtype` Gets the associated internal kgen.dtype value. **Returns:** The kgen.dtype value. ### `__hash__` `__hash__(self) -> UInt` Return a 64-bit hash for this `DType` value. **Returns:** A 64-bit integer hash of this `DType` value. `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with this `DType` value. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. ### `is_unsigned` `is_unsigned(self) -> Bool` Returns True if the type parameter is unsigned and False otherwise. **Returns:** Returns True if the input type parameter is unsigned. ### `is_signed` `is_signed(self) -> Bool` Returns True if the type parameter is signed and False otherwise. **Returns:** Returns True if the input type parameter is signed. ### `is_integral` `is_integral(self) -> Bool` Returns True if the type parameter is an integer and False otherwise. **Returns:** Returns True if the input type parameter is an integer. ### `is_floating_point` `is_floating_point(self) -> Bool` Returns True if the type parameter is a floating-point and False otherwise. **Returns:** Returns True if the input type parameter is a floating-point. ### `is_float8` `is_float8(self) -> Bool` Returns True if the dtype is a 8bit-precision floating point type, e.g. float8\_e5m2, float8\_e5m2fnuz, float8\_e4m3fn and float8\_e4m3fnuz. **Returns:** True if the dtype is a 8bit-precision float, false otherwise. ### `is_half_float` `is_half_float(self) -> Bool` Returns True if the dtype is a half-precision floating point type, e.g. either fp16 or bf16. **Returns:** True if the dtype is a half-precision float, false otherwise.. ### `is_numeric` `is_numeric(self) -> Bool` Returns True if the type parameter is numeric (i.e. you can perform arithmetic operations on). **Returns:** Returns True if the input type parameter is either integral or floating-point. ### `sizeof` `sizeof(self) -> Int` Returns the size in bytes of the current DType. **Returns:** Returns the size in bytes of the current DType. ### `bitwidth` `bitwidth(self) -> Int` Returns the size in bits of the current DType. **Returns:** Returns the size in bits of the current DType. ### `dispatch_integral` `dispatch_integral[: origin.set, //, func: fn[DType]() capturing -> None](self)` Dispatches an integral function corresponding to the current DType. **Constraints:** DType must be integral. **Parameters:** * ​func (`fn[DType]() capturing -> None`): A parametrized on dtype function to dispatch. ### `dispatch_floating` `dispatch_floating[: origin.set, //, func: fn[DType]() capturing -> None](self)` Dispatches a floating-point function corresponding to the current DType. **Constraints:** DType must be floating-point or integral. **Parameters:** * ​func (`fn[DType]() capturing -> None`): A parametrized on dtype function to dispatch. ### `dispatch_arithmetic` `dispatch_arithmetic[: origin.set, //, func: fn[DType]() capturing -> None](self)` Dispatches a function corresponding to the current DType. **Parameters:** * ​func (`fn[DType]() capturing -> None`): A parametrized on dtype function to dispatch. ### `__mlir_type` `__mlir_type(self) -> !kgen.deferred` Returns the MLIR type of the current DType as an MLIR type. **Returns:** The MLIR type of the current DType. ### `get_dtype` `static get_dtype[T: AnyType, size: Int = 1]() -> Self` Get the `DType` if the given Type is a `SIMD[_, size]` of a `DType`. **Parameters:** * ​T (`AnyType`): AnyType. * ​size (`Int`): The SIMD size to compare against. **Returns:** The `DType` if matched, otherwise `DType.invalid`. ### `is_scalar` `static is_scalar[T: AnyType]() -> Bool` Whether the given Type is a Scalar of a DType. **Parameters:** * ​T (`AnyType`): AnyType. **Returns:** The result. --- ## dtype Implements the DType class. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`DType`](/mojo/stdlib/builtin/dtype/DType): Represents DType and provides methods for working with it. --- ## EqualityComparable A type which can be compared for equality with other instances of itself. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__eq__` `__eq__(self: _Self, other: _Self) -> Bool` Define whether two instances of the object are equal to each other. **Args:** * ​other (`_Self`): Another instance of the same type. **Returns:** True if the instances are equal according to the type's definition of equality, False otherwise. ### `__ne__` `__ne__(self: _Self, other: _Self) -> Bool` Define whether two instances of the object are not equal to each other. **Args:** * ​other (`_Self`): Another instance of the same type. **Returns:** True if the instances are not equal according to the type's definition of equality, False otherwise. --- ## equality_comparable ## Traits * [​`EqualityComparable`](/mojo/stdlib/builtin/equality_comparable/EqualityComparable): A type which can be compared for equality with other instances of itself. --- ## Error `@register_passable` `struct Error` This type represents an Error. ## Fields * ​data (`UnsafePointer[SIMD[uint8, 1]]`): A pointer to the beginning of the string data being referenced. * ​loaded\_length (`Int`): The length of the string being referenced. Error instances conditionally own their error message. To reduce the size of the error instance we use the sign bit of the length field to store the ownership value. When loaded\_length is negative it indicates ownership and a free is executed in the destructor. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Defaultable`, `ExplicitlyCopyable`, `Movable`, `Representable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Default constructor. `@implicit` `__init__(value: StringLiteral[value]) -> Self` Construct an Error object with a given string literal. **Args:** * ​value (`StringLiteral[value]`): The error message. `@implicit` `__init__(src: String) -> Self` Construct an Error object with a given string. **Args:** * ​src (`String`): The error message. `@implicit` `__init__(src: StringSlice[origin]) -> Self` Construct an Error object with a given string ref. **Args:** * ​src (`StringSlice[origin]`): The error message. `__init__[*Ts: Writable](*args: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")) -> Self` Construct an Error by concatenating a sequence of Writable arguments. **Parameters:** * ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy `Writable`. **Args:** * ​\*args (`*Ts`): A sequence of Writable arguments. * ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements. * ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements. ### `__copyinit__` `__copyinit__(existing: Self) -> Self` Creates a deep copy of an existing error. **Args:** * ​existing (`Self`): The error to copy from. ### `__del__` `__del__(owned self)` Releases memory if allocated. ### `__bool__` `__bool__(self) -> Bool` Returns True if the error is set and false otherwise. **Returns:** True if the error object contains a value and False otherwise. ### `copy` `copy(self) -> Self` Copy the object. **Returns:** A copy of the value. ### `__str__` `__str__(self) -> String` Converts the Error to string representation. **Returns:** A String of the error message. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this error to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__repr__` `__repr__(self) -> String` Converts the Error to printable representation. **Returns:** A printable representation of the error message. ### `byte_length` `byte_length(self) -> Int` Get the length of the Error string in bytes. Notes: This does not include the trailing null terminator in the count. **Returns:** The length of the Error string in bytes. ### `unsafe_cstr_ptr` `unsafe_cstr_ptr(self) -> UnsafePointer[SIMD[int8, 1]]` Retrieves a C-string-compatible pointer to the underlying memory. The returned pointer is guaranteed to be NUL terminated, and not null. **Returns:** The pointer to the underlying memory. ### `as_string_slice` `as_string_slice(self) -> StringSlice[ImmutableAnyOrigin]` Returns a string slice of the data maybe owned by the Error. Notes: Since the data is not guaranteed to be owned by the Error, the resulting StringSlice is given an ImmutableAnyOrigin. **Returns:** A string slice pointing to the data maybe owned by the Error. --- ## error Implements the Error class. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`Error`](/mojo/stdlib/builtin/error/Error): This type represents an Error. --- ## FileHandle `struct FileHandle` File handle to an opened file. ## Fields * ​handle (`UnsafePointer[NoneType]`): The underlying pointer to the file handle. ## Implemented traits `AnyType`, `Defaultable`, `Movable`, `UnknownDestructibility`, `Writer` ## Methods ### `__init__` `__init__(out self)` Default constructor. `__init__(out self, path: StringSlice[origin], mode: StringSlice[origin])` Construct the FileHandle using the file path and mode. **Args:** * ​path (`StringSlice[origin]`): The file path. * ​mode (`StringSlice[origin]`): The mode to open the file in (the mode can be "r" or "w" or "rw"). ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Moves constructor for the file handle. **Args:** * ​existing (`Self`): The existing file handle. ### `__del__` `__del__(owned self)` Closes the file handle. ### `close` `close(mut self)` Closes the file handle. ### `read` `read(self, size: Int = -1) -> String` Reads data from a file and sets the file handle seek position. If size is left as the default of -1, it will read to the end of the file. Setting size to a number larger than what's in the file will set the String length to the total number of bytes, and read all the data. Examples: Read the entire file into a String: ```mojo var file = open("/tmp/example.txt", "r") var string = file.read() print(string) ``` Read the first 8 bytes, skip 2 bytes, and then read the next 8 bytes: ```mojo import os var file = open("/tmp/example.txt", "r") var word1 = file.read(8) print(word1) _ = file.seek(2, os.SEEK_CUR) var word2 = file.read(8) print(word2) ``` Read the last 8 bytes in the file, then the first 8 bytes ```mojo _ = file.seek(-8, os.SEEK_END) var last_word = file.read(8) print(last_word) _ = file.seek(8, os.SEEK_SET) # os.SEEK_SET is the default start of file var first_word = file.read(8) print(first_word) ``` . **Args:** * ​size (`Int`): Requested number of bytes to read (Default: -1 = EOF). **Returns:** The contents of the file. **Raises:** An error if this file handle is invalid, or if the file read returned a failure. `read[dtype: DType, origin: MutableOrigin](self, buffer: Span[SIMD[dtype, 1], origin]) -> Int` Read data from the file into the Span. This will read n bytes from the file into the input Span where `0 dtype (`DType`): The type that the data will be represented as. * ​origin (`MutableOrigin`): The origin of the passed in Span. **Args:** * ​buffer (`Span[SIMD[dtype, 1], origin]`): The mutable Span to read data into. **Returns:** The total amount of data that was read in bytes. **Raises:** An error if this file handle is invalid, or if the file read returned a failure. ### `read_bytes` `read_bytes(self, size: Int = -1) -> List[SIMD[uint8, 1]]` Reads data from a file and sets the file handle seek position. If size is left as default of -1, it will read to the end of the file. Setting size to a number larger than what's in the file will be handled and set the List length to the total number of bytes in the file. Examples: Reading the entire file into a List\[Int8]: ```mojo var file = open("/tmp/example.txt", "r") var string = file.read_bytes() ``` Reading the first 8 bytes, skipping 2 bytes, and then reading the next 8 bytes: ```mojo import os var file = open("/tmp/example.txt", "r") var list1 = file.read(8) _ = file.seek(2, os.SEEK_CUR) var list2 = file.read(8) ``` Reading the last 8 bytes in the file, then the first 8 bytes: ```mojo import os var file = open("/tmp/example.txt", "r") _ = file.seek(-8, os.SEEK_END) var last_data = file.read(8) _ = file.seek(8, os.SEEK_SET) # os.SEEK_SET is the default start of file var first_data = file.read(8) ``` . **Args:** * ​size (`Int`): Requested number of bytes to read (Default: -1 = EOF). **Returns:** The contents of the file. **Raises:** An error if this file handle is invalid, or if the file read returned a failure. ### `seek` `seek(self, offset: SIMD[uint64, 1], whence: SIMD[uint8, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[uint64, 1]` Seeks to the given offset in the file. Examples: Skip 32 bytes from the current read position: ```mojo import os var f = open("/tmp/example.txt", "r") _ = f.seek(32, os.SEEK_CUR) ``` Start from 32 bytes from the end of the file: ```mojo import os var f = open("/tmp/example.txt", "r") _ = f.seek(-32, os.SEEK_END) ``` . **Args:** * ​offset (`SIMD[uint64, 1]`): The byte offset to seek to. * ​whence (`SIMD[uint8, 1]`): The reference point for the offset: os.SEEK\_SET = 0: start of file (Default). os.SEEK\_CUR = 1: current position. os.SEEK\_END = 2: end of file. **Returns:** The resulting byte offset from the start of the file. **Raises:** An error if this file handle is invalid, or if file seek returned a failure. ### `write_bytes` `write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])` Write a span of bytes to the file. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file. ### `write` `write[*Ts: Writable](mut self, *args: *Ts)` Write a sequence of Writable arguments to the provided Writer. **Parameters:** * ​\*Ts (`Writable`): Types of the provided argument sequence. **Args:** * ​\*args (`*Ts`): Sequence of arguments to write to this Writer. ### `__enter__` `__enter__(owned self) -> Self` The function to call when entering the context. **Returns:** The file handle. --- ## file Provides APIs to read and write files. These are Mojo built-ins, so you don't need to import them. For example, here's how to read a file: ```mojo var f = open("my_file.txt", "r") print(f.read()) f.close() ``` Or use a `with` statement to close the file automatically: ```mojo with open("my_file.txt", "r") as f: print(f.read()) ``` ## Structs * [​`FileHandle`](/mojo/stdlib/builtin/file/FileHandle): File handle to an opened file. ## Functions * [​`open`](/mojo/stdlib/builtin/file/open): Opens the file specified by path using the mode provided, returning a FileHandle. --- ## open `open[PathLike: PathLike](path: PathLike, mode: StringSlice[origin]) -> FileHandle` Opens the file specified by path using the mode provided, returning a FileHandle. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the file to open. * ​mode (`StringSlice[origin]`): The mode to open the file in (the mode can be "r" or "w"). **Returns:** A file handle. --- ## FileDescriptor `@register_passable(trivial)` `struct FileDescriptor` File descriptor of a file. ## Fields * ​value (`Int`): The underlying value of the file descriptor. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility`, `Writer` ## Methods ### `__init__` `__init__(value: Int = 1) -> Self` Constructs the file descriptor from an integer. **Args:** * ​value (`Int`): The file identifier (Default 1 = stdout). `@implicit` `__init__(f: FileHandle) -> Self` Constructs the file descriptor from a file handle. **Args:** * ​f (`FileHandle`): The file handle. ### `__write_bytes_cpu` `__write_bytes_cpu(mut self, bytes: Span[SIMD[uint8, 1], origin])` Write a span of bytes to the file. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file. ### `write_bytes` `write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])` Write a span of bytes to the file. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file. ### `read_bytes` `read_bytes(mut self, buffer: Span[SIMD[uint8, 1], origin]) -> UInt` Read a number of bytes from the file into a buffer. Notes: [Reference](https://pubs.opengroup.org/onlinepubs/9799919799/functions/read.html). **Args:** * ​buffer (`Span[SIMD[uint8, 1], origin]`): A `Span[Byte]` to read bytes into. Read up to `len(buffer)` number of bytes. **Returns:** Actual number of bytes read. ### `write` `write[*Ts: Writable](mut self, *args: *Ts)` Write a sequence of Writable arguments to the provided Writer. **Parameters:** * ​\*Ts (`Writable`): Types of the provided argument sequence. **Args:** * ​\*args (`*Ts`): Sequence of arguments to write to this Writer. --- ## file_descriptor Higher level abstraction for file stream. These are Mojo built-ins, so you don't need to import them. For example, here's how to print to a file ```mojo var f = open("my_file.txt", "r") print("hello", file=f^) f.close() ``` ## Structs * [​`FileDescriptor`](/mojo/stdlib/builtin/file_descriptor/FileDescriptor): File descriptor of a file. --- ## FloatLiteral `@register_passable(trivial)` `struct FloatLiteral[value: !pop.float_literal]` Mojo floating point literal type. ## Parameters * ​value (`!pop.float_literal`): The underlying infinite precision floating point value. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Defaultable`, `Floatable`, `ImplicitlyBoolable`, `Intable`, `Movable`, `Stringable`, `UnknownDestructibility` ## Aliases ### `infinity` `alias infinity = inf` ### `nan` `alias nan` ### `negative_infinity` `alias negative_infinity = -inf` ### `negative_zero` `alias negative_zero = -0.0` ## Methods ### `__init__` `__init__() -> Self` Create a FloatLiteral for any parameter value. `@implicit` `__init__(value: IntLiteral[value]) -> FloatLiteral[#pop.int_to_float_literal]` Convert an IntLiteral to a FloatLiteral value. **Args:** * ​value (`IntLiteral[value]`): The IntLiteral value. ### `__bool__` `__bool__(self) -> Bool` A FloatLiteral value is true if it is non-zero. **Returns:** True if non-zero. ### `__neg__` `__neg__(self) -> FloatLiteral[#pop.float_literal_bin>]` Return the negation of the FloatLiteral value. **Returns:** The negated FloatLiteral value. ### `__lt__` `__lt__(self, rhs: FloatLiteral[value]) -> Bool` Less than comparison. **Args:** * ​rhs (`FloatLiteral[value]`): The value to compare. **Returns:** True if this value is less than `rhs`. ### `__le__` `__le__(self, rhs: FloatLiteral[value]) -> Bool` Less than or equal to comparison. **Args:** * ​rhs (`FloatLiteral[value]`): The value to compare. **Returns:** True if this value is less than or equal to `rhs`. ### `__eq__` `__eq__(self, rhs: FloatLiteral[value]) -> Bool` Compare for equality. **Args:** * ​rhs (`FloatLiteral[value]`): The value to compare. **Returns:** True if they are equal. ### `__ne__` `__ne__(self, rhs: FloatLiteral[value]) -> Bool` Compare for inequality. **Args:** * ​rhs (`FloatLiteral[value]`): The value to compare. **Returns:** True if they are not equal. ### `__gt__` `__gt__(self, rhs: FloatLiteral[value]) -> Bool` Greater than comparison. **Args:** * ​rhs (`FloatLiteral[value]`): The value to compare. **Returns:** True if this value is greater than `rhs`. ### `__ge__` `__ge__(self, rhs: FloatLiteral[value]) -> Bool` Greater than or equal to comparison. **Args:** * ​rhs (`FloatLiteral[value]`): The value to compare. **Returns:** True if this value is greater than or equal to `rhs`. ### `__add__` `__add__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Add two FloatLiterals. **Args:** * ​rhs (`FloatLiteral[value]`): The value to add. **Returns:** The sum of the two values. ### `__sub__` `__sub__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Subtract two FloatLiterals. **Args:** * ​rhs (`FloatLiteral[value]`): The value to subtract. **Returns:** The difference of the two values. ### `__mul__` `__mul__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Multiply two FloatLiterals. **Args:** * ​rhs (`FloatLiteral[value]`): The value to multiply. **Returns:** The product of the two values. ### `__truediv__` `__truediv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Divide two FloatLiterals. **Args:** * ​rhs (`FloatLiteral[value]`): The value to divide. **Returns:** The quotient of the two values. ### `__floordiv__` `__floordiv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Returns self divided by rhs, rounded down to the nearest integer. **Args:** * ​rhs (`FloatLiteral[value]`): The divisor value. **Returns:** `floor(self / rhs)` value. ### `__mod__` `__mod__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin, value>>]` Return the remainder of self divided by rhs. **Args:** * ​rhs (`FloatLiteral[value]`): The value to divide on. **Returns:** The remainder of dividing self by rhs. ### `__radd__` `__radd__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Reversed addition operator. **Args:** * ​rhs (`FloatLiteral[value]`): The value to add. **Returns:** The sum of this and the given value. ### `__rsub__` `__rsub__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Reversed subtraction operator. **Args:** * ​rhs (`FloatLiteral[value]`): The value to subtract from. **Returns:** The result of subtracting this from the given value. ### `__rmul__` `__rmul__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Reversed multiplication operator. **Args:** * ​rhs (`FloatLiteral[value]`): The value to multiply. **Returns:** The product of the given number and this. ### `__rtruediv__` `__rtruediv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Reversed division. **Args:** * ​rhs (`FloatLiteral[value]`): The value to be divided by this. **Returns:** The result of dividing the given value by this. ### `__rfloordiv__` `__rfloordiv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Returns rhs divided by self, rounded down to the nearest integer. **Args:** * ​rhs (`FloatLiteral[value]`): The value to be divided by self. **Returns:** `floor(rhs / self)` value. ### `__rmod__` `__rmod__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin, value>>]` Return the remainder of rhs divided by self. **Args:** * ​rhs (`FloatLiteral[value]`): The value to divide on. **Returns:** The remainder of dividing rhs by self. ### `is_nan` `is_nan(self) -> Bool` Return whether the FloatLiteral is nan. Since `nan == nan` is False, this provides a way to check for nan-ness. **Returns:** True, if the value is nan, False otherwise. ### `is_neg_zero` `is_neg_zero(self) -> Bool` Return whether the FloatLiteral is negative zero. Since `FloatLiteral.negative_zero == 0.0` is True, this provides a way to check if the FloatLiteral is negative zero. **Returns:** True, if the value is negative zero, False otherwise. ### `__str__` `__str__(self) -> String` Get the float as a string. **Returns:** A string representation. ### `__int_literal__` `__int_literal__(self) -> IntLiteral[#pop.float_to_int_literal]` Casts the floating point value to an IntLiteral. If there is a fractional component, then the value is truncated towards zero. Eg. `(4.5).__int_literal__()` returns `4`, and `(-3.7).__int_literal__()` returns `-3`. **Returns:** The value as an integer. ### `__int__` `__int__(self) -> Int` Converts the FloatLiteral value to an Int. If there is a fractional component, then the value is truncated towards zero. Eg. `(4.5).__int__()` returns `4`, and `(-3.7).__int__()` returns `-3`. **Returns:** The value as an integer. ### `__float__` `__float__(self) -> SIMD[float64, 1]` Converts the FloatLiteral to a concrete Float64. **Returns:** The Float value. ### `__as_bool__` `__as_bool__(self) -> Bool` A FloatLiteral value is true if it is non-zero. **Returns:** True if non-zero. ### `__ceildiv__` `__ceildiv__(self, denominator: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin>>, #pop.float_literal>]` Return the rounded-up result of dividing self by denominator. **Args:** * ​denominator (`FloatLiteral[value]`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. --- ## float_literal Implements the FloatLiteral class. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`FloatLiteral`](/mojo/stdlib/builtin/float_literal/FloatLiteral): Mojo floating point literal type. --- ## Floatable The `Floatable` trait describes a type that can be converted to a Float64. This trait requires the type to implement the `__float__()` method. For example: ```mojo struct Foo(Floatable): var i: Float64 fn __float__(self) -> Float64: return self.i ``` A `Foo` can now be converted to a `Float64`: ```mojo var f = Float64(Foo(5.5)) ``` **Note:** If the `__float__()` method can raise an error, use the [`FloatableRaising`](/mojo/stdlib/builtin/floatable/floatableraising) trait instead. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__float__` `__float__(self: _Self) -> SIMD[float64, 1]` Get the float point representation of the value. **Returns:** The float point representation of the value. --- ## FloatableRaising The `FloatableRaising` trait describes a type that can be converted to a Float64, but the conversion might raise an error (e.g.: a string). This trait requires the type to implement the `__float__()` method, which can raise an error. For example: ```mojo from utils import Variant struct MaybeFloat(FloatableRaising): var value: Variant[Float64, NoneType] fn __float__(self) raises -> Float64: if self.value.isa[NoneType](): raise "Float expected" return self.value[Float64] ``` A `MaybeFloat` can now be converted to `Float64`: ```mojo try: print(Float64(MaybeFloat(4.6))) except: print("error occurred") ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__float__` `__float__(self: _Self) -> SIMD[float64, 1]` Get the float point representation of the value. **Returns:** The float point representation of the value. **Raises:** If the type does not have a float point representation. --- ## floatable Implements the `Floatable` and `FloatableRaising` traits. These are Mojo built-ins, so you don't need to import them. ## Traits * [​`Floatable`](/mojo/stdlib/builtin/floatable/Floatable): The `Floatable` trait describes a type that can be converted to a Float64. * [​`FloatableRaising`](/mojo/stdlib/builtin/floatable/FloatableRaising): The `FloatableRaising` trait describes a type that can be converted to a Float64, but the conversion might raise an error (e.g.: a string). --- ## bin `bin(num: SIMD[dtype, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0b")) -> String` Return the binary string representation an integral value. ```mojo print(bin(123)) print(bin(-123)) ``` ```plaintext '0b1111011' '-0b1111011' ``` **Args:** * ​num (`SIMD[dtype, 1]`): An integral scalar value. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** The binary string representation of num. `bin(b: SIMD[bool, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0b")) -> String` Returns the binary representation of a scalar bool. **Args:** * ​b (`SIMD[bool, 1]`): A scalar bool value. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** The binary string representation of b. `bin[T: Intable, //](num: T, /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0b")) -> String` Returns the binary representation of an indexer type. **Parameters:** * ​T (`Intable`): The Intable type. **Args:** * ​num (`T`): An indexer value. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** The binary string representation of num. --- ## hex `hex(value: SIMD[dtype, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0x")) -> String` Returns the hex string representation of the given integer. The hexadecimal representation is a base-16 encoding of the integer value. The returned string will be prefixed with "0x" to indicate that the subsequent digits are hex. **Args:** * ​value (`SIMD[dtype, 1]`): The integer value to format. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** A string containing the hex representation of the given integer. `hex[T: Intable, //](value: T, /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0x")) -> String` Returns the hex string representation of the given integer. The hexadecimal representation is a base-16 encoding of the integer value. The returned string will be prefixed with "0x" to indicate that the subsequent digits are hex. **Parameters:** * ​T (`Intable`): The indexer type to represent in hexadecimal. **Args:** * ​value (`T`): The integer value to format. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** A string containing the hex representation of the given integer. `hex(value: SIMD[bool, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0x")) -> String` Returns the hex string representation of the given scalar bool. The hexadecimal representation is a base-16 encoding of the bool. The returned string will be prefixed with "0x" to indicate that the subsequent digits are hex. **Args:** * ​value (`SIMD[bool, 1]`): The bool value to format. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** A string containing the hex representation of the given bool. --- ## format_int Provides the `hex` and `bin` functions. These are Mojo built-ins, so you don't need to import them. ## Functions * [​`bin`](/mojo/stdlib/builtin/format_int/bin): Return the binary string representation an integral value. * [​`hex`](/mojo/stdlib/builtin/format_int/hex): Returns the hex string representation of the given integer. * [​`oct`](/mojo/stdlib/builtin/format_int/oct): Returns the octal string representation of the given integer. --- ## oct `oct(value: SIMD[dtype, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0o")) -> String` Returns the octal string representation of the given integer. The octal representation is a base-8 encoding of the integer value. The returned string will be prefixed with "0o" to indicate that the subsequent digits are octal. **Args:** * ​value (`SIMD[dtype, 1]`): The integer value to format. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** A string containing the octal representation of the given integer. `oct[T: Intable, //](value: T, /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0o")) -> String` Returns the octal string representation of the given integer. The octal representation is a base-8 encoding of the integer value. The returned string will be prefixed with "0o" to indicate that the subsequent digits are octal. **Parameters:** * ​T (`Intable`): The intable type to represent in octal. **Args:** * ​value (`T`): The integer value to format. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** A string containing the octal representation of the given integer. `oct(value: SIMD[bool, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0o")) -> String` Returns the octal string representation of the given scalar bool. The octal representation is a base-8 encoding of the bool. The returned string will be prefixed with "0o" to indicate that the subsequent digits are octal. **Args:** * ​value (`SIMD[bool, 1]`): The bool value to format. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** A string containing the octal representation of the given bool. --- ## Identifiable The Identifiable trait denotes a type with an identity which can be compared with other instances of itself. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__is__` `__is__(self: _Self, rhs: _Self) -> Bool` Define whether `self` has the same identity as `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is `rhs`. ### `__isnot__` `__isnot__(self: _Self, rhs: _Self) -> Bool` Define whether `self` has a different identity than `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is not `rhs`. --- ## identifiable ## Traits * [​`Identifiable`](/mojo/stdlib/builtin/identifiable/Identifiable): The Identifiable trait denotes a type with an identity which can be compared with other instances of itself. --- ## builtin Implements the builtin package. ## Modules * [​`anytype`](/mojo/stdlib/builtin/anytype/): Defines the core traits for object lifetime management in Mojo. * [​`bool`](/mojo/stdlib/builtin/bool/): Implements the Bool class. * [​`breakpoint`](/mojo/stdlib/builtin/breakpoint/): This module includes the builtin breakpoint function. * [​`builtin_slice`](/mojo/stdlib/builtin/builtin_slice/): Implements slice. * [​`comparable`](/mojo/stdlib/builtin/comparable/): * [​`constrained`](/mojo/stdlib/builtin/constrained/): Implements compile-time constraints. * [​`coroutine`](/mojo/stdlib/builtin/coroutine/): Implements classes and methods for coroutines. * [​`debug_assert`](/mojo/stdlib/builtin/debug_assert/): Implements run-time assertions. * [​`device_passable`](/mojo/stdlib/builtin/device_passable/): * [​`dtype`](/mojo/stdlib/builtin/dtype/): Implements the DType class. * [​`equality_comparable`](/mojo/stdlib/builtin/equality_comparable/): * [​`error`](/mojo/stdlib/builtin/error/): Implements the Error class. * [​`file`](/mojo/stdlib/builtin/file/): Provides APIs to read and write files. * [​`file_descriptor`](/mojo/stdlib/builtin/file_descriptor/): Higher level abstraction for file stream. * [​`float_literal`](/mojo/stdlib/builtin/float_literal/): Implements the FloatLiteral class. * [​`floatable`](/mojo/stdlib/builtin/floatable/): Implements the `Floatable` and `FloatableRaising` traits. * [​`format_int`](/mojo/stdlib/builtin/format_int/): Provides the `hex` and `bin` functions. * [​`identifiable`](/mojo/stdlib/builtin/identifiable/): * [​`int`](/mojo/stdlib/builtin/int/): Implements the Int class. * [​`int_literal`](/mojo/stdlib/builtin/int_literal/): Implements the IntLiteral class. * [​`io`](/mojo/stdlib/builtin/io/): Provides utilities for working with input/output. * [​`len`](/mojo/stdlib/builtin/len/): Provides the `len()` function and its associated traits. * [​`math`](/mojo/stdlib/builtin/math/): Defines basic math functions for use in the open source parts of the standard library since the `math` package is currently closed source and cannot be depended on in the open source parts of the standard library. * [​`none`](/mojo/stdlib/builtin/none/): Defines the builtin `NoneType`. * [​`range`](/mojo/stdlib/builtin/range/): Implements a 'range' call. * [​`rebind`](/mojo/stdlib/builtin/rebind/): Implements type rebind. * [​`repr`](/mojo/stdlib/builtin/repr/): Provide the `repr` function. * [​`reversed`](/mojo/stdlib/builtin/reversed/): Provides the `reversed` function for reverse iteration over collections. * [​`simd`](/mojo/stdlib/builtin/simd/): Implements SIMD primitives and abstractions. * [​`sort`](/mojo/stdlib/builtin/sort/): Implements the built-in `sort` function. * [​`str`](/mojo/stdlib/builtin/str/): Provides the `str` function. * [​`string_literal`](/mojo/stdlib/builtin/string_literal/): Implements the StringLiteral struct. * [​`swap`](/mojo/stdlib/builtin/swap/): Implements the built-in `swap` function. * [​`tuple`](/mojo/stdlib/builtin/tuple/): Implements the Tuple type. * [​`type_aliases`](/mojo/stdlib/builtin/type_aliases/): Defines some type aliases. * [​`uint`](/mojo/stdlib/builtin/uint/): Implements the UInt class. * [​`value`](/mojo/stdlib/builtin/value/): Defines core value traits. * [​`variadics`](/mojo/stdlib/builtin/variadics/): Implements the VariadicList and VariadicPack types. --- ## ImplicitlyIntable The `ImplicitlyIntable` trait describes a type that can be converted to an Int implicitly. This trait requires the type to implement the `__as_int__()` method. For example: ```mojo struct Foo(ImplicitlyIntable): var i: Int fn __int__(self) -> Int: return self.i fn __as_int__(self) -> Int: return self.__int__() ``` Now you can use `Foo` anywhere that an `Int` is expected, e.g. equality checks: ```mojo foo = Foo(42) assert_equal(Int(42), foo) ``` ## Implemented traits `AnyType`, `Copyable`, `Intable`, `Movable`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `__as_int__` `__as_int__(self: _Self) -> Int` Implicitly convert to an integral representation of the value, wherever an `Int` is expected. **Returns:** The integral representation of the value. ### `__int__` `__int__(self: _Self) -> Int` Get the integral representation of the value. **Returns:** The integral representation of the value. --- ## Indexer The `Indexer` trait is used for types that can index into a collection or pointer. The type returned is the underlying \_\_mlir\_type.index, enabling types like `UInt` to not have to be converted to an `Int` first. This type is implicitly convertible to an `Int`, so can be used anywhere an `Int` can e.g. for comparisons. ## Implemented traits `AnyType`, `Copyable`, `Intable`, `Movable`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `__index__` `__index__(self: _Self) -> index` Convert to index. **Returns:** The corresponding \_\_mlir\_type.index value. ### `__int__` `__int__(self: _Self) -> Int` Get the integral representation of the value. **Returns:** The integral representation of the value. --- ## Int `@register_passable(trivial)` `struct Int` This type represents an integer value. ## Fields * ​value (`index`): The underlying storage for the integer value. ## Implemented traits `Absable`, `AnyType`, `Boolable`, `CeilDivable`, `Ceilable`, `ConvertibleFromPython`, `Copyable`, `Defaultable`, `DevicePassable`, `EqualityComparable`, `ExplicitlyCopyable`, `Floorable`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `Hashable`, `ImplicitlyBoolable`, `Indexer`, `Intable`, `IntervalElement`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `Powable`, `PythonConvertible`, `Representable`, `Roundable`, `Stringable`, `Truncable`, `UnknownDestructibility`, `Writable`, `_HashableWithHasher` ## Aliases ### `BITWIDTH` `alias BITWIDTH = __init__[::Intable](bitwidthof[::DType,__mlir_type.!kgen.target]())` The bit width of the integer type. ### `device_type` `alias device_type = Int` Int is remapped to the same type when passed to accelerator devices. ### `MAX` `alias MAX = __init__[::Intable](SIMD(max_or_inf[::DType]()))` Returns the maximum integer value. ### `MIN` `alias MIN = __init__[::Intable](SIMD(min_or_neg_inf[::DType]()))` Returns the minimum value of type. ## Methods ### `__init__` `__init__() -> Self` Default constructor that produces zero. `@implicit` `__init__(value: IntLiteral[value]) -> Self` Construct Int from the given IntLiteral value. **Args:** * ​value (`IntLiteral[value]`): The init value. `@implicit` `__init__(value: UInt) -> Self` Construct Int from the given UInt value. **Args:** * ​value (`UInt`): The init value. `__init__[T: Intable](value: T) -> Self` Get the Int representation of the value. **Parameters:** * ​T (`Intable`): The Intable type. **Args:** * ​value (`T`): The object to get the integral representation of. `__init__[T: IntableRaising](out self, value: T)` Get the Int representation of the value. **Parameters:** * ​T (`IntableRaising`): The Intable type. **Args:** * ​value (`T`): The object to get the integral representation of. **Raises:** If the type does not have an integral representation. `@implicit` `__init__[I: ImplicitlyIntable](value: I) -> Self` Construct Int from implicitly convertible type. **Parameters:** * ​I (`ImplicitlyIntable`): The type that is implicitly convertible to an `Int`. **Args:** * ​value (`I`): The init value. `__init__(out self, value: StringSlice[origin], base: UInt = UInt(10))` Parses and returns the given string as an integer in the given base. If base is set to 0, the string is parsed as an Integer literal, with the following considerations: * '0b' or '0B' prefix indicates binary (base 2) * '0o' or '0O' prefix indicates octal (base 8) * '0x' or '0X' prefix indicates hexadecimal (base 16) * Without a prefix, it's treated as decimal (base 10) Examples: > > > Int("32") > > > 32 > > > Int("FF", 16) > > > 255 > > > Int("0xFF", 0) > > > 255 > > > Int("0b1010", 0) > > > 10 Notes: This follows [Python's integer literals](https://docs.python.org/3/reference/lexical_analysis.html#integers). **Args:** * ​value (`StringSlice[origin]`): A string to be parsed as an integer in the given base. * ​base (`UInt`): Base used for conversion, value must be between 2 and 36, or 0. **Raises:** If the given string cannot be parsed as an integer value or if an incorrect base is provided. ### `__bool__` `__bool__(self) -> Bool` Convert this Int to Bool. **Returns:** False Bool value if the value is equal to 0 and True otherwise. ### `__neg__` `__neg__(self) -> Self` Return -self. **Returns:** The -self value. ### `__pos__` `__pos__(self) -> Self` Return +self. **Returns:** The +self value. ### `__invert__` `__invert__(self) -> Self` Return \~self. **Returns:** The \~self value. ### `__lt__` `__lt__(self, rhs: Self) -> Bool` Compare this Int to the RHS using LT comparison. **Args:** * ​rhs (`Self`): The other Int to compare against. **Returns:** True if this Int is less-than the RHS Int and False otherwise. ### `__le__` `__le__(self, rhs: Self) -> Bool` Compare this Int to the RHS using LE comparison. **Args:** * ​rhs (`Self`): The other Int to compare against. **Returns:** True if this Int is less-or-equal than the RHS Int and False otherwise. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compare this Int to the RHS using EQ comparison. **Args:** * ​rhs (`Self`): The other Int to compare against. **Returns:** True if this Int is equal to the RHS Int and False otherwise. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Compare this Int to the RHS using NE comparison. **Args:** * ​rhs (`Self`): The other Int to compare against. **Returns:** True if this Int is non-equal to the RHS Int and False otherwise. ### `__gt__` `__gt__(self, rhs: Self) -> Bool` Compare this Int to the RHS using GT comparison. **Args:** * ​rhs (`Self`): The other Int to compare against. **Returns:** True if this Int is greater than the RHS Int and False otherwise. ### `__ge__` `__ge__(self, rhs: Self) -> Bool` Compare this Int to the RHS using GE comparison. **Args:** * ​rhs (`Self`): The other Int to compare against. **Returns:** True if this Int is greater-or-equal than the RHS Int and False otherwise. ### `__add__` `__add__(self, rhs: Self) -> Self` Return `self + rhs`. **Args:** * ​rhs (`Self`): The value to add. **Returns:** `self + rhs` value. ### `__sub__` `__sub__(self, rhs: Self) -> Self` Return `self - rhs`. **Args:** * ​rhs (`Self`): The value to subtract. **Returns:** `self - rhs` value. ### `__mul__` `__mul__(self, rhs: Self) -> Self` Return `self * rhs`. **Args:** * ​rhs (`Self`): The value to multiply with. **Returns:** `self * rhs` value. ### `__truediv__` `__truediv__(self, rhs: Self) -> SIMD[float64, 1]` Return the floating point division of `self` and `rhs`. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** `Float64(self)/Float64(rhs)` value. ### `__floordiv__` `__floordiv__(self, rhs: Self) -> Self` Return the division of `self` and `rhs` rounded down to the nearest integer. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** `floor(self/rhs)` value. ### `__mod__` `__mod__(self, rhs: Self) -> Self` Return the remainder of self divided by rhs. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** The remainder of dividing self by rhs. ### `__pow__` `__pow__(self, exp: Self) -> Self` Return the value raised to the power of the given exponent. Computes the power of an integer using the Russian Peasant Method. **Args:** * ​exp (`Self`): The exponent value. **Returns:** The value of `self` raised to the power of `exp`. ### `__lshift__` `__lshift__(self, rhs: Self) -> Self` Return `self rhs (`Self`): The value to shift with. **Returns:** `self ### `__rshift__` `__rshift__(self, rhs: Self) -> Self` Return `self >> rhs`. **Args:** * ​rhs (`Self`): The value to shift with. **Returns:** `self >> rhs`. ### `__and__` `__and__(self, rhs: Self) -> Self` Return `self & rhs`. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self & rhs`. ### `__or__` `__or__(self, rhs: Self) -> Self` Return `self | rhs`. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self | rhs`. ### `__xor__` `__xor__(self, rhs: Self) -> Self` Return `self ^ rhs`. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self ^ rhs`. ### `__radd__` `__radd__(self, value: Self) -> Self` Return `value + self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value + self`. ### `__rsub__` `__rsub__(self, value: Self) -> Self` Return `value - self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value - self`. ### `__rmul__` `__rmul__(self, value: Self) -> Self` Return `value * self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value * self`. ### `__rfloordiv__` `__rfloordiv__(self, value: Self) -> Self` Return `value // self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value // self`. ### `__rmod__` `__rmod__(self, value: Self) -> Self` Return `value % self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value % self`. ### `__rpow__` `__rpow__(self, value: Self) -> Self` Return `pow(value,self)`. **Args:** * ​value (`Self`): The other value. **Returns:** `pow(value,self)`. ### `__rlshift__` `__rlshift__(self, value: Self) -> Self` Return `value value (`Self`): The other value. **Returns:** `value ### `__rrshift__` `__rrshift__(self, value: Self) -> Self` Return `value >> self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value >> self`. ### `__rand__` `__rand__(self, value: Self) -> Self` Return `value & self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value & self`. ### `__ror__` `__ror__(self, value: Self) -> Self` Return `value | self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value | self`. ### `__rxor__` `__rxor__(self, value: Self) -> Self` Return `value ^ self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value ^ self`. ### `__iadd__` `__iadd__(mut self, rhs: Self)` Compute `self + rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__isub__` `__isub__(mut self, rhs: Self)` Compute `self - rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__imul__` `__imul__(mut self, rhs: Self)` Compute self\*rhs and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__itruediv__` `__itruediv__(mut self, rhs: Self)` Compute `self / rhs`, convert to int, and save the result in self. Since `floor(self / rhs)` is equivalent to `self // rhs`, this yields the same as `__ifloordiv__`. **Args:** * ​rhs (`Self`): The RHS value. ### `__ifloordiv__` `__ifloordiv__(mut self, rhs: Self)` Compute `self // rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__imod__` `__imod__(mut self, rhs: Self)` Compute `self % rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ipow__` `__ipow__(mut self, rhs: Self)` Compute `pow(self, rhs)` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ilshift__` `__ilshift__(mut self, rhs: Self)` Compute `self rhs (`Self`): The RHS value. ### `__irshift__` `__irshift__(mut self, rhs: Self)` Compute `self >> rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__iand__` `__iand__(mut self, rhs: Self)` Compute `self & rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ixor__` `__ixor__(mut self, rhs: Self)` Compute `self ^ rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ior__` `__ior__(mut self, rhs: Self)` Compute self|rhs and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `get_type_name` `static get_type_name() -> String` Gets this type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls. **Returns:** This type's name. ### `get_device_type_name` `static get_device_type_name() -> String` Gets device\_type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls. **Returns:** This type's name. ### `__divmod__` `__divmod__(self, rhs: Self) -> Tuple[Int, Int]` Computes both the quotient and remainder using integer division. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** The quotient and remainder as a tuple `(self // rhs, self % rhs)`. ### `__as_bool__` `__as_bool__(self) -> Bool` Convert this Int to Bool. **Returns:** False Bool value if the value is equal to 0 and True otherwise. ### `__index__` `__index__(self) -> index` Convert to index. **Returns:** The corresponding \_\_mlir\_type.index value. ### `__int__` `__int__(self) -> Self` Gets the integral value (this is an identity function for Int). **Returns:** The value as an integer. ### `__abs__` `__abs__(self) -> Self` Return the absolute value of the Int value. **Returns:** The absolute value. ### `__ceil__` `__ceil__(self) -> Self` Return the ceiling of the Int value, which is itself. **Returns:** The Int value itself. ### `__floor__` `__floor__(self) -> Self` Return the floor of the Int value, which is itself. **Returns:** The Int value itself. ### `__round__` `__round__(self) -> Self` Return the rounded value of the Int value, which is itself. **Returns:** The Int value itself. `__round__(self, ndigits: Self) -> Self` Return the rounded value of the Int value, which is itself. **Args:** * ​ndigits (`Self`): The number of digits to round to. **Returns:** The Int value itself if ndigits >= 0 else the rounded value. ### `__trunc__` `__trunc__(self) -> Self` Return the truncated Int value, which is itself. **Returns:** The Int value itself. ### `__ceildiv__` `__ceildiv__(self, denominator: Self) -> Self` Return the rounded-up result of dividing self by denominator. **Args:** * ​denominator (`Self`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. ### `is_power_of_two` `is_power_of_two(self) -> Bool` Check if the integer is a (non-zero) power of two. **Returns:** True if the integer is a power of two, False otherwise. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this integer to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `write_padded` `write_padded[W: Writer](self, mut writer: W, width: Self)` Write the int right-aligned to a set padding. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. * ​width (`Self`): The amount to pad to the left. ### `__str__` `__str__(self) -> String` Get the integer as a string. **Returns:** A string representation. ### `__repr__` `__repr__(self) -> String` Get the integer as a string. Returns the same `String` as `__str__`. **Returns:** A string representation. ### `__hash__` `__hash__(self) -> UInt` Hash the int using builtin hash. **Returns:** A 64-bit hash value. This value is *not* suitable for cryptographic uses. Its intended usage is for data structures. See the `hash` builtin documentation for more details. `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with this int value. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. ### `to_python_object` `to_python_object(owned self) -> PythonObject` Convert this value to a PythonObject. **Returns:** A PythonObject representing the value. --- ## Intable The `Intable` trait describes a type that can be converted to an Int. Any type that conforms to `Intable` or [`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising) can construct an `Int`. This trait requires the type to implement the `__int__()` method. For example: ```mojo struct Foo(Intable): var i: Int fn __int__(self) -> Int: return self.i ``` Now you can construct an `Int`: ```mojo foo = Foo(42) assert_equal(Int(foo), 42) ``` **Note:** If the `__int__()` method can raise an error, use the [`IntableRaising`](/mojo/stdlib/builtin/int/intableraising) trait instead. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `__int__` `__int__(self: _Self) -> Int` Get the integral representation of the value. **Returns:** The integral representation of the value. --- ## IntableRaising The `IntableRaising` trait describes a type can be converted to an Int, but the conversion might raise an error. Any type that conforms to [`Intable`](/mojo/stdlib/builtin/int/Intable) or `IntableRaising` can construct an `Int`. This trait requires the type to implement the `__int__()` method, which can raise an error. For example: ```mojo struct Foo(IntableRaising): var i: Int fn __int__(self) raises -> Int: return self.i ``` Now you can construct an `Int`: ```mojo foo = Foo(42) assert_equal(Int(foo), 42) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__int__` `__int__(self: _Self) -> Int` Get the integral representation of the value. **Returns:** The integral representation of the type. **Raises:** If the type does not have an integral representation. --- ## index `index[T: Indexer](idx: T, /) -> index` Returns the value of `__index__` for the given value. **Parameters:** * ​T (`Indexer`): A type conforming to the `Indexer` trait. **Args:** * ​idx (`T`): The value. **Returns:** An `__mlir_type` representing the index value. --- ## int Implements the Int class. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`Int`](/mojo/stdlib/builtin/int/Int): This type represents an integer value. ## Traits * [​`ImplicitlyIntable`](/mojo/stdlib/builtin/int/ImplicitlyIntable): The `ImplicitlyIntable` trait describes a type that can be converted to an Int implicitly. * [​`Indexer`](/mojo/stdlib/builtin/int/Indexer): The `Indexer` trait is used for types that can index into a collection or pointer. The type returned is the underlying \_\_mlir\_type.index, enabling types like `UInt` to not have to be converted to an `Int` first. This type is implicitly convertible to an `Int`, so can be used anywhere an `Int` can e.g. for comparisons. * [​`Intable`](/mojo/stdlib/builtin/int/Intable): The `Intable` trait describes a type that can be converted to an Int. * [​`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising): The `IntableRaising` trait describes a type can be converted to an Int, but the conversion might raise an error. ## Functions * [​`index`](/mojo/stdlib/builtin/int/index-function): Returns the value of `__index__` for the given value. --- ## IntLiteral `@register_passable(trivial)` `struct IntLiteral[value: !pop.int_literal]` This type represents a static integer literal value with infinite precision. This type is a compile-time construct which stores its value as a parameter. It is typically materialized into other types (like `Int`) for use at runtime. This compile-time representation allows for arbitrary precision constants that would overflow on Int and other fixed precision integer types. ## Parameters * ​value (`!pop.int_literal`): The underlying integer value. ## Implemented traits `AnyType`, `Boolable`, `Ceilable`, `Copyable`, `Defaultable`, `Floorable`, `ImplicitlyBoolable`, `ImplicitlyIntable`, `Indexer`, `Intable`, `Movable`, `Stringable`, `Truncable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Constructor for any value. ### `__bool__` `__bool__(self) -> Bool` Convert this IntLiteral to Bool. **Returns:** False Bool value if the value is equal to 0 and True otherwise. ### `__neg__` `__neg__(self) -> IntLiteral[(0 - value)]` Return -self. **Returns:** The -self value. ### `__pos__` `__pos__(self) -> Self` Return +self. **Returns:** The +self value. ### `__invert__` `__invert__(self) -> IntLiteral[(value ^ -1)]` Return \~self. **Returns:** The \~self value. ### `__lt__` `__lt__(self, rhs: IntLiteral[value]) -> Bool` Compare this IntLiteral to the RHS using LT comparison. **Args:** * ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against. **Returns:** True if this IntLiteral is less-than the RHS IntLiteral and False otherwise. ### `__le__` `__le__(self, rhs: IntLiteral[value]) -> Bool` Compare this IntLiteral to the RHS using LE comparison. **Args:** * ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against. **Returns:** True if this IntLiteral is less-or-equal than the RHS IntLiteral and False otherwise. ### `__eq__` `__eq__(self, rhs: IntLiteral[value]) -> Bool` Compare this IntLiteral to the RHS using EQ comparison. **Args:** * ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against. **Returns:** True if this IntLiteral is equal to the RHS IntLiteral and False otherwise. ### `__ne__` `__ne__(self, rhs: IntLiteral[value]) -> Bool` Compare this IntLiteral to the RHS using NE comparison. **Args:** * ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against. **Returns:** True if this IntLiteral is non-equal to the RHS IntLiteral and False otherwise. ### `__gt__` `__gt__(self, rhs: IntLiteral[value]) -> Bool` Compare this IntLiteral to the RHS using GT comparison. **Args:** * ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against. **Returns:** True if this IntLiteral is greater-than the RHS IntLiteral and False otherwise. ### `__ge__` `__ge__(self, rhs: IntLiteral[value]) -> Bool` Compare this IntLiteral to the RHS using GE comparison. **Args:** * ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against. **Returns:** True if this IntLiteral is greater-or-equal than the RHS IntLiteral and False otherwise. ### `__add__` `__add__(self, rhs: IntLiteral[value]) -> IntLiteral[(value + value)]` Return `self + rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The value to add. **Returns:** `self + rhs` value. ### `__sub__` `__sub__(self, rhs: IntLiteral[value]) -> IntLiteral[(value - value)]` Return `self - rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The value to subtract. **Returns:** `self - rhs` value. ### `__mul__` `__mul__(self, rhs: IntLiteral[value]) -> IntLiteral[(value * value)]` Return `self * rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The value to multiply with. **Returns:** `self * rhs` value. ### `__floordiv__` `__floordiv__(self, rhs: IntLiteral[value]) -> IntLiteral[(value // value)]` Return `self // rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The value to divide with. **Returns:** `self // rhs` value. ### `__mod__` `__mod__(self, rhs: IntLiteral[value]) -> IntLiteral[(value % value)]` Return the remainder of self divided by rhs. **Args:** * ​rhs (`IntLiteral[value]`): The value to divide on. **Returns:** The remainder of dividing self by rhs. ### `__lshift__` `__lshift__(self, rhs: IntLiteral[value]) -> IntLiteral[(value Return `self rhs (`IntLiteral[value]`): The value to shift with. **Returns:** `self ### `__rshift__` `__rshift__(self, rhs: IntLiteral[value]) -> IntLiteral[(value >> value)]` Return `self >> rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The value to shift with. **Returns:** `self >> rhs`. ### `__and__` `__and__(self, rhs: IntLiteral[value]) -> IntLiteral[(value & value)]` Return `self & rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The RHS value. **Returns:** `self & rhs`. ### `__or__` `__or__(self, rhs: IntLiteral[value]) -> IntLiteral[(value | value)]` Return `self | rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The RHS value. **Returns:** `self | rhs`. ### `__xor__` `__xor__(self, rhs: IntLiteral[value]) -> IntLiteral[(value ^ value)]` Return `self ^ rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The RHS value. **Returns:** `self ^ rhs`. ### `__as_bool__` `__as_bool__(self) -> Bool` Convert this IntLiteral to Bool. **Returns:** False Bool value if the value is equal to 0 and True otherwise. ### `__int__` `__int__(self) -> Int` Convert from IntLiteral to Int. **Returns:** The value as an integer of platform-specific width. ### `__as_int__` `__as_int__(self) -> Int` Implicitly convert to an Int. **Returns:** An integral value that represents this object. ### `__uint__` `__uint__(self) -> UInt` Convert from IntLiteral to UInt. **Returns:** The value as an unsigned integer of platform-specific width. ### `__ceil__` `__ceil__(self) -> Self` Return the ceiling of the IntLiteral value, which is itself. **Returns:** The IntLiteral value itself. ### `__floor__` `__floor__(self) -> Self` Return the floor of the IntLiteral value, which is itself. **Returns:** The IntLiteral value itself. ### `__trunc__` `__trunc__(self) -> Self` Return the truncated of the IntLiteral value, which is itself. **Returns:** The IntLiteral value itself. ### `__str__` `__str__(self) -> String` Convert from IntLiteral to String. **Returns:** The value as a string. ### `__ceildiv__` `__ceildiv__(self, denominator: IntLiteral[value]) -> IntLiteral[(0 - (value // (0 - value)))]` Return the rounded-up result of dividing self by denominator. **Args:** * ​denominator (`IntLiteral[value]`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. ### `__index__` `__index__(self) -> index` Convert from IntLiteral to index. **Returns:** The corresponding \_\_mlir\_type.index value, interpreting as signed. --- ## int_literal Implements the IntLiteral class. ## Structs * [​`IntLiteral`](/mojo/stdlib/builtin/int_literal/IntLiteral): This type represents a static integer literal value with infinite precision. This type is a compile-time construct which stores its value as a parameter. It is typically materialized into other types (like `Int`) for use at runtime. This compile-time representation allows for arbitrary precision constants that would overflow on Int and other fixed precision integer types. --- ## io Provides utilities for working with input/output. These are Mojo built-ins, so you don't need to import them. ## Functions * [​`input`](/mojo/stdlib/builtin/io/input): Reads a line of input from the user. * [​`print`](/mojo/stdlib/builtin/io/print): Prints elements to the text stream. Each element is separated by `sep` and followed by `end`. --- ## input `input(prompt: String = __init__[__mlir_type.!kgen.string]("")) -> String` Reads a line of input from the user. Reads a line from standard input, converts it to a string, and returns that string. If the prompt argument is present, it is written to standard output without a trailing newline. Examples: ```mojo name = input("Enter your name: ") print("Hello", name) ``` If the user enters "Mojo" it prints "Hello Mojo". **Args:** * ​prompt (`String`): An optional string to be printed before reading input. **Returns:** A string containing the line read from the user input. --- ## print `print[*Ts: Writable](*values: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" "), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("\n"), flush: Bool = False, owned file: FileDescriptor = FileDescriptor(1))` Prints elements to the text stream. Each element is separated by `sep` and followed by `end`. **Parameters:** * ​\*Ts (`Writable`): The elements types. **Args:** * ​\*values (`*Ts`): The elements to print. * ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements. * ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements. * ​flush (`Bool`): If set to true, then the stream is forcibly flushed. * ​file (`FileDescriptor`): The output stream. --- ## Sized The `Sized` trait describes a type that has an integer length (such as a string or array). Any type that conforms to `Sized` or [`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) works with the built-in [`len()`](/mojo/stdlib/builtin/len/len) function. The `Sized` trait requires a type to implement the `__len__()` method. For example: ```mojo struct Foo(Sized): var length: Int fn __len__(self) -> Int: return self.length ``` You can pass an instance of `Foo` to the `len()` function to get its length: ```mojo var foo = Foo(42) print(len(foo) == 42) ``` ```plaintext True ``` **Note:** If the `__len__()` method can raise an error, use the [`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) trait instead. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__len__` `__len__(self: _Self) -> Int` Get the length of the type. **Returns:** The length of the type. --- ## SizedRaising The `SizedRaising` trait describes a type that has an integer length, which might raise an error if the length can't be determined. Any type that conforms to [`Sized`](/mojo/stdlib/builtin/len/Sized) or `SizedRaising` works with the built-in [`len()`](/mojo/stdlib/builtin/len/len) function. The `SizedRaising` trait requires a type to implement the `__len__()` method, which can raise an error. For example: ```mojo struct Foo(SizedRaising): var length: Int fn __len__(self) raises -> Int: if self.length `__len__(self: _Self) -> Int` Get the length of the type. **Returns:** The length of the type. **Raises:** If the length cannot be computed. --- ## UIntSized The `Sized` trait describes a type that has an integer length (such as a string or array). Any type that conforms to `Sized` or [`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) works with the built-in [`len()`](/mojo/stdlib/builtin/len/len) function. The `Sized` trait requires a type to implement the `__len__()` method. For example: ```mojo struct Foo(Sized): var length: Int fn __len__(self) -> Int: return self.length ``` You can pass an instance of `Foo` to the `len()` function to get its length: ```mojo var foo = Foo(42) print(len(foo) == 42) ``` ```plaintext True ``` **Note:** If the `__len__()` method can raise an error, use the [`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) trait instead. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__len__` `__len__(self: _Self) -> UInt` Get the length of the type. **Returns:** The length of the type. --- ## len Provides the `len()` function and its associated traits. These are Mojo built-ins, so you don't need to import them. ## Traits * [​`Sized`](/mojo/stdlib/builtin/len/Sized): The `Sized` trait describes a type that has an integer length (such as a string or array). * [​`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising): The `SizedRaising` trait describes a type that has an integer length, which might raise an error if the length can't be determined. * [​`UIntSized`](/mojo/stdlib/builtin/len/UIntSized): The `Sized` trait describes a type that has an integer length (such as a string or array). ## Functions * [​`len`](/mojo/stdlib/builtin/len/len): Get the length of a value. --- ## len `len[T: Sized](value: T) -> Int` Get the length of a value. **Parameters:** * ​T (`Sized`): The Sized type. **Args:** * ​value (`T`): The object to get the length of. **Returns:** The length of the object. `len[T: SizedRaising](value: T) -> Int` Get the length of a value. **Parameters:** * ​T (`SizedRaising`): The Sized type. **Args:** * ​value (`T`): The object to get the length of. **Returns:** The length of the object. **Raises:** If the length cannot be computed. --- ## Absable The `Absable` trait describes a type that defines an absolute value operation. Types that conform to `Absable` will work with the builtin `abs` function. The absolute value operation always returns the same type as the input. For example: ```mojo struct Point(Absable): var x: Float64 var y: Float64 fn __abs__(self) -> Self: return sqrt(self.x * self.x + self.y * self.y) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__abs__` `__abs__(self: _Self) -> _Self` Get the absolute value of this instance. **Returns:** The absolute value of the instance. --- ## Powable The `Powable` trait describes a type that defines a power operation (i.e. exponentiation) with the same base and exponent types. Types that conform to `Powable` will work with the builtin `pow` function, which will return the same type as the inputs. For example: ```mojo struct Rational(Powable): var numerator: Float64 var denominator: Float64 fn __init__(out self, numerator: Float64, denominator: Float64): self.numerator = numerator self.denominator = denominator fn __pow__(self, exp: Self) -> Self: var exp_value = exp.numerator / exp.denominator return Self(pow(self.numerator, exp_value), pow(self.denominator, exp_value)) ``` You can now use the \*\* operator to exponentiate objects inside generic functions: ```mojo fn exponentiate[T: Powable](base: T, exp: T) -> T: return base ** exp var base = Rational(Float64(3.0), 5.0) var exp = Rational(Float64(1.0), 2.0) var res = exponentiate(base, exp) ``` ```plaintext raising to power ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__pow__` `__pow__(self: _Self, exp: _Self) -> _Self` Return the value raised to the power of the given exponent. **Args:** * ​exp (`_Self`): The exponent value. **Returns:** The value of `self` raised to the power of `exp`. --- ## Roundable The `Roundable` trait describes a type that defines a rounding operation. Types that conform to `Roundable` will work with the builtin `round` function. The round operation always returns the same type as the input. For example: ```mojo @fieldwise_init struct Complex(Roundable): var re: Float64 var im: Float64 fn __round__(self) -> Self: return Self(round(self.re), round(self.im)) fn __round__(self, ndigits: Int) -> Self: return Self(round(self.re, ndigits), round(self.im, ndigits)) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__round__` `__round__(self: _Self) -> _Self` Get a rounded value for the type. **Returns:** The rounded value. `__round__(self: _Self, ndigits: Int) -> _Self` Get a rounded value for the type. **Args:** * ​ndigits (`Int`): Number of digits after the decimal point. **Returns:** The rounded value. --- ## abs `abs[T: Absable](value: T) -> T` Get the absolute value of the given object. **Parameters:** * ​T (`Absable`): The type conforming to Absable. **Args:** * ​value (`T`): The object to get the absolute value of. **Returns:** The absolute value of the object. --- ## divmod `divmod(numerator: Int, denominator: Int) -> Tuple[Int, Int]` Performs integer division and returns the quotient and the remainder. Currently supported only for integers. Support for more standard library types like Int8, Int16... is planned. This method calls `a.__divmod__(b)`, thus, the actual implementation of divmod should go in the `__divmod__` method of the struct of `a`. **Args:** * ​numerator (`Int`): The dividend. * ​denominator (`Int`): The divisor. **Returns:** A `Tuple` containing the quotient and the remainder. `divmod(numerator: UInt, denominator: UInt) -> Tuple[UInt, UInt]` Performs integer division and returns the quotient and the remainder. Currently supported only for integers. Support for more standard library types like Int8, Int16... is planned. This method calls `a.__divmod__(b)`, thus, the actual implementation of divmod should go in the `__divmod__` method of the struct of `a`. **Args:** * ​numerator (`UInt`): The dividend. * ​denominator (`UInt`): The divisor. **Returns:** A `Tuple` containing the quotient and the remainder. --- ## math Defines basic math functions for use in the open source parts of the standard library since the `math` package is currently closed source and cannot be depended on in the open source parts of the standard library. These are Mojo built-ins, so you don't need to import them. ## Traits * [​`Absable`](/mojo/stdlib/builtin/math/Absable): The `Absable` trait describes a type that defines an absolute value operation. * [​`Powable`](/mojo/stdlib/builtin/math/Powable): The `Powable` trait describes a type that defines a power operation (i.e. exponentiation) with the same base and exponent types. * [​`Roundable`](/mojo/stdlib/builtin/math/Roundable): The `Roundable` trait describes a type that defines a rounding operation. ## Functions * [​`abs`](/mojo/stdlib/builtin/math/abs): Get the absolute value of the given object. * [​`divmod`](/mojo/stdlib/builtin/math/divmod): Performs integer division and returns the quotient and the remainder. * [​`max`](/mojo/stdlib/builtin/math/max): Gets the maximum of two integers. * [​`min`](/mojo/stdlib/builtin/math/min): Gets the minimum of two integers. * [​`pow`](/mojo/stdlib/builtin/math/pow): Computes the `base` raised to the power of the `exp`. * [​`round`](/mojo/stdlib/builtin/math/round): Get the rounded value of the given object. --- ## max `max(x: Int, y: Int, /) -> Int` Gets the maximum of two integers. **Args:** * ​x (`Int`): Integer input to max. * ​y (`Int`): Integer input to max. **Returns:** Maximum of x and y. `max(x: UInt, y: UInt, /) -> UInt` Gets the maximum of two integers. **Args:** * ​x (`UInt`): Integer input to max. * ​y (`UInt`): Integer input to max. **Returns:** Maximum of x and y. `max[dtype: DType, //](x: SIMD[dtype, size], y: SIMD[dtype, size], /) -> SIMD[dtype, size]` Performs elementwise maximum of x and y. An element of the result SIMD vector will be the maximum of the corresponding elements in x and y. **Constraints:** The type of the inputs must be numeric or boolean. **Parameters:** * ​dtype (`DType`): The data type of the SIMD vector. **Args:** * ​x (`SIMD[dtype, size]`): First SIMD vector. * ​y (`SIMD[dtype, size]`): Second SIMD vector. **Returns:** A SIMD vector containing the elementwise maximum of x and y. `max[T: Copyable & GreaterThanComparable](x: T, *ys: T) -> T` Gets the maximum value from a sequence of values. **Parameters:** * ​T (`Copyable & GreaterThanComparable`): A type that is both copyable and comparable with greater than. **Args:** * ​x (`T`): The first value to compare. * ​\*ys (`T`): Zero or more additional values to compare. **Returns:** The maximum value from the input sequence. --- ## min `min(x: Int, y: Int, /) -> Int` Gets the minimum of two integers. **Args:** * ​x (`Int`): Integer input to min. * ​y (`Int`): Integer input to min. **Returns:** Minimum of x and y. `min(x: UInt, y: UInt, /) -> UInt` Gets the minimum of two integers. **Args:** * ​x (`UInt`): Integer input to min. * ​y (`UInt`): Integer input to min. **Returns:** Minimum of x and y. `min[dtype: DType, //](x: SIMD[dtype, size], y: SIMD[dtype, size], /) -> SIMD[dtype, size]` Gets the elementwise minimum of x and y. An element of the result SIMD vector will be the minimum of the corresponding elements in x and y. **Constraints:** The type of the inputs must be numeric or boolean. **Parameters:** * ​dtype (`DType`): The data type of the SIMD vector. **Args:** * ​x (`SIMD[dtype, size]`): First SIMD vector. * ​y (`SIMD[dtype, size]`): Second SIMD vector. **Returns:** A SIMD vector containing the elementwise minimum of x and y. `min[T: Copyable & LessThanComparable](x: T, *ys: T) -> T` Gets the minimum value from a sequence of values. **Parameters:** * ​T (`Copyable & LessThanComparable`): A type that is both copyable and comparable with less than. **Args:** * ​x (`T`): The first value to compare. * ​\*ys (`T`): Zero or more additional values to compare. **Returns:** The minimum value from the input sequence. --- ## pow `pow[T: Powable](base: T, exp: T) -> T` Computes the `base` raised to the power of the `exp`. **Parameters:** * ​T (`Powable`): A type conforming to the `Powable` trait. **Args:** * ​base (`T`): The base of the power operation. * ​exp (`T`): The exponent of the power operation. **Returns:** The `base` raised to the power of the `exp`. `pow(base: SIMD[dtype, size], exp: Int) -> SIMD[dtype, size]` Computes elementwise value of a SIMD vector raised to the power of the given integer. **Args:** * ​base (`SIMD[dtype, size]`): The first input argument. * ​exp (`Int`): The second input argument. **Returns:** The `base` elementwise raised raised to the power of `exp`. --- ## round `round[T: Roundable, //](number: T) -> T` Get the rounded value of the given object. **Parameters:** * ​T (`Roundable`): The type conforming to Roundable. **Args:** * ​number (`T`): The object to get the rounded value of. **Returns:** The rounded value of the object. `round[T: Roundable, //](number: T, ndigits: Int) -> T` Get the value of this object, rounded to a specified number of digits after the decimal point. **Parameters:** * ​T (`Roundable`): The type conforming to Roundable. **Args:** * ​number (`T`): The object to get the rounded value of. * ​ndigits (`Int`): The number of digits to round to. **Returns:** The rounded value of the object. --- ## NoneType `@register_passable(trivial)` `struct NoneType` Represents the absence of a value. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `ExplicitlyCopyable`, `Movable`, `Representable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Construct an instance of the `None` type. `@implicit` `__init__(value: None) -> Self` Construct an instance of the `None` type. **Args:** * ​value (`None`): The MLIR none type to construct from. ### `copy` `copy(self) -> Self` Explicit copy constructor. **Returns:** A copy of the value. ### `__str__` `__str__(self) -> String` Returns the string representation of `None`. **Returns:** `"None"`. ### `__repr__` `__repr__(self) -> String` Returns the string representation of `None`. **Returns:** `"None"`. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Write `None` to a writer stream. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. --- ## none Defines the builtin `NoneType`. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`NoneType`](/mojo/stdlib/builtin/none/NoneType): Represents the absence of a value. --- ## range Implements a 'range' call. These are Mojo built-ins, so you don't need to import them. ## Functions * [​`range`](/mojo/stdlib/builtin/range/range): Constructs a \[0; end) Range. --- ## range `range[T: Indexer, //](end: T) -> _ZeroStartingRange` Constructs a \[0; end) Range. **Parameters:** * ​T (`Indexer`): The type of the end value. **Args:** * ​end (`T`): The end of the range. **Returns:** The constructed range. `range[T: IntableRaising, //](end: T) -> _ZeroStartingRange` Constructs a \[0; end) Range. **Parameters:** * ​T (`IntableRaising`): The type of the end value. **Args:** * ​end (`T`): The end of the range. **Returns:** The constructed range. **Raises:** An error if the conversion to an `Int` failed. `range(end: PythonObject) -> _ZeroStartingRange` Constructs a \[0; end) Range from a Python `int`. **Args:** * ​end (`PythonObject`): The end of the range as a Python `int`. **Returns:** The constructed range. **Raises:** An error if converting `end` to an `Int` failed. `range[T0: Indexer, T1: Indexer, //](start: T0, end: T1) -> _SequentialRange` Constructs a \[start; end) Range. **Parameters:** * ​T0 (`Indexer`): The type of the start value. * ​T1 (`Indexer`): The type of the end value. **Args:** * ​start (`T0`): The start of the range. * ​end (`T1`): The end of the range. **Returns:** The constructed range. `range[T0: IntableRaising, T1: IntableRaising](start: T0, end: T1) -> _SequentialRange` Constructs a \[start; end) Range. **Parameters:** * ​T0 (`IntableRaising`): The type of the start value. * ​T1 (`IntableRaising`): The type of the end value. **Args:** * ​start (`T0`): The start of the range. * ​end (`T1`): The end of the range. **Returns:** The constructed range. **Raises:** An error if converting `start` or `end` to an `Int` failed. `range(start: PythonObject, end: PythonObject) -> _SequentialRange` Constructs a \[start; end) Range from Python `int` objects. **Args:** * ​start (`PythonObject`): The start of the range as a Python `int`. * ​end (`PythonObject`): The end of the range as a Python `int`. **Returns:** The constructed range. **Raises:** An error if converting `start` or `end` to an `Int` failed. `range[T0: Indexer, T1: Indexer, T2: Indexer, //](start: T0, end: T1, step: T2) -> _StridedRange` Constructs a \[start; end) Range with a given step. **Parameters:** * ​T0 (`Indexer`): The type of the start value. * ​T1 (`Indexer`): The type of the end value. * ​T2 (`Indexer`): The type of the step value. **Args:** * ​start (`T0`): The start of the range. * ​end (`T1`): The end of the range. * ​step (`T2`): The step for the range. **Returns:** The constructed range. `range[T0: IntableRaising, T1: IntableRaising, T2: IntableRaising, //](start: T0, end: T1, step: T2) -> _StridedRange` Constructs a \[start; end) Range with a given step. **Parameters:** * ​T0 (`IntableRaising`): The type of the start value. * ​T1 (`IntableRaising`): The type of the end value. * ​T2 (`IntableRaising`): The type of the step value. **Args:** * ​start (`T0`): The start of the range. * ​end (`T1`): The end of the range. * ​step (`T2`): The step for the range. **Returns:** The constructed range. **Raises:** An error if converting `start`, `end`, or `step` to an `Int` failed. `range(start: PythonObject, end: PythonObject, step: PythonObject) -> _StridedRange` Constructs a \[start; end) Range from Python `int` objects with a given step. **Args:** * ​start (`PythonObject`): The start of the range as a Python `int`. * ​end (`PythonObject`): The end of the range as a Python `int`. * ​step (`PythonObject`): The step for the range as a Python `int`. **Returns:** The constructed range. **Raises:** An error if converting `start`, `end`, or `step` to an `Int` failed. `range(end: UInt) -> _UIntZeroStartingRange` Constructs a \[0; end) Range. **Args:** * ​end (`UInt`): The end of the range. **Returns:** The constructed range. `range(start: UInt, end: UInt, step: UInt = UInt(1)) -> _UIntStridedRange` Constructs a \[start; end) Range with a given step. **Args:** * ​start (`UInt`): The start of the range. * ​end (`UInt`): The end of the range. * ​step (`UInt`): The step for the range. Defaults to 1. **Returns:** The constructed range. `range[dtype: DType, //](end: SIMD[dtype, 1]) -> _ZeroStartingScalarRange[dtype]` Constructs a \[start; end) Range with a given step. **Parameters:** * ​dtype (`DType`): The range dtype. **Args:** * ​end (`SIMD[dtype, 1]`): The end of the range. **Returns:** The constructed range. `range[dtype: DType, //](start: SIMD[dtype, 1], end: SIMD[dtype, 1]) -> _SequentialScalarRange[dtype]` Constructs a \[start; end) Range with a given step. **Parameters:** * ​dtype (`DType`): The range dtype. **Args:** * ​start (`SIMD[dtype, 1]`): The start of the range. * ​end (`SIMD[dtype, 1]`): The end of the range. **Returns:** The constructed range. `range[dtype: DType, //](start: SIMD[dtype, 1], end: SIMD[dtype, 1], step: SIMD[dtype, 1]) -> _StridedScalarRange[dtype]` Constructs a \[start; end) Range with a given step. **Parameters:** * ​dtype (`DType`): The range dtype. **Args:** * ​start (`SIMD[dtype, 1]`): The start of the range. * ​end (`SIMD[dtype, 1]`): The end of the range. * ​step (`SIMD[dtype, 1]`): The step for the range. Defaults to 1. **Returns:** The constructed range. --- ## rebind Implements type rebind. These are Mojo built-ins, so you don't need to import them. ## Functions * [​`rebind`](/mojo/stdlib/builtin/rebind/rebind): Statically assert that a parameter input type `src_type` resolves to the same type as a parameter result type `dest_type` after function instantiation and "rebind" the input to the result type. --- ## rebind `rebind[src_type: AnyTrivialRegType, //, dest_type: AnyTrivialRegType](src: src_type) -> dest_type` Statically assert that a parameter input type `src_type` resolves to the same type as a parameter result type `dest_type` after function instantiation and "rebind" the input to the result type. This function is meant to be used in uncommon cases where a parametric type depends on the value of a constrained parameter in order to manually refine the type with the constrained parameter value. **Parameters:** * ​src\_type (`AnyTrivialRegType`): The original type. * ​dest\_type (`AnyTrivialRegType`): The type to rebind to. **Args:** * ​src (`src_type`): The value to rebind. **Returns:** The rebound value of `dest_type`. `rebind[src_type: AnyType, //, dest_type: AnyType](ref src: src_type) -> ref [src] dest_type` Statically assert that a parameter input type `src_type` resolves to the same type as a parameter result type `dest_type` after function instantiation and "rebind" the input to the result type, returning a reference to the input value with an adjusted type. This function is meant to be used in uncommon cases where a parametric type depends on the value of a constrained parameter in order to manually refine the type with the constrained parameter value. **Parameters:** * ​src\_type (`AnyType`): The original type. * ​dest\_type (`AnyType`): The type to rebind to. **Args:** * ​src (`src_type`): The value to rebind. **Returns:** A reference to the value rebound as `dest_type`. --- ## Representable A trait that describes a type that has a String representation. Any type that conforms to the `Representable` trait can be used with the `repr` function. Any conforming type must also implement the `__repr__` method. Here is an example: ```mojo struct Dog(Representable): var name: String var age: Int fn __repr__(self) -> String: return "Dog(name=" + repr(self.name) + ", age=" + repr(self.age) + ")" var dog = Dog("Rex", 5) print(repr(dog)) # Dog(name='Rex', age=5) ``` The method `__repr__` should compute the "official" string representation of a type. If at all possible, this should look like a valid Mojo expression that could be used to recreate a struct instance with the same value (given an appropriate environment). So a returned String of the form `module_name.SomeStruct(arg1=value1, arg2=value2)` is advised. If this is not possible, a string of the form `` should be returned. The return value must be a `String` instance. This is typically used for debugging, so it is important that the representation is information-rich and unambiguous. Note that when computing the string representation of a collection (`Dict`, `List`, `Set`, etc...), the `repr` function is called on each element, not the `String()` function. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__repr__` `__repr__(self: _Self) -> String` Get the string representation of the type instance, if possible, compatible with Mojo syntax. **Returns:** The string representation of the instance. --- ## repr Provide the `repr` function. The functions and traits provided here are built-ins, so you don't need to import them. ## Traits * [​`Representable`](/mojo/stdlib/builtin/repr/Representable): A trait that describes a type that has a String representation. ## Functions * [​`repr`](/mojo/stdlib/builtin/repr/repr): Returns the string representation of the given value. --- ## repr `repr[T: Representable](value: T) -> String` Returns the string representation of the given value. **Parameters:** * ​T (`Representable`): The type of `value`. Must implement the `Representable` trait. **Args:** * ​value (`T`): The value to get the string representation of. **Returns:** The string representation of the given value. `repr(value: None) -> String` Returns the string representation of `None`. **Args:** * ​value (`None`): A `None` value. **Returns:** The string representation of `None`. --- ## ReversibleRange The `ReversibleRange` trait describes a range that can be reversed. Any type that conforms to `ReversibleRange` works with the builtin [`reversed()`](/mojo/stdlib/builtin/reversed.html) functions. The `ReversibleRange` trait requires the type to define the `__reversed__()` method. **Note**: iterators are currently non-raising. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__reversed__` `__reversed__(self: _Self) -> _StridedRange` Get a reversed iterator for the type. **Note**: iterators are currently non-raising. **Returns:** The reversed iterator of the type. --- ## reversed Provides the `reversed` function for reverse iteration over collections. These are Mojo built-ins, so you don't need to import them. ## Traits * [​`ReversibleRange`](/mojo/stdlib/builtin/reversed/ReversibleRange): The `ReversibleRange` trait describes a range that can be reversed. ## Functions * [​`reversed`](/mojo/stdlib/builtin/reversed/reversed): Get a reversed iterator of the input range. --- ## reversed `reversed[T: ReversibleRange](value: T) -> _StridedRange` Get a reversed iterator of the input range. **Note**: iterators are currently non-raising. **Parameters:** * ​T (`ReversibleRange`): The type conforming to ReversibleRange. **Args:** * ​value (`T`): The range to get the reversed iterator of. **Returns:** The reversed iterator of the range. `reversed[T: Copyable & Movable](ref value: List[T, hint_trivial_type]) -> _ListIter[T, hint_trivial_type, value_is_origin, False]` Get a reversed iterator of the input list. **Note**: iterators are currently non-raising. **Parameters:** * ​T (`Copyable & Movable`): The type of the elements in the list. **Args:** * ​value (`List[T, hint_trivial_type]`): The list to get the reversed iterator of. **Returns:** The reversed iterator of the list. `reversed[T: Copyable & Movable](ref value: Deque[T]) -> _DequeIter[T, value_is_origin, False]` Get a reversed iterator of the deque. **Note**: iterators are currently non-raising. **Parameters:** * ​T (`Copyable & Movable`): The type of the elements in the deque. **Args:** * ​value (`Deque[T]`): The deque to get the reversed iterator of. **Returns:** The reversed iterator of the deque. `reversed[K: Copyable & Movable & Hashable & EqualityComparable, V: Copyable & Movable](ref value: Dict[K, V]) -> _DictKeyIter[K, V, value_is_origin, False]` Get a reversed iterator of the input dict. **Note**: iterators are currently non-raising. **Parameters:** * ​K (`Copyable & Movable & Hashable & EqualityComparable`): The type of the keys in the dict. * ​V (`Copyable & Movable`): The type of the values in the dict. **Args:** * ​value (`Dict[K, V]`): The dict to get the reversed iterator of. **Returns:** The reversed iterator of the dict keys. `reversed[K: Copyable & Movable & Hashable & EqualityComparable, V: Copyable & Movable, dict_mutability: Bool, dict_origin: Origin[dict_mutability]](ref value: _DictValueIter[K, V, dict_origin]) -> _DictValueIter[K, V, dict_origin, False]` Get a reversed iterator of the input dict values. **Note**: iterators are currently non-raising. **Parameters:** * ​K (`Copyable & Movable & Hashable & EqualityComparable`): The type of the keys in the dict. * ​V (`Copyable & Movable`): The type of the values in the dict. * ​dict\_mutability (`Bool`): Whether the reference to the dict values is mutable. * ​dict\_origin (`Origin[dict_mutability]`): The origin of the dict values. **Args:** * ​value (`_DictValueIter[K, V, dict_origin]`): The dict values to get the reversed iterator of. **Returns:** The reversed iterator of the dict values. `reversed[K: Copyable & Movable & Hashable & EqualityComparable, V: Copyable & Movable, dict_mutability: Bool, dict_origin: Origin[dict_mutability]](ref value: _DictEntryIter[K, V, dict_origin]) -> _DictEntryIter[K, V, dict_origin, False]` Get a reversed iterator of the input dict items. **Note**: iterators are currently non-raising. **Parameters:** * ​K (`Copyable & Movable & Hashable & EqualityComparable`): The type of the keys in the dict. * ​V (`Copyable & Movable`): The type of the values in the dict. * ​dict\_mutability (`Bool`): Whether the reference to the dict items is mutable. * ​dict\_origin (`Origin[dict_mutability]`): The origin of the dict items. **Args:** * ​value (`_DictEntryIter[K, V, dict_origin]`): The dict items to get the reversed iterator of. **Returns:** The reversed iterator of the dict items. `reversed[T: Copyable & Movable](value: Span[T, origin]) -> _SpanIter[T, origin, False]` Get a reversed iterator of the input Span. **Note**: iterators are currently non-raising. **Parameters:** * ​T (`Copyable & Movable`): The type of the elements in the Span. **Args:** * ​value (`Span[T, origin]`): The Span to get the reversed iterator of. **Returns:** The reversed iterator of the Span. --- ## SIMD `@register_passable(trivial)` `struct SIMD[dtype: DType, size: Int]` Represents a small vector that is backed by a hardware vector element. SIMD allows a single instruction to be executed across the multiple data elements of the vector. **Constraints:** The size of the SIMD vector to be positive and a power of 2. ## Parameters * ​dtype (`DType`): The data type of SIMD vector elements. * ​size (`Int`): The size of the SIMD vector. ## Fields * ​value (`simd, #lit.struct.extract>`): The underlying storage for the vector. ## Implemented traits `Absable`, `AnyType`, `Boolable`, `CeilDivable`, `Ceilable`, `Copyable`, `Defaultable`, `DevicePassable`, `ExplicitlyCopyable`, `Floatable`, `Floorable`, `Hashable`, `Indexer`, `Intable`, `Movable`, `Powable`, `PythonConvertible`, `Representable`, `Roundable`, `Sized`, `Stringable`, `Truncable`, `UnknownDestructibility`, `Writable`, `_HashableWithHasher` ## Aliases ### `device_type` `alias device_type = SIMD[dtype, size]` SIMD types are remapped to the same type when passed to accelerator devices. ### `element_type` `alias element_type = dtype` ### `MAX` `alias MAX = SIMD(max_or_inf[::DType]())` Gets the maximum value for the SIMD value, potentially +inf. ### `MAX_FINITE` `alias MAX_FINITE = SIMD(max_finite[::DType]())` Returns the maximum finite value of SIMD value. ### `MIN` `alias MIN = SIMD(min_or_neg_inf[::DType]())` Gets the minimum value for the SIMD value, potentially -inf. ### `MIN_FINITE` `alias MIN_FINITE = SIMD(min_finite[::DType]())` Returns the minimum (lowest) finite value of SIMD value. ## Methods ### `__init__` `__init__() -> Self` Default initializer of the SIMD vector. By default the SIMD vectors are initialized to all zeros. `__init__[other_dtype: DType, //](value: SIMD[other_dtype, size], /) -> Self` Initialize from another SIMD of the same size. If the value passed is a scalar, you can initialize a SIMD vector with more elements. Example: ```mojo print(UInt64(UInt8(42))) # 42 print(SIMD[DType.uint64, 4](UInt8(42))) # [42, 42, 42, 42] ``` Casting behavior: ```mojo # Basic casting preserves value within range Int8(UInt8(127)) == Int8(127) # Numbers above signed max wrap to negative using two's complement Int8(UInt8(128)) == Int8(-128) Int8(UInt8(129)) == Int8(-127) Int8(UInt8(256)) == Int8(0) # Negative signed cast to unsigned using two's complement UInt8(Int8(-128)) == UInt8(128) UInt8(Int8(-127)) == UInt8(129) UInt8(Int8(-1)) == UInt8(255) # Truncate precision after downcast and upcast Float64(Float32(Float64(123456789.123456789))) == Float64(123456792.0) # Rightmost bits of significand become 0's on upcast Float64(Float32(0.3)) == Float64(0.30000001192092896) # Numbers equal after truncation of float literal and cast truncation Float32(Float64(123456789.123456789)) == Float32(123456789.123456789) # Float to int/uint floors Int64(Float64(42.2)) == Int64(42) ``` . **Parameters:** * ​other\_dtype (`DType`): The type of the value that is being cast from. **Args:** * ​value (`SIMD[other_dtype, size]`): The value to cast from. `@implicit` `__init__(value: UInt, /) -> Self` Initializes the SIMD vector with an unsigned integer. The unsigned integer value is splatted across all the elements of the SIMD vector. **Args:** * ​value (`UInt`): The input value. `@implicit` `__init__(value: Int, /) -> Self` Initializes the SIMD vector with a signed integer. The signed integer value is splatted across all the elements of the SIMD vector. **Args:** * ​value (`Int`): The input value. `__init__[T: Floatable, //](value: T, /) -> SIMD[float64, 1]` Initialize a Float64 from a type conforming to Floatable. **Parameters:** * ​T (`Floatable`): The Floatable type. **Args:** * ​value (`T`): The object to get the float point representation of. `__init__[T: FloatableRaising, //](out self: SIMD[float64, 1], value: T, /)` Initialize a Float64 from a type conforming to FloatableRaising. **Parameters:** * ​T (`FloatableRaising`): The FloatableRaising type. **Args:** * ​value (`T`): The object to get the float point representation of. **Raises:** If the type does not have a float point representation. `__init__[*, _: Int = 0](out self: SIMD[float64, 1], value: PythonObject, /)` Initialize a Float64 from a PythonObject. **Parameters:** * ​\_ (`Int`): A dummy parameter to ensure this overload has lower priority than the others. Its value is ignored. **Args:** * ​value (`PythonObject`): The PythonObject to convert. **Raises:** If the conversion to double fails. `@implicit` `__init__(value: IntLiteral[value], /) -> Self` Initializes the SIMD vector with an integer. The integer value is splatted across all the elements of the SIMD vector. **Args:** * ​value (`IntLiteral[value]`): The input value. `@implicit` `__init__(value: Bool, /) -> SIMD[bool, size]` Initializes the SIMD vector with a bool value. The bool value is splatted across all elements of the SIMD vector. **Args:** * ​value (`Bool`): The bool value. `@implicit` `__init__(value: simd, #lit.struct.extract>, /) -> Self` Initializes the SIMD vector with the underlying mlir value. **Args:** * ​value (`simd, #lit.struct.extract>`): The input value. `@implicit` `__init__(value: SIMD[dtype, 1], /) -> Self` Constructs a SIMD vector by splatting a scalar value. The input value is splatted across all elements of the SIMD vector. **Args:** * ​value (`SIMD[dtype, 1]`): The value to splat to the elements of the vector. `__init__(*elems: SIMD[dtype, 1], *, __list_literal__: Tuple[] = Tuple()) -> Self` Constructs a SIMD vector via a variadic list of elements. The input values are assigned to the corresponding elements of the SIMD vector. **Constraints:** The number of input values is equal to size of the SIMD vector. **Args:** * ​\*elems (`SIMD[dtype, 1]`): The variadic list of elements from which the SIMD vector is constructed. * ​**list\_literal** (`Tuple[]`): Tell Mojo to use this method for list literals. `@implicit` `__init__(value: FloatLiteral[value], /) -> Self` Initializes the SIMD vector with a float. The value is splatted across all the elements of the SIMD vector. **Args:** * ​value (`FloatLiteral[value]`): The input value. ### `__bool__` `__bool__(self) -> Bool` Converts the SIMD scalar into a boolean value. **Constraints:** The size of the SIMD vector must be 1. **Returns:** True if the SIMD scalar is non-zero and False otherwise. ### `__getitem__` `__getitem__(self, idx: Int) -> SIMD[dtype, 1]` Gets an element from the vector. **Args:** * ​idx (`Int`): The element index. **Returns:** The value at position `idx`. ### `__setitem__` `__setitem__(mut self, idx: Int, val: SIMD[dtype, 1])` Sets an element in the vector. **Args:** * ​idx (`Int`): The index to set. * ​val (`SIMD[dtype, 1]`): The value to set. ### `__neg__` `__neg__(self) -> Self` Defines the unary `-` operation. **Returns:** The negation of this SIMD vector. ### `__pos__` `__pos__(self) -> Self` Defines the unary `+` operation. **Returns:** This SIMD vector. ### `__invert__` `__invert__(self) -> Self` Returns `~self`. **Constraints:** The element type of the SIMD vector must be boolean or integral. **Returns:** The `~self` value. ### `__lt__` `__lt__(self, rhs: Self) -> SIMD[bool, size]` Compares two SIMD vectors using less-than comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** A new bool SIMD vector of the same size whose element at position `i` is True or False depending on the expression `self[i] ### `__le__` `__le__(self, rhs: Self) -> SIMD[bool, size]` Compares two SIMD vectors using less-than-or-equal comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** A new bool SIMD vector of the same size whose element at position `i` is True or False depending on the expression `self[i] ### `__eq__` `__eq__(self, rhs: Self) -> SIMD[bool, size]` Compares two SIMD vectors using equal-to comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** A new bool SIMD vector of the same size whose element at position `i` is True or False depending on the expression `self[i] == rhs[i]`. ### `__ne__` `__ne__(self, rhs: Self) -> SIMD[bool, size]` Compares two SIMD vectors using not-equal comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** A new bool SIMD vector of the same size whose element at position `i` is True or False depending on the expression `self[i] != rhs[i]`. ### `__gt__` `__gt__(self, rhs: Self) -> SIMD[bool, size]` Compares two SIMD vectors using greater-than comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** A new bool SIMD vector of the same size whose element at position `i` is True or False depending on the expression `self[i] > rhs[i]`. ### `__ge__` `__ge__(self, rhs: Self) -> SIMD[bool, size]` Compares two SIMD vectors using greater-than-or-equal comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** A new bool SIMD vector of the same size whose element at position `i` is True or False depending on the expression `self[i] >= rhs[i]`. ### `__contains__` `__contains__(self, value: SIMD[dtype, 1]) -> Bool` Whether the vector contains the value. **Args:** * ​value (`SIMD[dtype, 1]`): The value. **Returns:** Whether the vector contains the value. ### `__add__` `__add__(self, rhs: Self) -> Self` Computes `self + rhs`. **Args:** * ​rhs (`Self`): The rhs value. **Returns:** A new vector whose element at position `i` is computed as `self[i] + rhs[i]`. ### `__sub__` `__sub__(self, rhs: Self) -> Self` Computes `self - rhs`. **Args:** * ​rhs (`Self`): The rhs value. **Returns:** A new vector whose element at position `i` is computed as `self[i] - rhs[i]`. ### `__mul__` `__mul__(self, rhs: Self) -> Self` Computes `self * rhs`. **Args:** * ​rhs (`Self`): The rhs value. **Returns:** A new vector whose element at position `i` is computed as `self[i] * rhs[i]`. ### `__truediv__` `__truediv__(self, rhs: Self) -> Self` Computes `self / rhs`. **Args:** * ​rhs (`Self`): The rhs value. **Returns:** A new vector whose element at position `i` is computed as `self[i] / rhs[i]`. ### `__floordiv__` `__floordiv__(self, rhs: Self) -> Self` Returns the division of self and rhs rounded down to the nearest integer. **Constraints:** The element type of the SIMD vector must be numeric. **Args:** * ​rhs (`Self`): The value to divide with. **Returns:** `floor(self / rhs)` value. ### `__mod__` `__mod__(self, rhs: Self) -> Self` Returns the remainder of self divided by rhs. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** The remainder of dividing self by rhs. ### `__pow__` `__pow__(self, exp: Int) -> Self` Computes the vector raised to the power of the input integer value. **Args:** * ​exp (`Int`): The exponent value. **Returns:** A SIMD vector where each element is raised to the power of the specified exponent value. `__pow__(self, exp: Self) -> Self` Computes the vector raised elementwise to the right hand side power. **Args:** * ​exp (`Self`): The exponent value. **Returns:** A SIMD vector where each element is raised to the power of the specified exponent value. ### `__lshift__` `__lshift__(self, rhs: Self) -> Self` Returns `self rhs (`Self`): The RHS value. **Returns:** `self ### `__rshift__` `__rshift__(self, rhs: Self) -> Self` Returns `self >> rhs`. **Constraints:** The element type of the SIMD vector must be integral. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self >> rhs`. ### `__and__` `__and__(self, rhs: Self) -> Self` Returns `self & rhs`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self & rhs`. ### `__or__` `__or__(self, rhs: Self) -> Self` Returns `self | rhs`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self | rhs`. ### `__xor__` `__xor__(self, rhs: Self) -> Self` Returns `self ^ rhs`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self ^ rhs`. ### `__radd__` `__radd__(self, value: Self) -> Self` Returns `value + self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value + self`. ### `__rsub__` `__rsub__(self, value: Self) -> Self` Returns `value - self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value - self`. ### `__rmul__` `__rmul__(self, value: Self) -> Self` Returns `value * self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value * self`. ### `__rtruediv__` `__rtruediv__(self, value: Self) -> Self` Returns `value / self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value / self`. ### `__rfloordiv__` `__rfloordiv__(self, rhs: Self) -> Self` Returns the division of rhs and self rounded down to the nearest integer. **Constraints:** The element type of the SIMD vector must be numeric. **Args:** * ​rhs (`Self`): The value to divide by self. **Returns:** `floor(rhs / self)` value. ### `__rmod__` `__rmod__(self, value: Self) -> Self` Returns `value mod self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value mod self`. ### `__rpow__` `__rpow__(self, base: Self) -> Self` Returns `base ** self`. **Args:** * ​base (`Self`): The base value. **Returns:** `base ** self`. ### `__rlshift__` `__rlshift__(self, value: Self) -> Self` Returns `value value (`Self`): The other value. **Returns:** `value ### `__rrshift__` `__rrshift__(self, value: Self) -> Self` Returns `value >> self`. **Constraints:** The element type of the SIMD vector must be integral. **Args:** * ​value (`Self`): The other value. **Returns:** `value >> self`. ### `__rand__` `__rand__(self, value: Self) -> Self` Returns `value & self`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​value (`Self`): The other value. **Returns:** `value & self`. ### `__ror__` `__ror__(self, value: Self) -> Self` Returns `value | self`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​value (`Self`): The other value. **Returns:** `value | self`. ### `__rxor__` `__rxor__(self, value: Self) -> Self` Returns `value ^ self`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​value (`Self`): The other value. **Returns:** `value ^ self`. ### `__iadd__` `__iadd__(mut self, rhs: Self)` Performs in-place addition. The vector is mutated where each element at position `i` is computed as `self[i] + rhs[i]`. **Args:** * ​rhs (`Self`): The rhs of the addition operation. ### `__isub__` `__isub__(mut self, rhs: Self)` Performs in-place subtraction. The vector is mutated where each element at position `i` is computed as `self[i] - rhs[i]`. **Args:** * ​rhs (`Self`): The rhs of the operation. ### `__imul__` `__imul__(mut self, rhs: Self)` Performs in-place multiplication. The vector is mutated where each element at position `i` is computed as `self[i] * rhs[i]`. **Args:** * ​rhs (`Self`): The rhs of the operation. ### `__itruediv__` `__itruediv__(mut self, rhs: Self)` In-place true divide operator. The vector is mutated where each element at position `i` is computed as `self[i] / rhs[i]`. **Args:** * ​rhs (`Self`): The rhs of the operation. ### `__ifloordiv__` `__ifloordiv__(mut self, rhs: Self)` In-place flood div operator. The vector is mutated where each element at position `i` is computed as `self[i] // rhs[i]`. **Args:** * ​rhs (`Self`): The rhs of the operation. ### `__imod__` `__imod__(mut self, rhs: Self)` In-place mod operator. The vector is mutated where each element at position `i` is computed as `self[i] % rhs[i]`. **Args:** * ​rhs (`Self`): The rhs of the operation. ### `__ipow__` `__ipow__(mut self, rhs: Int)` In-place pow operator. The vector is mutated where each element at position `i` is computed as `pow(self[i], rhs)`. **Args:** * ​rhs (`Int`): The rhs of the operation. ### `__ilshift__` `__ilshift__(mut self, rhs: Self)` Computes `self rhs (`Self`): The RHS value. ### `__irshift__` `__irshift__(mut self, rhs: Self)` Computes `self >> rhs` and save the result in `self`. **Constraints:** The element type of the SIMD vector must be integral. **Args:** * ​rhs (`Self`): The RHS value. ### `__iand__` `__iand__(mut self, rhs: Self)` Computes `self & rhs` and save the result in `self`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​rhs (`Self`): The RHS value. ### `__ixor__` `__ixor__(mut self, rhs: Self)` Computes `self ^ rhs` and save the result in `self`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​rhs (`Self`): The RHS value. ### `__ior__` `__ior__(mut self, rhs: Self)` Computes `self | rhs` and save the result in `self`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​rhs (`Self`): The RHS value. ### `get_type_name` `static get_type_name() -> String` Gets this type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls. **Returns:** This type's name. ### `get_device_type_name` `static get_device_type_name() -> String` Gets device\_type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls. **Returns:** This type's name. ### `copy` `copy(self) -> Self` Explicitly construct a copy of self. **Returns:** A copy of this value. ### `from_bits` `static from_bits[int_dtype: DType, //](value: SIMD[int_dtype, size]) -> Self` Initializes the SIMD vector from the bits of an integral SIMD vector. **Parameters:** * ​int\_dtype (`DType`): The integral type of the input SIMD vector. **Args:** * ​value (`SIMD[int_dtype, size]`): The SIMD vector to copy the bits from. **Returns:** The bitcast SIMD vector. ### `to_python_object` `to_python_object(owned self) -> PythonObject` Convert this value to a PythonObject. **Returns:** A PythonObject representing the value. ### `__len__` `__len__(self) -> Int` Gets the length of the SIMD vector. **Returns:** The length of the SIMD vector. ### `__int__` `__int__(self) -> Int` Casts to the value to an Int. If there is a fractional component, then the fractional part is truncated. **Constraints:** The size of the SIMD vector must be 1. **Returns:** The value as an integer. ### `__index__` `__index__(self) -> index` Convert to index. **Returns:** The corresponding \_\_mlir\_type.index value. ### `__float__` `__float__(self) -> SIMD[float64, 1]` Casts the value to a float. **Constraints:** The size of the SIMD vector must be 1. **Returns:** The value as a float. ### `__str__` `__str__(self) -> String` Get the SIMD as a string. **Returns:** A string representation. ### `__repr__` `__repr__(self) -> String` Get the representation of the SIMD value e.g. "SIMD\[DType.int8, 2]\(1, 2)". **Returns:** The representation of the SIMD value. ### `__floor__` `__floor__(self) -> Self` Performs elementwise floor on the elements of a SIMD vector. **Returns:** The elementwise floor of this SIMD vector. ### `__ceil__` `__ceil__(self) -> Self` Performs elementwise ceiling on the elements of a SIMD vector. **Returns:** The elementwise ceiling of this SIMD vector. ### `__trunc__` `__trunc__(self) -> Self` Performs elementwise truncation on the elements of a SIMD vector. **Returns:** The elementwise truncated values of this SIMD vector. ### `__abs__` `__abs__(self) -> Self` Defines the absolute value operation. **Returns:** The absolute value of this SIMD vector. ### `__round__` `__round__(self) -> Self` Performs elementwise rounding on the elements of a SIMD vector. This rounding goes to the nearest integer with ties away from zero. **Returns:** The elementwise rounded value of this SIMD vector. `__round__(self, ndigits: Int) -> Self` Performs elementwise rounding on the elements of a SIMD vector. This rounding goes to the nearest integer with ties away from zero. **Args:** * ​ndigits (`Int`): The number of digits to round to. **Returns:** The elementwise rounded value of this SIMD vector. ### `__hash__` `__hash__(self) -> UInt` Hash the value using builtin hash. **Returns:** A 64-bit hash value. This value is *not* suitable for cryptographic uses. Its intended usage is for data structures. See the `hash` builtin documentation for more details. `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with this SIMD value. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. ### `__ceildiv__` `__ceildiv__(self, denominator: Self) -> Self` Return the rounded-up result of dividing self by denominator. **Args:** * ​denominator (`Self`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. ### `cast` `cast[target: DType](self) -> SIMD[target, size]` Casts the elements of the SIMD vector to the target element type. Casting behavior: ```mojo # Basic casting preserves value within range Int8(UInt8(127)) == Int8(127) # Numbers above signed max wrap to negative using two's complement Int8(UInt8(128)) == Int8(-128) Int8(UInt8(129)) == Int8(-127) Int8(UInt8(256)) == Int8(0) # Negative signed cast to unsigned using two's complement UInt8(Int8(-128)) == UInt8(128) UInt8(Int8(-127)) == UInt8(129) UInt8(Int8(-1)) == UInt8(255) # Truncate precision after downcast and upcast Float64(Float32(Float64(123456789.123456789))) == Float64(123456792.0) # Rightmost bits of significand become 0's on upcast Float64(Float32(0.3)) == Float64(0.30000001192092896) # Numbers equal after truncation of float literal and cast truncation Float32(Float64(123456789.123456789)) == Float32(123456789.123456789) # Float to int/uint floors Int64(Float64(42.2)) == Int64(42) ``` . **Parameters:** * ​target (`DType`): The target DType. **Returns:** A new SIMD vector whose elements have been casted to the target element type. ### `is_power_of_two` `is_power_of_two(self) -> SIMD[bool, size]` Checks if the input value is a power of 2 for each element of a SIMD vector. **Constraints:** The element type of the input vector must be integral. **Returns:** A SIMD value where the element at position `i` is True if the integer at position `i` of the input value is a power of 2, False otherwise. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this SIMD value to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `to_bits` `to_bits[int_dtype: DType = _integral_type_of[::DType]()](self) -> SIMD[int_dtype, size]` Bitcasts the SIMD vector to an integer SIMD vector. **Parameters:** * ​int\_dtype (`DType`): The integer type to cast to. **Returns:** An integer representation of the floating-point value. ### `from_bytes` `static from_bytes[big_endian: Bool = is_big_endian[__mlir_type.!kgen.target]()](bytes: InlineArray[SIMD[uint8, 1], dtype.sizeof()]) -> SIMD[dtype, 1]` Converts a byte array to an scalar integer. **Parameters:** * ​big\_endian (`Bool`): Whether the byte array is big-endian. **Args:** * ​bytes (`InlineArray[SIMD[uint8, 1], dtype.sizeof()]`): The byte array to convert. **Returns:** The integer value. ### `as_bytes` `as_bytes[big_endian: Bool = is_big_endian[__mlir_type.!kgen.target]()](self) -> InlineArray[SIMD[uint8, 1], dtype.sizeof()]` Convert the scalar integer to a byte array. **Parameters:** * ​big\_endian (`Bool`): Whether the byte array should be big-endian. **Returns:** The byte array. ### `clamp` `clamp(self, lower_bound: Self, upper_bound: Self) -> Self` Clamps the values in a SIMD vector to be in a certain range. Clamp cuts values in the input SIMD vector off at the upper bound and lower bound values. For example, SIMD vector `[0, 1, 2, 3]` clamped to a lower bound of 1 and an upper bound of 2 would return `[1, 1, 2, 2]`. **Args:** * ​lower\_bound (`Self`): Minimum of the range to clamp to. * ​upper\_bound (`Self`): Maximum of the range to clamp to. **Returns:** A new SIMD vector containing x clamped to be within lower\_bound and upper\_bound. ### `fma` `fma(self, multiplier: Self, accumulator: Self) -> Self` Performs a fused multiply-add operation, i.e. `self*multiplier + accumulator`. **Args:** * ​multiplier (`Self`): The value to multiply. * ​accumulator (`Self`): The value to accumulate. **Returns:** A new vector whose element at position `i` is computed as `self[i]*multiplier[i] + accumulator[i]`. ### `shuffle` `shuffle[*mask: Int](self) -> Self` Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`. **Parameters:** * ​\*mask (`Int`): The permutation to use in the shuffle. **Returns:** A new vector with the same length as the mask where the value at position `i` is `(self)[permutation[i]]`. `shuffle[*mask: Int](self, other: Self) -> Self` Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`. **Parameters:** * ​\*mask (`Int`): The permutation to use in the shuffle. **Args:** * ​other (`Self`): The other vector to shuffle with. **Returns:** A new vector with the same length as the mask where the value at position `i` is `(self + other)[permutation[i]]`. `shuffle[: DType, //, mask: IndexList[size, element_type=$0]](self) -> Self` Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`. **Parameters:** * ​mask (`IndexList[size, element_type=$0]`): The permutation to use in the shuffle. **Returns:** A new vector with the same length as the mask where the value at position `i` is `(self)[permutation[i]]`. `shuffle[: DType, //, mask: IndexList[size, element_type=$0]](self, other: Self) -> Self` Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`. **Parameters:** * ​mask (`IndexList[size, element_type=$0]`): The permutation to use in the shuffle. **Args:** * ​other (`Self`): The other vector to shuffle with. **Returns:** A new vector with the same length as the mask where the value at position `i` is `(self + other)[permutation[i]]`. ### `slice` `slice[output_width: Int, /, *, offset: Int = 0](self) -> SIMD[dtype, output_width]` Returns a slice of the vector of the specified width with the given offset. **Constraints:** `output_width + offset` must not exceed the size of this SIMD vector. **Parameters:** * ​output\_width (`Int`): The output SIMD vector size. * ​offset (`Int`): The given offset for the slice. **Returns:** A new vector whose elements map to `self[offset:offset+output_width]`. ### `insert` `insert[*, offset: Int = 0](self, value: SIMD[dtype, size]) -> Self` Returns a new vector where the elements between `offset` and `offset + input_width` have been replaced with the elements in `value`. **Parameters:** * ​offset (`Int`): The offset to insert at. **Args:** * ​value (`SIMD[dtype, size]`): The value to be inserted. **Returns:** A new vector whose elements at `self[offset:offset+input_width]` contain the values of `value`. ### `join` `join(self, other: Self) -> SIMD[dtype, (size * 2)]` Concatenates the two vectors together. **Args:** * ​other (`Self`): The other SIMD vector. **Returns:** A new vector `self_0, self_1, ..., self_n, other_0, ..., other_n`. ### `interleave` `interleave(self, other: Self) -> SIMD[dtype, (size * 2)]` Constructs a vector by interleaving two input vectors. **Args:** * ​other (`Self`): The other SIMD vector. **Returns:** A new vector `self_0, other_0, ..., self_n, other_n`. ### `split` `split(self) -> Tuple[SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)], SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]]` Splits the SIMD vector into 2 subvectors. **Returns:** A new vector `self_0:N/2, self_N/2:N`. ### `deinterleave` `deinterleave(self) -> Tuple[SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)], SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]]` Constructs two vectors by deinterleaving the even and odd lanes of the vector. **Constraints:** The vector size must be greater than 1. **Returns:** Two vectors the first of the form `self_0, self_2, ..., self_{n-2}` and the other being `self_1, self_3, ..., self_{n-1}`. ### `reduce` `reduce[func: fn[Int](SIMD[dtype, $0], SIMD[dtype, $0]) -> SIMD[dtype, $0], size_out: Int = 1](self) -> SIMD[dtype, size_out]` Reduces the vector using a provided reduce operator. **Constraints:** `size_out` must not exceed width of the vector. **Parameters:** * ​func (`fn[Int](SIMD[dtype, $0], SIMD[dtype, $0]) -> SIMD[dtype, $0]`): The reduce function to apply to elements in this SIMD. * ​size\_out (`Int`): The width of the reduction. **Returns:** A new scalar which is the reduction of all vector elements. `reduce[func: fn[Int](SIMD[dtype, $0], SIMD[dtype, $0]) capturing -> SIMD[dtype, $0], size_out: Int = 1](self) -> SIMD[dtype, size_out]` Reduces the vector using a provided reduce operator. **Constraints:** `size_out` must not exceed width of the vector. **Parameters:** * ​func (`fn[Int](SIMD[dtype, $0], SIMD[dtype, $0]) capturing -> SIMD[dtype, $0]`): The reduce function to apply to elements in this SIMD. * ​size\_out (`Int`): The width of the reduction. **Returns:** A new scalar which is the reduction of all vector elements. ### `reduce_max` `reduce_max[size_out: Int = 1](self) -> SIMD[dtype, size_out]` Reduces the vector using the `max` operator. **Constraints:** `size_out` must not exceed width of the vector. The element type of the vector must be integer or FP. **Parameters:** * ​size\_out (`Int`): The width of the reduction. **Returns:** The maximum element of the vector. ### `reduce_min` `reduce_min[size_out: Int = 1](self) -> SIMD[dtype, size_out]` Reduces the vector using the `min` operator. **Constraints:** `size_out` must not exceed width of the vector. The element type of the vector must be integer or FP. **Parameters:** * ​size\_out (`Int`): The width of the reduction. **Returns:** The minimum element of the vector. ### `reduce_add` `reduce_add[size_out: Int = 1](self) -> SIMD[dtype, size_out]` Reduces the vector using the `add` operator. **Constraints:** `size_out` must not exceed width of the vector. **Parameters:** * ​size\_out (`Int`): The width of the reduction. **Returns:** The sum of all vector elements. ### `reduce_mul` `reduce_mul[size_out: Int = 1](self) -> SIMD[dtype, size_out]` Reduces the vector using the `mul` operator. **Constraints:** `size_out` must not exceed width of the vector. The element type of the vector must be integer or FP. **Parameters:** * ​size\_out (`Int`): The width of the reduction. **Returns:** The product of all vector elements. ### `reduce_and` `reduce_and[size_out: Int = 1](self) -> SIMD[dtype, size_out]` Reduces the vector using the bitwise `&` operator. **Constraints:** `size_out` must not exceed width of the vector. The element type of the vector must be integer or boolean. **Parameters:** * ​size\_out (`Int`): The width of the reduction. **Returns:** The reduced vector. ### `reduce_or` `reduce_or[size_out: Int = 1](self) -> SIMD[dtype, size_out]` Reduces the vector using the bitwise `|` operator. **Constraints:** `size_out` must not exceed width of the vector. The element type of the vector must be integer or boolean. **Parameters:** * ​size\_out (`Int`): The width of the reduction. **Returns:** The reduced vector. ### `reduce_bit_count` `reduce_bit_count(self) -> Int` Returns the total number of bits set in the SIMD vector. **Constraints:** Must be either an integral or a boolean type. **Returns:** Count of set bits across all elements of the vector. ### `select` `select[dtype: DType](self, true_case: SIMD[dtype, size], false_case: SIMD[dtype, size]) -> SIMD[dtype, size]` Selects the values of the `true_case` or the `false_case` based on the current boolean values of the SIMD vector. **Constraints:** The element type of the vector must be boolean. **Parameters:** * ​dtype (`DType`): The element type of the input and output SIMD vectors. **Args:** * ​true\_case (`SIMD[dtype, size]`): The values selected if the positional value is True. * ​false\_case (`SIMD[dtype, size]`): The values selected if the positional value is False. **Returns:** A new vector of the form `[true_case[i] if elem else false_case[i] for i, elem in enumerate(self)]`. ### `rotate_left` `rotate_left[shift: Int](self) -> Self` Shifts the elements of a SIMD vector to the left by `shift` elements (with wrap-around). **Constraints:** `-size shift (`Int`): The number of positions by which to rotate the elements of SIMD vector to the left (with wrap-around). **Returns:** The SIMD vector rotated to the left by `shift` elements (with wrap-around). ### `rotate_right` `rotate_right[shift: Int](self) -> Self` Shifts the elements of a SIMD vector to the right by `shift` elements (with wrap-around). **Constraints:** `-size shift (`Int`): The number of positions by which to rotate the elements of SIMD vector to the right (with wrap-around). **Returns:** The SIMD vector rotated to the right by `shift` elements (with wrap-around). ### `shift_left` `shift_left[shift: Int](self) -> Self` Shifts the elements of a SIMD vector to the left by `shift` elements (no wrap-around, fill with zero). **Constraints:** `0 shift (`Int`): The number of positions by which to rotate the elements of SIMD vector to the left (no wrap-around, fill with zero). **Returns:** The SIMD vector rotated to the left by `shift` elements (no wrap-around, fill with zero). ### `shift_right` `shift_right[shift: Int](self) -> Self` Shifts the elements of a SIMD vector to the right by `shift` elements (no wrap-around, fill with zero). **Constraints:** `0 shift (`Int`): The number of positions by which to rotate the elements of SIMD vector to the right (no wrap-around, fill with zero). **Returns:** The SIMD vector rotated to the right by `shift` elements (no wrap-around, fill with zero). ### `reversed` `reversed(self) -> Self` Reverses the SIMD vector by indexes. Examples: ```mojo print(SIMD[DType.uint8, 4](1, 2, 3, 4).reversed()) # [4, 3, 2, 1] ``` . **Returns:** The by index reversed vector. --- ## simd Implements SIMD primitives and abstractions. Provides high-performance SIMD primitives and abstractions for vectorized computation in Mojo. It enables efficient data-parallel operations by leveraging hardware vector processing units across different architectures. Key Features: 1. Architecture-agnostic SIMD abstractions with automatic hardware detection 2. Optimized vector operations for common numerical computations 3. Explicit control over vectorization strategies and memory layouts 4. Zero-cost abstractions that compile to efficient machine code 5. Support for different vector widths and element types Primary Components: * Vector types: Strongly-typed vector containers with element-wise operations * SIMD intrinsics: Low-level access to hardware SIMD instructions * Vectorized algorithms: Common algorithms optimized for SIMD execution * Memory utilities: Aligned memory allocation and vector load/store operations Performance Considerations: * Vector width selection should match target hardware capabilities * Memory alignment affects load/store performance * Data layout transformations may be necessary for optimal vectorization Integration: This module is designed to work seamlessly with other Mojo numerical computing components, including tensor operations, linear algebra routines, and domain-specific libraries for machine learning and scientific computing. ## Aliases ### `BFloat16` `alias BFloat16 = SIMD[bfloat16, 1]` Represents a 16-bit brain floating point value. ### `Byte` `alias Byte = SIMD[uint8, 1]` Represents a byte (backed by an 8-bit unsigned integer). ### `Float16` `alias Float16 = SIMD[float16, 1]` Represents a 16-bit floating point value. ### `Float32` `alias Float32 = SIMD[float32, 1]` Represents a 32-bit floating point value. ### `Float64` `alias Float64 = SIMD[float64, 1]` Represents a 64-bit floating point value. ### `Float8_e4m3fn` `alias Float8_e4m3fn = SIMD[float8_e4m3fn, 1]` Represents the E4M3 floating point format defined in the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1). This type is named differently across libraries and vendors, for example: * Mojo, PyTorch, JAX, and LLVM refer to it as `e4m3fn`. * OCP, NVIDIA CUDA, and AMD ROCm refer to it as `e4m3`. In these contexts, they are all referring to the same finite type specified in the OFP8 standard above, encoded as `seeeemmm`: * (s)ign: 1 bit * (e)xponent: 4 bits * (m)antissa: 3 bits * exponent bias: 7 * nan: 01111111, 11111111 * -0: 10000000 * fn: finite (no inf or -inf encodings) ### `Float8_e4m3fnuz` `alias Float8_e4m3fnuz = SIMD[float8_e4m3fnuz, 1]` Represents an 8-bit e4m3fnuz floating point format, encoded as `seeeemmm`: - (s)ign: 1 bit - (e)xponent: 4 bits - (m)antissa: 3 bits - exponent bias: 8 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding) ### `Float8_e5m2` `alias Float8_e5m2 = SIMD[float8_e5m2, 1]` Represents the 8-bit E5M2 floating point format from the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1), encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 15 - nan: {0,1}11111{01,10,11} - inf: 01111100 - -inf: 11111100 - -0: 10000000 ### `Float8_e5m2fnuz` `alias Float8_e5m2fnuz = SIMD[float8_e5m2fnuz, 1]` Represents an 8-bit floating point format, encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 16 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding) ### `Int128` `alias Int128 = SIMD[si128, 1]` Represents a 128-bit signed scalar integer. ### `Int16` `alias Int16 = SIMD[int16, 1]` Represents a 16-bit signed scalar integer. ### `Int256` `alias Int256 = SIMD[si256, 1]` Represents a 256-bit signed scalar integer. ### `Int32` `alias Int32 = SIMD[int32, 1]` Represents a 32-bit signed scalar integer. ### `Int64` `alias Int64 = SIMD[int64, 1]` Represents a 64-bit signed scalar integer. ### `Int8` `alias Int8 = SIMD[int8, 1]` Represents an 8-bit signed scalar integer. ### `Scalar` `alias Scalar = SIMD[?, 1]` Represents a scalar dtype. ### `U8x16` `alias U8x16 = SIMD[uint8, 16]` ### `UInt128` `alias UInt128 = SIMD[ui128, 1]` Represents a 128-bit unsigned scalar integer. ### `UInt16` `alias UInt16 = SIMD[uint16, 1]` Represents a 16-bit unsigned scalar integer. ### `UInt256` `alias UInt256 = SIMD[ui256, 1]` Represents a 256-bit unsigned scalar integer. ### `UInt32` `alias UInt32 = SIMD[uint32, 1]` Represents a 32-bit unsigned scalar integer. ### `UInt64` `alias UInt64 = SIMD[uint64, 1]` Represents a 64-bit unsigned scalar integer. ### `UInt8` `alias UInt8 = SIMD[uint8, 1]` Represents an 8-bit unsigned scalar integer. ## Structs * [​`SIMD`](/mojo/stdlib/builtin/simd/SIMD): Represents a small vector that is backed by a hardware vector element. --- ## sort Implements the built-in `sort` function. These are Mojo built-ins, so you don't need to import them. ## Aliases ### `insertion_sort_threshold` `alias insertion_sort_threshold = 32` ## Functions * [​`partition`](/mojo/stdlib/builtin/sort/partition): Partition the input buffer inplace such that first k elements are the largest (or smallest if cmp\_fn is --- ## partition `partition[: origin.set, T: Copyable & Movable, origin: MutableOrigin, //, cmp_fn: fn(T, T) capturing -> Bool](span: Span[T, origin], k: Int)` Partition the input buffer inplace such that first k elements are the largest (or smallest if cmp\_fn is T (`Copyable & Movable`): Type of the underlying data. * ​origin (`MutableOrigin`): Origin of span. * ​cmp\_fn (`fn(T, T) capturing -> Bool`): Comparison functor of (T, T) capturing \[\_] -> Bool type. **Args:** * ​span (`Span[T, origin]`): Input buffer. * ​k (`Int`): Index of the partition element. --- ## sort `sort[: origin.set, T: Copyable & Movable, origin: MutableOrigin, //, cmp_fn: fn(T, T) capturing -> Bool, *, stable: Bool = False](span: Span[T, origin])` Sort the list inplace. The function doesn't return anything, the list is updated inplace. **Parameters:** * ​T (`Copyable & Movable`): Copyable & Movable type of the underlying data. * ​origin (`MutableOrigin`): Origin of span. * ​cmp\_fn (`fn(T, T) capturing -> Bool`): The comparison function. * ​stable (`Bool`): Whether the sort should be stable. **Args:** * ​span (`Span[T, origin]`): The span to be sorted. `sort[: origin.set, origin: MutableOrigin, //, cmp_fn: fn(Int, Int) capturing -> Bool, *, stable: Bool = False](span: Span[Int, origin])` Sort the list inplace. The function doesn't return anything, the list is updated inplace. **Parameters:** * ​origin (`MutableOrigin`): Origin of span. * ​cmp\_fn (`fn(Int, Int) capturing -> Bool`): The comparison function. * ​stable (`Bool`): Whether the sort should be stable. **Args:** * ​span (`Span[Int, origin]`): The span to be sorted. `sort[origin: MutableOrigin, //, *, stable: Bool = False](span: Span[Int, origin])` Sort the list inplace. The function doesn't return anything, the list is updated inplace. **Parameters:** * ​origin (`MutableOrigin`): Origin of span. * ​stable (`Bool`): Whether the sort should be stable. **Args:** * ​span (`Span[Int, origin]`): The span to be sorted. `sort[dtype: DType, origin: MutableOrigin, //, *, stable: Bool = False](span: Span[SIMD[dtype, 1], origin])` Sort the list inplace. The function doesn't return anything, the list is updated inplace. **Parameters:** * ​dtype (`DType`): Copyable & Movable type of the underlying data. * ​origin (`MutableOrigin`): Origin of span. * ​stable (`Bool`): Whether the sort should be stable. **Args:** * ​span (`Span[SIMD[dtype, 1], origin]`): The span to be sorted. `sort[T: Copyable & Movable & EqualityComparable & LessThanComparable & GreaterThanComparable & LessThanOrEqualComparable & GreaterThanOrEqualComparable, origin: MutableOrigin, //, *, stable: Bool = False](span: Span[T, origin])` Sort list of the order comparable elements in-place. **Parameters:** * ​T (`Copyable & Movable & EqualityComparable & LessThanComparable & GreaterThanComparable & LessThanOrEqualComparable & GreaterThanOrEqualComparable`): The order comparable collection element type. * ​origin (`MutableOrigin`): Origin of span. * ​stable (`Bool`): Whether the sort should be stable. **Args:** * ​span (`Span[T, origin]`): The span to be sorted. --- ## Stringable The `Stringable` trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String). Any type that conforms to `Stringable` or [`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising) works with the built-in [`print()`](/mojo/stdlib/builtin/io/print) and [`String()`](/mojo/stdlib/builtin/str/str) functions. The `Stringable` trait requires the type to define the `__str__()` method. For example: ```mojo struct Foo(Stringable): var s: String fn __str__(self) -> String: return self.s ``` Now you can pass an instance of `Foo` to the `String()` function to get back a `String`: ```mojo var foo = Foo("test") print(String(foo) == "test") ``` ```plaintext True ``` **Note:** If the `__str__()` method might raise an error, use the [`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising) trait, instead. About the difference between `__repr__()` and `__str__()`: The method `__repr__` computes the "official" string representation of an object while `__str__` computes the "informal" or nicely printable string representation of an object. This method differs from `__repr__()` in that there is no expectation that `__str__()` return a valid Mojo expression: a more convenient or concise representation can be used. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__str__` `__str__(self: _Self) -> String` Get the string representation of the type. **Returns:** The string representation of the type. --- ## StringableRaising The StringableRaising trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String). Any type that conforms to [`Stringable`](/mojo/stdlib/builtin/str/Stringable) or `StringableRaising` works with the built-in [`print()`](/mojo/stdlib/builtin/io/print) and [`String()`](/mojo/stdlib/builtin/str/str) functions. The `StringableRaising` trait requires the type to define the `__str__()` method, which can raise an error. For example: ```mojo struct Foo(StringableRaising): var s: String fn __str__(self) raises -> String: if self.s == "": raise Error("Empty String") return self.s ``` Now you can pass an instance of `Foo` to the `String()` function to get back a `String`: ```mojo fn main() raises: var foo = Foo("test") print(String(foo) == "test") ``` ```plaintext True ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__str__` `__str__(self: _Self) -> String` Get the string representation of the type. **Returns:** The string representation of the type. **Raises:** If there is an error when computing the string representation of the type. --- ## str Provides the `str` function. These are Mojo built-ins, so you don't need to import them. ## Traits * [​`Stringable`](/mojo/stdlib/builtin/str/Stringable): The `Stringable` trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String). * [​`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising): The StringableRaising trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String). --- ## StringLiteral `@register_passable(trivial)` `struct StringLiteral[value: string]` This type represents a string literal. String literals are all null-terminated for compatibility with C APIs, but this is subject to change. String literals store their length as an integer, and this does not include the null terminator. ## Parameters * ​value (`string`): The underlying string value. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Defaultable`, `ExplicitlyCopyable`, `FloatableRaising`, `IntableRaising`, `Movable`, `PathLike`, `PythonConvertible`, `Representable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Constructor for any value. ### `__bool__` `__bool__(self) -> Bool` Convert the string to a bool value. **Returns:** True if the string is not empty. ### `__getitem__` `__getitem__[IndexerType: Indexer](self, idx: IndexerType) -> String` Gets the character at the specified position. **Parameters:** * ​IndexerType (`Indexer`): The inferred type of an indexer argument. **Args:** * ​idx (`IndexerType`): The index value. **Returns:** A new string containing the character at the specified position. ### `__lt__` `__lt__(self, rhs: StringSlice[origin]) -> Bool` Compare this value to the RHS using lesser than (LT) comparison. **Args:** * ​rhs (`StringSlice[origin]`): The other value to compare against. **Returns:** True if this is strictly less than the RHS and False otherwise. ### `__le__` `__le__(self, rhs: StringSlice[origin]) -> Bool` Compare this value to the RHS using lesser than or equal to (LE) comparison. **Args:** * ​rhs (`StringSlice[origin]`): The other value to compare against. **Returns:** True if this is less than or equal to the RHS and False otherwise. ### `__eq__` `__eq__(self, rhs: StringSlice[origin]) -> Bool` Compare two string literals for equality. **Args:** * ​rhs (`StringSlice[origin]`): The string to compare. **Returns:** True if they are equal. ### `__ne__` `__ne__(self, rhs: StringSlice[origin]) -> Bool` Compare two string literals for inequality. **Args:** * ​rhs (`StringSlice[origin]`): The string to compare. **Returns:** True if they are not equal. ### `__gt__` `__gt__(self, rhs: StringSlice[origin]) -> Bool` Compare this value to the RHS using greater than (GT) comparison. **Args:** * ​rhs (`StringSlice[origin]`): The other value to compare against. **Returns:** True if this is strictly greater than the RHS and False otherwise. ### `__ge__` `__ge__(self, rhs: StringSlice[origin]) -> Bool` Compare this value to the RHS using greater than or equal to (GE) comparison. **Args:** * ​rhs (`StringSlice[origin]`): The other value to compare against. **Returns:** True if this is greater than or equal to the RHS and False otherwise. ### `__add__` `__add__(self, rhs: StringLiteral[value]) -> StringLiteral[#pop.string_concat]` Concatenate two string literals. **Args:** * ​rhs (`StringLiteral[value]`): The string to concat. **Returns:** The concatenated string. ### `__mul__` `__mul__(self, n: Int) -> String` Concatenates the string `n` times. **Args:** * ​n (`Int`): The number of times to concatenate the string. **Returns:** The string concatenated `n` times. ### `copy` `copy(self) -> Self` Copy constructor. **Returns:** A copy of the value. ### `to_python_object` `to_python_object(owned self) -> PythonObject` Convert this value to a PythonObject. **Returns:** A PythonObject representing the value. ### `__len__` `__len__(self) -> Int` Get the string length. **Returns:** The length of this value. ### `__int__` `__int__(self) -> Int` Parses the given string as a base-10 integer and returns that value. If the string cannot be parsed as an int, an error is raised. **Returns:** An integer value that represents the string, or otherwise raises. ### `__float__` `__float__(self) -> SIMD[float64, 1]` Parses the string as a float point number and returns that value. If the string cannot be parsed as a float, an error is raised. **Returns:** A float value that represents the string, or otherwise raises. ### `__str__` `__str__(self) -> String` Convert the string literal to a string. **Returns:** A new string. ### `__repr__` `__repr__(self) -> String` Return a representation of this value. You don't need to call this method directly, use `repr("...")` instead. **Returns:** A new representation of the string. ### `__fspath__` `__fspath__(self) -> String` Return the file system path representation of the object. **Returns:** The file system path representation as a string. ### `__iter__` `__iter__(self) -> CodepointSliceIter[StaticConstantOrigin]` Return an iterator over the string literal. **Returns:** An iterator over the string. ### `__reversed__` `__reversed__(self) -> CodepointSliceIter[StaticConstantOrigin, False]` Iterate backwards over the string, returning immutable references. **Returns:** A reversed iterator over the string. ### `__merge_with__` `__merge_with__[: string, //, other_type: AnyStruct[StringLiteral[$0]]](self) -> StringSlice[StaticConstantOrigin]` Returns a StaticString after merging with another string literal. **Parameters:** * ​other\_type (`AnyStruct[StringLiteral[$0]]`): The type of the string literal to merge with. **Returns:** A StaticString after merging with the specified `other_type`. ### `byte_length` `byte_length(self) -> Int` Get the string length in bytes. Notes: This does not include the trailing null terminator in the count. **Returns:** The length of this string in bytes. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[SIMD[uint8, 1], mut=False, origin=StaticConstantOrigin]` Get raw pointer to the underlying data. **Returns:** The raw pointer to the data. ### `unsafe_cstr_ptr` `unsafe_cstr_ptr(self) -> UnsafePointer[SIMD[int8, 1], mut=False, origin=StaticConstantOrigin]` Retrieves a C-string-compatible pointer to the underlying memory. The returned pointer is guaranteed to be NUL terminated, and not null. **Returns:** The pointer to the underlying memory. ### `as_string_slice` `as_string_slice(self) -> StringSlice[StaticConstantOrigin]` Returns a string slice of this static string literal. **Returns:** A string slice pointing to this static string literal. ### `as_bytes` `as_bytes(self) -> Span[SIMD[uint8, 1], StaticConstantOrigin]` Returns a contiguous Span of the bytes owned by this string. **Returns:** A contiguous slice pointing to the bytes owned by this string. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this string literal to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `find` `find(self, substr: StringSlice[StaticConstantOrigin], start: Int = 0) -> Int` Finds the offset of the first occurrence of `substr` starting at `start`. If not found, returns -1. **Args:** * ​substr (`StringSlice[StaticConstantOrigin]`): The substring to find. * ​start (`Int`): The offset from which to find. **Returns:** The offset of `substr` relative to the beginning of the string. ### `rfind` `rfind(self, substr: StringSlice[StaticConstantOrigin], start: Int = 0) -> Int` Finds the offset of the last occurrence of `substr` starting at `start`. If not found, returns -1. **Args:** * ​substr (`StringSlice[StaticConstantOrigin]`): The substring to find. * ​start (`Int`): The offset from which to find. **Returns:** The offset of `substr` relative to the beginning of the string. ### `count` `count(self, substr: StringSlice[origin]) -> Int` Return the number of non-overlapping occurrences of substring `substr` in the string literal. If sub is empty, returns the number of empty strings between characters which is the length of the string plus one. **Args:** * ​substr (`StringSlice[origin]`): The substring to count. **Returns:** The number of occurrences of `substr`. ### `lower` `lower(self) -> String` Returns a copy of the string literal with all cased characters converted to lowercase. **Returns:** A new string where cased letters have been converted to lowercase. ### `upper` `upper(self) -> String` Returns a copy of the string literal with all cased characters converted to uppercase. **Returns:** A new string where cased letters have been converted to uppercase. ### `rjust` `rjust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String` Returns the string right justified in a string literal of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns right justified string, or self if width is not bigger than self length. ### `ljust` `ljust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String` Returns the string left justified in a string literal of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns left justified string, or self if width is not bigger than self length. ### `center` `center(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String` Returns the string center justified in a string literal of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns center justified string, or self if width is not bigger than self length. ### `startswith` `startswith(self, prefix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool` Checks if the string literal starts with the specified prefix between start and end positions. Returns True if found and False otherwise. **Args:** * ​prefix (`StringSlice[origin]`): The prefix to check. * ​start (`Int`): The start offset from which to check. * ​end (`Int`): The end offset from which to check. **Returns:** True if the `self[start:end]` is prefixed by the input prefix. ### `endswith` `endswith(self, suffix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool` Checks if the string literal end with the specified suffix between start and end positions. Returns True if found and False otherwise. **Args:** * ​suffix (`StringSlice[origin]`): The suffix to check. * ​start (`Int`): The start offset from which to check. * ​end (`Int`): The end offset from which to check. **Returns:** True if the `self[start:end]` is suffixed by the input suffix. ### `isdigit` `isdigit(self) -> Bool` Returns True if all characters in the string literal are digits. Note that this currently only works with ASCII strings. **Returns:** True if all characters are digits else False. ### `isupper` `isupper(self) -> Bool` Returns True if all cased characters in the string literal are uppercase and there is at least one cased character. Note that this currently only works with ASCII strings. **Returns:** True if all cased characters in the string literal are uppercase and there is at least one cased character, False otherwise. ### `islower` `islower(self) -> Bool` Returns True if all cased characters in the string literal are lowercase and there is at least one cased character. Note that this currently only works with ASCII strings. **Returns:** True if all cased characters in the string literal are lowercase and there is at least one cased character, False otherwise. ### `strip` `strip(self) -> String` Return a copy of the string literal with leading and trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. **Returns:** A string with no leading or trailing whitespaces. `strip(self, chars: StringSlice[origin]) -> String` Return a copy of the string literal with leading and trailing characters removed. **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A string with no leading or trailing characters. ### `rstrip` `rstrip(self, chars: StringSlice[origin]) -> String` Return a copy of the string literal with trailing characters removed. **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A string with no trailing characters. `rstrip(self) -> String` Return a copy of the string with trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. **Returns:** A copy of the string with no trailing whitespaces. ### `lstrip` `lstrip(self, chars: StringSlice[origin]) -> String` Return a copy of the string with leading characters removed. **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A copy of the string with no leading characters. `lstrip(self) -> String` Return a copy of the string with leading whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. **Returns:** A copy of the string with no leading whitespaces. --- ## string_literal Implements the StringLiteral struct. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral): This type represents a string literal. --- ## swap Implements the built-in `swap` function. These are Mojo built-ins, so you don't need to import them. ## Functions * [​`swap`](/mojo/stdlib/builtin/swap/swap): Swaps the two given arguments. --- ## swap `swap[T: Movable](mut lhs: T, mut rhs: T)` Swaps the two given arguments. **Parameters:** * ​T (`Movable`): Constrained to Movable types. **Args:** * ​lhs (`T`): Argument value swapped with rhs. * ​rhs (`T`): Argument value swapped with lhs. --- ## Tuple `struct Tuple[*element_types: Copyable & Movable]` The type of a literal tuple expression. A tuple consists of zero or more values, separated by commas. ## Parameters * ​\*element\_types (`Copyable & Movable`): The elements type. ## Fields * ​storage (`!kgen.pack> element_types>`): The underlying storage for the tuple. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self: Tuple[])` Construct an empty tuple. `__init__(out self, owned *args: *element_types)` Construct the tuple. **Args:** * ​\*args (`*element_types`): Initial values. `__init__(out self, *, owned storage: VariadicPack[is_owned, origin, Copyable & Movable, element_types])` Construct the tuple from a low-level internal representation. **Args:** * ​storage (`VariadicPack[is_owned, origin, Copyable & Movable, element_types]`): The variadic pack storage to construct from. ### `__copyinit__` `__copyinit__(out self, existing: Self)` Copy construct the tuple. **Args:** * ​existing (`Self`): The value to copy from. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Move construct the tuple. **Args:** * ​existing (`Self`): The value to move from. ### `__del__` `__del__(owned self)` Destructor that destroys all of the elements. ### `__getitem__` `__getitem__[idx: Int](ref self) -> ref [self] element_types[idx.value]` Get a reference to an element in the tuple. **Parameters:** * ​idx (`Int`): The element to return. **Returns:** A reference to the specified element. ### `__contains__` `__contains__[T: EqualityComparable & Copyable & Movable](self, value: T) -> Bool` Return whether the tuple contains the specified value. For example: ```mojo var t = Tuple(True, 1, 2.5) if 1 in t: print("t contains 1") ``` **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the value. **Args:** * ​value (`T`): The value to search for. **Returns:** True if the value is in the tuple, False otherwise. ### `copy` `copy(self) -> Self` Explicitly construct a copy of self. **Returns:** A copy of this value. ### `__len__` `static __len__() -> Int` Return the number of elements in the tuple. **Returns:** The tuple length. `__len__(self) -> Int` Get the number of elements in the tuple. **Returns:** The tuple length. --- ## tuple Implements the Tuple type. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`Tuple`](/mojo/stdlib/builtin/tuple/Tuple): The type of a literal tuple expression. --- ## Origin `@register_passable(trivial)` `struct Origin[mut: Bool]` This represents a origin reference for a memory value. ## Parameters * ​mut (`Bool`): Whether the origin is mutable. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `cast_from` `alias cast_from = _lit_mut_cast[mut, ?]` Cast an existing Origin to be of the specified mutability. This is a low-level way to coerce Origin mutability. This should be used rarely, typically when building low-level fundamental abstractions. Strongly consider alternatives before reaching for this "escape hatch". Safety: This is an UNSAFE operation if used to cast an immutable origin to a mutable origin. Examples: Cast a mutable origin to be immutable: ```mojo struct Container[mut: Bool, //, origin: Origin[mut]]: var data: Int fn imm_borrow(self) -> Container[ImmutableOrigin.cast_from[origin].result]: # ... ``` ### `empty` `alias empty = {}` An empty `__origin_of()` of the given mutability. The empty origin is guaranteed not to alias any existing origins. --- ## type_aliases Defines some type aliases. These are Mojo built-ins, so you don't need to import them. ## Aliases ### `AnyTrivialRegType` `alias AnyTrivialRegType = AnyTrivialRegType` Represents any register passable Mojo data type. ### `ImmutableAnyOrigin` `alias ImmutableAnyOrigin = ImmutableAnyOrigin` The immutable origin that might access any memory value. ### `ImmutableOrigin` `alias ImmutableOrigin = ImmutableOrigin` Immutable origin reference type. ### `MutableAnyOrigin` `alias MutableAnyOrigin = MutableAnyOrigin` The mutable origin that might access any memory value. ### `MutableOrigin` `alias MutableOrigin = MutableOrigin` Mutable origin reference type. ### `OriginSet` `alias OriginSet = origin.set` A set of origin parameters. ### `StaticConstantOrigin` `alias StaticConstantOrigin = StaticConstantOrigin` An origin for strings and other always-immutable static constants. ## Structs * [​`Origin`](/mojo/stdlib/builtin/type_aliases/Origin): This represents a origin reference for a memory value. --- ## UInt `@register_passable(trivial)` `struct UInt` This type represents an unsigned integer. The size of this unsigned integer is platform-dependent. If you wish to use a fixed size unsigned integer, consider using `UInt8`, `UInt16`, `UInt32`, or `UInt64`. ## Fields * ​value (`index`): The underlying storage for the integer value. Note that it is the same type as the `Int.value` field. MLIR doesn't differentiate between signed and unsigned integers when it comes to storing them with the index dialect. The difference is in the operations that are performed on them, which have signed and unsigned variants. ## Implemented traits `Absable`, `AnyType`, `Boolable`, `CeilDivable`, `Copyable`, `Defaultable`, `EqualityComparable`, `ExplicitlyCopyable`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `Hashable`, `Indexer`, `Intable`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `Representable`, `Stringable`, `UnknownDestructibility`, `Writable`, `_HashableWithHasher` ## Aliases ### `BITWIDTH` `alias BITWIDTH = __init__[::Intable](bitwidthof[::DType,__mlir_type.!kgen.target]())` The bit width of the integer type. ### `MAX` `alias MAX = UInt((0 if (__init__[::Intable](bitwidthof[::DType,__mlir_type.!kgen.target]()) Returns the maximum integer value. ### `MIN` `alias MIN = UInt(0)` Returns the minimum value of type. ## Methods ### `__init__` `__init__() -> Self` Default constructor that produces zero. `@implicit` `__init__(value: IntLiteral[value]) -> Self` Construct UInt from the given IntLiteral value. **Args:** * ​value (`IntLiteral[value]`): The init value. `@implicit` `__init__(value: Int) -> Self` Construct UInt from the given Int value. **Args:** * ​value (`Int`): The init value. `__init__[T: Indexer](value: T) -> Self` Construct UInt from the given Indexable value. **Parameters:** * ​T (`Indexer`): The type that that can index into a collection or pointer. **Args:** * ​value (`T`): The init value. ### `__bool__` `__bool__(self) -> Bool` Convert this Int to Bool. **Returns:** False Bool value if the value is equal to 0 and True otherwise. ### `__pos__` `__pos__(self) -> Self` Return +self. **Returns:** The +self value. ### `__invert__` `__invert__(self) -> Self` Return \~self. **Returns:** The \~self value. ### `__lt__` `__lt__(self, rhs: Self) -> Bool` Return whether this UInt is strictly less than another. **Args:** * ​rhs (`Self`): The other UInt to compare against. **Returns:** True if this UInt is less than the other UInt and False otherwise. ### `__le__` `__le__(self, rhs: Self) -> Bool` Compare this Int to the RHS using LE comparison. **Args:** * ​rhs (`Self`): The other UInt to compare against. **Returns:** True if this Int is less-or-equal than the RHS Int and False otherwise. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compare this UInt to the RHS using EQ comparison. **Args:** * ​rhs (`Self`): The other UInt to compare against. **Returns:** True if this UInt is equal to the RHS UInt and False otherwise. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Compare this UInt to the RHS using NE comparison. **Args:** * ​rhs (`Self`): The other UInt to compare against. **Returns:** True if this UInt is non-equal to the RHS UInt and False otherwise. ### `__gt__` `__gt__(self, rhs: Self) -> Bool` Return whether this UInt is strictly greater than another. **Args:** * ​rhs (`Self`): The other UInt to compare against. **Returns:** True if this UInt is greater than the other UInt and False otherwise. ### `__ge__` `__ge__(self, rhs: Self) -> Bool` Return whether this UInt is greater than or equal to another. **Args:** * ​rhs (`Self`): The other UInt to compare against. **Returns:** True if this UInt is greater than or equal to the other UInt and False otherwise. ### `__add__` `__add__(self, rhs: Self) -> Self` Return `self + rhs`. **Args:** * ​rhs (`Self`): The value to add. **Returns:** `self + rhs` value. ### `__sub__` `__sub__(self, rhs: Self) -> Self` Return `self - rhs`. **Args:** * ​rhs (`Self`): The value to subtract. **Returns:** `self - rhs` value. ### `__mul__` `__mul__(self, rhs: Self) -> Self` Return `self * rhs`. **Args:** * ​rhs (`Self`): The value to multiply with. **Returns:** `self * rhs` value. ### `__truediv__` `__truediv__(self, rhs: Self) -> SIMD[float64, 1]` Return the floating point division of `self` and `rhs`. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** `Float64(self)/Float64(rhs)` value. ### `__floordiv__` `__floordiv__(self, rhs: Self) -> Self` Return the division of `self` and `rhs` rounded down to the nearest integer. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** `floor(self/rhs)` value. ### `__mod__` `__mod__(self, rhs: Self) -> Self` Return the remainder of self divided by rhs. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** The remainder of dividing self by rhs. ### `__pow__` `__pow__(self, exp: Self) -> Self` Return the value raised to the power of the given exponent. Computes the power of an integer using the Russian Peasant Method. **Args:** * ​exp (`Self`): The exponent value. **Returns:** The value of `self` raised to the power of `exp`. ### `__lshift__` `__lshift__(self, rhs: Self) -> Self` Return `self rhs (`Self`): The value to shift with. **Returns:** `self ### `__rshift__` `__rshift__(self, rhs: Self) -> Self` Return `self >> rhs`. **Args:** * ​rhs (`Self`): The value to shift with. **Returns:** `self >> rhs`. ### `__and__` `__and__(self, rhs: Self) -> Self` Return `self & rhs`. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self & rhs`. ### `__or__` `__or__(self, rhs: Self) -> Self` Return `self | rhs`. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self | rhs`. ### `__xor__` `__xor__(self, rhs: Self) -> Self` Return `self ^ rhs`. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self ^ rhs`. ### `__radd__` `__radd__(self, value: Self) -> Self` Return `value + self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value + self`. ### `__rsub__` `__rsub__(self, value: Self) -> Self` Return `value - self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value - self`. ### `__rmul__` `__rmul__(self, value: Self) -> Self` Return `value * self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value * self`. ### `__rfloordiv__` `__rfloordiv__(self, value: Self) -> Self` Return `value // self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value // self`. ### `__rmod__` `__rmod__(self, value: Self) -> Self` Return `value % self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value % self`. ### `__rpow__` `__rpow__(self, value: Self) -> Self` Return `pow(value,self)`. **Args:** * ​value (`Self`): The other value. **Returns:** `pow(value,self)`. ### `__rlshift__` `__rlshift__(self, value: Self) -> Self` Return `value value (`Self`): The other value. **Returns:** `value ### `__rrshift__` `__rrshift__(self, value: Self) -> Self` Return `value >> self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value >> self`. ### `__rand__` `__rand__(self, value: Self) -> Self` Return `value & self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value & self`. ### `__ror__` `__ror__(self, value: Self) -> Self` Return `value | self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value | self`. ### `__rxor__` `__rxor__(self, value: Self) -> Self` Return `value ^ self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value ^ self`. ### `__iadd__` `__iadd__(mut self, rhs: Self)` Compute `self + rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__isub__` `__isub__(mut self, rhs: Self)` Compute `self - rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__imul__` `__imul__(mut self, rhs: Self)` Compute self\*rhs and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__itruediv__` `__itruediv__(mut self, rhs: Self)` Compute `self / rhs`, convert to int, and save the result in self. Since `floor(self / rhs)` is equivalent to `self // rhs`, this yields the same as `__ifloordiv__`. **Args:** * ​rhs (`Self`): The RHS value. ### `__ifloordiv__` `__ifloordiv__(mut self, rhs: Self)` Compute `self // rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__imod__` `__imod__(mut self, rhs: Self)` Compute `self % rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ipow__` `__ipow__(mut self, rhs: Self)` Compute `pow(self, rhs)` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ilshift__` `__ilshift__(mut self, rhs: Self)` Compute `self rhs (`Self`): The RHS value. ### `__irshift__` `__irshift__(mut self, rhs: Self)` Compute `self >> rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__iand__` `__iand__(mut self, rhs: Self)` Compute `self & rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ixor__` `__ixor__(mut self, rhs: Self)` Compute `self ^ rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ior__` `__ior__(mut self, rhs: Self)` Compute self|rhs and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__divmod__` `__divmod__(self, rhs: Self) -> Tuple[UInt, UInt]` Computes both the quotient and remainder using integer division. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** The quotient and remainder as a `Tuple(self // rhs, self % rhs)`. ### `__index__` `__index__(self) -> index` Convert to index. **Returns:** The corresponding \_\_mlir\_type.index value. ### `__int__` `__int__(self) -> Int` Gets the integral value, wrapping to a negative number on overflow. **Returns:** The value as an integer. ### `__abs__` `__abs__(self) -> Self` Return the absolute value of the UInt value. **Returns:** The absolute value. ### `__ceil__` `__ceil__(self) -> Self` Return the ceiling of the UInt value, which is itself. **Returns:** The UInt value itself. ### `__floor__` `__floor__(self) -> Self` Return the floor of the UInt value, which is itself. **Returns:** The UInt value itself. ### `__round__` `__round__(self) -> Self` Return the rounded value of the UInt value, which is itself. **Returns:** The UInt value itself. `__round__(self, ndigits: Self) -> Self` Return the rounded value of the UInt value, which is itself. **Args:** * ​ndigits (`Self`): The number of digits to round to. **Returns:** The UInt value itself if ndigits >= 0 else the rounded value. ### `__trunc__` `__trunc__(self) -> Self` Return the truncated UInt value, which is itself. **Returns:** The Int value itself. ### `__ceildiv__` `__ceildiv__(self, denominator: Self) -> Self` Return the rounded-up result of dividing self by denominator. **Args:** * ​denominator (`Self`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. ### `is_power_of_two` `is_power_of_two(self) -> Bool` Check if the integer is a (non-zero) power of two. **Returns:** True if the integer is a power of two, False otherwise. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this integer to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__str__` `__str__(self) -> String` Convert this UInt to a string. A small example. ```mojo x = UInt(50) assert_equal(String(x), "50") ``` **Returns:** The string representation of this UInt. ### `__repr__` `__repr__(self) -> String` Convert this UInt to a string. A small example. ```mojo x = UInt(50) assert_equal(repr(x), "UInt(50)") ``` **Returns:** The string representation of this UInt. ### `__hash__` `__hash__(self) -> Self` Hash the UInt using builtin hash. **Returns:** A 64-bit hash value. This value is *not* suitable for cryptographic uses. Its intended usage is for data structures. See the `hash` builtin documentation for more details. `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with this uint value. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. --- ## uint Implements the UInt class. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`UInt`](/mojo/stdlib/builtin/uint/UInt): This type represents an unsigned integer. --- ## Copyable The Copyable trait denotes a type whose value can be copied. Example implementing the `Copyable` trait on `Foo` which requires the `__copyinit__` method: ```mojo struct Foo(Copyable): var s: String @implicit fn __init__(out self, s: String): self.s = s fn __copyinit__(out self, other: Self): print("copying value") self.s = other.s ``` You can now copy objects inside a generic function: ```mojo fn copy_return[T: Copyable](foo: T) -> T: var copy = foo return copy var foo = Foo("test") var res = copy_return(foo) ``` ```plaintext copying value ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. --- ## Defaultable The `Defaultable` trait describes a type with a default constructor. Implementing the `Defaultable` trait requires the type to define an `__init__` method with no arguments: ```mojo struct Foo(Defaultable): var s: String fn __init__(out self): self.s = "default" ``` You can now construct a generic `Defaultable` type: ```mojo fn default_init[T: Defaultable]() -> T: return T() var foo = default_init[Foo]() print(foo.s) ``` ```plaintext default ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self: _Self)` Create a default instance of the value. --- ## ExplicitlyCopyable The ExplicitlyCopyable trait denotes a type whose value can be copied explicitly. Unlike `Copyable`, which denotes types that are *implicitly* copyable, an explicitly copyable type can only be copied when the explicit copy initializer is called intentionally by the programmer. An explicit copy initializer is just a normal `__init__` method that takes a `read-only` argument of `Self`. Example implementing the `ExplicitlyCopyable` trait on `Foo` which requires the `fn(self) -> Self` method: ```mojo struct Foo(ExplicitlyCopyable): var s: String @implicit fn __init__(out self, s: String): self.s = s fn copy(self) -> Self: print("explicitly copying value") return Foo(self.s) ``` You can now copy objects inside a generic function: ```mojo fn copy_return[T: ExplicitlyCopyable](foo: T) -> T: var copy = foo.copy() return copy var foo = Foo("test") var res = copy_return(foo) ``` ```plaintext explicitly copying value ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `copy` `copy(self: _Self) -> _Self` Explicitly construct a copy of self. **Returns:** A copy of this value. --- ## Movable The Movable trait denotes a type whose value can be moved. Implement the `Movable` trait on `Foo` which requires the `__moveinit__` method: ```mojo struct Foo(Movable): fn __init__(out self): pass fn __moveinit__(out self, owned existing: Self): print("moving") ``` You can now use the ^ suffix to move the object instead of copying it inside generic functions: ```mojo fn return_foo[T: Movable](owned foo: T) -> T: return foo^ var foo = Foo() var res = return_foo(foo^) ``` ```plaintext moving ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. --- ## value Defines core value traits. These are Mojo built-ins, so you don't need to import them. ## Traits * [​`Copyable`](/mojo/stdlib/builtin/value/Copyable): The Copyable trait denotes a type whose value can be copied. * [​`Defaultable`](/mojo/stdlib/builtin/value/Defaultable): The `Defaultable` trait describes a type with a default constructor. * [​`ExplicitlyCopyable`](/mojo/stdlib/builtin/value/ExplicitlyCopyable): The ExplicitlyCopyable trait denotes a type whose value can be copied explicitly. * [​`Movable`](/mojo/stdlib/builtin/value/Movable): The Movable trait denotes a type whose value can be moved. --- ## VariadicList `@register_passable(trivial)` `struct VariadicList[type: AnyTrivialRegType]` A utility class to access variadic function arguments. Provides a "list" view of the function argument so that the size of the argument list and each individual argument can be accessed. ## Parameters * ​type (`AnyTrivialRegType`): The type of the elements in the list. ## Fields * ​value (`Variadic[type]`): The underlying storage for the variadic list. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Aliases ### `IterType` `alias IterType = _VariadicListIter[type]` ## Methods ### `__init__` `@implicit` `__init__(*value: type) -> Self` Constructs a VariadicList from a variadic list of arguments. **Args:** * ​\*value (`type`): The variadic argument list to construct the variadic list with. ### `__getitem__` `__getitem__[I: Indexer](self, idx: I) -> type` Gets a single element on the variadic list. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The index of the element to access on the list. **Returns:** The element on the list corresponding to the given index. ### `__len__` `__len__(self) -> Int` Gets the size of the list. **Returns:** The number of elements on the variadic list. ### `__iter__` `__iter__(self) -> _VariadicListIter[type]` Iterate over the list. **Returns:** An iterator to the start of the list. --- ## VariadicListMem `struct VariadicListMem[elt_is_mutable: Bool, //, element_type: AnyType, origin: Origin[elt_is_mutable], is_owned: Bool]` A utility class to access variadic function arguments of memory-only types that may have ownership. It exposes references to the elements in a way that can be enumerated. Each element may be accessed with `elt[]`. ## Parameters * ​elt\_is\_mutable (`Bool`): True if the elements of the list are mutable for an mut or owned argument. * ​element\_type (`AnyType`): The type of the elements in the list. * ​origin (`Origin[elt_is_mutable]`): The origin of the underlying elements. * ​is\_owned (`Bool`): Whether the elements are owned by the list. ## Fields * ​value (`Variadic[ref [origin._mlir_origin] element_type]`): The underlying storage, a variadic list of references to elements of the given type. ## Implemented traits `AnyType`, `Sized`, `UnknownDestructibility` ## Aliases ### `reference_type` `alias reference_type = Pointer[element_type, origin]` ## Methods ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Moves constructor. **Args:** * ​existing (`Self`): The existing VariadicListMem. ### `__del__` `__del__(owned self)` Destructor that releases elements if owned. ### `__getitem__` `__getitem__(self, idx: Int) -> ref [origin, *[0,0]] element_type` Gets a single element on the variadic list. **Args:** * ​idx (`Int`): The index of the element to access on the list. **Returns:** A low-level pointer to the element on the list corresponding to the given index. ### `__len__` `__len__(self) -> Int` Gets the size of the list. **Returns:** The number of elements on the variadic list. ### `__iter__` `__iter__(self, out result: _VariadicListMemIter[element_type, origin, self, is_owned])` Iterate over the list. **Returns:** An iterator to the start of the list. --- ## VariadicPack `@register_passable` `struct VariadicPack[elt_is_mutable: Bool, //, is_owned: Bool, origin: Origin[elt_is_mutable], element_trait: AnyTrait[AnyType], *element_types: element_trait]` A utility class to access variadic pack arguments and provide an API for doing things with them. ## Parameters * ​elt\_is\_mutable (`Bool`): True if the elements of the list are mutable for an mut or owned argument pack. * ​is\_owned (`Bool`): Whether the elements are owned by the pack. If so, the pack will release the elements when it is destroyed. * ​origin (`Origin[elt_is_mutable]`): The origin of the underlying elements. * ​element\_trait (`AnyTrait[AnyType]`): The trait that each element of the pack conforms to. * ​\*element\_types (`element_trait`): The list of types held by the argument pack. ## Implemented traits `AnyType`, `Sized`, `UnknownDestructibility` ## Methods ### `__del__` `__del__(owned self)` Destructor that releases elements if owned. ### `__getitem__` `__getitem__[index: Int](self) -> ref [origin] element_types[index.value]` Return a reference to an element of the pack. **Parameters:** * ​index (`Int`): The element of the pack to return. **Returns:** A reference to the element. The Pointer's mutability follows the mutability of the pack argument convention. ### `__len__` `static __len__() -> Int` Return the VariadicPack length. **Returns:** The number of elements in the variadic pack. `__len__(self) -> Int` Return the VariadicPack length. **Returns:** The number of elements in the variadic pack. --- ## variadics Implements the VariadicList and VariadicPack types. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`VariadicList`](/mojo/stdlib/builtin/variadics/VariadicList): A utility class to access variadic function arguments. Provides a "list" view of the function argument so that the size of the argument list and each individual argument can be accessed. * [​`VariadicListMem`](/mojo/stdlib/builtin/variadics/VariadicListMem): A utility class to access variadic function arguments of memory-only types that may have ownership. It exposes references to the elements in a way that can be enumerated. Each element may be accessed with `elt[]`. * [​`VariadicPack`](/mojo/stdlib/builtin/variadics/VariadicPack): A utility class to access variadic pack arguments and provide an API for doing things with them. --- ## BitSet `struct BitSet[size: UInt]` A grow-only set storing non-negative integers efficiently using bits. Each integer element is represented by a single bit within an array of 64-bit words (`UInt64`). This structure is optimized for: * **Compactness:** Uses 64 times less memory than `List[Bool]`. * **Speed:** Offers O(1) time complexity for `set`, `clear`, `test`, and `toggle` operations (single word load/store). Ideal for applications like data-flow analysis, graph algorithms, or any task requiring dense sets of small integers where memory and lookup speed are critical. ## Parameters * ​size (`UInt`): The maximum number of bits the bitset can store. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Defaultable`, `Movable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(out self)` Initializes an empty BitSet with zero capacity and size. `__init__(out self: BitSet[UInt(size)], init: SIMD[bool, size])` Initializes a BitSet with the given SIMD vector of booleans. **Args:** * ​init (`SIMD[bool, size]`): A SIMD vector of booleans to initialize the bitset with. ### `__bool__` `__bool__(self) -> Bool` Checks if the bitset is non-empty (contains at least one set bit). Equivalent to `len(self) != 0` or `not self.is_empty()`. **Returns:** True if at least one bit is set, False otherwise. ### `__len__` `__len__(self) -> Int` Counts the total number of bits that are set to 1 in the bitset. Uses the efficient `pop_count` intrinsic for each underlying word. The complexity is proportional to the number of words used by the bitset's capacity (`_words_size`), not the logical size (`len`). **Returns:** The total count of set bits (population count). ### `is_empty` `is_empty(self) -> Bool` Checks if the bitset contains any set bits. Equivalent to `len(self) == 0`. Note that this checks the logical size, not the allocated capacity. **Returns:** True if no bits are set (logical size is 0), False otherwise. ### `set` `set(mut self, idx: UInt)` Sets the bit at the specified index `idx` to 1. If `idx` is greater than or equal to the current logical size, the logical size is updated. Aborts if `idx` is negative or greater than or equal to the compile-time `size`. **Args:** * ​idx (`UInt`): The non-negative index of the bit to set (must be ### `clear` `clear(mut self, idx: UInt)` Clears the bit at the specified index `idx` (sets it to 0). Aborts if `idx` is negative or greater than or equal to the compile-time `size`. Does not change the logical size. **Args:** * ​idx (`UInt`): The non-negative index of the bit to clear (must be ### `toggle` `toggle(mut self, idx: UInt)` Toggles (inverts) the bit at the specified index `idx`. If the bit becomes 1 and `idx` is greater than or equal to the current logical size, the logical size is updated. Aborts if `idx` is negative or greater than or equal to the compile-time `size`. **Args:** * ​idx (`UInt`): The non-negative index of the bit to toggle (must be ### `test` `test(self, idx: UInt) -> Bool` Tests if the bit at the specified index `idx` is set (is 1). Aborts if `idx` is negative or greater than or equal to the compile-time `size`. **Args:** * ​idx (`UInt`): The non-negative index of the bit to test (must be ### `clear_all` `clear_all(mut self)` Clears all bits in the set, resetting the logical size to 0. The allocated storage capacity remains unchanged. Equivalent to re-initializing the set with `Self()`. ### `union` `union(self, other: Self) -> Self` Returns a new bitset that is the union of `self` and `other`. **Args:** * ​other (`Self`): The bitset to union with. **Returns:** A new bitset containing all elements from both sets. ### `intersection` `intersection(self, other: Self) -> Self` Returns a new bitset that is the intersection of `self` and `other`. **Args:** * ​other (`Self`): The bitset to intersect with. **Returns:** A new bitset containing only the elements present in both sets. ### `difference` `difference(self, other: Self) -> Self` Returns a new bitset that is the difference of `self` and `other`. **Args:** * ​other (`Self`): The bitset to subtract from `self`. **Returns:** A new bitset containing elements from `self` that are not in `other`. ### `write_to` `write_to[W: Writer, //](self, mut writer: W)` Writes a string representation of the set bits to the given writer. Outputs the indices of the set bits in ascending order, enclosed in curly braces and separated by commas (e.g., "{1, 5, 42}"). Uses efficient bitwise operations to find set bits without iterating through every possible bit. **Parameters:** * ​W (`Writer`): The type of the writer, conforming to the `Writer` trait. **Args:** * ​writer (`W`): The writer instance to output the representation to. ### `__repr__` `__repr__(self) -> String` Returns a developer-friendly string representation of the bitset. Currently equivalent to `__str__`. **Returns:** A string showing the set bits (e.g., "{1, 5, 42}"). ### `__str__` `__str__(self) -> String` Returns a user-friendly string representation of the bitset. Formats the set bits as a comma-separated list within curly braces, like "{1, 5, 42}". Uses the `write_to` method internally. **Returns:** A string showing the set bits. --- ## bitset Provides a compact, grow-only set of non-negative integers. Optimized for space (1 bit per element) and speed (O(1) operations). Offers set/clear/test/toggle and fast population count. The underlying storage grows automatically but does not shrink unless `shrink_to_fit` is called (not implemented yet). Example: ```mojo var bs = BitSet[128]() # 128-bit set, all clear bs.set(42) # Mark value 42 as present. if bs.test(42): # Check membership. print("hit") # Prints "hit". bs.clear(42) # Remove 42. print(bs.count()) # Prints 0. ``` ## Structs * [​`BitSet`](/mojo/stdlib/collections/bitset/BitSet): A grow-only set storing non-negative integers efficiently using bits. --- ## CountTuple `struct CountTuple[V: Copyable & Movable & Hashable & EqualityComparable]` A tuple representing a value and its count in a Counter. ## Parameters * ​V (`Copyable & Movable & Hashable & EqualityComparable`): The value in the Counter. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, value: V, count: UInt)` Create a new CountTuple. **Args:** * ​value (`V`): The value in the Counter. * ​count (`UInt`): The count of the value in the Counter. ### `__getitem__` `__getitem__(self, idx: Int) -> Variant[V, Int]` Get an element in the tuple. **Args:** * ​idx (`Int`): The element to return. **Returns:** The value if idx is 0 and the count if idx is 1. ### `__lt__` `__lt__(self, other: Self) -> Bool` Compare two CountTuples by count, then by value. **Args:** * ​other (`Self`): The other CountTuple to compare to. **Returns:** True if this CountTuple is less than the other, False otherwise. ### `__eq__` `__eq__(self, other: Self) -> Bool` Compare two CountTuples for equality. **Args:** * ​other (`Self`): The other CountTuple to compare to. **Returns:** True if the two CountTuples are equal, False otherwise. ### `copy` `copy(self) -> Self` Explicitly construct a copy of self. **Returns:** A copy of this value. --- ## Counter `struct Counter[V: Copyable & Movable & Hashable & EqualityComparable]` A container for counting hashable items. The value type must be specified statically, unlike a Python Counter, which can accept arbitrary value types. The value type must implement the `KeyElement` trait, as its values are stored in the dictionary as keys. Usage: ```mojo from collections import Counter var c = Counter[String]("a", "a", "a", "b", "b", "c", "d", "c", "c") print(c["a"]) # prints 3 print(c["b"]) # prints 2 ``` ## Parameters * ​V (`Copyable & Movable & Hashable & EqualityComparable`): The value type to be counted. Currently must be KeyElement. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Defaultable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Create a new, empty Counter object. `__init__(out self, owned *values: V)` Create a new Counter from a list of values. Usage: ```mojo from collections import Counter var c = Counter[String]("a", "a", "a", "b", "b", "c", "d", "c", "c") print(c["a"]) # print 3 print(c["b"]) # print 2 ``` **Args:** * ​\*values (`V`): A list of values to count. `@implicit` `__init__(out self, items: List[V, hint_trivial_type])` Create a from an input iterable. Usage: ```mojo from collections import Counter var c = Counter[String](["a", "a", "a", "b", "b", "c", "d", "c", "c"]) print(c["a"]) # prints 3 print(c["b"]) # prints 2 ``` **Args:** * ​items (`List[V, hint_trivial_type]`): A list of items to count. ### `__bool__` `__bool__(self) -> Bool` Check if the Counter is empty or not. **Returns:** `False` if the Counter is empty, `True` otherwise. ### `__getitem__` `__getitem__(self, key: V) -> Int` Get the count of a key. **Args:** * ​key (`V`): The key to get the count of. **Returns:** The count of the key. ### `__setitem__` `__setitem__(mut self, value: V, count: Int)` Set a value in the keyword Counter by key. **Args:** * ​value (`V`): The value to associate with the specified count. * ​count (`Int`): The count to store in the Counter. ### `__neg__` `__neg__(self) -> Self` Subtract from an empty Counter. Strips positive and zero counts, and flips the sign on negative counts. **Returns:** A new Counter with stripped counts and negative counts. ### `__pos__` `__pos__(self) -> Self` Return a shallow copy of the Counter, stripping non-positive counts. **Returns:** A shallow copy of the Counter. ### `__lt__` `__lt__(self, other: Self) -> Bool` Check if all counts are less than in the other Counter. **Args:** * ​other (`Self`): The other Counter to compare to. **Returns:** True if all counts are less than in the other Counter, False otherwise. ### `__le__` `__le__(self, other: Self) -> Bool` Check if all counts are less than or equal to the other Counter. **Args:** * ​other (`Self`): The other Counter to compare to. **Returns:** True if all counts are less than or equal to the other Counter, False otherwise. ### `__eq__` `__eq__(self, other: Self) -> Bool` Check if all counts agree. Missing counts are treated as zero. **Args:** * ​other (`Self`): The other Counter to compare to. **Returns:** True if the two Counters are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Check if all counts disagree. Missing counts are treated as zero. **Args:** * ​other (`Self`): The other Counter to compare to. **Returns:** True if the two Counters are not equal, False otherwise. ### `__gt__` `__gt__(self, other: Self) -> Bool` Check if all counts are greater than in the other Counter. **Args:** * ​other (`Self`): The other Counter to compare to. **Returns:** True if all counts are greater than in the other Counter, False otherwise. ### `__ge__` `__ge__(self, other: Self) -> Bool` Check if all counts are greater than or equal to the other Counter. **Args:** * ​other (`Self`): The other Counter to compare to. **Returns:** True if all counts are greater than or equal to the other Counter, False otherwise. ### `__contains__` `__contains__(self, key: V) -> Bool` Check if a given key is in the dictionary or not. **Args:** * ​key (`V`): The key to check. **Returns:** True if there key exists in the dictionary, False otherwise. ### `__add__` `__add__(self, other: Self) -> Self` Add counts from two Counters. **Args:** * ​other (`Self`): The other Counter to add to this Counter. **Returns:** A new Counter with the counts from both Counters added together. ### `__sub__` `__sub__(self, other: Self) -> Self` Subtract counts, but keep only results with positive counts. **Args:** * ​other (`Self`): The other Counter to subtract from this Counter. **Returns:** A new Counter with the counts from the other Counter subtracted from this Counter. ### `__and__` `__and__(self, other: Self) -> Self` Intersection: keep common elements with the minimum count. **Args:** * ​other (`Self`): The other Counter to intersect with. **Returns:** A new Counter with the common elements and the minimum count of the two Counters. ### `__or__` `__or__(self, other: Self) -> Self` Union: keep all elements with the maximum count. **Args:** * ​other (`Self`): The other Counter to union with. **Returns:** A new Counter with all elements and the maximum count of the two Counters. ### `__iadd__` `__iadd__(mut self, other: Self)` Add counts from another Counter to this Counter. **Args:** * ​other (`Self`): The other Counter to add to this Counter. ### `__isub__` `__isub__(mut self, other: Self)` Subtract counts from another Counter from this Counter. **Args:** * ​other (`Self`): The other Counter to subtract from this Counter. ### `__iand__` `__iand__(mut self, other: Self)` Intersection: keep common elements with the minimum count. **Args:** * ​other (`Self`): The other Counter to intersect with. ### `__ior__` `__ior__(mut self, other: Self)` Union: keep all elements with the maximum count. **Args:** * ​other (`Self`): The other Counter to union with. ### `copy` `copy(self) -> Self` Create a new Counter by copying another Counter. **Returns:** A copy of the value. ### `fromkeys` `static fromkeys(keys: List[V, hint_trivial_type], value: Int) -> Self` Create a new Counter from a list of keys and a default value. **Args:** * ​keys (`List[V, hint_trivial_type]`): The keys to create the Counter from. * ​value (`Int`): The default value to associate with each key. **Returns:** A new Counter with the keys and default value. ### `__iter__` `__iter__(self) -> _DictKeyIter[V, Int, self._data]` Iterate over the keyword dict's keys as immutable references. **Returns:** An iterator of immutable references to the Counter values. ### `__len__` `__len__(self) -> Int` Returns the number of elements currently stored in the Counter. **Returns:** The number of elements in the Counter. ### `get` `get(self, value: V) -> Optional[Int]` Get a value from the counter. **Args:** * ​value (`V`): The value to search for in the Counter. **Returns:** An optional value containing a copy of the value if it was present, otherwise an empty Optional. `get(self, value: V, default: Int) -> Int` Get a value from the Counter. **Args:** * ​value (`V`): The value to search for in the counter. * ​default (`Int`): Default count to return. **Returns:** A copy of the value if it was present, otherwise default. ### `pop` `pop(mut self, value: V) -> Int` Remove a value from the Counter by value. **Args:** * ​value (`V`): The value to remove from the Counter. **Returns:** The value associated with the key, if it was in the Counter. **Raises:** "KeyError" if the key was not present in the Counter. `pop(mut self, value: V, owned default: Int) -> Int` Remove a value from the Counter by value. **Args:** * ​value (`V`): The value to remove from the Counter. * ​default (`Int`): Optionally provide a default value to return if the value was not found instead of raising. **Returns:** The value associated with the key, if it was in the Counter. If it wasn't, return the provided default value instead. **Raises:** "KeyError" if the key was not present in the Counter and no default value was provided. ### `keys` `keys(ref self) -> _DictKeyIter[V, Int, self_is_origin._data]` Iterate over the Counter's keys as immutable references. **Returns:** An iterator of immutable references to the Counter keys. ### `values` `values(ref self) -> _DictValueIter[V, Int, self_is_origin._data]` Iterate over the Counter's values as references. **Returns:** An iterator of references to the Counter values. ### `items` `items(self) -> _DictEntryIter[V, Int, self._data]` Iterate over the dict's entries as immutable references. **Returns:** An iterator of immutable references to the Counter entries. ### `clear` `clear(mut self)` Remove all elements from the Counter. ### `popitem` `popitem(mut self) -> CountTuple[V]` Remove and return an arbitrary (key, value) pair from the Counter. **Returns:** A CountTuple containing the key and value of the removed item. **Raises:** "KeyError" if the Counter is empty. ### `total` `total(self) -> UInt` Return the total of all counts in the Counter. **Returns:** The total of all counts in the Counter. ### `most_common` `most_common(self, n: UInt) -> List[CountTuple[V]]` Return a list of the `n` most common elements and their counts from the most common to the least. **Args:** * ​n (`UInt`): The number of most common elements to return. **Returns:** A list of the n most common elements and their counts. ### `elements` `elements(self) -> List[V]` Return an iterator over elements repeating each as many times as its count. **Returns:** An iterator over the elements in the Counter. ### `update` `update(mut self, other: Self)` Update the Counter, like `dict.update()` but add counts instead of replacing them. **Args:** * ​other (`Self`): The Counter to update this Counter with. ### `subtract` `subtract(mut self, other: Self)` Subtract count. Both inputs and outputs may be zero or negative. **Args:** * ​other (`Self`): The Counter to subtract from this Counter. --- ## counter Defines the `Counter` type. You can import these APIs from the `collections` package. For example: ```mojo from collections import Counter ``` ## Structs * [​`Counter`](/mojo/stdlib/collections/counter/Counter): A container for counting hashable items. * [​`CountTuple`](/mojo/stdlib/collections/counter/CountTuple): A tuple representing a value and its count in a Counter. --- ## Deque `struct Deque[ElementType: Copyable & Movable]` Implements a double-ended queue. It supports pushing and popping from both ends in O(1) time resizing the underlying storage as needed. ## Parameters * ​ElementType (`Copyable & Movable`): The type of the elements in the deque. Must implement the traits `Copyable` and `Movable`. ## Implemented traits `AnyType`, `Boolable`, `ExplicitlyCopyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Aliases ### `default_capacity` `alias default_capacity = 64` The default capacity of the deque: must be the power of 2. ## Methods ### `__init__` `__init__(out self, *, owned elements: Optional[List[ElementType]] = Optional(None), capacity: Int = 64, min_capacity: Int = 64, maxlen: Int = -1, shrink: Bool = True)` Constructs a deque. **Args:** * ​elements (`Optional[List[ElementType]]`): The optional list of initial deque elements. * ​capacity (`Int`): The initial capacity of the deque. * ​min\_capacity (`Int`): The minimum allowed capacity of the deque when shrinking. * ​maxlen (`Int`): The maximum allowed capacity of the deque when growing. * ​shrink (`Bool`): Should storage be de-allocated when not needed. `__init__(out self, owned *values: ElementType, *, __list_literal__: Tuple[] = Tuple())` Constructs a deque from the given values. **Args:** * ​\*values (`ElementType`): The values to populate the deque with. * ​**list\_literal** (`Tuple[]`): Tell Mojo to use this method for list literals. `__init__(out self, *, owned elements: VariadicListMem[ElementType, origin, is_owned])` Constructs a deque from the given values. **Args:** * ​elements (`VariadicListMem[ElementType, origin, is_owned]`): The values to populate the deque with. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Moves data of an existing deque into a new one. **Args:** * ​existing (`Self`): The existing deque. ### `__del__` `__del__(owned self)` Destroys all elements in the deque and free its memory. ### `__bool__` `__bool__(self) -> Bool` Checks whether the deque has any elements or not. **Returns:** `False` if the deque is empty, `True` if there is at least one element. ### `__getitem__` `__getitem__(ref self, idx: Int) -> ref [self] ElementType` Gets the deque element at the given index. **Args:** * ​idx (`Int`): The index of the element. **Returns:** A reference to the element at the given index. ### `__eq__` `__eq__[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], other: Deque[T]) -> Bool` Checks if two deques are equal. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque. Must implement the trait `EqualityComparable`. **Args:** * ​other (`Deque[T]`): The deque to compare with. **Returns:** `True` if the deques are equal, `False` otherwise. ### `__ne__` `__ne__[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], other: Deque[T]) -> Bool` Checks if two deques are not equal. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque. Must implement the trait `EqualityComparable`. **Args:** * ​other (`Deque[T]`): The deque to compare with. **Returns:** `True` if the deques are not equal, `False` otherwise. ### `__contains__` `__contains__[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], value: T) -> Bool` Verify if a given value is present in the deque. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque. Must implement the trait `EqualityComparable`. **Args:** * ​value (`T`): The value to find. **Returns:** True if the value is contained in the deque, False otherwise. ### `__add__` `__add__(self, other: Self) -> Self` Concatenates self with other and returns the result as a new deque. **Args:** * ​other (`Self`): Deque whose elements will be appended to the elements of self. **Returns:** The newly created deque with the properties of `self`. ### `__mul__` `__mul__(self, n: Int) -> Self` Concatenates `n` deques of `self` and returns a new deque. **Args:** * ​n (`Int`): The multiplier number. **Returns:** The new deque. ### `__iadd__` `__iadd__(mut self, other: Self)` Appends the elements of other deque into self. **Args:** * ​other (`Self`): Deque whose elements will be appended to self. ### `__imul__` `__imul__(mut self, n: Int)` Concatenates self `n` times in place. **Args:** * ​n (`Int`): The multiplier number. ### `copy` `copy(self) -> Self` Creates a deepcopy of the given deque. **Returns:** A copy of the value. ### `__iter__` `__iter__(ref self) -> _DequeIter[ElementType, self_is_origin]` Iterates over elements of the deque, returning the references. **Returns:** An iterator of the references to the deque elements. ### `__reversed__` `__reversed__(ref self) -> _DequeIter[ElementType, self_is_origin, False]` Iterate backwards over the deque, returning the references. **Returns:** A reversed iterator of the references to the deque elements. ### `__len__` `__len__(self) -> Int` Gets the number of elements in the deque. **Returns:** The number of elements in the deque. ### `write_to` `write_to[T: Representable & Copyable & Movable, WriterType: Writer](self: Deque[T], mut writer: WriterType)` Writes `my_deque.__str__()` to a `Writer`. **Parameters:** * ​T (`Representable & Copyable & Movable`): The type of the Deque elements. Must implement the trait `Representable`. * ​WriterType (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`WriterType`): The object to write to. ### `__str__` `__str__[T: Representable & Copyable & Movable, //](self: Deque[T]) -> String` Returns a string representation of a `Deque`. Note that since we can't condition methods on a trait yet, the way to call this method is a bit special. Here is an example below: ```mojo my_deque = Deque[Int](1, 2, 3) print(my_deque.__str__()) ``` When the compiler supports conditional methods, then a simple `String(my_deque)` will be enough. **Parameters:** * ​T (`Representable & Copyable & Movable`): The type of the elements in the deque. Must implement the trait `Representable`. **Returns:** A string representation of the deque. ### `__repr__` `__repr__[T: Representable & Copyable & Movable, //](self: Deque[T]) -> String` Returns a string representation of a `Deque`. Note that since we can't condition methods on a trait yet, the way to call this method is a bit special. Here is an example below: ```mojo my_deque = Deque[Int](1, 2, 3) print(my_deque.__repr__()) ``` When the compiler supports conditional methods, then a simple `repr(my_deque)` will be enough. **Parameters:** * ​T (`Representable & Copyable & Movable`): The type of the elements in the deque. Must implement the trait `Representable`. **Returns:** A string representation of the deque. ### `append` `append(mut self, owned value: ElementType)` Appends a value to the right side of the deque. **Args:** * ​value (`ElementType`): The value to append. ### `appendleft` `appendleft(mut self, owned value: ElementType)` Appends a value to the left side of the deque. **Args:** * ​value (`ElementType`): The value to append. ### `clear` `clear(mut self)` Removes all elements from the deque leaving it with length 0. Resets the underlying storage capacity to `_min_capacity`. ### `count` `count[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], value: T) -> Int` Counts the number of occurrences of a `value` in the deque. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque. Must implement the trait `EqualityComparable`. **Args:** * ​value (`T`): The value to count. **Returns:** The number of occurrences of the value in the deque. ### `extend` `extend(mut self, owned values: List[ElementType])` Extends the right side of the deque by consuming elements of the list argument. **Args:** * ​values (`List[ElementType]`): List whose elements will be added at the right side of the deque. ### `extendleft` `extendleft(mut self, owned values: List[ElementType])` Extends the left side of the deque by consuming elements from the list argument. Acts as series of left appends resulting in reversed order of elements in the list argument. **Args:** * ​values (`List[ElementType]`): List whose elements will be added at the left side of the deque. ### `index` `index[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], value: T, start: Int = 0, stop: Optional[Int] = Optional(None)) -> Int` Returns the index of the first occurrence of a `value` in a deque restricted by the range given the `start` and `stop` bounds. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque. Must implement the `EqualityComparable` trait. **Args:** * ​value (`T`): The value to search for. * ​start (`Int`): The starting index of the search, treated as a slice index (defaults to 0). * ​stop (`Optional[Int]`): The ending index of the search, treated as a slice index (defaults to None, which means the end of the deque). **Returns:** The index of the first occurrence of the value in the deque. **Raises:** ValueError: If the value is not found in the deque. ### `insert` `insert(mut self, idx: Int, owned value: ElementType)` Inserts the `value` into the deque at position `idx`. **Args:** * ​idx (`Int`): The position to insert the value into. * ​value (`ElementType`): The value to insert. **Raises:** IndexError: If deque is already at its maximum size. ### `remove` `remove[T: EqualityComparable & Copyable & Movable, //](mut self: Deque[T], value: T)` Removes the first occurrence of the `value`. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque. Must implement the `EqualityComparable` trait. **Args:** * ​value (`T`): The value to remove. **Raises:** ValueError: If the value is not found in the deque. ### `peek` `peek(self) -> ElementType` Inspect the last (rightmost) element of the deque without removing it. **Returns:** The last (rightmost) element of the deque. **Raises:** IndexError: If the deque is empty. ### `peekleft` `peekleft(self) -> ElementType` Inspect the first (leftmost) element of the deque without removing it. **Returns:** The first (leftmost) element of the deque. **Raises:** IndexError: If the deque is empty. ### `pop` `pop(mut self) -> ElementType` Removes and returns the element from the right side of the deque. **Returns:** The popped value. **Raises:** IndexError: If the deque is empty. ### `popleft` `popleft(mut self) -> ElementType` Removes and returns the element from the left side of the deque. **Returns:** The popped value. **Raises:** IndexError: If the deque is empty. ### `reverse` `reverse(mut self)` Reverses the elements of the deque in-place. ### `rotate` `rotate(mut self, n: Int = 1)` Rotates the deque by `n` steps. If `n` is positive, rotates to the right. If `n` is negative, rotates to the left. **Args:** * ​n (`Int`): Number of steps to rotate the deque (defaults to 1). --- ## deque Defines the Deque type. You can import these APIs from the `collections` package. Examples: ```mojo from collections import Deque ``` ## Structs * [​`Deque`](/mojo/stdlib/collections/deque/Deque): Implements a double-ended queue. --- ## Dict `struct Dict[K: Copyable & Movable & Hashable & EqualityComparable, V: Copyable & Movable]` A container that stores key-value pairs. The key type and value type must be specified statically, unlike a Python dictionary, which can accept arbitrary key and value types. The key type must implement the `KeyElement` trait, which encompasses `Movable`, `Hashable`, and `EqualityComparable`. It also includes `Copyable` and `Movable` until we have references. The value type must implement the `Copyable` and `Movable` traits. Examples: ```mojo var d = Dict[String, Int]() d["a"] = 1 d["b"] = 2 print(len(d)) # prints 2 print(d["a"]) # prints 1 print(d.pop("b")) # prints 2 print(len(d)) # prints 1 ``` For more information on the Mojo `Dict` type, see the [Mojo `Dict` manual](/mojo/manual/types/#dict). To learn more about using Python dictionaries from Mojo, see [Python types in Mojo](/mojo/manual/python/types/#python-types-in-mojo). ## Parameters * ​K (`Copyable & Movable & Hashable & EqualityComparable`): The type of the dictionary key. Must be `Hashable` and `EqualityComparable` so we can find the key in the map. * ​V (`Copyable & Movable`): The value type of the dictionary. Currently must be Copyable & Movable. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Defaultable`, `ExplicitlyCopyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Aliases ### `EMPTY` `alias EMPTY = -1` ### `REMOVED` `alias REMOVED = -2` ## Methods ### `__init__` `__init__(out self)` Initialize an empty dictiontary. `__init__(out self, *, power_of_two_initial_capacity: Int)` Initialize an empty dictiontary with a pre-reserved initial capacity. Examples: ```mojo var x = Dict[Int, Int](power_of_two_initial_capacity = 1024) # Insert (2/3 of 1024) entries without reallocation. ``` **Args:** * ​power\_of\_two\_initial\_capacity (`Int`): At least 8, has to be a power of two. `__init__(out self, owned keys: List[K], owned values: List[V], __dict_literal__: Tuple[])` Constructs a dictionary from the given keys and values. **Args:** * ​keys (`List[K]`): The list of keys to build the dictionary with. * ​values (`List[V]`): The corresponding values to pair with the keys. * ​**dict\_literal** (`Tuple[]`): Tell Mojo to use this method for dict literals. ### `__copyinit__` `__copyinit__(out self, existing: Self)` Copy an existing dictiontary. **Args:** * ​existing (`Self`): The existing dict. ### `__bool__` `__bool__(self) -> Bool` Check if the dictionary is empty or not. **Returns:** `False` if the dictionary is empty, `True` if there is at least one element. ### `__getitem__` `__getitem__(self, key: K) -> ref [*[0,0]._entries._value.value] V` Retrieve a value out of the dictionary. **Args:** * ​key (`K`): The key to retrieve. **Returns:** The value associated with the key, if it's present. **Raises:** "KeyError" if the key isn't present. ### `__setitem__` `__setitem__(mut self, owned key: K, owned value: V)` Set a value in the dictionary by key. **Args:** * ​key (`K`): The key to associate with the specified value. * ​value (`V`): The data to store in the dictionary. ### `__contains__` `__contains__(self, key: K) -> Bool` Check if a given key is in the dictionary or not. **Args:** * ​key (`K`): The key to check. **Returns:** True if there key exists in the dictionary, False otherwise. ### `__or__` `__or__(self, other: Self) -> Self` Merge self with other and return the result as a new dict. **Args:** * ​other (`Self`): The dictionary to merge with. **Returns:** The result of the merge. ### `__ior__` `__ior__(mut self, other: Self)` Merge self with other in place. **Args:** * ​other (`Self`): The dictionary to merge with. ### `copy` `copy(self) -> Self` Copy an existing dictiontary. **Returns:** A copy of the value. ### `fromkeys` `static fromkeys(keys: List[K, hint_trivial_type], value: V) -> Self` Create a new dictionary with keys from list and values set to value. **Args:** * ​keys (`List[K, hint_trivial_type]`): The keys to set. * ​value (`V`): The value to set. **Returns:** The new dictionary. `static fromkeys(keys: List[K, hint_trivial_type], value: Optional[V] = Optional(None)) -> Dict[K, Optional[V]]` Create a new dictionary with keys from list and values set to value. **Args:** * ​keys (`List[K, hint_trivial_type]`): The keys to set. * ​value (`Optional[V]`): The value to set. **Returns:** The new dictionary. ### `__iter__` `__iter__(ref self) -> _DictKeyIter[K, V, self_is_origin]` Iterate over the dict's keys as immutable references. **Returns:** An iterator of immutable references to the dictionary keys. ### `__reversed__` `__reversed__(ref self) -> _DictKeyIter[K, V, self_is_origin, False]` Iterate backwards over the dict keys, returning immutable references. **Returns:** A reversed iterator of immutable references to the dict keys. ### `__len__` `__len__(self) -> Int` The number of elements currently stored in the dictionary. **Returns:** The number of elements currently stored in the dictionary. ### `__str__` `__str__[T: Copyable & Movable & Hashable & EqualityComparable & Representable, U: Copyable & Movable & Representable, //](self: Dict[T, U]) -> String` Returns a string representation of a `Dict`. Notes: Since we can't condition methods on a trait yet, the way to call this method is a bit special. Here is an example below: ```mojo var my_dict = Dict[Int, Float64]() my_dict[1] = 1.1 my_dict[2] = 2.2 dict_as_string = my_dict.__str__() print(dict_as_string) # prints "{1: 1.1, 2: 2.2}" ``` When the compiler supports conditional methods, then a simple `String(my_dict)` will be enough. **Parameters:** * ​T (`Copyable & Movable & Hashable & EqualityComparable & Representable`): The type of the keys in the Dict. Must implement the traits `Representable` and `KeyElement`. * ​U (`Copyable & Movable & Representable`): The type of the values in the Dict. Must implement the traits `Representable`, `Copyable` and `Movable`. **Returns:** A string representation of the Dict. ### `find` `find(self, key: K) -> Optional[V]` Find a value in the dictionary by key. **Args:** * ​key (`K`): The key to search for in the dictionary. **Returns:** An optional value containing a copy of the value if it was present, otherwise an empty Optional. ### `get` `get(self, key: K) -> Optional[V]` Get a value from the dictionary by key. **Args:** * ​key (`K`): The key to search for in the dictionary. **Returns:** An optional value containing a copy of the value if it was present, otherwise an empty Optional. `get(self, key: K, default: V) -> V` Get a value from the dictionary by key. **Args:** * ​key (`K`): The key to search for in the dictionary. * ​default (`V`): Default value to return. **Returns:** A copy of the value if it was present, otherwise default. ### `pop` `pop(mut self, key: K, owned default: V) -> V` Remove a value from the dictionary by key. **Args:** * ​key (`K`): The key to remove from the dictionary. * ​default (`V`): A default value to return if the key was not found instead of raising. **Returns:** The value associated with the key, if it was in the dictionary. If it wasn't, return the provided default value instead. `pop(mut self, key: K) -> V` Remove a value from the dictionary by key. **Args:** * ​key (`K`): The key to remove from the dictionary. **Returns:** The value associated with the key, if it was in the dictionary. Raises otherwise. **Raises:** "KeyError" if the key was not present in the dictionary. ### `popitem` `popitem(mut self) -> DictEntry[K, V]` Remove and return a (key, value) pair from the dictionary. Notes: Pairs are returned in LIFO order. popitem() is useful to destructively iterate over a dictionary, as often used in set algorithms. If the dictionary is empty, calling popitem() raises a KeyError. **Returns:** Last dictionary item **Raises:** "KeyError" if the dictionary is empty. ### `keys` `keys(ref self) -> _DictKeyIter[K, V, self_is_origin]` Iterate over the dict's keys as immutable references. **Returns:** An iterator of immutable references to the dictionary keys. ### `values` `values(ref self) -> _DictValueIter[K, V, self_is_origin]` Iterate over the dict's values as references. **Returns:** An iterator of references to the dictionary values. ### `items` `items(ref self) -> _DictEntryIter[K, V, self_is_origin]` Iterate over the dict's entries as immutable references. Examples: ```mojo var my_dict = Dict[String, Int]() my_dict["a"] = 1 my_dict["b"] = 2 for e in my_dict.items(): print(e.key, e.value) ``` Notes: These can't yet be unpacked like Python dict items, but you can access the key and value as attributes. **Returns:** An iterator of immutable references to the dictionary entries. ### `update` `update(mut self, other: Self, /)` Update the dictionary with the key/value pairs from other, overwriting existing keys. Notes: The argument must be positional only. **Args:** * ​other (`Self`): The dictionary to update from. ### `clear` `clear(mut self)` Remove all elements from the dictionary. ### `setdefault` `setdefault(mut self, key: K, owned default: V) -> ref [*[0,0]._entries._value.value] V` Get a value from the dictionary by key, or set it to a default if it doesn't exist. **Args:** * ​key (`K`): The key to search for in the dictionary. * ​default (`V`): The default value to set if the key is not present. **Returns:** The value associated with the key, or the default value if it wasn't present. --- ## DictEntry `struct DictEntry[K: Copyable & Movable & Hashable & EqualityComparable, V: Copyable & Movable]` Store a key-value pair entry inside a dictionary. ## Parameters * ​K (`Copyable & Movable & Hashable & EqualityComparable`): The key type of the dict. Must be Hashable+EqualityComparable. * ​V (`Copyable & Movable`): The value type of the dict. ## Fields * ​hash (`SIMD[uint64, 1]`): `key.__hash__()`, stored so hashing isn't re-computed during dict lookup. * ​key (`K`): The unique key for the entry. * ​value (`V`): The value associated with the key. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, owned key: K, owned value: V)` Create an entry from a key and value, computing the hash. **Args:** * ​key (`K`): The key of the entry. * ​value (`V`): The value of the entry. ### `copy` `copy(self) -> Self` Copy an existing entry. **Returns:** A copy of the value. ### `reap_value` `reap_value(owned self, out result: V)` Take the value from an owned entry. **Returns:** The value of the entry. --- ## OwnedKwargsDict `struct OwnedKwargsDict[V: Copyable & Movable]` Container used to pass owned variadic keyword arguments to functions. This type mimics the interface of a dictionary with `String` keys, and should be usable more-or-less like a dictionary. Notably, however, this type should not be instantiated directly by users. ## Parameters * ​V (`Copyable & Movable`): The value type of the dictionary. Currently must be Copyable & Movable. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `ExplicitlyCopyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Aliases ### `key_type` `alias key_type = String` ## Methods ### `__init__` `__init__(out self)` Initialize an empty keyword dictionary. ### `__copyinit__` `__copyinit__(out self, existing: Self)` Copy an existing keyword dictionary. **Args:** * ​existing (`Self`): The existing keyword dictionary. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Move data of an existing keyword dictionary into a new one. **Args:** * ​existing (`Self`): The existing keyword dictionary. ### `__getitem__` `__getitem__(self, key: String) -> V` Retrieve a value out of the keyword dictionary. **Args:** * ​key (`String`): The key to retrieve. **Returns:** The value associated with the key, if it's present. **Raises:** "KeyError" if the key isn't present. ### `__setitem__` `__setitem__(mut self, key: String, value: V)` Set a value in the keyword dictionary by key. **Args:** * ​key (`String`): The key to associate with the specified value. * ​value (`V`): The data to store in the dictionary. ### `__contains__` `__contains__(self, key: String) -> Bool` Check if a given key is in the keyword dictionary or not. **Args:** * ​key (`String`): The key to check. **Returns:** True if there key exists in the keyword dictionary, False otherwise. ### `copy` `copy(self) -> Self` Copy an existing keyword dictionary. **Returns:** A copy of the value. ### `__len__` `__len__(self) -> Int` The number of elements currently stored in the keyword dictionary. **Returns:** The number of elements currently stored in the keyword dictionary. ### `find` `find(self, key: String) -> Optional[V]` Find a value in the keyword dictionary by key. **Args:** * ​key (`String`): The key to search for in the dictionary. **Returns:** An optional value containing a copy of the value if it was present, otherwise an empty Optional. ### `pop` `pop(mut self, key: String, owned default: V) -> V` Remove a value from the dictionary by key. **Args:** * ​key (`String`): The key to remove from the dictionary. * ​default (`V`): A default value to return if the key was not found instead of raising. **Returns:** The value associated with the key, if it was in the dictionary. If it wasn't, return the provided default value instead. `pop(mut self, key: String) -> V` Remove a value from the dictionary by key. **Args:** * ​key (`String`): The key to remove from the dictionary. **Returns:** The value associated with the key, if it was in the dictionary. Raises otherwise. **Raises:** "KeyError" if the key was not present in the dictionary. ### `__iter__` `__iter__(ref self) -> _DictKeyIter[String, V, self_is_origin._dict]` Iterate over the keyword dict's keys as immutable references. **Returns:** An iterator of immutable references to the dictionary keys. ### `keys` `keys(ref self) -> _DictKeyIter[String, V, self_is_origin._dict]` Iterate over the keyword dict's keys as immutable references. **Returns:** An iterator of immutable references to the dictionary keys. ### `values` `values(ref self) -> _DictValueIter[String, V, self_is_origin._dict]` Iterate over the keyword dict's values as references. **Returns:** An iterator of references to the dictionary values. ### `items` `items(ref self) -> _DictEntryIter[String, V, self_is_origin._dict]` Iterate over the keyword dictionary's entries as immutable references. Examples: ```mojo var my_dict = Dict[String, Int]() my_dict["a"] = 1 my_dict["b"] = 2 for e in my_dict.items(): print(e.key, e.value) ``` Notes: These can't yet be unpacked like Python dict items, but you can access the key and value as attributes. **Returns:** An iterator of immutable references to the dictionary entries. --- ## dict Defines `Dict`, a collection that stores key-value pairs. Dict provides an efficient, O(1) amortized average-time complexity for insert, lookup, and removal of dictionary elements. Its implementation closely mirrors Python's `dict` implementation: * Performance and size are heavily optimized for small dictionaries, but can scale to large dictionaries. * Insertion order is implicitly preserved. Iteration over keys, values, and items have a deterministic order based on insertion. * For more information on the Mojo `Dict` type, see the [Mojo `Dict` manual](/mojo/manual/types/#dict). To learn more about using Python dictionaries from Mojo, see [Python types in Mojo](/mojo/manual/python/types/#python-types-in-mojo). Key elements must implement the `KeyElement` trait, which encompasses Movable, Hashable, and EqualityComparable. It also includes Copyable and Movable until we push references through the standard library types. Value elements must be CollectionElements for a similar reason. Both key and value types must always be Movable so we can resize the dictionary as it grows. See the `Dict` docs for more details. ## Aliases ### `KeyElement` `alias KeyElement = Copyable & Movable & Hashable & EqualityComparable` A trait composition for types which implement all requirements of dictionary keys. Dict keys must minimally be Copyable, Movable, Hashable, and EqualityComparable for a hash map. Until we have references they must also be copyable. ## Structs * [​`Dict`](/mojo/stdlib/collections/dict/Dict): A container that stores key-value pairs. * [​`DictEntry`](/mojo/stdlib/collections/dict/DictEntry): Store a key-value pair entry inside a dictionary. * [​`OwnedKwargsDict`](/mojo/stdlib/collections/dict/OwnedKwargsDict): Container used to pass owned variadic keyword arguments to functions. --- ## collections Implements the collections package. ## Packages * [​`string`](/mojo/stdlib/collections/string/): The string package provides comprehensive Unicode string handling functionality for Mojo. ## Modules * [​`bitset`](/mojo/stdlib/collections/bitset/): Provides a compact, grow-only set of non-negative integers. * [​`counter`](/mojo/stdlib/collections/counter/): Defines the `Counter` type. * [​`deque`](/mojo/stdlib/collections/deque/): Defines the Deque type. * [​`dict`](/mojo/stdlib/collections/dict/): Defines `Dict`, a collection that stores key-value pairs. * [​`inline_array`](/mojo/stdlib/collections/inline_array/): Provides a fixed-size array implementation with compile-time size checking. * [​`interval`](/mojo/stdlib/collections/interval/): A self-balancing interval tree is a specialized binary search tree designed to efficiently store and query intervals. * [​`linked_list`](/mojo/stdlib/collections/linked_list/): * [​`list`](/mojo/stdlib/collections/list/): Defines the List type. * [​`optional`](/mojo/stdlib/collections/optional/): Defines Optional, a type modeling a value which may or may not be present. * [​`set`](/mojo/stdlib/collections/set/): Implements the Set datatype. --- ## InlineArray `struct InlineArray[ElementType: Copyable & Movable, size: Int, *, run_destructors: Bool = False]` A fixed-size sequence of homogeneous elements where size is a constant expression. InlineArray provides a fixed-size array implementation with compile-time size checking. The array size is determined at compile time and cannot be changed. Elements must implement the `Copyable` and `Movable` traits. Examples: ```mojo # Create array of 3 integers var arr = InlineArray[Int, 3](1, 2, 3) # Create array filled with value var filled = InlineArray[Int, 5](fill=42) # Access elements print(arr[0]) # Prints 1 ``` ## Parameters * ​ElementType (`Copyable & Movable`): The type of the elements in the array. Must implement `Copyable` and `Movable`. * ​size (`Int`): The size of the array. Must be a positive integer constant. * ​run\_destructors (`Bool`): Whether to run destructors on the elements. Defaults to `False` for backwards compatibility. Will default to `True` in the future. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `ExplicitlyCopyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Aliases ### `type` `alias type = array, :trait ElementType>` ## Methods ### `__init__` `__init__(out self)` This constructor will always cause a compile time error if used. It is used to steer users away from uninitialized memory. `__init__(out self, *, uninitialized: Bool)` Create an InlineArray with uninitialized memory. Examples: ```mojo var uninitialized_array = InlineArray[Int, 10](uninitialized=True) ``` Notes: This constructor is unsafe and should be used with caution. The array elements will be uninitialized and accessing them before initialization is undefined behavior. **Args:** * ​uninitialized (`Bool`): A boolean to indicate if the array should be initialized. Always set to `True` (it's not actually used inside the constructor). `__init__(out self, *, owned unsafe_assume_initialized: InlineArray[UnsafeMaybeUninitialized[ElementType], size])` Constructs an `InlineArray` from an `InlineArray` of `UnsafeMaybeUninitialized`. Warning: This is an unsafe constructor. Only use it if you are certain all elements are properly initialized. Notes: This constructor assumes all elements in the input array are initialized. Using uninitialized elements results in undefined behavior, even for types that are valid for any bit pattern (e.g. `Int` or `Float`). **Args:** * ​unsafe\_assume\_initialized (`InlineArray[UnsafeMaybeUninitialized[ElementType], size]`): The array of `UnsafeMaybeUninitialized` elements. All elements must be initialized. `@implicit` `__init__[batch_size: Int = 64](out self, fill: ElementType)` Constructs an array where each element is initialized to the supplied value. Examples: ```mojo var filled = InlineArray[Int, 5](fill=42) # [42, 42, 42, 42, 42] # For large arrays, consider adjusting batch_size to balance # compile time and runtime performance: var large = InlineArray[Int, 10000].__init__[batch_size=32](fill=0) ``` Notes: * Full unrolling with large arrays (>2k elements) can cause significant compiler slowdowns. * Using batch\_size=64 balances AVX512 efficiency and instruction cache usage. * For very large arrays, using smaller batch sizes (e.g., 32 or 16) can further improve compilation speed while still maintaining good runtime performance. **Parameters:** * ​batch\_size (`Int`): The number of elements to unroll for filling the array. Default is 64, which optimizes for AVX512 operations on modern CPUs. For large arrays (>2k elements), this batched approach significantly improves compile times compared to full unrolling while maintaining good runtime performance. **Args:** * ​fill (`ElementType`): The element value to fill each index with. `@implicit` `__init__(out self, owned *elems: ElementType, *, __list_literal__: Tuple[] = Tuple())` Constructs an array from a variadic list of elements. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) # [1, 2, 3] ``` **Args:** * ​\*elems (`ElementType`): The elements to initialize the array with. Must match the array size. * ​**list\_literal** (`Tuple[]`): Specifies that this constructor can be used for list literals. `__init__(out self, *, owned storage: VariadicListMem[ElementType, origin, is_owned])` Construct an array from a low-level internal representation. **Args:** * ​storage (`VariadicListMem[ElementType, origin, is_owned]`): The variadic list storage to construct from. Must match array size. ### `__copyinit__` `__copyinit__(out self, other: Self)` Copy constructs the array from another array. Notes: Creates a deep copy by copying each element individually. **Args:** * ​other (`Self`): The array to copy from. ### `__del__` `__del__(owned self)` Deallocates the array and destroys its elements. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) # arr's destructor is called automatically when it goes out of scope ``` Notes: This destructor is called automatically when the array goes out of scope. If the array's `run_destructors` parameter is `True`, it will call the destructor on each element in the array before deallocating the array's memory. ### `__getitem__` `__getitem__[I: Indexer](ref self, idx: I) -> ref [self] ElementType` Gets a reference to the element at the given index. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) print(arr[0]) # Prints 1 - first element print(arr[1]) # Prints 2 - second element print(arr[-1]) # Prints 3 - last element print(arr[-2]) # Prints 2 - second to last element ``` Notes: This method provides array-style indexing access to elements in the InlineArray. It supports both positive indices starting from 0 and negative indices counting backwards from the end of the array. The index is bounds-checked at runtime. **Parameters:** * ​I (`Indexer`): The type parameter representing the index type, must implement Indexer trait. **Args:** * ​idx (`I`): The index to access. Can be positive (0 to len-1) or negative (-len to -1). **Returns:** A reference to the element at the specified index. `__getitem__[I: Indexer, //, idx: I](ref self) -> ref [self] ElementType` Gets a reference to the element at the given index with compile-time bounds checking. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) print(arr[0]) # Prints 1 - first element print(arr[-1]) # Prints 3 - last element ``` Notes: This overload provides array-style indexing with compile-time bounds checking. The index must be a compile-time constant value. It supports both positive indices starting from 0 and negative indices counting backwards from the end of the array. **Parameters:** * ​I (`Indexer`): The type parameter representing the index type, must implement Indexer trait. * ​idx (`I`): The compile-time constant index to access. Can be positive (0 to len-1) or negative (-len to -1). **Returns:** A reference to the element at the specified index. ### `__contains__` `__contains__[T: EqualityComparable & Copyable & Movable, //](self: InlineArray[T, size], value: T) -> Bool` Tests if a value is present in the array using the `in` operator. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) print(3 in arr) # Prints True - value exists print(4 in arr) # Prints False - value not found ``` Notes: This method enables using the `in` operator to check if a value exists in the array. It performs a linear search comparing each element for equality with the given value. The element type must implement the `EqualityComparable`, `Copyable` and `Movable` traits to support equality comparison. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The element type, must implement both `EqualityComparable` and `Copyable` and `Movable`. **Args:** * ​value (`T`): The value to search for. **Returns:** True if the value is found in any position in the array, False otherwise. ### `copy` `copy(self) -> Self` Creates a deep copy of the array. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) var copy = arr.copy() # Creates new array [1, 2, 3] ``` **Returns:** A new array containing copies of all elements. ### `__len__` `__len__(self) -> Int` Returns the length of the array. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) print(len(arr)) # Prints 3 ``` Notes: The length is a compile-time constant value determined by the size parameter used when creating the array. **Returns:** The size of the array as an Int. ### `unsafe_get` `unsafe_get[I: Indexer](ref self, idx: I) -> ref [self] ElementType` Gets a reference to an element without bounds checking. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) print(arr.unsafe_get(0)) # Prints 1 ``` Warning: This is an unsafe method. No bounds checking is performed. Using an invalid index will cause undefined behavior. Negative indices are not supported. Notes: This is an unsafe method that skips bounds checking for performance. Users should prefer `__getitem__` instead for safety. **Parameters:** * ​I (`Indexer`): A type parameter representing the index type, must implement Indexer trait. **Args:** * ​idx (`I`): The index of the element to get. Must be non-negative and in bounds. Using an invalid index will cause undefined behavior. **Returns:** A reference to the element at the given index. ### `unsafe_ptr` `unsafe_ptr(ref self) -> UnsafePointer[ElementType, mut=self_is_mut, origin=self_is_origin]` Gets an unsafe pointer to the underlying array storage. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) var ptr = arr.unsafe_ptr() print(ptr[0]) # Prints 1 ``` Warning: This is an unsafe method. The returned pointer: * Becomes invalid if the array is moved * Must not be used to access memory outside array bounds * Must be refreshed after any operation that could move the array Notes: Returns a raw pointer to the array's memory that can be used for direct memory access. The pointer inherits mutability from the array reference. **Returns:** An `UnsafePointer` to the underlying array storage. The pointer's mutability matches that of the array reference. --- ## inline_array Provides a fixed-size array implementation with compile-time size checking. The `InlineArray` type represents a fixed-size sequence of homogeneous elements where the size is determined at compile time. It provides efficient memory layout and bounds checking while maintaining type safety. The `InlineArray` type is part of the `prelude` module and therefore does not need to be imported in order to use it. Examples: ```mojo # Create an array of 3 integers var arr = InlineArray[Int, 3](1, 2, 3) # Access elements print(arr[0]) # Prints 1 # Fill with a value var filled = InlineArray[Int, 5](fill=42) ``` Notes: * For historical reasons, destructors are not run by default on the elements of an `InlineArray`. This can be controlled with the `run_destructors` parameter. In the future, this will default to `True` and the `run_destructors` parameter will be removed. ## Structs * [​`InlineArray`](/mojo/stdlib/collections/inline_array/InlineArray): A fixed-size sequence of homogeneous elements where size is a constant expression. --- ## Interval `struct Interval[T: IntervalElement]` A half-open interval \[start, end) that represents a range of values. The interval includes the start value but excludes the end value. ## Parameters * ​T (`IntervalElement`): The type of the interval bounds. ## Fields * ​start (`T`): The inclusive start of the interval. * ​end (`T`): The exclusive end of the interval. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `EqualityComparable`, `Movable`, `Representable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(out self, start: T, end: T)` Initialize an interval with start and end values. **Args:** * ​start (`T`): The starting value of the interval. * ​end (`T`): The ending value of the interval. Must be greater than or equal to start. `__init__(out self, interval: Tuple[T, T], /)` Initialize an interval with a tuple of start and end values. **Args:** * ​interval (`Tuple[T, T]`): A tuple containing the start and end values. ### `__copyinit__` `__copyinit__(out self, existing: Self, /)` Create a new instance of the interval by copying the values from an existing one. **Args:** * ​existing (`Self`): The interval to copy values from. ### `__moveinit__` `__moveinit__(out self, owned existing: Self, /)` Create a new instance of the interval by moving the values from an existing one. **Args:** * ​existing (`Self`): The interval to move values from. ### `__bool__` `__bool__(self) -> Bool` Returns whether this interval is empty. **Returns:** True if the interval is not empty (start ### `__lt__` `__lt__(self, other: Self) -> Bool` Returns whether this interval is less than another interval. **Args:** * ​other (`Self`): The interval to compare with. **Returns:** True if this interval's start is less than the other interval's start. ### `__le__` `__le__(self, other: Self) -> Bool` Returns whether this interval is less than or equal to another interval. **Args:** * ​other (`Self`): The interval to compare with. **Returns:** True if this interval's start is less than or equal to the other interval's start. ### `__eq__` `__eq__(self, other: Self) -> Bool` Returns whether this interval equals another interval. **Args:** * ​other (`Self`): The interval to compare with. **Returns:** True if both intervals have the same start and end values. ### `__ne__` `__ne__(self, other: Self) -> Bool` Returns whether this interval is not equal to another interval. **Args:** * ​other (`Self`): The interval to compare with. **Returns:** True if the intervals are not equal, False if they are equal. ### `__gt__` `__gt__(self, other: Self) -> Bool` Returns whether this interval is greater than another interval. **Args:** * ​other (`Self`): The interval to compare with. **Returns:** True if this interval's end is greater than the other interval's end. ### `__ge__` `__ge__(self, other: Self) -> Bool` Returns whether this interval is greater than or equal to another interval. **Args:** * ​other (`Self`): The interval to compare with. **Returns:** True if this interval's end is greater than or equal to the other interval's end. ### `__contains__` `__contains__(self, other: T) -> Bool` Returns whether a value is contained within this interval. **Args:** * ​other (`T`): The value to check. **Returns:** True if the value is within the interval bounds, False otherwise. `__contains__(self, other: Self) -> Bool` Returns whether another interval is fully contained within this interval. **Args:** * ​other (`Self`): The interval to check. **Returns:** True if the other interval is fully contained within this interval, False otherwise. ### `overlaps` `overlaps(self, other: Self) -> Bool` Returns whether this interval overlaps with another interval. **Args:** * ​other (`Self`): The interval to check for overlap with. **Returns:** True if the intervals overlap, False otherwise. ### `union` `union(self, other: Self) -> Self` Returns the union of this interval and another interval. **Args:** * ​other (`Self`): The interval to union with. **Returns:** The union of this interval and the other interval. ### `intersection` `intersection(self, other: Self) -> Self` Returns the intersection of this interval and another interval. **Args:** * ​other (`Self`): The interval to intersect with. **Returns:** The intersection of this interval and the other interval. ### `__len__` `__len__(self) -> Int` Returns the length of this interval. **Returns:** The difference between end and start values as an integer. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes this interval to a writer in the format '(start, end)'. **Parameters:** * ​W (`Writer`): The writer type that implements the Writer trait. **Args:** * ​writer (`W`): The writer to write the interval to. ### `__str__` `__str__(self) -> String` Returns a string representation of this interval. **Returns:** A string in the format '(start, end)' representing this interval. ### `__repr__` `__repr__(self) -> String` Returns a string representation of this interval suitable for debugging. **Returns:** A string in the format '(start, end)' representing this interval. --- ## IntervalElement The trait denotes a trait composition of the `Copyable`, `Movable`, `Writable`, `Intable`, and `Comparable` traits. Which is also subtractable. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `Intable`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `UnknownDestructibility`, `Writable` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `__lt__` `__lt__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is less than `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is less than `rhs`. ### `__le__` `__le__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is less than or equal to `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is less than or equal to `rhs`. ### `__eq__` `__eq__(self: _Self, other: _Self) -> Bool` Define whether two instances of the object are equal to each other. **Args:** * ​other (`_Self`): Another instance of the same type. **Returns:** True if the instances are equal according to the type's definition of equality, False otherwise. ### `__ne__` `__ne__(self: _Self, other: _Self) -> Bool` Define whether two instances of the object are not equal to each other. **Args:** * ​other (`_Self`): Another instance of the same type. **Returns:** True if the instances are not equal according to the type's definition of equality, False otherwise. ### `__gt__` `__gt__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is greater than `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is greater than `rhs`. ### `__ge__` `__ge__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is greater than or equal to `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is greater than or equal to `rhs`. ### `__sub__` `__sub__(self: _Self, rhs: _Self) -> _Self` Subtracts rhs from self, must be implemented in concrete types. **Args:** * ​rhs (`_Self`): The value to subtract from self. **Returns:** The result of subtracting rhs from self. ### `__int__` `__int__(self: _Self) -> Int` Get the integral representation of the value. **Returns:** The integral representation of the value. ### `write_to` `write_to[W: Writer](self: _Self, mut writer: W)` Formats the string representation of this type to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The type conforming to `Writable`. --- ## IntervalTree `struct IntervalTree[T: IntervalElement, U: Copyable & Movable & Stringable & EqualityComparable & LessThanComparable & GreaterThanComparable & LessThanOrEqualComparable & GreaterThanOrEqualComparable]` An interval tree data structure for efficient range queries. ## Parameters * ​T (`IntervalElement`): The type of the interval bounds, must support subtraction, integer conversion, string conversion, comparison and collection operations. * ​U (`Copyable & Movable & Stringable & EqualityComparable & LessThanComparable & GreaterThanComparable & LessThanOrEqualComparable & GreaterThanOrEqualComparable`): The type of the associated data, must support string conversion and collection operations. ## Implemented traits `AnyType`, `Defaultable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(out self)` Initializes an empty IntervalTree. ### `insert` `insert(mut self, interval: Tuple[T, T], data: U)` Insert a new interval into the tree using a tuple representation. **Args:** * ​interval (`Tuple[T, T]`): A tuple containing the start and end values of the interval. * ​data (`U`): The data value to associate with this interval. `insert(mut self, interval: Interval[T], data: U)` Insert a new interval into the tree. This method inserts a new interval and its associated data into the interval tree. It maintains the binary search tree property based on interval start times and updates the tree structure to preserve red-black tree properties. **Args:** * ​interval (`Interval[T]`): The interval to insert into the tree. * ​data (`U`): The data value to associate with this interval. ### `__str__` `__str__(self) -> String` Returns a string representation of the interval tree. **Returns:** A string representation of the interval tree. ### `__repr__` `__repr__(self) -> String` Returns a string representation of the interval tree suitable for debugging. **Returns:** A string representation of the interval tree. ### `write_to` `write_to[w: Writer](self, mut writer: w)` Writes the interval tree to a writer. **Parameters:** * ​w (`Writer`): The writer type that implements the Writer trait. **Args:** * ​writer (`w`): The writer to write the interval tree to. ### `depth` `depth(self) -> Int` Returns the depth of the interval tree. **Returns:** The depth of the interval tree. ### `transplant` `transplant(mut self, mut u: UnsafePointer[_IntervalNode[T, U]], mut v: UnsafePointer[_IntervalNode[T, U]])` Transplants the subtree rooted at node u with the subtree rooted at node v. **Args:** * ​u (`UnsafePointer[_IntervalNode[T, U]]`): The node to transplant. * ​v (`UnsafePointer[_IntervalNode[T, U]]`): The node to transplant to. ### `search` `search(self, interval: Tuple[T, T]) -> List[U]` Searches for intervals overlapping with the given tuple. **Args:** * ​interval (`Tuple[T, T]`): The interval tuple (start, end). **Returns:** A list of data associated with overlapping intervals. `search(self, interval: Interval[T]) -> List[U]` Searches for intervals overlapping with the given interval. **Args:** * ​interval (`Interval[T]`): The interval to search. **Returns:** A list of data associated with overlapping intervals. --- ## interval A self-balancing interval tree is a specialized binary search tree designed to efficiently store and query intervals. It maintains intervals sorted by their low endpoints and augments each node with a `max_high` attribute, representing the maximum high endpoint in its subtree. This `max_high` value enables efficient overlap searching by pruning the search space. Self-balancing mechanisms, such as Red-Black or AVL trees, ensure logarithmic time complexity for operations. Key Features: * Stores intervals (low, high). * Nodes ordered by `low` endpoints. * `max_high` attribute at each node for efficient overlap search. * Self-balancing (e.g., using Red-Black tree logic) for O(log n) operations. Operations: * Insertion: O(log n) - Adds a new interval, maintaining balance and updating `max_high`. * Overlap Search: O(log n) - Finds intervals overlapping a query interval using `max_high` for pruning. * Deletion: O(log n) - Removes an interval, maintaining balance and updating `max_high`. Space Complexity: O(n), where n is the number of intervals. Use Cases: * Calendar scheduling * Computational geometry * Genomics * Database indexing * Resource allocation In essence, this data structure provides a fast and efficient way to manage and query interval data, particularly for finding overlaps. ## Structs * [​`Interval`](/mojo/stdlib/collections/interval/Interval): A half-open interval \[start, end) that represents a range of values. * [​`IntervalTree`](/mojo/stdlib/collections/interval/IntervalTree): An interval tree data structure for efficient range queries. ## Traits * [​`IntervalElement`](/mojo/stdlib/collections/interval/IntervalElement): The trait denotes a trait composition of the `Copyable`, `Movable`, `Writable`, `Intable`, and `Comparable` traits. Which is also subtractable. --- ## LinkedList `struct LinkedList[ElementType: Copyable & Movable]` A doubly-linked list implementation. A doubly-linked list is a data structure where each element points to both the next and previous elements, allowing for efficient insertion and deletion at any position. ## Parameters * ​ElementType (`Copyable & Movable`): The type of elements stored in the list. Must implement the `Copyable` and `Movable` traits. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Defaultable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Initialize an empty linked list. Notes: Time Complexity: O(1). `__init__(out self, owned *elements: ElementType, *, __list_literal__: Tuple[] = Tuple())` Initialize a linked list with the given elements. Notes: Time Complexity: O(n) in len(elements). **Args:** * ​\*elements (`ElementType`): Variable number of elements to initialize the list with. * ​**list\_literal** (`Tuple[]`): Tell Mojo to use this method for list literals. `__init__(out self, *, owned elements: VariadicListMem[ElementType, origin, is_owned])` Construct a list from a `VariadicListMem`. Notes: Time Complexity: O(n) in len(elements). **Args:** * ​elements (`VariadicListMem[ElementType, origin, is_owned]`): The elements to add to the list. ### `__copyinit__` `__copyinit__(out self, other: Self)` Initialize this list as a copy of another list. Notes: Time Complexity: O(n) in len(elements). **Args:** * ​other (`Self`): The list to copy from. ### `__moveinit__` `__moveinit__(out self, owned other: Self)` Initialize this list by moving elements from another list. Notes: Time Complexity: O(1). **Args:** * ​other (`Self`): The list to move elements from. ### `__del__` `__del__(owned self)` Clean up the list by freeing all nodes. Notes: Time Complexity: O(n) in len(self). ### `__bool__` `__bool__(self) -> Bool` Check if the list is non-empty. Notes: Time Complexity: O(1). **Returns:** True if the list has elements, False otherwise. ### `__getitem__` `__getitem__[I: Indexer](ref self, index: I) -> ref [self] ElementType` Get the element at the specified index. Notes: Time Complexity: O(n) in len(self). **Parameters:** * ​I (`Indexer`): The type of index to use. **Args:** * ​index (`I`): The index of the element to get. **Returns:** The element at the specified index. ### `__setitem__` `__setitem__[I: Indexer](mut self, index: I, owned value: ElementType)` Set the element at the specified index. Notes: Time Complexity: O(n) in len(self). **Parameters:** * ​I (`Indexer`): The type of index to use. **Args:** * ​index (`I`): The index of the element to set. * ​value (`ElementType`): The new value to set. ### `__eq__` `__eq__[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], other: LinkedList[ElementType]) -> Bool` Checks if the two lists are equal. Notes: Time Complexity: O(n) in min(len(self), len(other)) compares. **Parameters:** * ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the function. **Args:** * ​other (`LinkedList[ElementType]`): The list to compare to. **Returns:** Whether the lists are equal. ### `__ne__` `__ne__[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], other: LinkedList[ElementType]) -> Bool` Checks if the two lists are not equal. Notes: Time Complexity: O(n) in min(len(self), len(other)) compares. **Parameters:** * ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the function. **Args:** * ​other (`LinkedList[ElementType]`): The list to compare to. **Returns:** Whether the lists are not equal. ### `__contains__` `__contains__[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], value: ElementType) -> Bool` Checks if the list contains `value`. Notes: Time Complexity: O(n) in len(self) compares. **Parameters:** * ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the function. **Args:** * ​value (`ElementType`): The value to search for in the list. **Returns:** Whether the list contains `value`. ### `append` `append(mut self, owned value: ElementType)` Add an element to the end of the list. Notes: Time Complexity: O(1). **Args:** * ​value (`ElementType`): The value to append. ### `prepend` `prepend(mut self, owned value: ElementType)` Add an element to the beginning of the list. Notes: Time Complexity: O(1). **Args:** * ​value (`ElementType`): The value to prepend. ### `reverse` `reverse(mut self)` Reverse the order of elements in the list. Notes: Time Complexity: O(n) in len(self). ### `pop` `pop(mut self) -> ElementType` Remove and return the last element of the list. Notes: Time Complexity: O(1). **Returns:** The last element in the list. `pop[I: Indexer](mut self, owned i: I) -> ElementType` Remove the ith element of the list, counting from the tail if given a negative index. Notes: Time Complexity: O(n) in len(self). **Parameters:** * ​I (`Indexer`): The type of index to use. **Args:** * ​i (`I`): The index of the element to get. **Returns:** Ownership of the indicated element. ### `maybe_pop` `maybe_pop(mut self) -> Optional[ElementType]` Removes the tail of the list and returns it, if it exists. Notes: Time Complexity: O(1). **Returns:** The tail of the list, if it was present. `maybe_pop[I: Indexer](mut self, owned i: I) -> Optional[ElementType]` Remove the ith element of the list, counting from the tail if given a negative index. Notes: Time Complexity: O(n) in len(self). **Parameters:** * ​I (`Indexer`): The type of index to use. **Args:** * ​i (`I`): The index of the element to get. **Returns:** The element, if it was found. ### `clear` `clear(mut self)` Removes all elements from the list. Notes: Time Complexity: O(n) in len(self). ### `copy` `copy(self) -> Self` Create a deep copy of the list. Notes: Time Complexity: O(n) in len(self). **Returns:** A new list containing copies of all elements. ### `insert` `insert[I: Indexer](mut self, idx: I, owned elem: ElementType)` Insert an element `elem` into the list at index `idx`. Notes: Time Complexity: O(n) in len(self). **Parameters:** * ​I (`Indexer`): The type of index to use. **Args:** * ​idx (`I`): The index to insert `elem` at `-len(self) elem (`ElementType`): The item to insert into the list. **Raises:** When given an out of bounds index. ### `extend` `extend(mut self, owned other: Self)` Extends the list with another. Notes: Time Complexity: O(1). **Args:** * ​other (`Self`): The list to append to this one. ### `count` `count[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], elem: ElementType) -> UInt` Count the occurrences of `elem` in the list. Notes: Time Complexity: O(n) in len(self) compares. **Parameters:** * ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the function. **Args:** * ​elem (`ElementType`): The element to search for. **Returns:** The number of occurrences of `elem` in the list. ### `__len__` `__len__(self) -> Int` Get the number of elements in the list. Notes: Time Complexity: O(1). **Returns:** The number of elements in the list. ### `__iter__` `__iter__(self) -> _LinkedListIter[ElementType, self]` Iterate over elements of the list, returning immutable references. Notes: Time Complexity: * O(1) for iterator construction. * O(n) in len(self) for a complete iteration of the list. **Returns:** An iterator of immutable references to the list elements. ### `__reversed__` `__reversed__(self) -> _LinkedListIter[ElementType, self, False]` Iterate backwards over the list, returning immutable references. Notes: Time Complexity: * O(1) for iterator construction. * O(n) in len(self) for a complete iteration of the list. **Returns:** A reversed iterator of immutable references to the list elements. ### `__str__` `__str__[ElementType: Copyable & Movable & Writable](self: LinkedList[ElementType]) -> String` Convert the list to its string representation. Notes: Time Complexity: O(n) in len(self). **Parameters:** * ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function when `ElementType` is `Writable`. **Returns:** String representation of the list. ### `__repr__` `__repr__[ElementType: Copyable & Movable & Writable](self: LinkedList[ElementType]) -> String` Convert the list to its string representation. Notes: Time Complexity: O(n) in len(self). **Parameters:** * ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function when `ElementType` is `Writable`. **Returns:** String representation of the list. ### `write_to` `write_to[W: Writer, ElementType: Copyable & Movable & Writable](self: LinkedList[ElementType], mut writer: W)` Write the list to the given writer. Notes: Time Complexity: O(n) in len(self). **Parameters:** * ​W (`Writer`): The type of writer to write the list to. * ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function when `ElementType` is `Writable`. **Args:** * ​writer (`W`): The writer to write the list to. --- ## Node `struct Node[ElementType: Copyable & Movable]` A node in a linked list data structure. ## Parameters * ​ElementType (`Copyable & Movable`): The type of element stored in the node. ## Fields * ​value (`ElementType`): The value stored in this node. * ​prev (`UnsafePointer[Node[ElementType]]`): The previous node in the list. * ​next (`UnsafePointer[Node[ElementType]]`): The next node in the list. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, owned value: ElementType, prev: Optional[UnsafePointer[Node[ElementType]]], next: Optional[UnsafePointer[Node[ElementType]]])` Initialize a new Node with the given value and optional prev/next pointers. **Args:** * ​value (`ElementType`): The value to store in this node. * ​prev (`Optional[UnsafePointer[Node[ElementType]]]`): Optional pointer to the previous node. * ​next (`Optional[UnsafePointer[Node[ElementType]]]`): Optional pointer to the next node. ### `__str__` `__str__[ElementType: Copyable & Movable & Writable](self: Node[ElementType]) -> String` Convert this node's value to a string representation. **Parameters:** * ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function if `ElementType` is `Writable`. **Returns:** String representation of the node's value. ### `write_to` `write_to[ElementType: Copyable & Movable & Writable, W: Writer](self: Node[ElementType], mut writer: W)` Write this node's value to the given writer. **Parameters:** * ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function if `ElementType` is `Writable`. * ​W (`Writer`): The type of writer to write the value to. **Args:** * ​writer (`W`): The writer to write the value to. --- ## linked_list ## Structs * [​`LinkedList`](/mojo/stdlib/collections/linked_list/LinkedList): A doubly-linked list implementation. * [​`Node`](/mojo/stdlib/collections/linked_list/Node): A node in a linked list data structure. --- ## List `struct List[T: Copyable & Movable, hint_trivial_type: Bool = False]` The `List` type is a dynamically-allocated list. Notes: It supports pushing and popping from the back resizing the underlying storage as needed. When it is deallocated, it frees its memory. ## Parameters * ​T (`Copyable & Movable`): The type of the elements. * ​hint\_trivial\_type (`Bool`): A hint to the compiler that the type T is trivial. It's not mandatory, but if set, it allows some optimizations. ## Fields * ​data (`UnsafePointer[T]`): The underlying storage for the list. * ​capacity (`Int`): The amount of elements that can fit in the list without resizing it. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Defaultable`, `ExplicitlyCopyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Constructs an empty list. `__init__(out self, *, capacity: Int)` Constructs a list with the given capacity. **Args:** * ​capacity (`Int`): The requested capacity of the list. `__init__(out self, *, length: UInt, fill: T)` Constructs a list with the given capacity. **Args:** * ​length (`UInt`): The requested length of the list. * ​fill (`T`): The element to fill each element of the list. `__init__(out self, owned *values: T, *, __list_literal__: Tuple[] = Tuple())` Constructs a list from the given values. **Args:** * ​\*values (`T`): The values to populate the list with. * ​**list\_literal** (`Tuple[]`): Tell Mojo to use this method for list literals. `__init__(out self, *, owned elements: VariadicListMem[T, origin, is_owned])` Constructs a list from the given values. **Args:** * ​elements (`VariadicListMem[T, origin, is_owned]`): The values to populate the list with. `__init__(out self, span: Span[T, origin])` Constructs a list from the a Span of values. **Args:** * ​span (`Span[T, origin]`): The span of values to populate the list with. `__init__(out self, *, unsafe_uninit_length: Int)` Construct a list with the specified length, with uninitialized memory. This is unsafe, as it relies on the caller initializing the elements with unsafe operations, not assigning over the uninitialized data. **Args:** * ​unsafe\_uninit\_length (`Int`): The number of elements to allocate. ### `__copyinit__` `__copyinit__(out self, existing: Self)` Creates a deepcopy of the given list. **Args:** * ​existing (`Self`): The list to copy. ### `__del__` `__del__(owned self)` Destroy all elements in the list and free its memory. ### `__bool__` `__bool__(self) -> Bool` Checks whether the list has any elements or not. **Returns:** `False` if the list is empty, `True` if there is at least one element. ### `__getitem__` `__getitem__(self, slice: Slice) -> Self` Gets the sequence of elements at the specified positions. **Args:** * ​slice (`Slice`): A slice that specifies positions of the new list. **Returns:** A new list containing the list at the specified slice. `__getitem__[I: Indexer](ref self, idx: I) -> ref [self] T` Gets the list element at the given index. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The index of the element. **Returns:** A reference to the element at the given index. ### `__eq__` `__eq__[U: EqualityComparable & Copyable & Movable, //](self: List[U, hint_trivial_type], other: List[U, hint_trivial_type]) -> Bool` Checks if two lists are equal. Examples: ```mojo var x = [1, 2, 3] var y = [1, 2, 3] print("x and y are equal" if x == y else "x and y are not equal") ``` **Parameters:** * ​U (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the trait `EqualityComparable`. **Args:** * ​other (`List[U, hint_trivial_type]`): The list to compare with. **Returns:** True if the lists are equal, False otherwise. ### `__ne__` `__ne__[U: EqualityComparable & Copyable & Movable, //](self: List[U, hint_trivial_type], other: List[U, hint_trivial_type]) -> Bool` Checks if two lists are not equal. Examples: ```mojo var x = [1, 2, 3] var y = [1, 2, 4] print("x and y are not equal" if x != y else "x and y are equal") ``` **Parameters:** * ​U (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the trait `EqualityComparable`. **Args:** * ​other (`List[U, hint_trivial_type]`): The list to compare with. **Returns:** True if the lists are not equal, False otherwise. ### `__contains__` `__contains__[U: EqualityComparable & Copyable & Movable, //](self: List[U, hint_trivial_type], value: U) -> Bool` Verify if a given value is present in the list. Examples: ```mojo var x = [1, 2, 3] print("x contains 3" if 3 in x else "x does not contain 3") ``` **Parameters:** * ​U (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the trait `EqualityComparable`. **Args:** * ​value (`U`): The value to find. **Returns:** True if the value is contained in the list, False otherwise. ### `__add__` `__add__(self, owned other: Self) -> Self` Concatenates self with other and returns the result as a new list. **Args:** * ​other (`Self`): List whose elements will be combined with the elements of self. **Returns:** The newly created list. ### `__mul__` `__mul__(self, x: Int) -> Self` Multiplies the list by x and returns a new list. **Args:** * ​x (`Int`): The multiplier number. **Returns:** The new list. ### `__iadd__` `__iadd__(mut self, owned other: Self)` Appends the elements of other into self. **Args:** * ​other (`Self`): List whose elements will be appended to self. ### `__imul__` `__imul__(mut self, x: Int)` Appends the original elements of this list x-1 times or clears it if x is x (`Int`): The multiplier number. ### `copy` `copy(self) -> Self` Creates a deep copy of the given list. **Returns:** A copy of the value. ### `__iter__` `__iter__(ref self) -> _ListIter[T, hint_trivial_type, self_is_origin]` Iterate over elements of the list, returning immutable references. **Returns:** An iterator of immutable references to the list elements. ### `__reversed__` `__reversed__(ref self) -> _ListIter[T, hint_trivial_type, self_is_origin, False]` Iterate backwards over the list, returning immutable references. **Returns:** A reversed iterator of immutable references to the list elements. ### `__len__` `__len__(self) -> Int` Gets the number of elements in the list. **Returns:** The number of elements in the list. ### `__str__` `__str__[U: Representable & Copyable & Movable, //](self: List[U, hint_trivial_type]) -> String` Returns a string representation of a `List`. Notes: Note that since we can't condition methods on a trait yet, the way to call this method is a bit special. Here is an example below: ```mojo var my_list = [1, 2, 3] print(my_list.__str__()) ``` When the compiler supports conditional methods, then a simple `String(my_list)` will be enough. **Parameters:** * ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the trait `Representable`. **Returns:** A string representation of the list. ### `write_to` `write_to[W: Writer, U: Representable & Copyable & Movable, //](self: List[U, hint_trivial_type], mut writer: W)` Write `my_list.__str__()` to a `Writer`. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. * ​U (`Representable & Copyable & Movable`): The type of the List elements. Must have the trait `Representable`. **Args:** * ​writer (`W`): The object to write to. ### `__repr__` `__repr__[U: Representable & Copyable & Movable, //](self: List[U, hint_trivial_type]) -> String` Returns a string representation of a `List`. Notes: Note that since we can't condition methods on a trait yet, the way to call this method is a bit special. Here is an example below: ```mojo var my_list = [1, 2, 3] print(my_list.__repr__()) ``` When the compiler supports conditional methods, then a simple `repr(my_list)` will be enough. **Parameters:** * ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the trait `Representable`. **Returns:** A string representation of the list. ### `byte_length` `byte_length(self) -> Int` Gets the byte length of the List (`len(self) * sizeof[T]()`). **Returns:** The byte length of the List (`len(self) * sizeof[T]()`). ### `append` `append(mut self, owned value: T)` Appends a value to this list. Notes: If there is no capacity left, resizes to twice the current capacity. Except for 0 capacity where it sets 1. **Args:** * ​value (`T`): The value to append. `append(mut self, elements: Span[T, origin])` Appends elements to this list. **Args:** * ​elements (`Span[T, origin]`): The elements to append. ### `insert` `insert(mut self, i: Int, owned value: T)` Inserts a value to the list at the given index. `a.insert(len(a), value)` is equivalent to `a.append(value)`. **Args:** * ​i (`Int`): The index for the value. * ​value (`T`): The value to insert. ### `extend` `extend(mut self, owned other: List[T, hint_trivial_type])` Extends this list by consuming the elements of `other`. **Args:** * ​other (`List[T, hint_trivial_type]`): List whose elements will be added in order at the end of this list. `extend[D: DType, //](mut self: List[SIMD[D, 1], hint_trivial_type], value: SIMD[D, size])` Extends this list with the elements of a vector. Notes: If there is no capacity left, resizes to `len(self) + value.size`. **Parameters:** * ​D (`DType`): The DType. **Args:** * ​value (`SIMD[D, size]`): The value to append. `extend[D: DType, //](mut self: List[SIMD[D, 1], hint_trivial_type], value: SIMD[D, size], *, count: Int)` Extends this list with `count` number of elements from a vector. Notes: If there is no capacity left, resizes to `len(self) + count`. **Parameters:** * ​D (`DType`): The DType. **Args:** * ​value (`SIMD[D, size]`): The value to append. * ​count (`Int`): The amount of items to append. Must be less than or equal to `value.size`. `extend[D: DType, //](mut self: List[SIMD[D, 1], hint_trivial_type], value: Span[SIMD[D, 1], origin])` Extends this list with the elements of a `Span`. Notes: If there is no capacity left, resizes to `len(self) + len(value)`. **Parameters:** * ​D (`DType`): The DType. **Args:** * ​value (`Span[SIMD[D, 1], origin]`): The value to append. ### `pop` `pop(mut self, i: Int = -1) -> T` Pops a value from the list at the given index. **Args:** * ​i (`Int`): The index of the value to pop. **Returns:** The popped value. ### `reserve` `reserve(mut self, new_capacity: Int)` Reserves the requested capacity. Notes: If the current capacity is greater or equal, this is a no-op. Otherwise, the storage is reallocated and the date is moved. **Args:** * ​new\_capacity (`Int`): The new capacity. ### `resize` `resize(mut self, new_size: Int, value: T)` Resizes the list to the given new size. Notes: If the new size is smaller than the current one, elements at the end are discarded. If the new size is larger than the current one, the list is appended with new values elements up to the requested size. **Args:** * ​new\_size (`Int`): The new size. * ​value (`T`): The value to use to populate new elements. `resize(mut self, *, unsafe_uninit_length: Int)` Resizes the list to the given new size leaving any new elements uninitialized. If the new size is smaller than the current one, elements at the end are discarded. If the new size is larger than the current one, the list is extended and the new elements are left uninitialized. **Args:** * ​unsafe\_uninit\_length (`Int`): The new size. ### `shrink` `shrink(mut self, new_size: Int)` Resizes to the given new size which must be new\_size (`Int`): The new size. ### `reverse` `reverse(mut self)` Reverses the elements of the list. ### `index` `index[C: EqualityComparable & Copyable & Movable, //](ref self: List[C, hint_trivial_type], value: C, start: Int = 0, stop: Optional[Int] = Optional(None)) -> Int` Returns the index of the first occurrence of a value in a list restricted by the range given the start and stop bounds. Examples: ```mojo var my_list = [1, 2, 3] print(my_list.index(2)) # prints `1` ``` **Parameters:** * ​C (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the `EqualityComparable` trait. **Args:** * ​value (`C`): The value to search for. * ​start (`Int`): The starting index of the search, treated as a slice index (defaults to 0). * ​stop (`Optional[Int]`): The ending index of the search, treated as a slice index (defaults to None, which means the end of the list). **Returns:** The index of the first occurrence of the value in the list. **Raises:** ValueError: If the value is not found in the list. ### `clear` `clear(mut self)` Clears the elements in the list. ### `steal_data` `steal_data(mut self) -> UnsafePointer[T]` Take ownership of the underlying pointer from the list. **Returns:** The underlying data. ### `unsafe_get` `unsafe_get(ref self, idx: Int) -> ref [self] T` Get a reference to an element of self without checking index bounds. Notes: Users should consider using `__getitem__` instead of this method as it is unsafe. If an index is out of bounds, this method will not abort, it will be considered undefined behavior. Note that there is no wraparound for negative indices, caution is advised. Using negative indices is considered undefined behavior. Never use `my_list.unsafe_get(-1)` to get the last element of the list. Instead, do `my_list.unsafe_get(len(my_list) - 1)`. **Args:** * ​idx (`Int`): The index of the element to get. **Returns:** A reference to the element at the given index. ### `unsafe_set` `unsafe_set(mut self, idx: Int, owned value: T)` Write a value to a given location without checking index bounds. Notes: Users should consider using `my_list[idx] = value` instead of this method as it is unsafe. If an index is out of bounds, this method will not abort, it will be considered undefined behavior. Note that there is no wraparound for negative indices, caution is advised. Using negative indices is considered undefined behavior. Never use `my_list.unsafe_set(-1, value)` to set the last element of the list. Instead, do `my_list.unsafe_set(len(my_list) - 1, value)`. **Args:** * ​idx (`Int`): The index of the element to set. * ​value (`T`): The value to set. ### `count` `count[T: EqualityComparable & Copyable & Movable, //](self: List[T, hint_trivial_type], value: T) -> Int` Counts the number of occurrences of a value in the list. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the trait `EqualityComparable`. **Args:** * ​value (`T`): The value to count. **Returns:** The number of occurrences of the value in the list. ### `swap_elements` `swap_elements(mut self, elt_idx_1: Int, elt_idx_2: Int)` Swaps elements at the specified indexes if they are different. Examples: ```mojo var my_list = [1, 2, 3] my_list.swap_elements(0, 2) print(my_list.__str__()) # 3, 2, 1 ``` Notes: This is useful because `swap(my_list[i], my_list[j])` cannot be supported by Mojo, because a mutable alias may be formed. **Args:** * ​elt\_idx\_1 (`Int`): The index of one element. * ​elt\_idx\_2 (`Int`): The index of the other element. ### `unsafe_ptr` `unsafe_ptr(ref self) -> UnsafePointer[T, mut=self_is_mut, origin=self_is_origin]` Retrieves a pointer to the underlying memory. **Returns:** The pointer to the underlying memory. --- ## list Defines the List type. These APIs are imported automatically, just like builtins. ## Structs * [​`List`](/mojo/stdlib/collections/list/List): The `List` type is a dynamically-allocated list. --- ## Optional `struct Optional[T: Copyable & Movable]` A type modeling a value which may or may not be present. Optional values can be thought of as a type-safe nullable pattern. Your value can take on a value or `None`, and you need to check and explicitly extract the value to get it out. Currently T is required to be a `Copyable & Movable` so we can implement copy/move for Optional and allow it to be used in collections itself. Examples: ```mojo var a = Optional(1) var b = Optional[Int](None) if a: print(a.value()) # prints 1 if b: # Bool(b) is False, so no print print(b.value()) var c = a.or_else(2) var d = b.or_else(2) print(c) # prints 1 print(d) # prints 2 ``` ## Parameters * ​T (`Copyable & Movable`): The type of value stored in the `Optional`. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Defaultable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Construct an empty `Optional`. `@implicit` `__init__(out self, owned value: T)` Construct an `Optional` containing a value. **Args:** * ​value (`T`): The value to store in the `Optional`. `@implicit` `__init__(out self, value: NoneType)` Construct an empty `Optional`. **Args:** * ​value (`NoneType`): Must be exactly `None`. ### `__bool__` `__bool__(self) -> Bool` Return true if the Optional has a value. **Returns:** True if the `Optional` has a value and False otherwise. ### `__getitem__` `__getitem__(ref self) -> ref [$1._value] T` Retrieve a reference to the value inside the `Optional`. **Returns:** A reference to the value inside the `Optional`. **Raises:** On empty `Optional`. ### `__invert__` `__invert__(self) -> Bool` Return False if the `Optional` has a value. **Returns:** False if the `Optional` has a value and True otherwise. ### `__eq__` `__eq__(self, rhs: NoneType) -> Bool` Return `True` if a value is not present. **Args:** * ​rhs (`NoneType`): The `None` value to compare to. **Returns:** `True` if a value is not present, `False` otherwise. `__eq__[T: EqualityComparable & Copyable & Movable](self: Optional[T], rhs: Optional[T]) -> Bool` Return `True` if this is the same as another `Optional` value, meaning both are absent, or both are present and have the same underlying value. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the traits `Copyable`, `Movable` and `EqualityComparable`. **Args:** * ​rhs (`Optional[T]`): The value to compare to. **Returns:** True if the values are the same. ### `__ne__` `__ne__(self, rhs: NoneType) -> Bool` Return `True` if a value is present. **Args:** * ​rhs (`NoneType`): The `None` value to compare to. **Returns:** `False` if a value is not present, `True` otherwise. `__ne__[T: EqualityComparable & Copyable & Movable, //](self: Optional[T], rhs: Optional[T]) -> Bool` Return `False` if this is the same as another `Optional` value, meaning both are absent, or both are present and have the same underlying value. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the traits `Copyable`, `Movable` and `EqualityComparable`. **Args:** * ​rhs (`Optional[T]`): The value to compare to. **Returns:** False if the values are the same. ### `__is__` `__is__(self, other: NoneType) -> Bool` Return `True` if the Optional has no value. Notes: It allows you to use the following syntax: `if my_optional is None:`. **Args:** * ​other (`NoneType`): The value to compare to (None). **Returns:** True if the Optional has no value and False otherwise. ### `__isnot__` `__isnot__(self, other: NoneType) -> Bool` Return `True` if the Optional has a value. Notes: It allows you to use the following syntax: `if my_optional is not None:`. **Args:** * ​other (`NoneType`): The value to compare to (None). **Returns:** True if the Optional has a value and False otherwise. ### `copy` `copy(self) -> Self` Copy construct an `Optional`. **Returns:** A copy of the value. ### `__str__` `__str__[U: Copyable & Movable & Representable, //](self: Optional[U]) -> String` Return the string representation of the value of the `Optional`. **Parameters:** * ​U (`Copyable & Movable & Representable`): The type of the elements in the list. Must implement the traits `Representable`, `Copyable` and `Movable`. **Returns:** A string representation of the `Optional`. ### `__repr__` `__repr__[U: Representable & Copyable & Movable, //](self: Optional[U]) -> String` Returns the verbose string representation of the `Optional`. **Parameters:** * ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the traits `Representable`, `Copyable` and `Movable`. **Returns:** A verbose string representation of the `Optional`. ### `write_to` `write_to[W: Writer, U: Representable & Copyable & Movable, //](self: Optional[U], mut writer: W)` Write `Optional` string representation to a `Writer`. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. * ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the traits `Representable`, `Copyable` and `Movable`. **Args:** * ​writer (`W`): The object to write to. ### `value` `value(ref self) -> ref [$1._value] T` Retrieve a reference to the value of the `Optional`. Notes: This will abort on empty `Optional`. **Returns:** A reference to the contained data of the `Optional` as a reference. ### `unsafe_value` `unsafe_value(ref self) -> ref [$1._value] T` Unsafely retrieve a reference to the value of the `Optional`. Notes: This will **not** abort on empty `Optional`. **Returns:** A reference to the contained data of the `Optional` as a reference. ### `take` `take(mut self) -> T` Move the value out of the `Optional`. Notes: This will abort on empty `Optional`. **Returns:** The contained data of the `Optional` as an owned T value. ### `unsafe_take` `unsafe_take(mut self) -> T` Unsafely move the value out of the `Optional`. Notes: This will **not** abort on empty `Optional`. **Returns:** The contained data of the `Optional` as an owned T value. ### `or_else` `or_else(self, default: T) -> T` Return the underlying value contained in the `Optional` or a default value if the `Optional`'s underlying value is not present. **Args:** * ​default (`T`): The new value to use if no value was present. **Returns:** The underlying value contained in the `Optional` or a default value. ### `copied` `copied[mut: Bool, origin: Origin[mut], //, T: Copyable & Movable](self: Optional[Pointer[T, origin]]) -> Optional[T]` Converts an `Optional` containing a Pointer to an `Optional` of an owned value by copying. Examples: Copy the value of an `Optional[Pointer[_]]` ```mojo var data = String("foo") var opt = Optional(Pointer(to=data)) var opt_owned: Optional[String] = opt.copied() ``` Notes: If `self` is an empty `Optional`, the returned `Optional` will be empty as well. **Parameters:** * ​mut (`Bool`): Mutability of the pointee origin. * ​origin (`Origin[mut]`): Origin of the contained `Pointer`. * ​T (`Copyable & Movable`): Type of the owned result value. **Returns:** An `Optional` containing an owned copy of the pointee value. --- ## OptionalReg `@register_passable(trivial)` `struct OptionalReg[T: AnyTrivialRegType]` A register-passable optional type. This struct optionally contains a value. It only works with trivial register passable types at the moment. ## Parameters * ​T (`AnyTrivialRegType`): The type of value stored in the Optional. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Defaultable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Create an optional with a value of None. `@implicit` `__init__(value: T) -> Self` Create an optional with a value. **Args:** * ​value (`T`): The value. `@implicit` `__init__(value: NoneType) -> Self` Create an optional without a value from a None literal. **Args:** * ​value (`NoneType`): The None value. ### `__bool__` `__bool__(self) -> Bool` Return true if the optional has a value. **Returns:** True if the optional has a value and False otherwise. ### `__is__` `__is__(self, other: NoneType) -> Bool` Return `True` if the Optional has no value. It allows you to use the following syntax: `if my_optional is None:` **Args:** * ​other (`NoneType`): The value to compare to (None). **Returns:** True if the Optional has no value and False otherwise. ### `__isnot__` `__isnot__(self, other: NoneType) -> Bool` Return `True` if the Optional has a value. It allows you to use the following syntax: `if my_optional is not None:` **Args:** * ​other (`NoneType`): The value to compare to (None). **Returns:** True if the Optional has a value and False otherwise. ### `value` `value(self) -> T` Get the optional value. **Returns:** The contained value. ### `or_else` `or_else(owned self, owned default: T) -> T` Return the underlying value contained in the Optional or a default value if the Optional's underlying value is not present. **Args:** * ​default (`T`): The new value to use if no value was present. **Returns:** The underlying value contained in the Optional or a default value. --- ## optional Defines Optional, a type modeling a value which may or may not be present. Optional values can be thought of as a type-safe nullable pattern. Your value can take on a value or `None`, and you need to check and explicitly extract the value to get it out. Examples: ```mojo var a = Optional(1) var b = Optional[Int](None) if a: print(a.value()) # prints 1 if b: # Bool(b) is False, so no print print(b.value()) var c = a.or_else(2) var d = b.or_else(2) print(c) # prints 1 print(d) # prints 2 ``` ## Structs * [​`Optional`](/mojo/stdlib/collections/optional/Optional): A type modeling a value which may or may not be present. * [​`OptionalReg`](/mojo/stdlib/collections/optional/OptionalReg): A register-passable optional type. --- ## Set `struct Set[T: Copyable & Movable & Hashable & EqualityComparable]` A set data type. O(1) average-case amortized add, remove, and membership check. ```mojo from collections import Set var set = { 1, 2, 3 } print(len(set)) # 3 set.add(4) for element in set: print(element) set -= Set[Int](3, 4, 5) print(set == Set[Int](1, 2)) # True print(set | Set[Int](0, 1) == Set[Int](0, 1, 2)) # True var element = set.pop() print(len(set)) # 1 ``` ## Parameters * ​T (`Copyable & Movable & Hashable & EqualityComparable`): The element type of the set. Must implement KeyElement. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `EqualityComparable`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `Hashable`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(out self, *ts: T, *, __set_literal__: Tuple[] = Tuple())` Construct a set from initial elements. **Args:** * ​\*ts (`T`): Variadic of elements to add to the set. * ​**set\_literal** (`Tuple[]`): Tell Mojo to use this method for set literals. `@implicit` `__init__(out self, elements: List[T, hint_trivial_type])` Construct a set from a List of elements. **Args:** * ​elements (`List[T, hint_trivial_type]`): A vector of elements to add to the set. ### `__copyinit__` `__copyinit__(out self, other: Self)` Copy constructor. **Args:** * ​other (`Self`): The existing Set instance to copy from. ### `__bool__` `__bool__(self) -> Bool` Whether the set is non-empty or not. **Returns:** True if the set is non-empty, False if it is empty. ### `__lt__` `__lt__(self, other: Self) -> Bool` Overloads the other (`Self`): The set to compare against for the strict subset relationship. **Returns:** True if the set is a strict subset of the `other` set, False otherwise. ### `__le__` `__le__(self, other: Self) -> Bool` Overloads the other (`Self`): Another Set instance to check against. **Returns:** True if this set is a subset of the `other` set, False otherwise. ### `__eq__` `__eq__(self, other: Self) -> Bool` Set equality. **Args:** * ​other (`Self`): Another Set instance to check equality against. **Returns:** True if the sets contain the same elements and False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Set inequality. **Args:** * ​other (`Self`): Another Set instance to check equality against. **Returns:** True if the sets are different and False otherwise. ### `__gt__` `__gt__(self, other: Self) -> Bool` Overloads the > operator for strict superset comparison of sets. **Args:** * ​other (`Self`): The set to compare against for the strict superset relationship. **Returns:** True if the set is a strict superset of the `other` set, False otherwise. ### `__ge__` `__ge__(self, other: Self) -> Bool` Overloads the >= operator for sets. Works like as `issuperset` method. **Args:** * ​other (`Self`): Another Set instance to check against. **Returns:** True if this set is a superset of the `other` set, False otherwise. ### `__contains__` `__contains__(self, t: T) -> Bool` Whether or not the set contains an element. **Args:** * ​t (`T`): The element to check membership in the set. **Returns:** Whether or not the set contains the element. ### `__sub__` `__sub__(self, other: Self) -> Self` Set subtraction. **Args:** * ​other (`Self`): Another Set instance to subtract from this one. **Returns:** A new set containing elements of this set, but not containing any elements which were in the `other` set. ### `__and__` `__and__(self, other: Self) -> Self` The set intersection operator. **Args:** * ​other (`Self`): Another Set instance to intersect with this one. **Returns:** A new set containing only the elements which appear in both this set and the `other` set. ### `__or__` `__or__(self, other: Self) -> Self` The set union operator. **Args:** * ​other (`Self`): Another Set instance to union with this one. **Returns:** A new set containing any elements which appear in either this set or the `other` set. ### `__xor__` `__xor__(self, other: Self) -> Self` Overloads the ^ operator for sets. Works like as `symmetric_difference` method. **Args:** * ​other (`Self`): The set to find the symmetric difference with. **Returns:** A new set containing the symmetric difference of the two sets. ### `__isub__` `__isub__(mut self, other: Self)` In-place set subtraction. Updates the set to remove any elements from the `other` set. **Args:** * ​other (`Self`): Another Set instance to subtract from this one. ### `__iand__` `__iand__(mut self, other: Self)` In-place set intersection. Updates the set to contain only the elements which are already in the set and are also contained in the `other` set. **Args:** * ​other (`Self`): Another Set instance to intersect with this one. ### `__ixor__` `__ixor__(mut self, other: Self)` Overloads the ^= operator. Works like as `symmetric_difference_update` method. Updates the set with the symmetric difference of itself and another set. **Args:** * ​other (`Self`): The set to find the symmetric difference with. ### `__ior__` `__ior__(mut self, other: Self)` In-place set union. Updates the set to contain all elements in the `other` set as well as keeping all elements it already contained. **Args:** * ​other (`Self`): Another Set instance to union with this one. ### `__len__` `__len__(self) -> Int` The size of the set. **Returns:** The number of elements in the set. ### `__hash__` `__hash__(self) -> UInt` A hash value of the elements in the set. The hash value is order independent, so s1 == s2 -> hash(s1) == hash(s2). **Returns:** A hash value of the set suitable for non-cryptographic purposes. ### `__str__` `__str__[U: Copyable & Movable & Hashable & EqualityComparable & Representable, //](self: Set[U]) -> String` Returns the string representation of the set. **Parameters:** * ​U (`Copyable & Movable & Hashable & EqualityComparable & Representable`): The type of the List elements. Must implement the `Representable` and `KeyElement` traits. **Returns:** The string representation of the set. ### `__repr__` `__repr__[U: Copyable & Movable & Hashable & EqualityComparable & Representable, //](self: Set[U]) -> String` Returns the string representation of the set. **Parameters:** * ​U (`Copyable & Movable & Hashable & EqualityComparable & Representable`): The type of the List elements. Must implement the `Representable` and `KeyElement` traits. **Returns:** The string representation of the set. ### `write_to` `write_to[W: Writer, U: Copyable & Movable & Hashable & EqualityComparable & Representable, //](self: Set[U], mut writer: W)` Write Set string representation to a `Writer`. **Parameters:** * ​W (`Writer`): A type conforming to the Writer trait. * ​U (`Copyable & Movable & Hashable & EqualityComparable & Representable`): The type of the List elements. Must implement the `Representable` and `KeyElement` traits. **Args:** * ​writer (`W`): The object to write to. ### `__iter__` `__iter__(ref self) -> _DictKeyIter[T, NoneType, self_is_origin._data]` Iterate over elements of the set, returning immutable references. **Returns:** An iterator of immutable references to the set elements. ### `add` `add(mut self, t: T)` Add an element to the set. **Args:** * ​t (`T`): The element to add to the set. ### `remove` `remove(mut self, t: T)` Remove an element from the set. **Args:** * ​t (`T`): The element to remove from the set. **Raises:** If the element isn't in the set to remove. ### `pop` `pop(mut self) -> T` Remove any one item from the set, and return it. As an implementation detail this will remove the first item according to insertion order. This is practically useful for breadth-first search implementations. **Returns:** The element which was removed from the set. **Raises:** If the set is empty. ### `union` `union(self, other: Self) -> Self` Set union. **Args:** * ​other (`Self`): Another Set instance to union with this one. **Returns:** A new set containing any elements which appear in either this set or the `other` set. ### `intersection` `intersection(self, other: Self) -> Self` Set intersection. **Args:** * ​other (`Self`): Another Set instance to intersect with this one. **Returns:** A new set containing only the elements which appear in both this set and the `other` set. ### `difference` `difference(self, other: Self) -> Self` Set difference. **Args:** * ​other (`Self`): Another Set instance to find the difference with this one. **Returns:** A new set containing elements that are in this set but not in the `other` set. ### `update` `update(mut self, other: Self)` In-place set update. Updates the set to contain all elements in the `other` set as well as keeping all elements it already contained. **Args:** * ​other (`Self`): Another Set instance to union with this one. ### `intersection_update` `intersection_update(mut self, other: Self)` In-place set intersection update. Updates the set by retaining only elements found in both this set and the `other` set, removing all other elements. The result is the intersection of this set with `other`. **Args:** * ​other (`Self`): Another Set instance to intersect with this one. ### `difference_update` `difference_update(mut self, other: Self)` In-place set subtraction. Updates the set by removing all elements found in the `other` set, effectively keeping only elements that are unique to this set. **Args:** * ​other (`Self`): Another Set instance to subtract from this one. ### `issubset` `issubset(self, other: Self) -> Bool` Check if this set is a subset of another set. **Args:** * ​other (`Self`): Another Set instance to check against. **Returns:** True if this set is a subset of the `other` set, False otherwise. ### `isdisjoint` `isdisjoint(self, other: Self) -> Bool` Check if this set is disjoint with another set. **Args:** * ​other (`Self`): Another Set instance to check against. **Returns:** True if this set is disjoint with the `other` set, False otherwise. ### `issuperset` `issuperset(self, other: Self) -> Bool` Check if this set is a superset of another set. **Args:** * ​other (`Self`): Another Set instance to check against. **Returns:** True if this set is a superset of the `other` set, False otherwise. ### `symmetric_difference` `symmetric_difference(self, other: Self) -> Self` Returns the symmetric difference of two sets. **Args:** * ​other (`Self`): The set to find the symmetric difference with. **Returns:** A new set containing the symmetric difference of the two sets. ### `symmetric_difference_update` `symmetric_difference_update(mut self, other: Self)` Updates the set with the symmetric difference of itself and another set. **Args:** * ​other (`Self`): The set to find the symmetric difference with. ### `discard` `discard(mut self, value: T)` Remove a value from the set if it exists. Pass otherwise. **Args:** * ​value (`T`): The element to remove from the set. ### `clear` `clear(mut self)` Removes all elements from the set. This method modifies the set in-place, removing all of its elements. After calling this method, the set will be empty. --- ## set Implements the Set datatype. ## Structs * [​`Set`](/mojo/stdlib/collections/set/Set): A set data type. --- ## Codepoint `struct Codepoint` A Unicode codepoint, typically a single user-recognizable character; restricted to valid Unicode scalar values. This type is restricted to store a single Unicode [*scalar value*][1], typically encoding a single user-recognizable character. All valid Unicode scalar values are in the range(s) 0 to 0xD7FF and 0xE000 to 0x10FFFF, inclusive. This type guarantees that the stored integer value falls in these ranges. [1]: https://www.unicode.org/glossary/#unicode_scalar_value **Codepoints versus Scalar Values** Formally, Unicode defines a codespace of values in the range 0 to 0x10FFFF inclusive, and a [Unicode codepoint](https://www.unicode.org/glossary/#code_point) is any integer falling within that range. However, due to historical reasons, it became necessary to "carve out" a subset of the codespace, excluding codepoints in the range 0xD7FF–0xE000. That subset of codepoints excluding that range are known as [Unicode scalar values][1]. The codepoints in the range 0xD7FF-0xE000 are known as "surrogate" codepoints. The surrogate codepoints will never be assigned a semantic meaning, and can only validly appear in UTF-16 encoded text. The difference between codepoints and scalar values is a technical distinction related to the backwards-compatible workaround chosen to enable UTF-16 to encode the full range of the Unicode codespace. For simplicities sake, and to avoid a confusing clash with the Mojo `Scalar` type, this type is pragmatically named `Codepoint`, even though it is restricted to valid scalar values. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Intable`, `Movable`, `Stringable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, *, unsafe_unchecked_codepoint: SIMD[uint32, 1])` Construct a `Codepoint` from a code point value without checking that it falls in the valid range. Safety: The provided codepoint value MUST be a valid Unicode scalar value. Providing a value outside of the valid range could lead to undefined behavior in algorithms that depend on the validity guarantees of this type. **Args:** * ​unsafe\_unchecked\_codepoint (`SIMD[uint32, 1]`): A valid Unicode scalar value code point. `__init__(out self, codepoint: SIMD[uint8, 1])` Construct a `Codepoint` from a single byte value. This constructor cannot fail because non-negative 8-bit integers are valid Unicode scalar values. **Args:** * ​codepoint (`SIMD[uint8, 1]`): The 8-bit codepoint value to convert to a `Codepoint`. ### `__eq__` `__eq__(self, other: Self) -> Bool` Return True if this character has the same codepoint value as `other`. **Args:** * ​other (`Self`): The codepoint value to compare against. **Returns:** True if this character and `other` have the same codepoint value; False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Return True if this character has a different codepoint value from `other`. **Args:** * ​other (`Self`): The codepoint value to compare against. **Returns:** True if this character and `other` have different codepoint values; False otherwise. ### `from_u32` `static from_u32(codepoint: SIMD[uint32, 1]) -> Optional[Codepoint]` Construct a `Codepoint` from a code point value. Returns None if the provided `codepoint` is not in the valid range. **Args:** * ​codepoint (`SIMD[uint32, 1]`): An integer representing a Unicode scalar value. **Returns:** A `Codepoint` if `codepoint` falls in the valid range for Unicode scalar values, otherwise None. ### `ord` `static ord(string: StringSlice[origin]) -> Self` Returns the `Codepoint` that represents the given single-character string. Given a string containing one character, return a `Codepoint` representing the codepoint of that character. For example, `Codepoint.ord("a")` returns the codepoint `97`. This is the inverse of the `chr()` function. This function is similar to the `ord()` free function, except that it returns a `Codepoint` instead of an `Int`. **Args:** * ​string (`StringSlice[origin]`): The input string, which must contain only a single character. **Returns:** A `Codepoint` representing the codepoint of the given character. ### `unsafe_decode_utf8_codepoint` `static unsafe_decode_utf8_codepoint(s: Span[SIMD[uint8, 1], origin]) -> Tuple[Codepoint, Int]` Decodes a single `Codepoint` and number of bytes read from a given UTF-8 string pointer. Safety: `_ptr` MUST point to the first byte in a **known-valid** UTF-8 character sequence. This function MUST NOT be used on unvalidated input. **Args:** * ​s (`Span[SIMD[uint8, 1], origin]`): Span to UTF-8 encoded data containing at least one valid encoded codepoint. **Returns:** The decoded codepoint `Codepoint`, as well as the number of bytes read. ### `__int__` `__int__(self) -> Int` Returns the numeric value of this scalar value as an integer. **Returns:** The numeric value of this scalar value as an integer. ### `__str__` `__str__(self) -> String` Formats this `Codepoint` as a single-character string. **Returns:** A string containing this single character. ### `is_ascii` `is_ascii(self) -> Bool` Returns True if this `Codepoint` is an ASCII character. All ASCII characters are less than or equal to codepoint value 127, and take exactly 1 byte to encode in UTF-8. **Returns:** A boolean indicating if this `Codepoint` is an ASCII character. ### `is_ascii_digit` `is_ascii_digit(self) -> Bool` Determines whether the given character is a digit \[0-9]. **Returns:** True if the character is a digit. ### `is_ascii_upper` `is_ascii_upper(self) -> Bool` Determines whether the given character is an uppercase character. This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "ABCDEFGHIJKLMNOPQRSTUVWXYZ". **Returns:** True if the character is uppercase. ### `is_ascii_lower` `is_ascii_lower(self) -> Bool` Determines whether the given character is an lowercase character. This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "abcdefghijklmnopqrstuvwxyz". **Returns:** True if the character is lowercase. ### `is_ascii_printable` `is_ascii_printable(self) -> Bool` Determines whether the given character is a printable character. **Returns:** True if the character is a printable character, otherwise False. ### `is_python_space` `is_python_space(self) -> Bool` Determines whether this character is a Python whitespace string. This corresponds to Python's [universal separators](https://docs.python.org/3/library/stdtypes.html#str.splitlines): `" \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`. # Examples Check if a string contains only whitespace: ```mojo from testing import assert_true, assert_false # ASCII space characters assert_true(Codepoint.ord(" ").is_python_space()) assert_true(Codepoint.ord(" ").is_python_space()) # Unicode paragraph separator: assert_true(Codepoint.from_u32(0x2029).value().is_python_space()) # Letters are not space characters assert_fales(Codepoint.ord("a").is_python_space()) ``` . **Returns:** True if this character is one of the whitespace characters listed above, otherwise False. ### `is_posix_space` `is_posix_space(self) -> Bool` Returns True if this `Codepoint` is a **space** character according to the [POSIX locale][1]. The POSIX locale is also known as the C locale. [1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_03_01 This only respects the default "C" locale, i.e. returns True only if the character specified is one of " \t\n\v\f\r". For semantics similar to Python, use `String.isspace()`. **Returns:** True iff the character is one of the whitespace characters listed above. ### `to_u32` `to_u32(self) -> SIMD[uint32, 1]` Returns the numeric value of this scalar value as an unsigned 32-bit integer. **Returns:** The numeric value of this scalar value as an unsigned 32-bit integer. ### `unsafe_write_utf8` `unsafe_write_utf8[optimize_ascii: Bool = True, branchless: Bool = False](self, ptr: UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, origin=origin]) -> UInt` Shift unicode to utf8 representation. Safety: `ptr` MUST point to at least `self.utf8_byte_length()` allocated bytes or else an out-of-bounds write will occur, which is undefined behavior. ### Unicode (represented as UInt32 BE) to UTF-8 conversion: * 1: 00000000 00000000 00000000 0aaaaaaa -> 0aaaaaaa * a * 2: 00000000 00000000 00000aaa aabbbbbb -> 110aaaaa 10bbbbbb * (a >> 6) | 0b11000000, b | 0b10000000 * 3: 00000000 00000000 aaaabbbb bbcccccc -> 1110aaaa 10bbbbbb 10cccccc * (a >> 12) | 0b11100000, (b >> 6) | 0b10000000, c | 0b10000000 * 4: 00000000 000aaabb bbbbcccc ccdddddd -> 11110aaa 10bbbbbb 10cccccc 10dddddd * (a >> 18) | 0b11110000, (b >> 12) | 0b10000000, (c >> 6) | 0b10000000, d | 0b10000000 . **Parameters:** * ​optimize\_ascii (`Bool`): Optimize for languages with mostly ASCII characters. * ​branchless (`Bool`): Use a branchless algorithm. **Args:** * ​ptr (`UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, origin=origin]`): Pointer value to write the encoded UTF-8 bytes. Must validly point to a sufficient number of bytes (1-4) to hold the encoded data. **Returns:** Returns the number of bytes written. ### `utf8_byte_length` `utf8_byte_length(self) -> UInt` Returns the number of UTF-8 bytes required to encode this character. Notes: The returned value is always between 1 and 4 bytes. **Returns:** Byte count of UTF-8 bytes required to encode this character. --- ## codepoint Unicode codepoint handling. This module provides the `Codepoint` type for representing single Unicode scalar values. A codepoint represents a single Unicode character, restricted to valid Unicode scalar values in the ranges 0 to 0xD7FF and 0xE000 to 0x10FFFF inclusive. The `Codepoint` type provides functionality for: * Converting between codepoints and UTF-8 encoded bytes. * Testing character properties like ASCII, digits, whitespace etc. * Converting between codepoints and strings. * Safe construction from integers with validation. Example: ```mojo from collections.string import Codepoint from testing import assert_true # Create a codepoint from a character var c = Codepoint.ord('A') # Check properties assert_true(c.is_ascii()) assert_true(c.is_ascii_upper()) # Convert to string var s = String(c) # "A" ``` ## Structs * [​`Codepoint`](/mojo/stdlib/collections/string/codepoint/Codepoint): A Unicode codepoint, typically a single user-recognizable character; restricted to valid Unicode scalar values. --- ## format String formatting utilities for Mojo. This module provides string formatting functionality similar to Python's `str.format()` method. The `format()` method (available on the [`String`](/mojo/stdlib/collections/string/string/String#format) and [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice#format) types) takes the current string as a template (or "format string"), which can contain literal text and/or replacement fields delimited by curly braces (`{}`). The replacement fields are replaced with the values of the arguments. Replacement fields can mapped to the arguments in one of two ways: * Automatic indexing by argument position: ```mojo var s = String("{} is {}").format("Mojo", "🔥") ``` * Manual indexing by argument position: ```mojo var s = String("{1} is {0}").format("hot", "🔥") ``` The replacement fields can also contain the `!r` or `!s` conversion flags, to indicate whether the argument should be formatted using `repr()` or `String()`, respectively: ```mojo var s = String("{!r}").format(myComplicatedObject) ``` Note that the following features from Python's `str.format()` are **not yet supported**: * Named arguments (for example `"{name} is {adjective}"`). * Accessing the attributes of an argument value (for example, `"{0.name}"`. * Accessing an indexed value from the argument (for example, `"{1[0]}"`). * Format specifiers for controlling output format (width, precision, and so on). Examples: ```mojo # Basic formatting var s1 = String("Hello {0}!").format("World") # Hello World! # Multiple arguments var s2 = String("{0} plus {1} equals {2}").format(1, 2, 3) # 1 plus 2 equals 3 # Conversion flags var s4 = String("{!r}").format("test") # "'test'" ``` This module has no public API; its functionality is available through the [`String.format()`](/mojo/stdlib/collections/string/string/String#format) and [`StringSlice.format()`](/mojo/stdlib/collections/string/string_slice/StringSlice#format) methods. --- ## string The string package provides comprehensive Unicode string handling functionality for Mojo. This package implements Unicode-aware string types and operations, with UTF-8 support. It includes efficient implementations for string manipulation, formatting, and Unicode operations while maintaining memory safety and performance. Key Components: * `String`: The main string type supporting UTF-8 encoded text, * `StringSlice`: Memory-efficient string view type for zero-copy operations * `Codepoint`: Unicode code point handling and operations * Format: String formatting and interpolation utilities Core Features: * Unicode support with UTF-8 encoding * Efficient string slicing and views * String formatting and interpolation * Memory-safe string operations * Unicode case conversion * Unicode property lookups and validation Example: ```mojo # Basic string creation and manipulation var s = String("Hello, 世界") var slice = s[0:5] # "Hello" # Unicode-aware operations for c in s.codepoints(): print(c.to_uppercase()) # String formatting var name = "Mojo" var formatted = String("Hello, {name}!") ``` Note: String stores data using UTF-8, and all operations (unless clearly noted) are intended to be fully Unicode compliant and maintain correct UTF-8 encoded data. A handful of operations are known to not be Unicode / UTF-8 compliant yet, but will be fixed as time permits. ## Modules * [​`codepoint`](/mojo/stdlib/collections/string/codepoint/): Unicode codepoint handling. * [​`format`](/mojo/stdlib/collections/string/format/): String formatting utilities for Mojo. * [​`string`](/mojo/stdlib/collections/string/string/): The core `String` type implementation for Mojo. * [​`string_slice`](/mojo/stdlib/collections/string/string_slice/): The `StringSlice` type implementation for efficient string operations. --- ## String `struct String` Represents a mutable string. See the [`string` module](/mojo/stdlib/collections/string/string/) for more information and examples. ## Implemented traits `AnyType`, `Boolable`, `ConvertibleFromPython`, `Copyable`, `Defaultable`, `EqualityComparable`, `ExplicitlyCopyable`, `FloatableRaising`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `Hashable`, `IntableRaising`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `PathLike`, `PythonConvertible`, `Representable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable`, `Writer`, `_HashableWithHasher` ## Aliases ### `ASCII_LETTERS` `alias ASCII_LETTERS = "abcdefghijklmnopqrstuvwxyz".__add__[__mlir_type.!kgen.string]("ABCDEFGHIJKLMNOPQRSTUVWXYZ")` ### `ASCII_LOWERCASE` `alias ASCII_LOWERCASE = "abcdefghijklmnopqrstuvwxyz"` ### `ASCII_UPPERCASE` `alias ASCII_UPPERCASE = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"` ### `DIGITS` `alias DIGITS = "0123456789"` ### `HEX_DIGITS` `alias HEX_DIGITS = "0123456789".__add__[__mlir_type.!kgen.string]("abcdef").__add__[__mlir_type.!kgen.string]("ABCDEF")` ### `OCT_DIGITS` `alias OCT_DIGITS = "01234567"` ### `PRINTABLE` `alias PRINTABLE = "0123456789".__add__[__mlir_type.!kgen.string]("abcdefghijklmnopqrstuvwxyz".__add__[__mlir_type.!kgen.string]("ABCDEFGHIJKLMNOPQRSTUVWXYZ")).__add__[__mlir_type.!kgen.string]("!\22#$%&'()*+,-./:;?@[\\]^_`{|}\~").**add**\[\_\_mlir\_type.!kgen.string]\(" \t\n\r\v\f")\` ### `PUNCTUATION` `alias PUNCTUATION = "!\22#$%&'()*+,-./:;?@[\\]^_`{|}\~"\` ## Methods ### `__init__` `__init__(out self)` Construct an empty string. `__init__(out self, *, capacity: Int)` Construct an empty string with a given capacity. **Args:** * ​capacity (`Int`): The capacity of the string to allocate. `@implicit` `__init__(out self, data: StringSlice[StaticConstantOrigin])` Construct a string from a static constant string without allocating. **Args:** * ​data (`StringSlice[StaticConstantOrigin]`): The static constant string to refer to. `@implicit` `__init__(out self, data: StringLiteral[value])` Construct a string from a string literal without allocating. **Args:** * ​data (`StringLiteral[value]`): The static constant string to refer to. `__init__(out self, *, bytes: Span[SIMD[uint8, 1], origin])` Construct a string by copying the data. This constructor is explicit because it can involve memory allocation. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The bytes to copy. `__init__[T: Stringable](out self, value: T)` Initialize from a type conforming to `Stringable`. **Parameters:** * ​T (`Stringable`): The type conforming to Stringable. **Args:** * ​value (`T`): The object to get the string representation of. `__init__[T: StringableRaising](out self, value: T)` Initialize from a type conforming to `StringableRaising`. **Parameters:** * ​T (`StringableRaising`): The type conforming to Stringable. **Args:** * ​value (`T`): The object to get the string representation of. **Raises:** If there is an error when computing the string representation of the type. `__init__[*Ts: Writable](out self, *args: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""))` Construct a string by concatenating a sequence of Writable arguments. Examples: Construct a String from several `Writable` arguments: ```mojo var string = String(1, 2.0, "three", sep=", ") print(string) # "1, 2.0, three" ``` **Parameters:** * ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy `Writable`. **Args:** * ​\*args (`*Ts`): A sequence of Writable arguments. * ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements. * ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements. `__init__[*Ts: Writable](out self, args: VariadicPack[is_owned, origin, Writable, Ts], sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""))` Construct a string by passing a variadic pack. Examples: ```mojo fn variadic_pack_to_string[ *Ts: Writable, ](*args: *Ts) -> String: return String(args) string = variadic_pack_to_string(1, ", ", 2.0, ", ", "three") ``` . **Parameters:** * ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy `Writable`. **Args:** * ​args (`VariadicPack[is_owned, origin, Writable, Ts]`): A VariadicPack of Writable arguments. * ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements. * ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements. `__init__(out self, *, unsafe_uninit_length: UInt)` Construct a String with the specified length, with uninitialized memory. This is unsafe, as it relies on the caller initializing the elements with unsafe operations, not assigning over the uninitialized data. **Args:** * ​unsafe\_uninit\_length (`UInt`): The number of bytes to allocate. `__init__(out self, *, unsafe_from_utf8_ptr: UnsafePointer[SIMD[int8, 1], mut=mut, origin=origin])` Creates a string from a UTF-8 encoded nul-terminated pointer. Safety: * `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data. * `unsafe_from_utf8_ptr` MUST be null terminated. **Args:** * ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[int8, 1], mut=mut, origin=origin]`): An `UnsafePointer[Byte]` of null-terminated bytes encoded in UTF-8. `__init__(out self, *, unsafe_from_utf8_ptr: UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin])` Creates a string from a UTF-8 encoded nul-terminated pointer. Safety: * `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data. * `unsafe_from_utf8_ptr` MUST be null terminated. **Args:** * ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin]`): An `UnsafePointer[Byte]` of null-terminated bytes encoded in UTF-8. `__init__(out self, obj: PythonObject)` Construct a `String` from a PythonObject. **Args:** * ​obj (`PythonObject`): The PythonObject to convert from. **Raises:** An error if the conversion failed. ### `__copyinit__` `__copyinit__(out self, other: Self)` Copy initialize the string from another string. **Args:** * ​other (`Self`): The string to copy. ### `__moveinit__` `__moveinit__(out self, owned other: Self)` Move initialize the string from another string. **Args:** * ​other (`Self`): The string to move. ### `__del__` `__del__(owned self)` Destroy the string data. ### `__bool__` `__bool__(self) -> Bool` Checks if the string is not empty. **Returns:** True if the string length is greater than zero, and False otherwise. ### `__getitem__` `__getitem__[I: Indexer](self, idx: I) -> Self` Gets the character at the specified position. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The index value. **Returns:** A new string containing the character at the specified position. `__getitem__(self, span: Slice) -> Self` Gets the sequence of characters at the specified positions. **Args:** * ​span (`Slice`): A slice that specifies positions of the new substring. **Returns:** A new string containing the string at the specified positions. ### `__lt__` `__lt__(self, rhs: Self) -> Bool` Compare this String to the RHS using LT comparison. **Args:** * ​rhs (`Self`): The other String to compare against. **Returns:** True if this String is strictly less than the RHS String and False otherwise. ### `__le__` `__le__(self, rhs: Self) -> Bool` Compare this String to the RHS using LE comparison. **Args:** * ​rhs (`Self`): The other String to compare against. **Returns:** True iff this String is less than or equal to the RHS String. ### `__eq__` `__eq__(self, other: Self) -> Bool` Compares two Strings if they have the same values. **Args:** * ​other (`Self`): The rhs of the operation. **Returns:** True if the Strings are equal and False otherwise. `__eq__(self, other: StringSlice[origin]) -> Bool` Compares two Strings if they have the same values. **Args:** * ​other (`StringSlice[origin]`): The rhs of the operation. **Returns:** True if the Strings are equal and False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Compares two Strings if they do not have the same values. **Args:** * ​other (`Self`): The rhs of the operation. **Returns:** True if the Strings are not equal and False otherwise. `__ne__(self, other: StringSlice[origin]) -> Bool` Compares two Strings if they have the same values. **Args:** * ​other (`StringSlice[origin]`): The rhs of the operation. **Returns:** True if the Strings are equal and False otherwise. ### `__gt__` `__gt__(self, rhs: Self) -> Bool` Compare this String to the RHS using GT comparison. **Args:** * ​rhs (`Self`): The other String to compare against. **Returns:** True iff this String is strictly greater than the RHS String. ### `__ge__` `__ge__(self, rhs: Self) -> Bool` Compare this String to the RHS using GE comparison. **Args:** * ​rhs (`Self`): The other String to compare against. **Returns:** True iff this String is greater than or equal to the RHS String. ### `__contains__` `__contains__(self, substr: StringSlice[origin]) -> Bool` Returns True if the substring is contained within the current string. **Args:** * ​substr (`StringSlice[origin]`): The substring to check. **Returns:** True if the string contains the substring. ### `__add__` `__add__(self, other: StringSlice[origin]) -> Self` Creates a string by appending a string slice at the end. **Args:** * ​other (`StringSlice[origin]`): The string slice to append. **Returns:** The new constructed string. ### `__mul__` `__mul__(self, n: Int) -> Self` Concatenates the string `n` times. **Args:** * ​n (`Int`): The number of times to concatenate the string. **Returns:** The string concatenated `n` times. ### `__radd__` `__radd__(self, other: StringSlice[origin]) -> Self` Creates a string by prepending another string slice to the start. **Args:** * ​other (`StringSlice[origin]`): The string to prepend. **Returns:** The new constructed string. ### `__iadd__` `__iadd__(mut self, other: StringSlice[origin])` Appends another string slice to this string. **Args:** * ​other (`StringSlice[origin]`): The string to append. ### `copy` `copy(self) -> Self` Explicitly copy the provided value. **Returns:** A copy of the value. ### `capacity` `capacity(self) -> UInt` Get the capacity of the string. **Returns:** The capacity of the string. ### `write_bytes` `write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])` Write a byte span to this String. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this String. Must NOT be null terminated. ### `write` `write[*Ts: Writable](mut self, *args: *Ts)` Write a sequence of Writable arguments to the provided Writer. **Parameters:** * ​\*Ts (`Writable`): Types of the provided argument sequence. **Args:** * ​\*args (`*Ts`): Sequence of arguments to write to this Writer. `static write[*Ts: Writable](*args: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")) -> Self` Construct a string by concatenating a sequence of Writable arguments. This is used only when reusing the `write_to` method for `__str__` in order to avoid an endless loop recalling the constructor. **Parameters:** * ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy `Writable`. **Args:** * ​\*args (`*Ts`): A sequence of Writable arguments. * ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements. * ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements. **Returns:** A string formed by formatting the argument sequence. ### `append_byte` `append_byte(mut self, byte: SIMD[uint8, 1])` Append a byte to the string. **Args:** * ​byte (`SIMD[uint8, 1]`): The byte to append. ### `__iter__` `__iter__(self) -> CodepointSliceIter[self]` Iterate over the string, returning immutable references. **Returns:** An iterator of references to the string elements. ### `__reversed__` `__reversed__(self) -> CodepointSliceIter[self, False]` Iterate backwards over the string, returning immutable references. **Returns:** A reversed iterator of references to the string elements. ### `__len__` `__len__(self) -> Int` Get the string length of in bytes. This function returns the number of bytes in the underlying UTF-8 representation of the string. To get the number of Unicode codepoints in a string, use `len(str.codepoints())`. # Examples Query the length of a string, in bytes and Unicode codepoints: ```mojo from testing import assert_equal var s = String("ನಮಸ್ಕಾರ") assert_equal(len(s), 21) assert_equal(len(s.codepoints()), 7) ``` Strings containing only ASCII characters have the same byte and Unicode codepoint length: ```mojo from testing import assert_equal var s = String("abc") assert_equal(len(s), 3) assert_equal(len(s.codepoints()), 3) ``` . **Returns:** The string length in bytes. ### `__str__` `__str__(self) -> Self` Gets the string itself. This method ensures that you can pass a `String` to a method that takes a `Stringable` value. **Returns:** The string itself. ### `__repr__` `__repr__(self) -> Self` Return a Mojo-compatible representation of the `String` instance. **Returns:** A new representation of the string. ### `__fspath__` `__fspath__(self) -> Self` Return the file system path representation (just the string itself). **Returns:** The file system path representation as a string. ### `to_python_object` `to_python_object(owned self) -> PythonObject` Convert this value to a PythonObject. **Returns:** A PythonObject representing the value. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this string to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `join` `join[*Ts: Writable](self, *elems: *Ts) -> Self` Joins string elements using the current string as a delimiter. **Parameters:** * ​\*Ts (`Writable`): The types of the elements. **Args:** * ​\*elems (`*Ts`): The input values. **Returns:** The joined string. `join[T: Copyable & Movable & Writable, //, buffer_size: Int = 4096](self, elems: List[T, hint_trivial_type]) -> Self` Joins string elements using the current string as a delimiter. Defaults to writing to the stack if total bytes of `elems` is less than `buffer_size`, otherwise will allocate once to the heap and write directly into that. The `buffer_size` defaults to 4096 bytes to match the default page size on arm64 and x86-64, but you can increase this if you're joining a very large `List` of elements to write into the stack instead of the heap. **Parameters:** * ​T (`Copyable & Movable & Writable`): The type of the elements. Must implement the `Copyable`, `Movable` and `Writable` traits. * ​buffer\_size (`Int`): The max size of the stack buffer. **Args:** * ​elems (`List[T, hint_trivial_type]`): The input values. **Returns:** The joined string. ### `codepoints` `codepoints(self) -> CodepointsIter[self]` Returns an iterator over the `Codepoint`s encoded in this string slice. # Examples Print the characters in a string: ```mojo from testing import assert_equal var s = String("abc") var iter = s.codepoints() assert_equal(iter.__next__(), Codepoint.ord("a")) assert_equal(iter.__next__(), Codepoint.ord("b")) assert_equal(iter.__next__(), Codepoint.ord("c")) assert_equal(iter.__has_next__(), False) ``` `codepoints()` iterates over Unicode codepoints, and supports multibyte codepoints: ```mojo from testing import assert_equal # A visual character composed of a combining sequence of 2 codepoints. var s = String("á") assert_equal(s.byte_length(), 3) var iter = s.codepoints() assert_equal(iter.__next__(), Codepoint.ord("a")) # U+0301 Combining Acute Accent assert_equal(iter.__next__().to_u32(), 0x0301) assert_equal(iter.__has_next__(), False) ``` . **Returns:** An iterator type that returns successive `Codepoint` values stored in this string slice. ### `codepoint_slices` `codepoint_slices(self) -> CodepointSliceIter[self]` Returns an iterator over single-character slices of this string. Each returned slice points to a single Unicode codepoint encoded in the underlying UTF-8 representation of this string. # Examples Iterate over the character slices in a string: ```mojo from testing import assert_equal, assert_true var s = String("abc") var iter = s.codepoint_slices() assert_true(iter.__next__() == "a") assert_true(iter.__next__() == "b") assert_true(iter.__next__() == "c") assert_equal(iter.__has_next__(), False) ``` . **Returns:** An iterator of references to the string elements. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[SIMD[uint8, 1], mut=False, origin=self]` Retrieves a pointer to the underlying memory. **Returns:** The pointer to the underlying memory. ### `unsafe_ptr_mut` `unsafe_ptr_mut(mut self) -> UnsafePointer[SIMD[uint8, 1], origin=self]` Retrieves a mutable pointer to the underlying memory, copying to a new buffer if this was previously pointing to a static constant. **Returns:** The pointer to the underlying memory. ### `unsafe_cstr_ptr` `unsafe_cstr_ptr(mut self) -> UnsafePointer[SIMD[int8, 1], origin=self]` Retrieves a C-string-compatible pointer to the underlying memory. The returned pointer is guaranteed to be null, or NUL terminated. **Returns:** The pointer to the underlying memory. ### `as_bytes` `as_bytes(self) -> Span[SIMD[uint8, 1], self]` Returns a contiguous slice of the bytes owned by this string. **Returns:** A contiguous slice pointing to the bytes owned by this string. ### `as_bytes_mut` `as_bytes_mut(mut self) -> Span[SIMD[uint8, 1], self]` Returns a mutable contiguous slice of the bytes owned by this string. This name has a \_mut suffix so the as\_bytes() method doesn't have to guarantee mutability. **Returns:** A contiguous slice pointing to the bytes owned by this string. ### `as_string_slice` `as_string_slice(self) -> StringSlice[self]` Returns a string slice of the data owned by this string. **Returns:** A string slice pointing to the data owned by this string. ### `as_string_slice_mut` `as_string_slice_mut(mut self) -> StringSlice[self]` Returns a mutable string slice of the data owned by this string. **Returns:** A string slice pointing to the data owned by this string. ### `byte_length` `byte_length(self) -> Int` Get the string length in bytes. **Returns:** The length of this string in bytes. ### `count` `count(self, substr: StringSlice[origin]) -> Int` Return the number of non-overlapping occurrences of substring `substr` in the string. If sub is empty, returns the number of empty strings between characters which is the length of the string plus one. **Args:** * ​substr (`StringSlice[origin]`): The substring to count. **Returns:** The number of occurrences of `substr`. ### `find` `find(self, substr: StringSlice[origin], start: Int = 0) -> Int` Finds the offset of the first occurrence of `substr` starting at `start`. If not found, returns -1. **Args:** * ​substr (`StringSlice[origin]`): The substring to find. * ​start (`Int`): The offset from which to find. **Returns:** The offset of `substr` relative to the beginning of the string. ### `rfind` `rfind(self, substr: StringSlice[origin], start: Int = 0) -> Int` Finds the offset of the last occurrence of `substr` starting at `start`. If not found, returns -1. **Args:** * ​substr (`StringSlice[origin]`): The substring to find. * ​start (`Int`): The offset from which to find. **Returns:** The offset of `substr` relative to the beginning of the string. ### `isspace` `isspace(self) -> Bool` Determines whether every character in the given String is a python whitespace String. This corresponds to Python's [universal separators](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `" \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`. **Returns:** True if the whole String is made up of whitespace characters listed above, otherwise False. ### `split` `split(self, sep: StringSlice[origin], maxsplit: Int = -1) -> List[String]` Split the string by a separator. Examples: ```mojo # Splitting a space _ = String("hello world").split(" ") # ["hello", "world"] # Splitting adjacent separators _ = String("hello,,world").split(",") # ["hello", "", "world"] # Splitting with maxsplit _ = String("1,2,3").split(",", 1) # ['1', '2,3'] # Splitting with an empty separator _ = StringSlice("123").split("") # ["", "1", "2", "3", ""] ``` **Args:** * ​sep (`StringSlice[origin]`): The string to split on. * ​maxsplit (`Int`): The maximum amount of items to split from String. Defaults to unlimited. **Returns:** A List of Strings containing the input split by the separator. `split(self, sep: NoneType = NoneType(None), maxsplit: Int = -1) -> List[String]` Split the string by every Whitespace separator. Examples: ```mojo # Splitting an empty string or filled with whitespaces _ = String(" ").split() # [] _ = String("").split() # [] # Splitting a string with leading, trailing, and middle whitespaces _ = String(" hello world ").split() # ["hello", "world"] # Splitting adjacent universal newlines: _ = String( "hello \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029world" ).split() # ["hello", "world"] ``` . **Args:** * ​sep (`NoneType`): None. * ​maxsplit (`Int`): The maximum amount of items to split from String. Defaults to unlimited. **Returns:** A List of Strings containing the input split by the separator. ### `splitlines` `splitlines(self, keepends: Bool = False) -> List[String]` Split the string at line boundaries. This corresponds to Python's [universal newlines:](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `"\r\n"` and `"\t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`. **Args:** * ​keepends (`Bool`): If True, line breaks are kept in the resulting strings. **Returns:** A List of Strings containing the input split by line boundaries. ### `replace` `replace(self, old: StringSlice[origin], new: StringSlice[origin]) -> Self` Return a copy of the string with all occurrences of substring `old` if replaced by `new`. **Args:** * ​old (`StringSlice[origin]`): The substring to replace. * ​new (`StringSlice[origin]`): The substring to replace with. **Returns:** The string where all occurrences of `old` are replaced with `new`. ### `strip` `strip(self, chars: StringSlice[origin]) -> StringSlice[self]` Return a copy of the string with leading and trailing characters removed. **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A copy of the string with no leading or trailing characters. `strip(self) -> StringSlice[self]` Return a copy of the string with leading and trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. **Returns:** A copy of the string with no leading or trailing whitespaces. ### `rstrip` `rstrip(self, chars: StringSlice[origin]) -> StringSlice[self]` Return a copy of the string with trailing characters removed. **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A copy of the string with no trailing characters. `rstrip(self) -> StringSlice[self]` Return a copy of the string with trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. **Returns:** A copy of the string with no trailing whitespaces. ### `lstrip` `lstrip(self, chars: StringSlice[origin]) -> StringSlice[self]` Return a copy of the string with leading characters removed. **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A copy of the string with no leading characters. `lstrip(self) -> StringSlice[self]` Return a copy of the string with leading whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. **Returns:** A copy of the string with no leading whitespaces. ### `__hash__` `__hash__(self) -> UInt` Hash the underlying buffer using builtin hash. **Returns:** A 64-bit hash value. This value is *not* suitable for cryptographic uses. Its intended usage is for data structures. See the `hash` builtin documentation for more details. `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with the underlying bytes. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. ### `lower` `lower(self) -> Self` Returns a copy of the string with all cased characters converted to lowercase. **Returns:** A new string where cased letters have been converted to lowercase. ### `upper` `upper(self) -> Self` Returns a copy of the string with all cased characters converted to uppercase. **Returns:** A new string where cased letters have been converted to uppercase. ### `startswith` `startswith(self, prefix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool` Checks if the string starts with the specified prefix between start and end positions. Returns True if found and False otherwise. **Args:** * ​prefix (`StringSlice[origin]`): The prefix to check. * ​start (`Int`): The start offset from which to check. * ​end (`Int`): The end offset from which to check. **Returns:** True if the `self[start:end]` is prefixed by the input prefix. ### `endswith` `endswith(self, suffix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool` Checks if the string end with the specified suffix between start and end positions. Returns True if found and False otherwise. **Args:** * ​suffix (`StringSlice[origin]`): The suffix to check. * ​start (`Int`): The start offset from which to check. * ​end (`Int`): The end offset from which to check. **Returns:** True if the `self[start:end]` is suffixed by the input suffix. ### `removeprefix` `removeprefix(self, prefix: StringSlice[origin], /) -> StringSlice[self]` Returns a new string with the prefix removed if it was present. Examples: ```mojo print(String('TestHook').removeprefix('Test')) # 'Hook' print(String('BaseTestCase').removeprefix('Test')) # 'BaseTestCase' ``` **Args:** * ​prefix (`StringSlice[origin]`): The prefix to remove from the string. **Returns:** `string[len(prefix):]` if the string starts with the prefix string, or a copy of the original string otherwise. ### `removesuffix` `removesuffix(self, suffix: StringSlice[origin], /) -> StringSlice[self]` Returns a new string with the suffix removed if it was present. Examples: ```mojo print(String('TestHook').removesuffix('Hook')) # 'Test' print(String('BaseTestCase').removesuffix('Test')) # 'BaseTestCase' ``` **Args:** * ​suffix (`StringSlice[origin]`): The suffix to remove from the string. **Returns:** `string[:-len(suffix)]` if the string ends with the suffix string, or a copy of the original string otherwise. ### `__int__` `__int__(self) -> Int` Parses the given string as a base-10 integer and returns that value. If the string cannot be parsed as an int, an error is raised. **Returns:** An integer value that represents the string, or otherwise raises. ### `__float__` `__float__(self) -> SIMD[float64, 1]` Parses the string as a float point number and returns that value. If the string cannot be parsed as a float, an error is raised. **Returns:** A float value that represents the string, or otherwise raises. ### `format` `format[*Ts: Stringable & Representable](self, *args: *Ts) -> Self` Produce a formatted string using the current string as a template. The template, or "format string" can contain literal text and/or replacement fields delimited with curly braces (`{}`). Returns a copy of the format string with the replacement fields replaced with string representations of the `args` arguments. For more information, see the discussion in the [`format` module](/mojo/stdlib/collections/string/format/). Example: ```mojo # Manual indexing: print(String("{0} {1} {0}").format("Mojo", 1.125)) # Mojo 1.125 Mojo # Automatic indexing: print(String("{} {}").format(True, "hello world")) # True hello world ``` **Parameters:** * ​\*Ts (`Stringable & Representable`): The types of substitution values that implement `Representable` and `Stringable` (to be changed and made more flexible). **Args:** * ​\*args (`*Ts`): The substitution values. **Returns:** The template with the given values substituted. ### `isdigit` `isdigit(self) -> Bool` A string is a digit string if all characters in the string are digits and there is at least one character in the string. Note that this currently only works with ASCII strings. **Returns:** True if all characters are digits and it's not empty else False. ### `isupper` `isupper(self) -> Bool` Returns True if all cased characters in the string are uppercase and there is at least one cased character. **Returns:** True if all cased characters in the string are uppercase and there is at least one cased character, False otherwise. ### `islower` `islower(self) -> Bool` Returns True if all cased characters in the string are lowercase and there is at least one cased character. **Returns:** True if all cased characters in the string are lowercase and there is at least one cased character, False otherwise. ### `isprintable` `isprintable(self) -> Bool` Returns True if all characters in the string are ASCII printable. Note that this currently only works with ASCII strings. **Returns:** True if all characters are printable else False. ### `rjust` `rjust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> Self` Returns the string right justified in a string of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns right justified string, or self if width is not bigger than self length. ### `ljust` `ljust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> Self` Returns the string left justified in a string of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns left justified string, or self if width is not bigger than self length. ### `center` `center(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> Self` Returns the string center justified in a string of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns center justified string, or self if width is not bigger than self length. ### `resize` `resize(mut self, length: Int, fill_byte: SIMD[uint8, 1] = __init__[__mlir_type.!pop.int_literal](0))` Resize the string to a new length. Notes: If the new length is greater than the current length, the string is extended by the difference, and the new bytes are initialized to `fill_byte`. **Args:** * ​length (`Int`): The new length of the string. * ​fill\_byte (`SIMD[uint8, 1]`): The byte to fill any new space with. `resize(mut self, *, unsafe_uninit_length: Int)` Resizes the string to the given new size leaving any new data uninitialized. If the new size is smaller than the current one, elements at the end are discarded. If the new size is larger than the current one, the string is extended and the new data is left uninitialized. **Args:** * ​unsafe\_uninit\_length (`Int`): The new size. ### `reserve` `reserve(mut self, new_capacity: UInt)` Reserves the requested capacity. Notes: If the current capacity is greater or equal, this is a no-op. Otherwise, the storage is reallocated and the data is moved. **Args:** * ​new\_capacity (`UInt`): The new capacity in stored bytes. --- ## ascii `ascii(value: StringSlice[origin]) -> String` Get the ASCII representation of the object. **Args:** * ​value (`StringSlice[origin]`): The object to get the ASCII representation of. **Returns:** A string containing the ASCII representation of the object. --- ## atof `atof(str_slice: StringSlice[origin]) -> SIMD[float64, 1]` Parses the given string as a floating point and returns that value. For example, `atof("2.25")` returns `2.25`. This function is in the prelude, so you don't need to import it. **Args:** * ​str\_slice (`StringSlice[origin]`): A string to be parsed as a floating point. **Returns:** An floating point value that represents the string, or otherwise raises. **Raises:** If the given string cannot be parsed as an floating point value, for example in `atof("hi")`. --- ## atol `atol(str_slice: StringSlice[origin], base: Int = 10) -> Int` Parses and returns the given string as an integer in the given base. If base is set to 0, the string is parsed as an integer literal, with the following considerations: * '0b' or '0B' prefix indicates binary (base 2) * '0o' or '0O' prefix indicates octal (base 8) * '0x' or '0X' prefix indicates hexadecimal (base 16) * Without a prefix, it's treated as decimal (base 10) This follows [Python's integer literals format](https://docs.python.org/3/reference/lexical_analysis.html#integers). This function is in the prelude, so you don't need to import it. Examples: ```text >>> atol("32") 32 >>> atol("FF", 16) 255 >>> atol("0xFF", 0) 255 >>> atol("0b1010", 0) 10 ``` **Args:** * ​str\_slice (`StringSlice[origin]`): A string to be parsed as an integer in the given base. * ​base (`Int`): Base used for conversion, value must be between 2 and 36, or 0. **Returns:** An integer value that represents the string. **Raises:** If the given string cannot be parsed as an integer value or if an incorrect base is provided. --- ## chr `chr(c: Int) -> String` Returns a String based on the given Unicode code point. This is the inverse of the `ord()` function. This function is in the prelude, so you don't need to import it. Example: ```mojo print(chr(97), chr(8364)) # "a €" ``` **Args:** * ​c (`Int`): An integer that represents a code point. **Returns:** A string containing a single character based on the given code point. --- ## string The core `String` type implementation for Mojo. This module provides the primary `String` type and its fundamental operations. The `String` type is a mutable string, and is designed to handle UTF-8 encoded text efficiently while providing a safe and ergonomic interface for string manipulation. Related types: * [`StringSlice`](/mojo/stdlib/collections/string/string_slice/). A non-owning view of string data, which can be either mutable or immutable. * [`StaticString`](/mojo/stdlib/collections/string/string_slice/#aliases). An alias for an immutable constant `StringSlice`. * [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral/). A string literal. String literals are compile-time values. For use at runtime, you usually want wrap a `StringLiteral` in a `String` (for a mutable string) or `StaticString` (for an immutable constant string). Key Features: * Short string optimization (SSO) and lazy copying of constant string data. * O(1) copy operation. * Memory-safe string operations. * Efficient string concatenation and slicing. * String-to-number conversions ( [`atof()`](/mojo/stdlib/collections/string/string/atof), [`atol()`](/mojo/stdlib/collections/string/string/atol)). * Character code conversions ( [`chr()`](/mojo/stdlib/collections/string/string/chr), [`ord()`](/mojo/stdlib/collections/string/string/ord)). * String formatting with [`format()`](/mojo/stdlib/collections/string/string/String/#format). The `String` type has Unicode support through UTF-8 encoding. A handful of operations are known to not be Unicode / UTF-8 compliant yet, but will be fixed as time permits. This type is in the prelude, so it is automatically imported into every Mojo program. Example: ```mojo # String creation and basic operations var s1 = String("Hello") var s2 = String("World") var combined = s1 + " " + s2 # "Hello World" # String-to-number conversion var num = atof("3.14") var int_val = atol("42") # Character operations var char = chr(65) # "A" var code = ord("A") # 65 # String formatting print(String("Codepoint {} is {}").format(code, char)) # Codepoint 65 is A # ASCII utilities var ascii_str = ascii("Hello") # ASCII-only string ``` ## Structs * [​`String`](/mojo/stdlib/collections/string/string/String): Represents a mutable string. ## Functions * [​`ascii`](/mojo/stdlib/collections/string/string/ascii): Get the ASCII representation of the object. * [​`atof`](/mojo/stdlib/collections/string/string/atof): Parses the given string as a floating point and returns that value. * [​`atol`](/mojo/stdlib/collections/string/string/atol): Parses and returns the given string as an integer in the given base. * [​`chr`](/mojo/stdlib/collections/string/string/chr): Returns a String based on the given Unicode code point. This is the inverse of the `ord()` function. * [​`ord`](/mojo/stdlib/collections/string/string/ord): Returns an integer that represents the codepoint of a single-character string. --- ## ord `ord(s: StringSlice[origin]) -> Int` Returns an integer that represents the codepoint of a single-character string. Given a string containing a single character `Codepoint`, return an integer representing the codepoint of that character. For example, `ord("a")` returns the integer `97`. This is the inverse of the `chr()` function. This function is in the prelude, so you don't need to import it. **Args:** * ​s (`StringSlice[origin]`): The input string, which must contain only a single- character. **Returns:** An integer representing the code point of the given character. --- ## CodepointSliceIter `struct CodepointSliceIter[mut: Bool, //, origin: Origin[mut], forward: Bool = True]` Iterator for `StringSlice` over substring slices containing a single Unicode codepoint. The `forward` parameter only controls the behavior of the `__next__()` method used for normal iteration. Calls to `next()` will always take an element from the front of the iterator, and calls to `next_back()` will always take an element from the end. ## Parameters * ​mut (`Bool`): Whether the slice is mutable. * ​origin (`Origin[mut]`): The origin of the underlying string data. * ​forward (`Bool`): The iteration direction. `False` is backwards. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__next__` `__next__(mut self) -> StringSlice[origin]` Get the next codepoint in the underlying string slice. This returns the next single-codepoint substring slice encoded in the underlying string, and advances the iterator state. If `forward` is set to `False`, this will return the next codepoint from the end of the string. This function will abort if this iterator has been exhausted. **Returns:** The next character in the string. ### `__has_next__` `__has_next__(self) -> Bool` Returns True if there are still elements in this iterator. **Returns:** A boolean indicating if there are still elements in this iterator. ### `__len__` `__len__(self) -> Int` Returns the remaining length of this iterator in `Codepoint`s. The value returned from this method indicates the number of subsequent calls to `next()` that will return a value. **Returns:** Number of codepoints remaining in this iterator. ### `peek_next` `peek_next(self) -> Optional[StringSlice[origin]]` Check what the next single-codepoint slice in this iterator is, without advancing the iterator state. Repeated calls to this method will return the same value. # Examples `peek_next()` does not advance the iterator, so repeated calls will return the same value: ```mojo from collections.string import Codepoint from testing import assert_equal var input = StringSlice("123") var iter = input.codepoint_slices() assert_equal(iter.peek_next().value(), "1") assert_equal(iter.peek_next().value(), "1") assert_equal(iter.peek_next().value(), "1") # A call to `next()` return the same value as `peek_next()` had, # but also advance the iterator. assert_equal(iter.next().value(), "1") # Later `peek_next()` calls will return the _new_ next character: assert_equal(iter.peek_next().value(), "2") ``` . **Returns:** The next codepoint slice in the underlying string, or None if the string is empty. ### `peek_back` `peek_back(mut self) -> Optional[StringSlice[origin]]` Check what the last single-codepoint slice in this iterator is, without advancing the iterator state. Repeated calls to this method will return the same value. # Examples `peek_back()` does not advance the iterator, so repeated calls will return the same value: ```mojo from collections.string import Codepoint from testing import assert_equal var input = StringSlice("123") var iter = input.codepoint_slices() # Repeated calls to `peek_back()` return the same value. assert_equal(iter.peek_back().value(), "3") assert_equal(iter.peek_back().value(), "3") assert_equal(iter.peek_back().value(), "3") # A call to `next_back()` return the same value as `peek_back()` had, # but also advance the iterator. assert_equal(iter.next_back().value(), "3") # Later `peek_back()` calls will return the _new_ next character: assert_equal(iter.peek_back().value(), "2") ``` . **Returns:** The last codepoint slice in the underlying string, or None if the string is empty. ### `next` `next(mut self) -> Optional[StringSlice[origin]]` Get the next codepoint slice in the underlying string slice, or None if the iterator is empty. This returns the next single-codepoint substring encoded in the underlying string, and advances the iterator state. **Returns:** A character if the string is not empty, otherwise None. ### `next_back` `next_back(mut self) -> Optional[StringSlice[origin]]` Get the last single-codepoint slice in this iterator is, or None if the iterator is empty. This returns the last codepoint slice in this iterator, and advances the iterator state. **Returns:** The last codepoint slice in the underlying string, or None if the string is empty. --- ## CodepointsIter `struct CodepointsIter[mut: Bool, //, origin: Origin[mut]]` Iterator over the `Codepoint`s in a string slice, constructed by `StringSlice.codepoints()`. ## Parameters * ​mut (`Bool`): Mutability of the underlying string data. * ​origin (`Origin[mut]`): Origin of the underlying string data. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__next__` `__next__(mut self) -> Codepoint` Get the next codepoint in the underlying string slice. This returns the next `Codepoint` encoded in the underlying string, and advances the iterator state. This function will abort if this iterator has been exhausted. **Returns:** The next character in the string. ### `__has_next__` `__has_next__(self) -> Bool` Returns True if there are still elements in this iterator. **Returns:** A boolean indicating if there are still elements in this iterator. ### `__len__` `__len__(self) -> Int` Returns the remaining length of this iterator in `Codepoint`s. The value returned from this method indicates the number of subsequent calls to `next()` that will return a value. **Returns:** Number of codepoints remaining in this iterator. ### `peek_next` `peek_next(self) -> Optional[Codepoint]` Check what the next codepoint in this iterator is, without advancing the iterator state. Repeated calls to this method will return the same value. # Examples `peek_next()` does not advance the iterator, so repeated calls will return the same value: ```mojo from collections.string import Codepoint from testing import assert_equal var input = StringSlice("123") var iter = input.codepoints() assert_equal(iter.peek_next().value(), Codepoint.ord("1")) assert_equal(iter.peek_next().value(), Codepoint.ord("1")) assert_equal(iter.peek_next().value(), Codepoint.ord("1")) # A call to `next()` return the same value as `peek_next()` had, # but also advance the iterator. assert_equal(iter.next().value(), Codepoint.ord("1")) # Later `peek_next()` calls will return the _new_ next character: assert_equal(iter.peek_next().value(), Codepoint.ord("2")) ``` . **Returns:** The next character in the underlying string, or None if the string is empty. ### `next` `next(mut self) -> Optional[Codepoint]` Get the next codepoint in the underlying string slice, or None if the iterator is empty. This returns the next `Codepoint` encoded in the underlying string, and advances the iterator state. **Returns:** A character if the string is not empty, otherwise None. --- ## StringSlice `@register_passable(trivial)` `struct StringSlice[mut: Bool, //, origin: Origin[mut]]` A non-owning view to encoded string data. This type is guaranteed to have the same ABI (size, alignment, and field layout) as the `llvm::StringRef` type. See the [`string_slice` module](/mojo/stdlib/collections/string/string_slice/) for more information and examples. Notes: TODO: The underlying string data is guaranteed to be encoded using UTF-8. ## Parameters * ​mut (`Bool`): Whether the slice is mutable. * ​origin (`Origin[mut]`): The origin of the underlying string data. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Defaultable`, `EqualityComparable`, `ExplicitlyCopyable`, `FloatableRaising`, `Hashable`, `IntableRaising`, `Movable`, `PathLike`, `PythonConvertible`, `Representable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `Immutable` `alias Immutable = StringSlice[(muttoimm origin._mlir_origin)]` The immutable version of the `StringSlice`. ### `Mutable` `alias Mutable = StringSlice[(mutcast origin._mlir_origin)]` The mutable version of the `StringSlice`. ## Methods ### `__init__` `__init__() -> Self` Create an empty / zero-length slice. `@implicit` `__init__(lit: StringLiteral[value]) -> StringSlice[StaticConstantOrigin]` Construct a new `StringSlice` from a `StringLiteral`. **Args:** * ​lit (`StringLiteral[value]`): The literal to construct this `StringSlice` from. `__init__(*, unsafe_from_utf8: Span[SIMD[uint8, 1], origin, address_space=address_space, alignment=alignment]) -> Self` Construct a new `StringSlice` from a sequence of UTF-8 encoded bytes. Safety: `unsafe_from_utf8` MUST be valid UTF-8 encoded data. **Args:** * ​unsafe\_from\_utf8 (`Span[SIMD[uint8, 1], origin, address_space=address_space, alignment=alignment]`): A `Span[Byte]` encoded in UTF-8. `__init__(*, unsafe_from_utf8_ptr: UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self` Construct a new StringSlice from a `UnsafePointer[Byte]` pointing to null-terminated UTF-8 encoded bytes. Safety: * `unsafe_from_utf8_ptr` MUST point to data that is valid for `origin`. * `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data. * `unsafe_from_utf8_ptr` MUST be null terminated. **Args:** * ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): An `UnsafePointer[Byte]` of null-terminated bytes encoded in UTF-8. `__init__(*, unsafe_from_utf8_ptr: UnsafePointer[SIMD[int8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self` Construct a new StringSlice from a `UnsafePointer[c_char]` pointing to null-terminated UTF-8 encoded bytes. Safety: * `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data. * `unsafe_from_utf8_ptr` MUST be null terminated. **Args:** * ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[int8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): An `UnsafePointer[c_char]` of null-terminated bytes encoded in UTF-8. `__init__(*, ptr: UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], length: UInt) -> Self` Construct a `StringSlice` from a pointer to a sequence of UTF-8 encoded bytes and a length. Safety: * `ptr` MUST point to at least `length` bytes of valid UTF-8 encoded data. * `ptr` must point to data that is live for the duration of `origin`. **Args:** * ​ptr (`UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): A pointer to a sequence of bytes encoded in UTF-8. * ​length (`UInt`): The number of bytes of encoded data. `@implicit` `__init__[origin: ImmutableOrigin, //](ref [origin] value: String) -> StringSlice[origin]` Construct an immutable StringSlice. **Parameters:** * ​origin (`ImmutableOrigin`): The immutable origin. **Args:** * ​value (`String`): The string value. `__init__[origin: MutableOrigin, //](ref [origin] value: String) -> StringSlice[origin]` Construct a mutable StringSlice. **Parameters:** * ​origin (`MutableOrigin`): The mutable origin. **Args:** * ​value (`String`): The string value. ### `__bool__` `__bool__(self) -> Bool` Check if a string slice is non-empty. **Returns:** True if a string slice is non-empty, False otherwise. ### `__getitem__` `__getitem__(self, span: Slice) -> Self` Gets the sequence of characters at the specified positions. Raises: This function will raise if the specified slice start or end position are outside the bounds of the string, or if they do not both fall on codepoint boundaries. **Args:** * ​span (`Slice`): A slice that specifies positions of the new substring. **Returns:** A new StringSlice containing the substring at the specified positions. `__getitem__[I: Indexer](self, idx: I) -> String` Gets the character at the specified position. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The index value. **Returns:** A new string containing the character at the specified position. ### `__lt__` `__lt__(self, rhs: StringSlice[origin]) -> Bool` Verify if the `StringSlice` bytes are strictly less than the input in overlapping content. **Args:** * ​rhs (`StringSlice[origin]`): The other `StringSlice` to compare against. **Returns:** If the `StringSlice` bytes are strictly less than the input in overlapping content. ### `__eq__` `__eq__(self, rhs_same: Self) -> Bool` Verify if a `StringSlice` is equal to another `StringSlice` with the same origin. **Args:** * ​rhs\_same (`Self`): The `StringSlice` to compare against. **Returns:** If the `StringSlice` is equal to the input in length and contents. `__eq__(self, rhs: StringSlice[origin]) -> Bool` Verify if a `StringSlice` is equal to another `StringSlice`. **Args:** * ​rhs (`StringSlice[origin]`): The `StringSlice` to compare against. **Returns:** If the `StringSlice` is equal to the input in length and contents. ### `__ne__` `__ne__(self, rhs_same: Self) -> Bool` Verify if a `StringSlice` is not equal to another `StringSlice` with the same origin. **Args:** * ​rhs\_same (`Self`): The `StringSlice` to compare against. **Returns:** If the `StringSlice` is not equal to the input in length and contents. `__ne__(self, rhs: StringSlice[origin]) -> Bool` Verify if span is not equal to another `StringSlice`. **Args:** * ​rhs (`StringSlice[origin]`): The `StringSlice` to compare against. **Returns:** If the `StringSlice` is not equal to the input in length and contents. ### `__contains__` `__contains__(self, substr: StringSlice[origin]) -> Bool` Returns True if the substring is contained within the current string. **Args:** * ​substr (`StringSlice[origin]`): The substring to check. **Returns:** True if the string contains the substring. ### `__add__` `__add__(self, rhs: StringSlice[origin]) -> String` Returns a string with this value prefixed on another string. **Args:** * ​rhs (`StringSlice[origin]`): The right side of the result. **Returns:** The result string. ### `__mul__` `__mul__(self, n: Int) -> String` Concatenates the string `n` times. **Args:** * ​n (`Int`): The number of times to concatenate the string. **Returns:** The string concatenated `n` times. ### `__radd__` `__radd__(self, lhs: StringSlice[origin]) -> String` Returns a string with this value appended to another string. **Args:** * ​lhs (`StringSlice[origin]`): The left side of the result. **Returns:** The result string. ### `copy` `copy(self) -> Self` Explicitly construct a deep copy of the provided `StringSlice`. **Returns:** A copy of the value. ### `from_utf8` `static from_utf8(from_utf8: Span[SIMD[uint8, 1], origin]) -> Self` Construct a new `StringSlice` from a buffer containing UTF-8 encoded data. **Args:** * ​from\_utf8 (`Span[SIMD[uint8, 1], origin]`): A span of bytes containing UTF-8 encoded data. **Returns:** A new validated `StringSlice` pointing to the provided buffer. **Raises:** An exception is raised if the provided buffer byte values do not form valid UTF-8 encoded codepoints. ### `__str__` `__str__(self) -> String` Convert this StringSlice to a String. Notes: This will allocate a new string that copies the string contents from the provided string slice. **Returns:** A new String. ### `__repr__` `__repr__(self) -> String` Return a Mojo-compatible representation of this string slice. **Returns:** Representation of this string slice as a Mojo string literal input form syntax. ### `__len__` `__len__(self) -> Int` Get the string length in bytes. This function returns the number of bytes in the underlying UTF-8 representation of the string. To get the number of Unicode codepoints in a string, use `len(str.codepoints())`. # Examples Query the length of a string, in bytes and Unicode codepoints: ```mojo from testing import assert_equal var s = StringSlice("ನಮಸ್ಕಾರ") assert_equal(len(s), 21) assert_equal(len(s.codepoints()), 7) ``` Strings containing only ASCII characters have the same byte and Unicode codepoint length: ```mojo from testing import assert_equal var s = StringSlice("abc") assert_equal(len(s), 3) assert_equal(len(s.codepoints()), 3) ``` . **Returns:** The string length in bytes. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this string slice to the provided `Writer`. **Parameters:** * ​W (`Writer`): A type conforming to the `Writable` trait. **Args:** * ​writer (`W`): The object to write to. ### `__hash__` `__hash__(self) -> UInt` Hash the underlying buffer using builtin hash. **Returns:** A 64-bit hash value. This value is *not* suitable for cryptographic uses. Its intended usage is for data structures. See the `hash` builtin documentation for more details. `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with the underlying bytes. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. ### `__fspath__` `__fspath__(self) -> String` Return the file system path representation of this string. **Returns:** The file system path representation as a string. ### `to_python_object` `to_python_object(owned self) -> PythonObject` Convert this value to a PythonObject. **Returns:** A PythonObject representing the value. ### `__iter__` `__iter__(self) -> CodepointSliceIter[origin]` Iterate over the string, returning immutable references. **Returns:** An iterator of references to the string elements. ### `__reversed__` `__reversed__(self) -> CodepointSliceIter[origin, False]` Iterate backwards over the string, returning immutable references. **Returns:** A reversed iterator of references to the string elements. ### `__int__` `__int__(self) -> Int` Parses the given string as a base-10 integer and returns that value. If the string cannot be parsed as an int, an error is raised. **Returns:** An integer value that represents the string, or otherwise raises. ### `__float__` `__float__(self) -> SIMD[float64, 1]` Parses the string as a float point number and returns that value. If the string cannot be parsed as a float, an error is raised. **Returns:** A float value that represents the string, or otherwise raises. ### `__merge_with__` `__merge_with__[: Bool, : Origin[$0], //, other_type: AnyStruct[StringSlice[$1]]](self) -> StringSlice[origin]` Returns a string slice with merged origins. **Parameters:** * ​other\_type (`AnyStruct[StringSlice[$1]]`): The type of the origin to merge with. **Returns:** A StringSlice merged with the other origin. ### `get_immutable` `get_immutable(self) -> StringSlice[(muttoimm origin._mlir_origin)]` Return an immutable version of this Span. **Returns:** An immutable version of the same Span. ### `replace` `replace(self, old: StringSlice[origin], new: StringSlice[origin]) -> String` Return a copy of the string with all occurrences of substring `old` if replaced by `new`. **Args:** * ​old (`StringSlice[origin]`): The substring to replace. * ​new (`StringSlice[origin]`): The substring to replace with. **Returns:** The string where all occurrences of `old` are replaced with `new`. ### `strip` `strip(self, chars: StringSlice[origin]) -> Self` Return a copy of the string with leading and trailing characters removed. Example: ```mojo print("himojohi".strip("hi")) # "mojo" ``` **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A copy of the string with no leading or trailing characters. `strip(self) -> Self` Return a copy of the string with leading and trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. Example: ```mojo print(" mojo ".strip()) # "mojo" ``` **Returns:** A copy of the string with no leading or trailing whitespaces. ### `rstrip` `rstrip(self, chars: StringSlice[origin]) -> Self` Return a copy of the string with trailing characters removed. Example: ```mojo print("mojohi".strip("hi")) # "mojo" ``` **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A copy of the string with no trailing characters. `rstrip(self) -> Self` Return a copy of the string with trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. Example: ```mojo print("mojo ".strip()) # "mojo" ``` **Returns:** A copy of the string with no trailing whitespaces. ### `lstrip` `lstrip(self, chars: StringSlice[origin]) -> Self` Return a copy of the string with leading characters removed. Example: ```mojo print("himojo".strip("hi")) # "mojo" ``` **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A copy of the string with no leading characters. `lstrip(self) -> Self` Return a copy of the string with leading whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. Example: ```mojo print(" mojo".strip()) # "mojo" ``` **Returns:** A copy of the string with no leading whitespaces. ### `codepoints` `codepoints(self) -> CodepointsIter[origin]` Returns an iterator over the `Codepoint`s encoded in this string slice. # Examples Print the characters in a string: ```mojo from testing import assert_equal var s = StringSlice("abc") var iter = s.codepoints() assert_equal(iter.__next__(), Codepoint.ord("a")) assert_equal(iter.__next__(), Codepoint.ord("b")) assert_equal(iter.__next__(), Codepoint.ord("c")) assert_equal(iter.__has_next__(), False) ``` `codepoints()` iterates over Unicode codepoints, and supports multibyte codepoints: ```mojo from testing import assert_equal # A visual character composed of a combining sequence of 2 codepoints. var s = StringSlice("á") assert_equal(s.byte_length(), 3) var iter = s.codepoints() assert_equal(iter.__next__(), Codepoint.ord("a")) # U+0301 Combining Acute Accent assert_equal(iter.__next__().to_u32(), 0x0301) assert_equal(iter.__has_next__(), False) ``` . **Returns:** An iterator type that returns successive `Codepoint` values stored in this string slice. ### `codepoint_slices` `codepoint_slices(self) -> CodepointSliceIter[origin]` Iterate over the string, returning immutable references. **Returns:** An iterator of references to the string elements. ### `as_bytes` `as_bytes(self) -> Span[SIMD[uint8, 1], origin]` Get the sequence of encoded bytes of the underlying string. **Returns:** A slice containing the underlying sequence of encoded bytes. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin]` Gets a pointer to the first element of this string slice. **Returns:** A pointer pointing at the first element of this string slice. ### `byte_length` `byte_length(self) -> Int` Get the length of this string slice in bytes. **Returns:** The length of this string slice in bytes. ### `char_length` `char_length(self) -> UInt` Returns the length in Unicode codepoints. This returns the number of `Codepoint` codepoint values encoded in the UTF-8 representation of this string. Note: To get the length in bytes, use `StringSlice.byte_length()`. # Examples Query the length of a string, in bytes and Unicode codepoints: ```mojo from testing import assert_equal var s = StringSlice("ನಮಸ್ಕಾರ") assert_equal(s.char_length(), 7) assert_equal(len(s), 21) ``` Strings containing only ASCII characters have the same byte and Unicode codepoint length: ```mojo from testing import assert_equal var s = StringSlice("abc") assert_equal(s.char_length(), 3) assert_equal(len(s), 3) ``` The character length of a string with visual combining characters is the length in Unicode codepoints, not grapheme clusters: ```mojo from testing import assert_equal var s = StringSlice("á") assert_equal(s.char_length(), 2) assert_equal(s.byte_length(), 3) ``` . **Returns:** The length in Unicode codepoints. ### `is_codepoint_boundary` `is_codepoint_boundary(self, index: UInt) -> Bool` Returns True if `index` is the position of the first byte in a UTF-8 codepoint sequence, or is at the end of the string. A byte position is considered a codepoint boundary if a valid subslice of the string would end (noninclusive) at `index`. Positions `0` and `len(self)` are considered to be codepoint boundaries. Positions beyond the length of the string slice will return False. Examples: Check if particular byte positions are codepoint boundaries: ```mojo from testing import assert_equal, assert_true, assert_false var abc = StringSlice("abc") assert_equal(len(abc), 3) assert_true(abc.is_codepoint_boundary(0)) assert_true(abc.is_codepoint_boundary(1)) assert_true(abc.is_codepoint_boundary(2)) assert_true(abc.is_codepoint_boundary(3)) ``` Only the index of the first byte in a multi-byte codepoint sequence is considered a codepoint boundary: ```mojo var thumb = StringSlice("👍") assert_equal(len(thumb), 4) assert_true(thumb.is_codepoint_boundary(0)) assert_false(thumb.is_codepoint_boundary(1)) assert_false(thumb.is_codepoint_boundary(2)) assert_false(thumb.is_codepoint_boundary(3)) ``` Visualization showing which bytes are considered codepoint boundaries, within a piece of text that includes codepoints whose UTF-8 representation requires, respectively, 1, 2, 3, and 4-bytes. The codepoint boundary byte indices are indicated by a vertical arrow (↑). For example, this diagram shows that a slice of bytes formed by the half-open range starting at byte 3 and extending up to but not including byte 6 (`[3, 6)`) is a valid UTF-8 sequence. ```text ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ a©➇𝄞 ┃ String ┣━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┫ ┃97┃ 169 ┃ 10119 ┃ 119070 ┃ Unicode Codepoints ┣━━╋━━━┳━━━╋━━━┳━━━┳━━━╋━━━┳━━━┳━━━┳━━━┫ ┃97┃194┃169┃226┃158┃135┃240┃157┃132┃158┃ UTF-8 Bytes ┗━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┛ 0 1 2 3 4 5 6 7 8 9 10 ↑ ↑ ↑ ↑ ↑ ``` The following program verifies the above diagram: ```mojo from testing import assert_true, assert_false var text = StringSlice("a©➇𝄞") assert_true(text.is_codepoint_boundary(0)) assert_true(text.is_codepoint_boundary(1)) assert_false(text.is_codepoint_boundary(2)) assert_true(text.is_codepoint_boundary(3)) assert_false(text.is_codepoint_boundary(4)) assert_false(text.is_codepoint_boundary(5)) assert_true(text.is_codepoint_boundary(6)) assert_false(text.is_codepoint_boundary(7)) assert_false(text.is_codepoint_boundary(8)) assert_false(text.is_codepoint_boundary(9)) assert_true(text.is_codepoint_boundary(10)) ``` **Args:** * ​index (`UInt`): An index into the underlying byte representation of the string. **Returns:** A boolean indicating if `index` gives the position of the first byte in a UTF-8 codepoint sequence, or is at the end of the string. ### `startswith` `startswith(self, prefix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool` Verify if the `StringSlice` starts with the specified prefix between start and end positions. The `start` and `end` positions must be offsets given in bytes, and must be codepoint boundaries. **Args:** * ​prefix (`StringSlice[origin]`): The prefix to check. * ​start (`Int`): The start offset in bytes from which to check. * ​end (`Int`): The end offset in bytes from which to check. **Returns:** True if the `self[start:end]` is prefixed by the input prefix. ### `endswith` `endswith(self, suffix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool` Verify if the `StringSlice` end with the specified suffix between start and end positions. The `start` and `end` positions must be offsets given in bytes, and must be codepoint boundaries. **Args:** * ​suffix (`StringSlice[origin]`): The suffix to check. * ​start (`Int`): The start offset in bytes from which to check. * ​end (`Int`): The end offset in bytes from which to check. **Returns:** True if the `self[start:end]` is suffixed by the input suffix. ### `removeprefix` `removeprefix(self, prefix: StringSlice[origin], /) -> Self` Returns a new string with the prefix removed if it was present. Examples: ```mojo print(StringSlice('TestHook').removeprefix('Test')) # 'Hook' print(StringSlice('BaseTestCase').removeprefix('Test')) # 'BaseTestCase' ``` **Args:** * ​prefix (`StringSlice[origin]`): The prefix to remove from the string. **Returns:** `string[len(prefix):]` if the string starts with the prefix string, or a copy of the original string otherwise. ### `removesuffix` `removesuffix(self, suffix: StringSlice[origin], /) -> Self` Returns a new string with the suffix removed if it was present. Examples: ```mojo print(StringSlice('TestHook').removesuffix('Hook')) # 'Test' print(StringSlice('BaseTestCase').removesuffix('Test')) # 'BaseTestCase' ``` **Args:** * ​suffix (`StringSlice[origin]`): The suffix to remove from the string. **Returns:** `string[:-len(suffix)]` if the string ends with the suffix string, or a copy of the original string otherwise. ### `format` `format[*Ts: Stringable & Representable](self, *args: *Ts) -> String` Produce a formatted string using the current string as a template. The template, or "format string" can contain literal text and/or replacement fields delimited with curly braces (`{}`). Returns a copy of the format string with the replacement fields replaced with string representations of the `args` arguments. For more information, see the discussion in the [`format` module](/mojo/stdlib/collections/string/format/). Examples: ```mojo # Manual indexing: print(StringSlice("{0} {1} {0}").format("Mojo", 1.125)) # Mojo 1.125 Mojo # Automatic indexing: print(StringSlice("{} {}").format(True, "hello world")) # True hello world ``` **Parameters:** * ​\*Ts (`Stringable & Representable`): The types of substitution values that implement `Representable` and `Stringable` (to be changed and made more flexible). **Args:** * ​\*args (`*Ts`): The substitution values. **Returns:** The template with the given values substituted. ### `find` `find(self, substr: StringSlice[origin], start: Int = 0) -> Int` Finds the offset in bytes of the first occurrence of `substr` starting at `start`. If not found, returns `-1`. **Args:** * ​substr (`StringSlice[origin]`): The substring to find. * ​start (`Int`): The offset in bytes from which to find. Must be a codepoint boundary. **Returns:** The offset in bytes of `substr` relative to the beginning of the string. ### `rfind` `rfind(self, substr: StringSlice[origin], start: Int = 0) -> Int` Finds the offset in bytes of the last occurrence of `substr` starting at `start`. If not found, returns `-1`. **Args:** * ​substr (`StringSlice[origin]`): The substring to find. * ​start (`Int`): The offset in bytes from which to find. Must be a valid codepoint boundary. **Returns:** The offset in bytes of `substr` relative to the beginning of the string. ### `isspace` `isspace[single_character: Bool = False](self) -> Bool` Determines whether every character in the given StringSlice is a python whitespace String. This corresponds to Python's [universal separators](https://docs.python.org/3/library/stdtypes.html#str.splitlines): `" \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`. Example: Check if a string contains only whitespace: ```mojo # An empty string is not considered to contain only whitespace chars: assert_false(StringSlice("").isspace()) # ASCII space characters assert_true(StringSlice(" ").isspace()) assert_true(StringSlice(" ").isspace()) # Contains non-space characters assert_false(StringSlice(" abc ").isspace()) ``` **Parameters:** * ​single\_character (`Bool`): Whether to evaluate the `StringSlice` as a single unicode character (avoids overhead when already iterating). **Returns:** True if the whole StringSlice is made up of whitespace characters listed above, otherwise False. ### `split` `split(self, sep: StringSlice[origin], maxsplit: Int = -1) -> List[StringSlice[(muttoimm origin._mlir_origin)]]` Split the string by a separator. Examples: ```mojo # Splitting a space _ = StringSlice("hello world").split(" ") # ["hello", "world"] # Splitting adjacent separators _ = StringSlice("hello,,world").split(",") # ["hello", "", "world"] # Splitting with maxsplit _ = StringSlice("1,2,3").split(",", 1) # ['1', '2,3'] # Splitting with an empty separator _ = StringSlice("123").split("") # ["", "1", "2", "3", ""] ``` **Args:** * ​sep (`StringSlice[origin]`): The string to split on. * ​maxsplit (`Int`): The maximum amount of items to split from String. Defaults to unlimited. **Returns:** A List of Strings containing the input split by the separator. `split(self, sep: NoneType = NoneType(None), maxsplit: Int = -1) -> List[StringSlice[(muttoimm origin._mlir_origin)]]` Split the string by every Whitespace separator. Examples: ```mojo # Splitting an empty string or filled with whitespaces _ = StringSlice(" ").split() # [] _ = StringSlice("").split() # [] # Splitting a string with leading, trailing, and middle whitespaces _ = StringSlice(" hello world ").split() # ["hello", "world"] # Splitting adjacent universal newlines: _ = StringSlice( "hello \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029world" ).split() # ["hello", "world"] ``` **Args:** * ​sep (`NoneType`): None. * ​maxsplit (`Int`): The maximum amount of items to split from String. Defaults to unlimited. **Returns:** A List of Strings containing the input split by the separator. ### `isnewline` `isnewline[single_character: Bool = False](self) -> Bool` Determines whether every character in the given StringSlice is a python newline character. This corresponds to Python's [universal newlines:](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `"\r\n"` and `"\t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`. **Parameters:** * ​single\_character (`Bool`): Whether to evaluate the stringslice as a single unicode character (avoids overhead when already iterating). **Returns:** True if the whole StringSlice is made up of whitespace characters listed above, otherwise False. ### `splitlines` `splitlines[O: ImmutableOrigin, //](self: StringSlice[O], keepends: Bool = False) -> List[StringSlice[O]]` Split the string at line boundaries. This corresponds to Python's [universal newlines:](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `"\r\n"` and `"\t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`. **Parameters:** * ​O (`ImmutableOrigin`): The immutable origin. **Args:** * ​keepends (`Bool`): If True, line breaks are kept in the resulting strings. **Returns:** A List of Strings containing the input split by line boundaries. ### `count` `count(self, substr: StringSlice[origin]) -> Int` Return the number of non-overlapping occurrences of substring `substr` in the string. If sub is empty, returns the number of empty strings between characters which is the length of the string plus one. **Args:** * ​substr (`StringSlice[origin]`): The substring to count. **Returns:** The number of occurrences of `substr`. ### `is_ascii_digit` `is_ascii_digit(self) -> Bool` A string is a digit string if all characters in the string are digits and there is at least one character in the string. Note that this currently only works with ASCII strings. **Returns:** True if all characters are digits and it's not empty else False. ### `isupper` `isupper(self) -> Bool` Returns True if all cased characters in the string are uppercase and there is at least one cased character. **Returns:** True if all cased characters in the string are uppercase and there is at least one cased character, False otherwise. ### `islower` `islower(self) -> Bool` Returns True if all cased characters in the string are lowercase and there is at least one cased character. **Returns:** True if all cased characters in the string are lowercase and there is at least one cased character, False otherwise. ### `lower` `lower(self) -> String` Returns a copy of the string with all cased characters converted to lowercase. **Returns:** A new string where cased letters have been converted to lowercase. ### `upper` `upper(self) -> String` Returns a copy of the string with all cased characters converted to uppercase. **Returns:** A new string where cased letters have been converted to uppercase. ### `is_ascii_printable` `is_ascii_printable(self) -> Bool` Returns True if all characters in the string are ASCII printable. Note that this currently only works with ASCII strings. **Returns:** True if all characters are printable else False. ### `rjust` `rjust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String` Returns the string right justified in a string of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns right justified string, or self if width is not bigger than self length. ### `ljust` `ljust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String` Returns the string left justified in a string of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns left justified string, or self if width is not bigger than self length. ### `center` `center(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String` Returns the string center justified in a string of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns center justified string, or self if width is not bigger than self length. ### `join` `join[T: Copyable & Movable & Writable](self, elems: List[T, hint_trivial_type]) -> String` Joins string elements using the current string as a delimiter. **Parameters:** * ​T (`Copyable & Movable & Writable`): The type of the elements, must implement the `Copyable`, `Movable` and `Writable` traits. **Args:** * ​elems (`List[T, hint_trivial_type]`): The input values. **Returns:** The joined string. `join[*Ts: Writable](self: StringSlice[StaticConstantOrigin], *elems: *Ts) -> String` Joins string elements using the current string as a delimiter. **Parameters:** * ​\*Ts (`Writable`): The types of the elements. **Args:** * ​\*elems (`*Ts`): The input values. **Returns:** The joined string. --- ## get_static_string `get_static_string[string: StringSlice[StaticConstantOrigin], *extra: StringSlice[StaticConstantOrigin]]() -> StringSlice[StaticConstantOrigin]` Form a StaticString from compile-time StringSlice values. This guarantees that the returned string is compile-time constant in static memory. It also guarantees that there is a 'nul' zero byte at the end, which is not included in the returned range. **Parameters:** * ​string (`StringSlice[StaticConstantOrigin]`): The first StringSlice value. * ​\*extra (`StringSlice[StaticConstantOrigin]`): Additional StringSlice values to concatenate. **Returns:** The string value as a StaticString. --- ## string_slice The `StringSlice` type implementation for efficient string operations. This module provides the `StringSlice` type, which is a lightweight view into string data that enables zero-copy string operations. `StringSlice` is designed for high-performance string manipulation while maintaining memory safety and UTF-8 awareness. The `StringSlice` type is particularly useful for: * High-performance string operations without copying. * Efficient string parsing and tokenization. `StaticString` is an alias for an immutable constant `StringSlice`. `StringSlice` and `StaticString` are in the prelude, so they are automatically imported into every Mojo program. Example: ```mojo # Create a string slice var text = StringSlice("Hello, 世界") # Zero-copy slicing var hello = text[0:5] # Hello # Unicode-aware operations var world = text[7:13] # "世界" # String comparison if text.startswith("Hello"): print("Found greeting") # String formatting var format_string = StaticString("{}: {}") print(format_string.format("bats", 6)) # bats: 6 ``` ## Aliases ### `StaticString` `alias StaticString = StringSlice[StaticConstantOrigin]` An immutable static string slice. ## Structs * [​`CodepointsIter`](/mojo/stdlib/collections/string/string_slice/CodepointsIter): Iterator over the `Codepoint`s in a string slice, constructed by `StringSlice.codepoints()`. * [​`CodepointSliceIter`](/mojo/stdlib/collections/string/string_slice/CodepointSliceIter): Iterator for `StringSlice` over substring slices containing a single Unicode codepoint. * [​`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice): A non-owning view to encoded string data. ## Functions * [​`get_static_string`](/mojo/stdlib/collections/string/string_slice/get_static_string): Form a StaticString from compile-time StringSlice values. This guarantees that the returned string is compile-time constant in static memory. It also guarantees that there is a 'nul' zero byte at the end, which is not included in the returned range. --- ## Info `@register_passable(trivial)` `struct Info[func_type: AnyTrivialRegType, func: func_type, target: target]` Contains compilation information and results for a function. Stores assembly/IR code, function metadata, and error information from compiling a function. Attributes: populate: Function to populate captures ## Parameters * ​func\_type (`AnyTrivialRegType`): Type of the function being compiled. * ​func (`func_type`): The function being compiled. * ​target (`target`): The target architecture to compile for. ## Fields * ​asm (`StringSlice[StaticConstantOrigin]`): Generated assembly/IR code from the compilation process. * ​function\_name (`StringSlice[StaticConstantOrigin]`): Mangled name of the compiled function, used for symbol resolution. * ​module\_name (`StringSlice[StaticConstantOrigin]`): Name of the module containing the compiled function. * ​num\_captures (`Int`): Number of variables captured by the function closure. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `populate` `alias populate = rebind[AnyTrivialRegType,AnyTrivialRegType](#kgen.compile_offload_closure : !kgen.param>)` Function pointer to populate captured variables in the function closure. ## Methods ### `__contains__` `__contains__(self, content: String) -> Bool` Checks if content exists in the assembly/IR. **Args:** * ​content (`String`): String to search for. **Returns:** True if content is found, False otherwise. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the assembly/IR to a writer. **Parameters:** * ​W (`Writer`): Type that implements the Writer interface for writing data. **Args:** * ​writer (`W`): Writer object to write the assembly to. ### `__str__` `__str__(self) -> String` Converts the assembly/IR to a string. **Returns:** The assembly/IR as a string. ### `write_text` `write_text[path_like: PathLike](self, path: path_like)` Writes the assembly/IR to a file. **Parameters:** * ​path\_like (`PathLike`): Type that implements the `PathLike` interface for file path representation. **Args:** * ​path (`path_like`): Path to write the file to. **Raises:** If file writing operations fail. --- ## compile_info `compile_info[func_type: AnyTrivialRegType, //, func: func_type, /, *, emission_kind: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("asm"), compile_options: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), target: target = _current_target()]() -> Info[func_type, func, target]` Compiles a function and returns detailed compilation information. This function takes a Mojo function and compiles it, providing access to the generated assembly code, linkage information, and other compilation artifacts. It can be used for inspection, debugging, and low-level optimization. Example: ```mojo from compile import compile_info fn my_func(x: Int) -> Int: return x info = compile_info[my_func]() print(info) # Print assembly ``` Note: The compilation is always performed, even if the function is not used. For performance-critical code, consider caching the compilation results. **Parameters:** * ​func\_type (`AnyTrivialRegType`): Type of the function to compile. Must be a trivially-copyable register type. * ​func (`func_type`): The function to compile. Must match the specified func\_type. * ​emission\_kind (`StringSlice[StaticConstantOrigin]`): The desired output format. Valid options are: * "asm": Assembly code (default). * "llvm": Unoptimized LLVM IR. * "llvm-opt": Optimized LLVM IR. * "object": Object code. * ​compile\_options (`StringSlice[StaticConstantOrigin]`): Additional compiler flags and options as a string. * ​target (`target`): The target architecture to compile for. Defaults to current architecture. **Returns:** An `Info` struct containing: * asm: The generated code in the requested format * linkage\_name: The mangled function name for linking * module\_hash: A unique hash of the compiled module * num\_captures: Number of captured variables * error: Any error message (empty if successful) * failed: Boolean indicating if compilation failed --- ## compile Provides utilities for compiling and inspecting Mojo code. This module contains functionality for compiling Mojo functions and examining their assembly, LLVM IR, or object code output. It is particularly useful for kernel engineers who want to inspect the low-level implementation details of specific functions without dealing with entire files or manual invocation of compilation tools. Key features: * Compile individual functions to assembly, LLVM IR, or object code * Get linkage names and module information * Inspect number of captures and other function metadata * Write compilation output to files * Control compilation options and targets Example: ```mojo from compile import compile_info fn my_func(x: Int) -> Int: return x # Get assembly for the function info = compile_info[my_func]() print(info) ``` ## Structs * [​`Info`](/mojo/stdlib/compile/compile/Info): Contains compilation information and results for a function. ## Functions * [​`compile_info`](/mojo/stdlib/compile/compile/compile_info): Compiles a function and returns detailed compilation information. --- ## compile Provides utilities for compiling and inspecting Mojo code at runtime. This module exposes functionality for compiling individual Mojo functions and examining their low-level implementation details. It is particularly useful for: * Inspecting assembly, LLVM IR, or object code output * Getting linkage names and module information * Examining function metadata like captures * Writing compilation output to files * Controlling compilation options and targets Example: ```mojo from compile import compile_info fn my_func(): print("Hello") # Get assembly for the function info = compile_info[my_func]() print(info.asm) ``` ## Modules * [​`compile`](/mojo/stdlib/compile/compile/): Provides utilities for compiling and inspecting Mojo code. * [​`reflection`](/mojo/stdlib/compile/reflection/): --- ## get_linkage_name `get_linkage_name[func_type: AnyTrivialRegType, //, target: target, func: func_type]() -> StringSlice[StaticConstantOrigin]` Returns `func` symbol name. **Parameters:** * ​func\_type (`AnyTrivialRegType`): Type of func. * ​target (`target`): The compilation target. * ​func (`func_type`): A mojo function. **Returns:** Symbol name. `get_linkage_name[func_type: AnyTrivialRegType, //, func: func_type]() -> StringSlice[StaticConstantOrigin]` Returns `func` symbol name. **Parameters:** * ​func\_type (`AnyTrivialRegType`): Type of func. * ​func (`func_type`): A mojo function. **Returns:** Symbol name. --- ## get_type_name `get_type_name[type_type: AnyTrivialRegType, //, type: type_type]() -> StringSlice[StaticConstantOrigin]` Returns the struct name of the given type parameter. **Parameters:** * ​type\_type (`AnyTrivialRegType`): Type of type. * ​type (`type_type`): A mojo type. **Returns:** Type name. --- ## reflection ## Functions * [​`get_linkage_name`](/mojo/stdlib/compile/reflection/get_linkage_name): Returns `func` symbol name. * [​`get_type_name`](/mojo/stdlib/compile/reflection/get_type_name): Returns the struct name of the given type parameter. --- ## ComplexSIMD `@register_passable(trivial)` `struct ComplexSIMD[type: DType, size: Int]` Represents a complex SIMD value. The class provides basic methods for manipulating complex values. ## Parameters * ​type (`DType`): DType of the value. * ​size (`Int`): SIMD width of the value. ## Fields * ​re (`SIMD[type, size]`): The real part of the complex SIMD value. * ​im (`SIMD[type, size]`): The imaginary part of the complex SIMD value. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable`, `_Expable` ## Aliases ### `element_type` `alias element_type = SIMD[type, size]` ## Methods ### `__init__` `__init__(re: SIMD[type, size], im: SIMD[type, size] = __init__[__mlir_type.!pop.int_literal](0)) -> Self` Initializes a complex SIMD value. **Args:** * ​re (`SIMD[type, size]`): The real part of the complex value. * ​im (`SIMD[type, size]`): The imaginary part of the complex value. ### `__neg__` `__neg__(self) -> Self` Negates the complex value. **Returns:** The negative of the complex value. ### `__add__` `__add__(self, rhs: Self) -> Self` Adds two complex values. **Args:** * ​rhs (`Self`): Complex value to add. **Returns:** A sum of this and RHS complex values. ### `__sub__` `__sub__(self, rhs: Self) -> Self` Subtracts two complex values. **Args:** * ​rhs (`Self`): Complex value to subtract. **Returns:** A difference of this and RHS complex values. ### `__mul__` `__mul__(self, rhs: Self) -> Self` Multiplies two complex values. **Args:** * ​rhs (`Self`): Complex value to multiply with. **Returns:** A product of this and RHS complex values. ### `__truediv__` `__truediv__(self, rhs: Self) -> Self` Divides two complex values. **Args:** * ​rhs (`Self`): Complex value to divide by. **Returns:** A quotient of this and RHS complex values. ### `__str__` `__str__(self) -> String` Get the complex as a string. **Returns:** A string representation. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this complex value to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__abs__` `__abs__(self) -> SIMD[type, size]` Returns the magnitude of the complex value. **Returns:** Value of `sqrt(re*re + im*im)`. ### `norm` `norm(self) -> SIMD[type, size]` Returns the magnitude of the complex value. **Returns:** Value of `sqrt(re*re + im*im)`. ### `squared_norm` `squared_norm(self) -> SIMD[type, size]` Returns the squared magnitude of the complex value. **Returns:** Value of `re*re + im*im`. ### `fma` `fma(self, b: Self, c: Self) -> Self` Computes FMA operation. Compute fused multiple-add with two other complex values: `result = self * b + c` **Args:** * ​b (`Self`): Multiplier complex value. * ​c (`Self`): Complex value to add. **Returns:** Computed `Self * B + C` complex value. ### `squared_add` `squared_add(self, c: Self) -> Self` Computes Square-Add operation. Compute `Self * Self + C`. **Args:** * ​c (`Self`): Complex value to add. **Returns:** Computed `Self * Self + C` complex value. ### `__exp__` `__exp__(self) -> Self` Computes the exponential of the complex value. **Returns:** The exponential of the complex value. --- ## abs `abs(x: ComplexSIMD[type, size]) -> SIMD[type, size]` Performs elementwise abs (norm) on each element of the complex value. **Args:** * ​x (`ComplexSIMD[type, size]`): The complex vector to perform absolute value on. **Returns:** The elementwise abs of x. --- ## complex Implements the Complex type. You can import these APIs from the `complex` package. For example: ```mojo from complex import ComplexSIMD ``` ## Aliases ### `ComplexFloat32` `alias ComplexFloat32 = ComplexSIMD[float32, 1]` ### `ComplexFloat64` `alias ComplexFloat64 = ComplexSIMD[float64, 1]` ## Structs * [​`ComplexSIMD`](/mojo/stdlib/complex/complex/ComplexSIMD): Represents a complex SIMD value. ## Functions * [​`abs`](/mojo/stdlib/complex/complex/abs): Performs elementwise abs (norm) on each element of the complex value. --- ## complex Provides types and functions for working with complex numbers. ## Modules * [​`complex`](/mojo/stdlib/complex/complex/): Implements the Complex type. --- ## doc_private `doc_private()` Indicate that the decorated declaration is private from the viewpoint of documentation generation. This decorator allows for hiding the documentation for a declaration during generation. This is often used to hide `__init__`, and other special methods, that are not intended to be part of a library's documentation. For example: ```mojo struct Foo: @doc_private fn __init__(out self): "This should not be called directly, use `Foo.create` instead." return @staticmethod fn create() -> Self: return Self() ``` --- ## documentation Provides decorators and utilities for interacting with Mojo documentation generation and validation. These are Mojo built-ins, so you don't need to import them. ## Functions * [​`doc_private`](/mojo/stdlib/documentation/documentation/doc_private): Indicate that the decorated declaration is private from the viewpoint of documentation generation. --- ## documentation Implements the documentation package. ## Modules * [​`documentation`](/mojo/stdlib/documentation/documentation/): Provides decorators and utilities for interacting with Mojo documentation generation and validation. --- ## broadcast `broadcast[type: DType, width: Int, //, *, block_size: Int](val: SIMD[type, width], src_thread: UInt = UInt(0)) -> SIMD[type, width]` Broadcasts a value from a source thread to all threads in a block. This function takes a SIMD value from the specified source thread and copies it to all other threads in the block, effectively broadcasting the value across the entire block. **Parameters:** * ​type (`DType`): The data type of the SIMD elements. * ​width (`Int`): The number of elements in each SIMD vector. * ​block\_size (`Int`): The total number of threads in the block. **Args:** * ​val (`SIMD[type, width]`): The SIMD value to broadcast from the source thread. * ​src\_thread (`UInt`): The thread ID of the source thread (default: 0). **Returns:** A SIMD value where all threads contain a copy of the input value from the source thread. --- ## block GPU block-level operations and utilities. This module provides block-level operations for NVIDIA and AMD GPUs, including: * Block-wide reductions: * sum: Compute sum across block * max: Find maximum value across block * min: Find minimum value across block * broadcast: Broadcast value to all threads The module builds on warp-level operations from the warp module, extending them to work across a full thread block (potentially multiple warps). It handles both NVIDIA and AMD GPU architectures and supports various data types with SIMD vectorization. ## Functions * [​`broadcast`](/mojo/stdlib/gpu/block/broadcast): Broadcasts a value from a source thread to all threads in a block. * [​`max`](/mojo/stdlib/gpu/block/max): Computes the maximum value across all threads in a block. * [​`min`](/mojo/stdlib/gpu/block/min): Computes the minimum value across all threads in a block. * [​`prefix_sum`](/mojo/stdlib/gpu/block/prefix_sum): Performs a prefix sum (scan) operation across all threads in a block. * [​`sum`](/mojo/stdlib/gpu/block/sum): Computes the sum of values across all threads in a block. --- ## max `max[type: DType, width: Int, //, *, block_size: Int, broadcast: Bool = True](val: SIMD[type, width]) -> SIMD[type, width]` Computes the maximum value across all threads in a block. Performs a parallel reduction using warp-level operations and shared memory to find the global maximum across all threads in the block. **Parameters:** * ​type (`DType`): The data type of the SIMD elements. * ​width (`Int`): The number of elements in each SIMD vector. * ​block\_size (`Int`): The total number of threads in the block. * ​broadcast (`Bool`): If True, the final reduced value is broadcast to all threads in the block. If False, only the first thread will have the complete result. **Args:** * ​val (`SIMD[type, width]`): The SIMD value to reduce. Each thread contributes its value to find the maximum. **Returns:** If broadcast is True, each thread in the block will receive the maximum value across the entire block. Otherwise, only the first thread will have the complete result. --- ## min `min[type: DType, width: Int, //, *, block_size: Int, broadcast: Bool = True](val: SIMD[type, width]) -> SIMD[type, width]` Computes the minimum value across all threads in a block. Performs a parallel reduction using warp-level operations and shared memory to find the global minimum across all threads in the block. **Parameters:** * ​type (`DType`): The data type of the SIMD elements. * ​width (`Int`): The number of elements in each SIMD vector. * ​block\_size (`Int`): The total number of threads in the block. * ​broadcast (`Bool`): If True, the final minimum is broadcast to all threads in the block. If False, only the first thread will have the complete min. **Args:** * ​val (`SIMD[type, width]`): The SIMD value to reduce. Each thread contributes its value to find the minimum. **Returns:** If broadcast is True, each thread in the block will receive the minimum value across the entire block. Otherwise, only the first thread will have the complete result. --- ## prefix_sum `prefix_sum[type: DType, //, *, block_size: Int, exclusive: Bool = False](val: SIMD[type, 1]) -> SIMD[type, 1]` Performs a prefix sum (scan) operation across all threads in a block. This function implements a block-level inclusive or exclusive scan, efficiently computing the cumulative sum for each thread based on thread indices. **Parameters:** * ​type (`DType`): The data type of the Scalar elements. * ​block\_size (`Int`): The total number of threads in the block. * ​exclusive (`Bool`): If True, perform exclusive scan instead of inclusive. **Args:** * ​val (`SIMD[type, 1]`): The Scalar value from each thread to include in the scan. **Returns:** A Scalar value containing the result of the scan operation for each thread. --- ## sum `sum[type: DType, width: Int, //, *, block_size: Int, broadcast: Bool = True](val: SIMD[type, width]) -> SIMD[type, width]` Computes the sum of values across all threads in a block. Performs a parallel reduction using warp-level operations and shared memory to find the global sum across all threads in the block. **Parameters:** * ​type (`DType`): The data type of the SIMD elements. * ​width (`Int`): The number of elements in each SIMD vector. * ​block\_size (`Int`): The total number of threads in the block. * ​broadcast (`Bool`): If True, the final sum is broadcast to all threads in the block. If False, only the first thread will have the complete sum. **Args:** * ​val (`SIMD[type, width]`): The SIMD value to reduce. Each thread contributes its value to the sum. **Returns:** If broadcast is True, each thread in the block will receive the final sum. Otherwise, only the first thread will have the complete sum. --- ## block_rank_in_cluster `block_rank_in_cluster() -> SIMD[uint32, 1]` Returns the unique identifier (rank) for the current thread block within its cluster. Note: * Only supported on NVIDIA SM90+ GPUs. * Maps directly to the `%cluster_ctarank` special register in CUDA PTX. **Returns:** A unique identifier in the range \[0, cluster\_size-1] where `cluster_size` is the total number of thread blocks in the cluster. --- ## cluster_arrive `cluster_arrive()` Signals arrival at a cluster synchronization point with memory ordering guarantees. This function ensures all prior memory operations from this thread block are visible to other thread blocks in the cluster before proceeding. Only supported on NVIDIA SM90+ GPUs. --- ## cluster_arrive_relaxed `cluster_arrive_relaxed()` Signals arrival at a cluster synchronization point with relaxed memory ordering. This is a relaxed version of cluster\_arrive() that does not enforce memory ordering guarantees. It should be used when memory ordering is not required between thread blocks in the cluster. Only supported on NVIDIA SM90+ GPUs. --- ## cluster_sync `cluster_sync()` Performs a full cluster synchronization with memory ordering guarantees. This is a convenience function that combines cluster\_arrive() and cluster\_wait() to provide a full barrier synchronization across all thread blocks in the cluster. Ensures memory ordering between thread blocks. Only supported on NVIDIA SM90+ GPUs. --- ## cluster_sync_acquire `cluster_sync_acquire()` Acquires the cluster sync proxy. Only supported on NVIDIA SM90+ GPUs. --- ## cluster_sync_relaxed `cluster_sync_relaxed()` Performs a full cluster synchronization with relaxed memory ordering. This is a convenience function that combines cluster\_arrive\_relaxed() and cluster\_wait() to provide a barrier synchronization across all thread blocks in the cluster without memory ordering guarantees. Only supported on NVIDIA SM90+ GPUs. --- ## cluster_sync_release `cluster_sync_release()` Release the cluster sync proxy. Only supported on NVIDIA SM90+ GPUs. --- ## cluster_wait `cluster_wait()` Waits for all thread blocks in the cluster to arrive at the synchronization point. This function blocks until all thread blocks in the cluster have called cluster\_arrive() or cluster\_arrive\_relaxed(). Only supported on NVIDIA SM90+ GPUs. --- ## clusterlaunchcontrol_query_cancel_get_first_ctaid `clusterlaunchcontrol_query_cancel_get_first_ctaid[id: String](result: UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)]) -> SIMD[uint32, 1]` Decodes the cancellation request. Only supported on NVIDIA SM100+ GPUs. **Parameters:** * ​id (`String`): The dimension to decode. Must be one of `x`, `y`, `z`. **Args:** * ​result (`UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)]`): A pointer to 2 `UInt64`s that make up the cancellation request result to decode. **Returns:** The coordinate of the first CTAID in the canceled cluster. --- ## clusterlaunchcontrol_query_cancel_get_first_ctaid_v4 `clusterlaunchcontrol_query_cancel_get_first_ctaid_v4(block_dim: UnsafePointer[SIMD[uint32, 1]], result: UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)])` Decodes the cancellation request. Only supported on NVIDIA SM100+ GPUs. **Args:** * ​block\_dim (`UnsafePointer[SIMD[uint32, 1]]`): A pointer to 4 `UInt32`s that will store the coordinates of the first CTAID in the canceled cluster. * ​result (`UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)]`): A pointer to 2 `UInt64`s that make up the cancellation request result to decode. --- ## clusterlaunchcontrol_query_cancel_is_canceled `clusterlaunchcontrol_query_cancel_is_canceled(result: UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)]) -> SIMD[uint32, 1]` Decodes the cancellation request. Only supported on NVIDIA SM100+ GPUs. **Args:** * ​result (`UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)]`): A pointer to 2 `UInt64`s that make up the cancellation request result to decode. **Returns:** True if the cancellation request is canceled, False otherwise. --- ## clusterlaunchcontrol_try_cancel `clusterlaunchcontrol_try_cancel[multicast: Bool = False](result: UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)], mbar: UnsafePointer[SIMD[int64, 1], address_space=AddressSpace(3)])` Requests to atomically cancel the cluster launch if it has not started running yet. Only supported on NVIDIA SM100+ GPUs. **Args:** * ​result (`UnsafePointer[SIMD[uint64, 1], address_space=AddressSpace(3)]`): A pointer to 2 `UInt64`s (16B aligned) that will store the result of the cancellation request. * ​mbar (`UnsafePointer[SIMD[int64, 1], address_space=AddressSpace(3)]`): A pointer to an `Int64` (8B aligned) memory barrier state. --- ## elect_one_sync `elect_one_sync() -> Bool` Elects a single thread within a warp to perform an operation. Note: * Only supported on NVIDIA SM90+ GPUs. * Maps directly to the `elect.sync` instruction in CUDA PTX. * Useful for having a single thread perform an operation while maintaining warp synchronization. **Returns:** True for the elected thread, False for all other threads in the warp. --- ## elect_one_sync_with_mask `elect_one_sync_with_mask(mask: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](4294967295)) -> Bool` Elects a single thread within a warp to perform an operation. Note: * Only supported on NVIDIA SM90+ GPUs. * Maps directly to the `elect.sync` instruction in CUDA PTX. * Useful for having a single thread perform an operation while maintaining warp synchronization. **Args:** * ​mask (`SIMD[uint32, 1]`): The mask to use for the election. Defaults to 0xFFFFFFFF. **Returns:** True for the elected thread, False for all other threads in the warp. --- ## cluster This module provides low-level NVIDIA GPU cluster synchronization primitives for SM90+ architectures. The module implements thread block cluster operations that enable efficient communication and synchronization between thread blocks (CTAs) within a cluster on NVIDIA Hopper architecture and newer GPUs. All functions are constrained to NVIDIA SM90+ GPUs and will raise an error if used on unsupported hardware. Note: These are low-level primitives that correspond directly to PTX/NVVM instructions and should be used with careful consideration of the underlying hardware synchronization mechanisms. ## Functions * [​`block_rank_in_cluster`](/mojo/stdlib/gpu/cluster/block_rank_in_cluster): Returns the unique identifier (rank) for the current thread block within its cluster. * [​`cluster_arrive`](/mojo/stdlib/gpu/cluster/cluster_arrive): Signals arrival at a cluster synchronization point with memory ordering guarantees. * [​`cluster_arrive_relaxed`](/mojo/stdlib/gpu/cluster/cluster_arrive_relaxed): Signals arrival at a cluster synchronization point with relaxed memory ordering. * [​`cluster_sync`](/mojo/stdlib/gpu/cluster/cluster_sync): Performs a full cluster synchronization with memory ordering guarantees. * [​`cluster_sync_acquire`](/mojo/stdlib/gpu/cluster/cluster_sync_acquire): Acquires the cluster sync proxy. * [​`cluster_sync_relaxed`](/mojo/stdlib/gpu/cluster/cluster_sync_relaxed): Performs a full cluster synchronization with relaxed memory ordering. * [​`cluster_sync_release`](/mojo/stdlib/gpu/cluster/cluster_sync_release): Release the cluster sync proxy. * [​`cluster_wait`](/mojo/stdlib/gpu/cluster/cluster_wait): Waits for all thread blocks in the cluster to arrive at the synchronization point. * [​`clusterlaunchcontrol_query_cancel_get_first_ctaid`](/mojo/stdlib/gpu/cluster/clusterlaunchcontrol_query_cancel_get_first_ctaid): Decodes the cancellation request. * [​`clusterlaunchcontrol_query_cancel_get_first_ctaid_v4`](/mojo/stdlib/gpu/cluster/clusterlaunchcontrol_query_cancel_get_first_ctaid_v4): Decodes the cancellation request. * [​`clusterlaunchcontrol_query_cancel_is_canceled`](/mojo/stdlib/gpu/cluster/clusterlaunchcontrol_query_cancel_is_canceled): Decodes the cancellation request. * [​`clusterlaunchcontrol_try_cancel`](/mojo/stdlib/gpu/cluster/clusterlaunchcontrol_try_cancel): Requests to atomically cancel the cluster launch if it has not started running yet. * [​`elect_one_sync`](/mojo/stdlib/gpu/cluster/elect_one_sync): Elects a single thread within a warp to perform an operation. * [​`elect_one_sync_with_mask`](/mojo/stdlib/gpu/cluster/elect_one_sync_with_mask): Elects a single thread within a warp to perform an operation. --- ## allgather `allgather[type: DType, rank: Int, ngpus: Int, //](input_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus], output_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], (ngpus * ngpus)], ctxs: List[DeviceContext])` Performs all-gather across GPUs with variadic output. Each device receives individual copies of all input buffers. **Parameters:** * ​type (`DType`): DType - The data type of tensor elements. * ​rank (`Int`): Int - Number of dimensions in input tensors. * ​ngpus (`Int`): Int - Number of GPUs participating in all-gather. **Args:** * ​input\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus]`): Input buffers from each GPU. * ​output\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], (ngpus * ngpus)]`): Flat array of ngpus \* ngpus output buffers. Layout: output\_buffers\[device\_idx \* ngpus + input\_idx] contains device\_idx's copy of input\_idx's data. * ​ctxs (`List[DeviceContext]`): List of device contexts for participating GPUs. --- ## allgather Multi-GPU allgather implementation that gathers values from multiple GPUs into an output buffer. ## Functions * [​`allgather`](/mojo/stdlib/gpu/comm/allgather/allgather): Performs all-gather across GPUs with variadic output. --- ## Signal `@register_passable(trivial)` `struct Signal` A synchronization primitive for coordinating GPU thread blocks across multiple devices. This struct provides counter-based synchronization between thread blocks on different GPUs. It maintains two sets of counters: 1. self\_counter: Used by blocks on the current GPU to signal their progress 2. peer\_counter: Used to track progress of blocks on other GPUs Note: The counters use unsigned integers that may overflow, but this is safe since unsigned integer overflow has well-defined behavior. ## Fields * ​self\_counter (`StaticTuple[StaticTuple[SIMD[uint32, 1], 8], 512]`): A 2D array of counters with shape (MAX\_NUM\_BLOCKS\_UPPER\_BOUND, MAX\_GPUS). Each counter tracks the progress of a specific thread block on the current GPU. Thread blocks increment their corresponding counter to signal completion of a phase, allowing other GPUs to detect when synchronization points are reached. The counters use atomic operations to ensure proper synchronization across devices. * ​peer\_counter (`StaticTuple[StaticTuple[StaticTuple[SIMD[uint32, 1], 8], 512], 2]`): A 3D array of counters with shape (2, MAX\_NUM\_BLOCKS\_UPPER\_BOUND, MAX\_GPUS). Contains two sets of counters to handle two synchronization points safely. The dual counter design prevents race conditions where a peer block arrives at the second sync point before the current block passes the first sync point. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` --- ## allreduce `allreduce[type: DType, rank: Int, ngpus: Int, outputs_lambda: fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None, pdl_level: PDLLevel = PDLLevel()](input_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus], output_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus], rank_sigs: InlineArray[UnsafePointer[Signal], 8], ctxs: List[DeviceContext], _max_num_blocks: Optional[Int] = Optional(None))` Performs an allreduce operation across multiple GPUs. This function serves as the main entry point for performing allreduce operations across multiple GPUs. It automatically selects between two implementations: * A peer-to-peer (P2P) based implementation when P2P access is possible between GPUs * A naive implementation as fallback when P2P access is not available The allreduce operation combines values from all GPUs using element-wise addition and distributes the result back to all GPUs. Note: * Input and output buffers must have identical shapes across all GPUs. * The number of elements must be identical across all input/output buffers. * Performance is typically better with P2P access enabled between GPUs. **Parameters:** * ​type (`DType`): The data type of the tensor elements (e.g. DType.float32). * ​rank (`Int`): The number of dimensions in the input/output tensors. * ​ngpus (`Int`): The number of GPUs participating in the allreduce. * ​outputs\_lambda (`fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None`): An output elementwise lambda. * ​pdl\_level (`PDLLevel`): Control PDL behavior for the kernel. **Args:** * ​input\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus]`): Array of input tensors from each GPU, one per GPU. * ​output\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus]`): Array of output tensors for each GPU to store results. * ​rank\_sigs (`InlineArray[UnsafePointer[Signal], 8]`): Array of Signal pointers used for cross-GPU synchronization. * ​ctxs (`List[DeviceContext]`): List of device contexts for each participating GPU. * ​\_max\_num\_blocks (`Optional[Int]`): Optional maximum number of blocks used to compute grid configuration. If not passed a dispatch table sets the grid configuration. --- ## can_enable_p2p `can_enable_p2p(ctxs: List[DeviceContext]) -> Bool` If peer-to-peer access is supported, enables it between all GPU pairs. **Args:** * ​ctxs (`List[DeviceContext]`): List of device contexts representing different GPUs. **Returns:** True if P2P access is possible between all GPU pairs, False otherwise. --- ## allreduce Multi-GPU allreduce implementation for efficient tensor reduction across GPUs. This module provides an optimized implementation of allreduce operations across multiple GPUs, supporting both peer-to-peer (P2P) and non-P2P communication patterns. The implementation automatically selects between two approaches based on hardware capabilities: 1. P2P-based implementation (when P2P access is available): * Uses direct GPU-to-GPU memory access for better performance * Implements both single-stage and two-stage algorithms: * Single-stage for latency-bound transfers (small tensors) * Two-stage (reduce-scatter + all-gather) for bandwidth-bound transfers (large tensors) * Optimized for NVLink bandwidth utilization * Uses vectorized memory access and higher precision accumulation 2. Non-P2P fallback implementation: * Copies data through host memory when direct GPU access isn't possible * Simple but functional approach for systems without P2P support The implementation is tuned for common GPU architectures (A100, H100) and includes parameters that can be adjusted for different hardware configurations. Limitations: * Number of elements must be a multiple of SIMD width * Maximum of 8 GPUs supported * All input/output buffers must have identical shapes ## Aliases ### `elementwise_epilogue_type` `alias elementwise_epilogue_type = fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None` ### `MAX_GPUS` `alias MAX_GPUS = 8` Maximum number of GPUs supported in the allreduce implementation. This constant sets the upper bound for the number of GPUS supported in this algorithm. ### `MAX_NUM_BLOCKS_UPPER_BOUND` `alias MAX_NUM_BLOCKS_UPPER_BOUND = 512` Maximum number of thread blocks to use for reduction kernels. This value has been empirically optimized through grid search across different GPU architectures. While this value is optimal for A100 GPUs, H100 GPUs may benefit from more blocks to fully saturate NVLink bandwidth. ## Structs * [​`Signal`](/mojo/stdlib/gpu/comm/allreduce/Signal): A synchronization primitive for coordinating GPU thread blocks across multiple devices. ## Functions * [​`allreduce`](/mojo/stdlib/gpu/comm/allreduce/allreduce): Performs an allreduce operation across multiple GPUs. * [​`can_enable_p2p`](/mojo/stdlib/gpu/comm/allreduce/can_enable_p2p): If peer-to-peer access is supported, enables it between all GPU pairs. --- ## comm The `gpu.comm` package provides communication primitives for GPUs. This package includes functions for sending and receiving data between GPUs, as well as for synchronizing threads across GPUs. ## Modules * [​`allgather`](/mojo/stdlib/gpu/comm/allgather/): Multi-GPU allgather implementation that gathers values from multiple GPUs into an output buffer. * [​`allreduce`](/mojo/stdlib/gpu/comm/allreduce/): Multi-GPU allreduce implementation for efficient tensor reduction across GPUs. --- ## globals This module provides GPU-specific global constants and configuration values. The module defines hardware-specific constants like warp size and thread block limits that are used throughout the GPU programming interface. It handles both NVIDIA and AMD GPU architectures, automatically detecting and configuring the appropriate values based on the available hardware. The constants are resolved at compile time based on the target GPU architecture and are used to optimize code generation and ensure hardware compatibility. ## Aliases ### `MAX_THREADS_PER_BLOCK_METADATA` `alias MAX_THREADS_PER_BLOCK_METADATA = _resolve_max_threads_per_block_metadata()` This is metadata tag that is used in conjunction with \_\_llvm\_metadata to give a hint to the compiler about the max threads per block that's used. ### `WARP_SIZE` `alias WARP_SIZE = _resolve_warp_size()` The number of threads that execute in lockstep within a warp on the GPU. This constant represents the hardware warp size, which is the number of threads that execute instructions synchronously as a unit. The value is architecture-dependent: * 32 threads per warp on NVIDIA GPUs * 64 threads per warp on AMD GPUs * 0 if no GPU is detected The warp size is a fundamental parameter that affects: * Thread scheduling and execution * Memory access coalescing * Synchronization primitives * Overall performance optimization --- ## PDL `struct PDL` Programmatic Dependency Launch (PDL) control structure. This struct provides a way to manage programmatic stream serialization on NVIDIA GPUs. It includes functions for launching dependent grids and waiting for them to complete. Note: * Only supported on NVIDIA SM90+ (Hopper architecture and newer) GPUs. ## Implemented traits `AnyType`, `Defaultable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Initialize the PDL control structure. ### `__enter__` `__enter__(self)` Launch dependent grids that were previously configured to depend on the current grid. ### `__exit__` `__exit__(self)` Wait for all dependent grids launched by this grid to complete execution. --- ## PDLLevel `@register_passable(trivial)` `struct PDLLevel` Programmatic Dependency Launch (PDL) level. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `UnknownDestructibility` ## Aliases ### `NO_WAIT_OVERLAP_AT_END` `alias NO_WAIT_OVERLAP_AT_END = PDLLevel(3)` ### `OFF` `alias OFF = PDLLevel(0)` ### `OVERLAP_AT_BEGINNING` `alias OVERLAP_AT_BEGINNING = PDLLevel(2)` ### `OVERLAP_AT_END` `alias OVERLAP_AT_END = PDLLevel(1)` ## Methods ### `__init__` `__init__() -> Self` Initialize the PDL level to OFF. `__init__(level: Int) -> Self` Initialize the PDL level. **Args:** * ​level (`Int`): The PDL level to initialize. ### `__eq__` `__eq__(self, other: Self) -> Bool` Check if the PDL level is equal to another PDL level. **Args:** * ​other (`Self`): The other PDL level to compare against. **Returns:** True if the PDL level is equal to the other PDL level, False otherwise. `__eq__(self, other: Int) -> Bool` Check if the PDL level is equal to another PDL level. **Args:** * ​other (`Int`): The other PDL level to compare against. **Returns:** True if the PDL level is equal to the other PDL level, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Check if the PDL level is not equal to another PDL level. **Args:** * ​other (`Self`): The other PDL level to compare against. **Returns:** True if the PDL level is not equal to the other PDL level, False otherwise. ### `__gt__` `__gt__(self, other: Self) -> Bool` Check if the PDL level is greater than another PDL level. **Args:** * ​other (`Self`): The other PDL level to compare against. **Returns:** True if the PDL level is greater than the other PDL level, False otherwise. ### `__ge__` `__ge__(self, other: Self) -> Bool` Check if the PDL level is greater than or equal to another PDL level. **Args:** * ​other (`Self`): The other PDL level to compare against. **Returns:** True if the PDL level is greater or equal to the other PDL level, False otherwise. --- ## grid_controls Grid Dependent Control primitives for NVIDIA Hopper (SM90+) GPUs. This module provides low-level primitives for managing grid dependencies on NVIDIA Hopper architecture and newer GPUs. It enables efficient orchestration of multi-grid workloads by allowing grids to launch dependent grids and synchronize with them. The module includes functions that map directly to CUDA grid dependency control instructions, providing fine-grained control over grid execution order: * `launch_dependent_grids()`: Triggers execution of grids that depend on the current grid * `wait_on_dependent_grids()`: Blocks until all dependent grids complete execution These primitives are essential for implementing complex GPU execution pipelines where multiple kernels need to execute in a specific order with minimal overhead. They eliminate the need for host-side synchronization when orchestrating dependent GPU work. ## Structs * [​`PDL`](/mojo/stdlib/gpu/grid_controls/PDL): Programmatic Dependency Launch (PDL) control structure. * [​`PDLLevel`](/mojo/stdlib/gpu/grid_controls/PDLLevel): Programmatic Dependency Launch (PDL) level. ## Functions * [​`launch_dependent_grids`](/mojo/stdlib/gpu/grid_controls/launch_dependent_grids): Launches dependent grids that were previously configured to depend on the current grid. * [​`wait_on_dependent_grids`](/mojo/stdlib/gpu/grid_controls/wait_on_dependent_grids): Waits for all dependent grids launched by this grid to complete execution. --- ## launch_dependent_grids `launch_dependent_grids()` Launches dependent grids that were previously configured to depend on the current grid. This function triggers the execution of dependent grids that have been configured with a dependency on the current grid. It maps directly to the CUDA grid dependency control instruction for launching dependent grids. Note: * Only supported on NVIDIA SM90+ (Hopper architecture and newer) GPUs. * Must be called by all threads in a thread block to avoid undefined behavior. * Typically used in multi-grid pipeline scenarios where one grid's completion should trigger the execution of other grids. --- ## wait_on_dependent_grids `wait_on_dependent_grids()` Waits for all dependent grids launched by this grid to complete execution. This function blocks the calling grid until all dependent grids that were launched by this grid have completed their execution. It provides a synchronization point between parent and child grids in a multi-grid dependency chain. Note: * Only supported on NVIDIA SM90+ (Hopper architecture and newer) GPUs. * Must be called by all threads in a thread block to avoid undefined behavior. * Can be used to ensure dependent grid work is complete before proceeding with subsequent operations in the parent grid. --- ## ConstantMemoryMapping `@register_passable(trivial)` `struct ConstantMemoryMapping` Represents a mapping of constant memory between host and device. This struct encapsulates the information needed to manage constant memory that can be accessed by GPU kernels. Constant memory provides a fast, read-only cache accessible by all threads on the GPU device. Attributes: name: A string identifier for the constant memory mapping. ptr: Pointer to the memory location. byte\_count: Size of the memory mapping in bytes. ## Fields * ​name (`StringSlice[StaticConstantOrigin]`): A string identifier for the constant memory mapping. This name is used to uniquely identify the constant memory region in the GPU programming model, allowing the runtime to properly associate the memory with kernel references to constant memory symbols. * ​ptr (`UnsafePointer[NoneType]`): Pointer to the host memory location that will be mapped to device constant memory. This raw pointer represents the starting address of the memory region that will be accessible as constant memory on the GPU. The memory should remain valid for the lifetime of any kernels that access it. * ​byte\_count (`Int`): Size of the memory mapping in bytes. Specifies the total size of the constant memory region. This value is used by the runtime to determine how much data to transfer between host and device. The size must be sufficient to hold all data needed by GPU kernels. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` --- ## constant_memory_mapping This module provides functionality for mapping constant memory between host and device. The module includes the `ConstantMemoryMapping` struct which represents a mapping of constant memory that can be used for efficient data transfer between host and GPU device. ## Structs * [​`ConstantMemoryMapping`](/mojo/stdlib/gpu/host/constant_memory_mapping/ConstantMemoryMapping): Represents a mapping of constant memory between host and device. --- ## DeviceAttribute `@register_passable(trivial)` `struct DeviceAttribute` Represents CUDA device attributes that can be queried from a GPU device. This struct encapsulates the various device properties and capabilities that can be queried through the CUDA driver API. Each attribute is represented as a constant with a corresponding integer value that maps to the CUDA driver's attribute enum. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `CLOCK_RATE` `alias CLOCK_RATE = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](13))` Typical clock frequency in kilohertz ### `COMPUTE_CAPABILITY_MAJOR` `alias COMPUTE_CAPABILITY_MAJOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](75))` Major compute capability version number ### `COMPUTE_CAPABILITY_MINOR` `alias COMPUTE_CAPABILITY_MINOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](76))` Minor compute capability version number ### `MAX_ACCESS_POLICY_WINDOW_SIZE` `alias MAX_ACCESS_POLICY_WINDOW_SIZE = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](109))` CUDA-only: Maximum value of CUaccessPolicyWindow::num\_bytes. ### `MAX_BLOCK_DIM_X` `alias MAX_BLOCK_DIM_X = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](2))` Maximum block dimension X ### `MAX_BLOCK_DIM_Y` `alias MAX_BLOCK_DIM_Y = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](3))` Maximum block dimension Y ### `MAX_BLOCK_DIM_Z` `alias MAX_BLOCK_DIM_Z = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](4))` Maximum block dimension Z ### `MAX_BLOCKS_PER_MULTIPROCESSOR` `alias MAX_BLOCKS_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](106))` Maximum resident blocks per multiprocessor ### `MAX_GRID_DIM_X` `alias MAX_GRID_DIM_X = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](5))` Maximum grid dimension X ### `MAX_GRID_DIM_Y` `alias MAX_GRID_DIM_Y = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](6))` Maximum grid dimension Y ### `MAX_GRID_DIM_Z` `alias MAX_GRID_DIM_Z = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](7))` Maximum grid dimension Z ### `MAX_REGISTERS_PER_BLOCK` `alias MAX_REGISTERS_PER_BLOCK = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](12))` Maximum number of 32-bit registers available per block ### `MAX_REGISTERS_PER_MULTIPROCESSOR` `alias MAX_REGISTERS_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](82))` Maximum number of 32-bit registers available per multiprocessor ### `MAX_SHARED_MEMORY_PER_BLOCK` `alias MAX_SHARED_MEMORY_PER_BLOCK = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](8))` Maximum shared memory available per block in bytes ### `MAX_SHARED_MEMORY_PER_MULTIPROCESSOR` `alias MAX_SHARED_MEMORY_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](81))` Maximum shared memory available per multiprocessor in bytes ### `MAX_THREADS_PER_BLOCK` `alias MAX_THREADS_PER_BLOCK = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](1))` Maximum number of threads per block ### `MAX_THREADS_PER_MULTIPROCESSOR` `alias MAX_THREADS_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](39))` Maximum resident threads per multiprocessor ### `MULTIPROCESSOR_COUNT` `alias MULTIPROCESSOR_COUNT = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](16))` Number of multiprocessors on device ### `WARP_SIZE` `alias WARP_SIZE = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](10))` Warp size in threads --- ## device_attribute This module defines GPU device attributes that can be queried from CUDA-compatible devices. The module provides the `DeviceAttribute` struct which encapsulates the various device properties and capabilities that can be queried through the CUDA driver API. Each attribute is represented as a constant with a corresponding integer value that maps to the CUDA driver's attribute enumeration. These attributes allow applications to query specific hardware capabilities and limitations of GPU devices, such as maximum thread counts, memory sizes, compute capabilities, and supported features. ## Structs * [​`DeviceAttribute`](/mojo/stdlib/gpu/host/device_attribute/DeviceAttribute): Represents CUDA device attributes that can be queried from a GPU device. --- ## DeviceBuffer `struct DeviceBuffer[type: DType]` Represents a block of device-resident storage. For GPU devices, a device buffer is allocated in the device's global memory. To allocate a `DeviceBuffer`, use one of the methods provided by `DeviceContext`, such as [`enqueue_create_buffer()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_create_buffer). ## Parameters * ​type (`DType`): Data type to be stored in the buffer. ## Implemented traits `AnyType`, `Copyable`, `DevicePassable`, `Movable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `device_type` `alias device_type = UnsafePointer[SIMD[type, 1]]` DeviceBuffer types are remapped to UnsafePointer when passed to accelerator devices. ## Methods ### `__copyinit__` `__copyinit__(out self, existing: Self)` Creates a copy of an existing device buffer by incrementing its reference count. This copy constructor creates a new reference to the same underlying device buffer by incrementing the reference count of the native buffer object. Both the original and the copy will refer to the same memory on the device. **Args:** * ​existing (`Self`): The device buffer to copy. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Initializes this buffer by taking ownership of an existing buffer. This move constructor transfers ownership of the device buffer from the existing instance to the new instance without incrementing the reference count. **Args:** * ​existing (`Self`): The buffer to move from, which will no longer be valid after this call. ### `__del__` `__del__(owned self)` Releases resources associated with this device buffer. This function schedules an owned buffer free using the stream in the device context. The actual deallocation may occur asynchronously after all operations using this buffer have completed. ### `get_type_name` `static get_type_name() -> String` Gets this type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls. **Returns:** This type's name. ### `get_device_type_name` `static get_device_type_name() -> String` Gets device\_type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls. **Returns:** This type's name. ### `copy` `copy(self) -> Self` Explicitly construct a copy of self. **Returns:** A copy of this value. ### `__len__` `__len__(self) -> Int` Returns the number of elements in this buffer. This method calculates the number of elements by dividing the total byte size of the buffer by the size of each element. **Returns:** The number of elements in the buffer. ### `create_sub_buffer` `create_sub_buffer[view_type: DType](self, offset: Int, size: Int) -> DeviceBuffer[view_type]` Creates a sub-buffer view of this buffer with a different element type. This method creates a new buffer that references a subset of the memory in this buffer, potentially with a different element type. The sub-buffer shares the underlying memory with the original buffer. **Parameters:** * ​view\_type (`DType`): The data type for elements in the new sub-buffer. **Args:** * ​offset (`Int`): The starting offset in elements from the beginning of this buffer. * ​size (`Int`): The number of elements in the new sub-buffer. **Returns:** A new DeviceBuffer referencing the specified region with the specified element type. ### `enqueue_copy_to` `enqueue_copy_to(self, dst: Self)` Enqueues an asynchronous copy from this buffer to another device buffer. This method schedules a memory copy operation from this buffer to the destination buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​dst (`Self`): The destination device buffer to copy data to. `enqueue_copy_to(self, dst: HostBuffer[type])` Enqueues an asynchronous copy from this buffer to a host buffer. This method schedules a memory copy operation from this buffer to the destination buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​dst (`HostBuffer[type]`): The destination host buffer to copy data to. `enqueue_copy_to(self, dst_ptr: UnsafePointer[SIMD[type, 1]])` Enqueues an asynchronous copy from this buffer to host memory. This method schedules a memory copy operation from this device buffer to the specified host memory location. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the destination host memory location. ### `enqueue_copy_from` `enqueue_copy_from(self, src: Self)` Enqueues an asynchronous copy to this buffer from another device buffer. This method schedules a memory copy operation to this buffer from the source buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​src (`Self`): The source device buffer to copy data from. `enqueue_copy_from(self, src: HostBuffer[type])` Enqueues an asynchronous copy to this buffer from a host buffer. This method schedules a memory copy operation to this buffer from the source buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​src (`HostBuffer[type]`): The source host buffer to copy data from. `enqueue_copy_from(self, src_ptr: UnsafePointer[SIMD[type, 1]])` Enqueues an asynchronous copy to this buffer from host memory. This method schedules a memory copy operation to this device buffer from the specified host memory location. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the source host memory location. ### `enqueue_fill` `enqueue_fill(self, val: SIMD[type, 1]) -> Self` Enqueues an operation to fill this buffer with a specified value. This method schedules a memory set operation that fills the entire buffer with the specified value. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​val (`SIMD[type, 1]`): The value to fill the buffer with. **Returns:** Self reference for method chaining. ### `reassign_ownership_to` `reassign_ownership_to(self, ctx: DeviceContext)` Transfers ownership of this buffer to another device context. This method changes the device context that owns this buffer. This can be useful when sharing buffers between different contexts or when migrating workloads between devices. **Args:** * ​ctx (`DeviceContext`): The new device context to take ownership of this buffer. ### `take_ptr` `take_ptr(owned self) -> UnsafePointer[SIMD[type, 1]]` Takes ownership of the device pointer from this buffer. This method releases the device pointer from the buffer's control and returns it to the caller. After this call, the buffer no longer owns the pointer, and the caller is responsible for managing its lifecycle. **Returns:** The raw device pointer that was owned by this buffer. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[SIMD[type, 1]]` Returns the raw device pointer without transferring ownership. This method provides direct access to the underlying device pointer for advanced use cases. The buffer retains ownership of the pointer. **Returns:** The raw device pointer owned by this buffer. ### `context` `context(self) -> DeviceContext` Returns the device context associated with this buffer. This method retrieves the device context that owns this buffer and is responsible for managing its lifecycle and operations. **Returns:** The device context associated with this buffer. ### `map_to_host` `map_to_host(self, out mapped_buffer: _HostMappedBuffer[type])` Maps this device buffer to host memory for CPU access. This method creates a host-accessible view of the device buffer's contents. The mapping operation may involve copying data from device to host memory. Notes: Values modified inside the `with` statement are updated on the device when the `with` statement exits. Example: ```mojo from gpu.host import DeviceContext var ctx = DeviceContext() var length = 1024 var in_dev = ctx.enqueue_create_buffer[DType.float32](length) var out_dev = ctx.enqueue_create_buffer[DType.float32](length) # Initialize the input and output with known values. with in_dev.map_to_host() as in_host, out_dev.map_to_host() as out_host: for i in range(length): in_host[i] = i out_host[i] = 255 ``` **Returns:** A host-mapped buffer that provides CPU access to the device buffer's contents inside a with-statement. **Raises:** If there's an error during buffer creation or data transfer. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes a string representation of this buffer to the provided writer. This method formats the buffer's contents as a string and writes it to the specified writer. For large buffers, a compact representation is used. **Parameters:** * ​W (`Writer`): The writer type. **Args:** * ​writer (`W`): The writer to output the formatted string to. ### `__str__` `__str__(self) -> String` Returns a string representation of the `DeviceBuffer`. This method creates a human-readable string representation of the buffer's contents by mapping the device memory to host memory and formatting the elements. **Returns:** A string containing the formatted buffer contents. --- ## DeviceContext `@register_passable` `struct DeviceContext` Represents a single stream of execution on a particular accelerator (GPU). A `DeviceContext` serves as the low-level interface to the accelerator inside a MAX [custom operation](/max/custom-ops/) and provides methods for allocating buffers on the device, copying data between host and device, and for compiling and running functions (also known as kernels) on the device. The device context can be used as a [context manager](/mojo/manual/errors#use-a-context-manager). For example: ```mojo from gpu.host import DeviceContext from gpu import thread_idx fn kernel(): print("hello from thread:", thread_idx.x, thread_idx.y, thread_idx.z) with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2)) ctx.synchronize() ``` A custom operation receives an opaque `DeviceContextPtr`, which provides a `get_device_context()` method to retrieve the device context: ```mojo from runtime.asyncrt import DeviceContextPtr @register("custom_op") struct CustomOp: @staticmethod fn execute(ctx_ptr: DeviceContextPtr) raises: var ctx = ctx_ptr.get_device_context() ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2)) ctx.synchronize() ``` ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `device_api` `alias device_api = from_name[::StringSlice[::Bool().api` Device API for the default accelerator (for example, "cuda" or "hip"). ### `device_info` `alias device_info = from_name[::StringSlice[::Bool()` `gpu.info.Info` object for the default accelerator. ## Methods ### `__init__` `__init__(out self, device_id: Int = 0, *, owned api: String = String(from_name[::StringSlice[::Bool()))` Constructs a `DeviceContext` for the specified device. This initializer creates a new device context for the specified accelerator device. The device context provides an interface for interacting with the GPU, including memory allocation, data transfer, and kernel execution. Example: ```mojo from gpu.host import DeviceContext # Create a context for the default GPU var ctx = DeviceContext() # Create a context for a specific GPU (device 1) var ctx2 = DeviceContext(1) ``` **Args:** * ​device\_id (`Int`): ID of the accelerator device. If not specified, uses the default accelerator (device 0). * ​api (`String`): Requested device API (for example, "cuda" or "hip"). Defaults to the device API specified by the DeviceContext class. **Raises:** If device initialization fails or the specified device is not available. ### `__copyinit__` `__copyinit__(existing: Self) -> Self` Creates a copy of an existing device context by incrementing its reference count. This copy constructor creates a new reference to the same underlying device context by incrementing the reference count of the native context object. Both the original and the copy will refer to the same device context. **Args:** * ​existing (`Self`): The device context to copy. ### `__del__` `__del__(owned self)` Releases resources associated with this device context. This destructor decrements the reference count of the native device context. When the reference count reaches zero, the underlying resources are released, including any cached memory buffers and compiled device functions. ### `copy` `copy(self) -> Self` Explicitly constructs a copy of this device context. This method creates a new reference to the same underlying device context by incrementing the reference count of the native context object. **Returns:** A copy of this device context that refers to the same underlying context. ### `__enter__` `__enter__(owned self) -> Self` Enables the use of DeviceContext in a 'with' statement context manager. This method allows DeviceContext to be used with Python-style context managers, which ensures proper resource management and cleanup when the context exits. Example: ```mojo from gpu.host import DeviceContext # Using DeviceContext as a context manager with DeviceContext() as ctx: # Perform GPU operations # Resources are automatically released when exiting the block ``` **Returns:** The DeviceContext instance to be used within the context manager block. ### `name` `name(self) -> String` Returns the device name, an ASCII string identifying this device, defined by the native device API. This method queries the underlying GPU device for its name, which typically includes the model and other identifying information. This can be useful for logging, debugging, or making runtime decisions based on the specific GPU hardware. Example: ```mojo from gpu.host import DeviceContext var ctx = DeviceContext() print("Running on device:", ctx.name()) ``` **Returns:** A string containing the device name. ### `api` `api(self) -> String` Returns the name of the API used to program the device. This method queries the underlying device context to determine which GPU programming API is being used for the current device. This information is useful for writing code that can adapt to different GPU architectures and programming models. Possible values are: * "cpu": Generic host device (CPU). * "cuda": NVIDIA GPUs. * "hip": AMD GPUs. Example: ```mojo from gpu.host import DeviceContext var ctx = DeviceContext() var api_name = ctx.api() print("Using device API:", api_name) # Conditionally execute code based on the API if api_name == "cuda": print("Running on NVIDIA GPU") elif api_name == "hip": print("Running on AMD GPU") ``` **Returns:** A string identifying the device API. ### `enqueue_create_buffer` `enqueue_create_buffer[type: DType](self, size: Int) -> DeviceBuffer[type]` Enqueues a buffer creation using the `DeviceBuffer` constructor. For GPU devices, the space is allocated in the device's global memory. **Parameters:** * ​type (`DType`): The data type to be stored in the allocated memory. **Args:** * ​size (`Int`): The number of elements of `type` to allocate memory for. **Returns:** The allocated buffer. ### `create_buffer_sync` `create_buffer_sync[type: DType](self, size: Int) -> DeviceBuffer[type]` Creates a buffer synchronously using the `DeviceBuffer` constructor. **Parameters:** * ​type (`DType`): The data type to be stored in the allocated memory. **Args:** * ​size (`Int`): The number of elements of `type` to allocate memory for. **Returns:** The allocated buffer. ### `enqueue_create_host_buffer` `enqueue_create_host_buffer[type: DType](self, size: Int) -> HostBuffer[type]` Enqueues the creation of a HostBuffer. This function allocates memory on the host that is accessible by the device. The memory is page-locked (pinned) for efficient data transfer between host and device. Pinned memory is guaranteed to remain resident in the host's RAM, not be paged/swapped out to disk. Memory allocated normally (for example, using [`UnsafePointer.alloc()`](/mojo/stdlib/memory/unsafe_ptr/UnsafePointer#alloc)) is pageable—individual pages of memory can be moved to secondary storage (disk/SSD) when main memory fills up. Using pinned memory allows devices to make fast transfers between host memory and device memory, because they can use direct memory access (DMA) to transfer data without relying on the CPU. Allocating too much pinned memory can cause performance issues, since it reduces the amount of memory available for other processes. Example: ```mojo from gpu.host import DeviceContext with DeviceContext() as ctx: # Allocate host memory accessible by the device var host_buffer = ctx.enqueue_create_host_buffer[DType.float32](1024) # Use the host buffer for device operations # ... ``` **Parameters:** * ​type (`DType`): The data type to be stored in the allocated memory. **Args:** * ​size (`Int`): The number of elements of `type` to allocate memory for. **Returns:** A `HostBuffer` object that wraps the allocated host memory. **Raises:** If memory allocation fails or if the device context is invalid. ### `compile_function` `compile_function[func_type: AnyTrivialRegType, //, func: func_type, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(None), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])` Compiles the provided function for execution on this device. **Parameters:** * ​func\_type (`AnyTrivialRegType`): Type of the function. * ​func (`func_type`): The function to compile. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). * ​\_target (`target`): Change the target to different device type than the one associated with this `DeviceContext`. **Args:** * ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such as maximum shared memory size). **Returns:** The compiled function. ### `compile_function_unchecked` `compile_function_unchecked[func_type: AnyTrivialRegType, //, func: func_type, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(None), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])` Compiles the provided function for execution on this device. **Parameters:** * ​func\_type (`AnyTrivialRegType`): Type of the function. * ​func (`func_type`): The function to compile. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). * ​\_target (`target`): Change the target to different device type than the one associated with this `DeviceContext`. **Args:** * ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such as maximum shared memory size). **Returns:** The compiled function. ### `compile_function_checked` `compile_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])` Compiles the provided function for execution on this device. **Parameters:** * ​func\_type (`AnyTrivialRegType`): Type of the function. * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`func_type`): The function to compile. * ​signature\_func (`fn(*args: *declared_arg_types) -> None`): The function to compile, passed in again. Used for checking argument types later. Note: This will disappear in future versions. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). * ​\_target (`target`): Change the target to different device type than the one associated with this `DeviceContext`. **Args:** * ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such as maximum shared memory size). **Returns:** The compiled function. `compile_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) capturing -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])` Compiles the provided function for execution on this device. **Parameters:** * ​func\_type (`AnyTrivialRegType`): Type of the function. * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`func_type`): The function to compile. * ​signature\_func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile, passed in again. Used for checking argument types later. Note: This will disappear in future versions. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). * ​\_target (`target`): Change the target to different device type than the one associated with this `DeviceContext`. **Args:** * ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such as maximum shared memory size). **Returns:** The compiled function. ### `compile_function_experimental` `compile_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])` Compiles the provided function for execution on this device. **Parameters:** * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`fn(*args: *declared_arg_types) -> None`): The function to compile. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). * ​\_target (`target`): Change the target to different device type than the one associated with this `DeviceContext`. **Args:** * ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such as maximum shared memory size). **Returns:** The compiled function. `compile_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) capturing -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])` Compiles the provided function for execution on this device. **Parameters:** * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). * ​\_target (`target`): Change the target to different device type than the one associated with this `DeviceContext`. **Args:** * ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such as maximum shared memory size). **Returns:** The compiled function. ### `load_function` `load_function[func_type: AnyTrivialRegType, //, func: func_type](self, *, function_name: StringSlice[origin], asm: StringSlice[origin], func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceExternalFunction)` Loads a pre-compiled device function from assembly code. This method loads an external GPU function from provided assembly code (PTX/SASS) rather than compiling it from Mojo source. This is useful for integrating with existing CUDA/HIP code or for using specialized assembly optimizations. Example: ```mojo from gpu.host import DeviceContext from gpu.host.device_context import DeviceExternalFunction fn func_signature( # Arguments being passed to the assembly code # e.g. two pointers and a length input: UnsafePointer[Float32], output: UnsafePointer[Float32], len: Int, ): # No body because that is passed as assembly code below. pass var ctx = DeviceContext() var ptx_code = "..." # PTX assembly code var ext_func = ctx.load_function[func_signature]( function_name="my_kernel", asm=ptx_code, ) ``` **Parameters:** * ​func\_type (`AnyTrivialRegType`): The type of the function to load. * ​func (`func_type`): The function reference. **Args:** * ​function\_name (`StringSlice[origin]`): The name of the function in the assembly code. * ​asm (`StringSlice[origin]`): The assembly code (PTX/SASS) containing the function. * ​func\_attribute (`OptionalReg[FuncAttribute]`): Optional attribute to apply to the function (such as maximum shared memory size). **Returns:** The loaded function is stored in the `result` parameter. **Raises:** If loading the function fails or the assembly code is invalid. ### `enqueue_function` `enqueue_function[func_type: AnyTrivialRegType, //, func: func_type, *Ts: AnyType, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))` Compiles and enqueues a kernel for execution on this device. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead: ```mojo with DeviceContext() as ctx: var compile_func = ctx.compile_function[kernel]() ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​func\_type (`AnyTrivialRegType`): The type of the function to launch. * ​func (`func_type`): The function to launch. * ​\*Ts (`AnyType`): The types of the arguments being passed to the function. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). **Args:** * ​\*args (`*Ts`): Variadic arguments which are passed to the `func`. * ​grid\_dim (`Dim`): The grid dimensions. * ​block\_dim (`Dim`): The block dimensions. * ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions. * ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks. * ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings. * ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum. `enqueue_function[*Ts: AnyType](self, f: DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()))` Enqueues a compiled function for execution on this device. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile the function first to remove the overhead: ```mojo from gpu.host import DeviceContext with DeviceContext() as ctx: var compiled_func = ctx.compile_function[kernel]() ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​\*Ts (`AnyType`): Argument types. **Args:** * ​f (`DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose]`): The compiled function to execute. * ​\*args (`*Ts`): Arguments to pass to the function. * ​grid\_dim (`Dim`): Dimensions of the compute grid, made up of thread blocks. * ​block\_dim (`Dim`): Dimensions of each thread block in the grid. * ​cluster\_dim (`OptionalReg[Dim]`): Dimensions of clusters (if the thread blocks are grouped into clusters). * ​shared\_mem\_bytes (`OptionalReg[Int]`): Amount of shared memory per thread block. * ​attributes (`List[LaunchAttribute]`): Launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): Constant memory mapping. `enqueue_function[*Ts: AnyType](self, f: DeviceExternalFunction, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()))` Enqueues an external device function for asynchronous execution on the GPU. This method schedules an external device function to be executed on the GPU with the specified execution configuration. The function and its arguments are passed to the underlying GPU runtime, which will execute them when resources are available. Example: ```mojo from gpu.host import DeviceContext from gpu.host.device_context import DeviceExternalFunction # Create a device context and load an external function with DeviceContext() as ctx: var ext_func = DeviceExternalFunction("my_kernel") # Enqueue the external function with execution configuration ctx.enqueue_function( ext_func, grid_dim=Dim(16), block_dim=Dim(256) ) # Wait for completion ctx.synchronize() ``` **Parameters:** * ​\*Ts (`AnyType`): The types of the arguments to be passed to the device function. **Args:** * ​f (`DeviceExternalFunction`): The external device function to execute. * ​\*args (`*Ts`): The arguments to pass to the device function. * ​grid\_dim (`Dim`): The dimensions of the grid (number of thread blocks). * ​block\_dim (`Dim`): The dimensions of each thread block (number of threads per block). * ​cluster\_dim (`OptionalReg[Dim]`): Optional dimensions for thread block clusters (for newer GPU architectures). * ​shared\_mem\_bytes (`OptionalReg[Int]`): Optional amount of dynamic shared memory to allocate per block. * ​attributes (`List[LaunchAttribute]`): Optional list of launch attributes for fine-grained control. * ​constant\_memory (`List[ConstantMemoryMapping]`): Optional list of constant memory mappings to use during execution. **Raises:** If there's an error enqueuing the function or if the function execution fails. ### `enqueue_function_unchecked` `enqueue_function_unchecked[func_type: AnyTrivialRegType, //, func: func_type, *Ts: AnyType, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))` Compiles and enqueues a kernel for execution on this device. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead: ```mojo with DeviceContext() as ctx: var compile_func = ctx.compile_function[kernel]() ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​func\_type (`AnyTrivialRegType`): The type of the function to launch. * ​func (`func_type`): The function to launch. * ​\*Ts (`AnyType`): The types of the arguments being passed to the function. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). **Args:** * ​\*args (`*Ts`): Variadic arguments which are passed to the `func`. * ​grid\_dim (`Dim`): The grid dimensions. * ​block\_dim (`Dim`): The block dimensions. * ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions. * ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks. * ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings. * ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum. `enqueue_function_unchecked[*Ts: AnyType](self, f: DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()))` Enqueues a compiled function for execution on this device. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile the function first to remove the overhead: ```mojo from gpu.host import DeviceContext with DeviceContext() as ctx: var compiled_func = ctx.compile_function[kernel]() ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​\*Ts (`AnyType`): Argument types. **Args:** * ​f (`DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose]`): The compiled function to execute. * ​\*args (`*Ts`): Arguments to pass to the function. * ​grid\_dim (`Dim`): Dimensions of the compute grid, made up of thread blocks. * ​block\_dim (`Dim`): Dimensions of each thread block in the grid. * ​cluster\_dim (`OptionalReg[Dim]`): Dimensions of clusters (if the thread blocks are grouped into clusters). * ​shared\_mem\_bytes (`OptionalReg[Int]`): Amount of shared memory per thread block. * ​attributes (`List[LaunchAttribute]`): Launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): Constant memory mapping. ### `enqueue_function_checked` `enqueue_function_checked[*Ts: DevicePassable](self, f: DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()))` Enqueues a compiled function for execution on this device. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile the function first to remove the overhead: ```mojo from gpu.host import DeviceContext with DeviceContext() as ctx: var compiled_func = ctx.compile_function[kernel]() ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​\*Ts (`DevicePassable`): Argument types. **Args:** * ​f (`DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose]`): The compiled function to execute. * ​\*args (`*Ts`): Arguments to pass to the function. * ​grid\_dim (`Dim`): Dimensions of the compute grid, made up of thread blocks. * ​block\_dim (`Dim`): Dimensions of each thread block in the grid. * ​cluster\_dim (`OptionalReg[Dim]`): Dimensions of clusters (if the thread blocks are grouped into clusters). * ​shared\_mem\_bytes (`OptionalReg[Int]`): Amount of shared memory per thread block. * ​attributes (`List[LaunchAttribute]`): Launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): Constant memory mapping. `enqueue_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))` Compiles and enqueues a kernel for execution on this device. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead: ```mojo with DeviceContext() as ctx: var compile_func = ctx.compile_function[kernel]() ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​func\_type (`AnyTrivialRegType`): The type of the function to launch. * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`func_type`): The function to compile and launch. * ​signature\_func (`fn(*args: *declared_arg_types) -> None`): The function to compile and launch, passed in again. Used for checking argument types later. Note: This will disappear in future versions. * ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). **Args:** * ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`. * ​grid\_dim (`Dim`): The grid dimensions. * ​block\_dim (`Dim`): The block dimensions. * ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions. * ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks. * ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings. * ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum. `enqueue_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) capturing -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))` Compiles and enqueues a kernel for execution on this device. This overload takes in a function that's `capturing`. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead: ```mojo with DeviceContext() as ctx: var compile_func = ctx.compile_function[kernel]() ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​func\_type (`AnyTrivialRegType`): The type of the function to launch. * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`func_type`): The function to compile and launch. * ​signature\_func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile and launch, passed in again. Used for checking argument types later. Note: This will disappear in future versions. * ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). **Args:** * ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`. * ​grid\_dim (`Dim`): The grid dimensions. * ​block\_dim (`Dim`): The block dimensions. * ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions. * ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks. * ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings. * ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum. ### `enqueue_function_experimental` `enqueue_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))` Compiles and enqueues a kernel for execution on this device. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead: ```mojo with DeviceContext() as ctx: var compile_func = ctx.compile_function[kernel]() ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`fn(*args: *declared_arg_types) -> None`): The function to compile and launch. * ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). **Args:** * ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`. * ​grid\_dim (`Dim`): The grid dimensions. * ​block\_dim (`Dim`): The block dimensions. * ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions. * ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks. * ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings. * ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum. `enqueue_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) capturing -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(, Tuple()), owned constant_memory: List[ConstantMemoryMapping] = List(, Tuple()), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))` Compiles and enqueues a kernel for execution on this device. This overload takes in a function that's `capturing`. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead: ```mojo with DeviceContext() as ctx: var compile_func = ctx.compile_function[kernel]() ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile and launch. * ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). **Args:** * ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`. * ​grid\_dim (`Dim`): The grid dimensions. * ​block\_dim (`Dim`): The block dimensions. * ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions. * ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks. * ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings. * ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum. ### `execution_time` `execution_time[: origin.set, //, func: fn(DeviceContext) raises capturing -> None](self, num_iters: Int) -> Int` Measures the execution time of a function that takes a DeviceContext parameter. This method times the execution of a provided function that requires the DeviceContext as a parameter. It runs the function for the specified number of iterations and returns the total elapsed time in nanoseconds. Example: ```mojo from gpu.host import DeviceContext fn gpu_operation(ctx: DeviceContext) raises capturing [_] -> None: # Perform some GPU operation using ctx pass with DeviceContext() as ctx: # Measure execution time of a function that uses the context var time_ns = ctx.execution_time[gpu_operation](10) print("Execution time for 10 iterations:", time_ns, "ns") ``` **Parameters:** * ​func (`fn(DeviceContext) raises capturing -> None`): A function that takes a DeviceContext parameter to execute and time. **Args:** * ​num\_iters (`Int`): The number of iterations to run the function. **Returns:** The total elapsed time in nanoseconds for all iterations. **Raises:** If the timer operations fail or if the function raises an exception. `execution_time[: origin.set, //, func: fn() raises capturing -> None](self, num_iters: Int) -> Int` Measures the execution time of a function over multiple iterations. This method times the execution of a provided function that doesn't require the DeviceContext as a parameter. It runs the function for the specified number of iterations and returns the total elapsed time in nanoseconds. Example: ```mojo from gpu.host import DeviceContext fn some_gpu_operation() raises capturing [_] -> None: # Perform some GPU operation pass with DeviceContext() as ctx: # Measure execution time of a function var time_ns = ctx.execution_time[some_gpu_operation] print("Execution time:", time_ns, "ns") ``` **Parameters:** * ​func (`fn() raises capturing -> None`): A function with no parameters to execute and time. **Args:** * ​num\_iters (`Int`): The number of iterations to run the function. **Returns:** The total elapsed time in nanoseconds for all iterations. **Raises:** If the timer operations fail or if the function raises an exception. ### `execution_time_iter` `execution_time_iter[: origin.set, //, func: fn(DeviceContext, Int) raises capturing -> None](self, num_iters: Int) -> Int` Measures the execution time of a function that takes iteration index as input. This method times the execution of a provided function that requires both the DeviceContext and the current iteration index as parameters. It runs the function for the specified number of iterations, passing the iteration index to each call, and returns the total elapsed time in nanoseconds. Example: ```mojo from gpu.host import DeviceContext var my_kernel = DeviceFunction(...) fn benchmark_kernel(ctx: DeviceContext, i: Int) raises capturing [_] -> None: # Run kernel with different parameters based on iteration ctx.enqueue_function[my_kernel](grid_dim=Dim(i), block_dim=Dim(256)) with DeviceContext() as ctx: # Measure execution time with iteration awareness var time_ns = ctx.execution_time_iter[benchmark_kernel](10) print("Total execution time:", time_ns, "ns") ``` **Parameters:** * ​func (`fn(DeviceContext, Int) raises capturing -> None`): A function that takes the DeviceContext and an iteration index. **Args:** * ​num\_iters (`Int`): The number of iterations to run the function. **Returns:** The total elapsed time in nanoseconds for all iterations. **Raises:** If the timer operations fail or if the function raises an exception. ### `enqueue_copy` `enqueue_copy[type: DType](self, dst_buf: DeviceBuffer[type], src_ptr: UnsafePointer[SIMD[type, 1]])` Enqueues an async copy from the host to the provided device buffer. The number of bytes copied is determined by the size of the device buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_buf (`DeviceBuffer[type]`): Device buffer to copy to. * ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy from. `enqueue_copy[type: DType](self, dst_buf: HostBuffer[type], src_ptr: UnsafePointer[SIMD[type, 1]])` Enqueues an async copy from the host to the provided device buffer. The number of bytes copied is determined by the size of the device buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_buf (`HostBuffer[type]`): Device buffer to copy to. * ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy from. `enqueue_copy[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_buf: DeviceBuffer[type])` Enqueues an async copy from the device to the host. The number of bytes copied is determined by the size of the device buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy to. * ​src\_buf (`DeviceBuffer[type]`): Device buffer to copy from. `enqueue_copy[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_buf: HostBuffer[type])` Enqueues an async copy from the device to the host. The number of bytes copied is determined by the size of the device buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy to. * ​src\_buf (`HostBuffer[type]`): Device buffer to copy from. `enqueue_copy[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_ptr: UnsafePointer[SIMD[type, 1]], size: Int)` Enqueues an async copy of `size` elements from a device pointer to another device pointer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy to. * ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Device pointer to copy from. * ​size (`Int`): Number of elements (of the specified `DType`) to copy. `enqueue_copy[type: DType](self, dst_buf: DeviceBuffer[type], src_buf: DeviceBuffer[type])` Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_buf (`DeviceBuffer[type]`): Device buffer to copy to. * ​src\_buf (`DeviceBuffer[type]`): Device buffer to copy from. Must be at least as large as `dst`. `enqueue_copy[type: DType](self, dst_buf: DeviceBuffer[type], src_buf: HostBuffer[type])` Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_buf (`DeviceBuffer[type]`): Device buffer to copy to. * ​src\_buf (`HostBuffer[type]`): Device buffer to copy from. Must be at least as large as `dst`. `enqueue_copy[type: DType](self, dst_buf: HostBuffer[type], src_buf: DeviceBuffer[type])` Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_buf (`HostBuffer[type]`): Device buffer to copy to. * ​src\_buf (`DeviceBuffer[type]`): Device buffer to copy from. Must be at least as large as `dst`. `enqueue_copy[type: DType](self, dst_buf: HostBuffer[type], src_buf: HostBuffer[type])` Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_buf (`HostBuffer[type]`): Device buffer to copy to. * ​src\_buf (`HostBuffer[type]`): Device buffer to copy from. Must be at least as large as `dst`. ### `enqueue_memset` `enqueue_memset[type: DType](self, dst: DeviceBuffer[type], val: SIMD[type, 1])` Enqueues an async memset operation, setting all of the elements in the destination device buffer to the specified value. **Parameters:** * ​type (`DType`): Type of the data stored in the buffer. **Args:** * ​dst (`DeviceBuffer[type]`): Destination buffer. * ​val (`SIMD[type, 1]`): Value to set all elements of `dst` to. `enqueue_memset[type: DType](self, dst: HostBuffer[type], val: SIMD[type, 1])` Enqueues an async memset operation, setting all of the elements in the destination host buffer to the specified value. **Parameters:** * ​type (`DType`): Type of the data stored in the buffer. **Args:** * ​dst (`HostBuffer[type]`): Destination buffer. * ​val (`SIMD[type, 1]`): Value to set all elements of `dst` to. ### `synchronize` `synchronize(self)` Blocks until all asynchronous calls on the stream associated with this device context have completed. This should never be necessary when writing a custom operation. ### `enqueue_wait_for` `enqueue_wait_for(self, other: Self)` Enqueues a wait operation for another device context to complete its work. This method creates a dependency between two device contexts, ensuring that operations in the current context will not begin execution until all previously enqueued operations in the other context have completed. This is useful for synchronizing work across multiple devices or streams. Example: ```mojo from gpu.host import DeviceContext # Create two device contexts var ctx1 = DeviceContext(0) # First GPU var ctx2 = DeviceContext(1) # Second GPU # Enqueue operations on ctx1 # ... # Make ctx2 wait for ctx1 to complete before proceeding ctx2.enqueue_wait_for(ctx1) # Enqueue operations on ctx2 that depend on ctx1's completion # ... ``` **Args:** * ​other (`Self`): The device context whose operations must complete before operations in this context can proceed. **Raises:** If there's an error enqueuing the wait operation or if the operation is not supported by the underlying device API. ### `get_api_version` `get_api_version(self) -> Int` Returns the API version associated with this device. This method retrieves the version number of the GPU driver currently installed on the system for the device associated with this context. The version is returned as an integer that can be used to check compatibility with specific features or to troubleshoot driver-related issues. Example: ```mojo from gpu.host import DeviceContext with DeviceContext() as ctx: # Get the API version var api_version = ctx.get_api_version() print("GPU API version:", api_version) ``` **Returns:** An integer representing the driver version. **Raises:** If the driver version cannot be retrieved or if the device context is invalid. ### `get_attribute` `get_attribute(self, attr: DeviceAttribute) -> Int` Returns the specified attribute for this device. Use the aliases defined by [DeviceAttribute](/mojo/stdlib/gpu/host/device_attribute/DeviceAttribute) to specify attributes. For example: ```mojo from gpu.host import DeviceAttribute, DeviceContext def main(): var ctx = DeviceContext() var attr = DeviceAttribute.MAX_BLOCKS_PER_MULTIPROCESSOR var max_blocks = ctx.get_attribute(attr) print(max_blocks) ``` **Args:** * ​attr (`DeviceAttribute`): The device attribute to query. **Returns:** The value for `attr` on this device. ### `is_compatible` `is_compatible(self) -> Bool` Returns True if this device is compatible with MAX. This method checks whether the current device is compatible with the Modular Accelerated Execution (MAX) runtime. It's useful for validating that the device can execute the compiled code before attempting operations. Example: ```mojo from gpu.host import DeviceContext var ctx = DeviceContext() print("Device is compatible with MAX:", ctx.is_compatible()) ``` **Returns:** True if the device is compatible with MAX, False otherwise. ### `id` `id(self) -> SIMD[int64, 1]` Returns the ID associated with this device. This method retrieves the unique identifier for the current device. Device IDs are used to distinguish between multiple devices in a system and are often needed for multi-GPU programming. Example: ```mojo var ctx = DeviceContext() try: var device_id = ctx.id() print("Using device with ID:", device_id) except: print("Failed to get device ID") ``` **Returns:** The unique device ID as an Int64. **Raises:** If there's an error retrieving the device ID. ### `get_memory_info` `get_memory_info(self) -> Tuple[UInt, UInt]` Returns the free and total memory size for this device. This method queries the current state of device memory, providing information about how much memory is available and the total memory capacity of the device. This is useful for memory management and determining if there's enough space for planned operations. Example: ```mojo from gpu.host import DeviceContext var ctx = DeviceContext() try: (free, total) = ctx.get_memory_info() print("Free memory:", free / (1024*1024), "MB") print("Total memory:", total / (1024*1024), "MB") except: print("Failed to get memory information") ``` **Returns:** A tuple of (free memory, total memory) in bytes. **Raises:** If there's an error retrieving the memory information. ### `can_access` `can_access(self, peer: Self) -> Bool` Returns True if this device can access the identified peer device. This method checks whether the current device can directly access memory on the specified peer device. Peer-to-peer access allows for direct memory transfers between devices without going through host memory, which can significantly improve performance in multi-GPU scenarios. Example: ```mojo from gpu.host import DeviceContext var ctx1 = DeviceContext(0) # First GPU var ctx2 = DeviceContext(1) # Second GPU try: if ctx1.can_access(ctx2): print("Direct peer access is possible") ctx1.enable_peer_access(ctx2) else: print("Direct peer access is not supported") except: print("Failed to check peer access capability") ``` **Args:** * ​peer (`Self`): The peer device to check for accessibility. **Returns:** True if the current device can access the peer device, False otherwise. **Raises:** If there's an error checking peer access capability. ### `enable_peer_access` `enable_peer_access(self, peer: Self)` Enables direct memory access to the peer device. This method establishes peer-to-peer access from the current device to the specified peer device. Once enabled, the current device can directly read from and write to memory allocated on the peer device without going through host memory, which can significantly improve performance for multi-GPU operations. Notes: * It's recommended to call `can_access()` first to check if peer access is possible. * Peer access is not always symmetric; you may need to enable access in both directions. Example: ```mojo from gpu.host import DeviceContext var ctx1 = DeviceContext(0) # First GPU var ctx2 = DeviceContext(1) # Second GPU try: if ctx1.can_access(ctx2): ctx1.enable_peer_access(ctx2) print("Peer access enabled from device 0 to device 1") # For bidirectional access if ctx2.can_access(ctx1): ctx2.enable_peer_access(ctx1) print("Peer access enabled from device 1 to device 0") else: print("Peer access not supported between these devices") except: print("Failed to enable peer access") ``` **Args:** * ​peer (`Self`): The peer device to enable access to. **Raises:** If there's an error enabling peer access or if peer access is not supported between the devices. ### `supports_multicast` `supports_multicast(self) -> Bool` Returns True if this device supports multicast memory mappings. **Returns:** True if the current device supports multicast memory, False otherwise. **Raises:** If there's an error checking peer access capability. ### `number_of_devices` `static number_of_devices(*, api: String = String(from_name[::StringSlice[::Bool())) -> Int` Returns the number of devices available that support the specified API. This function queries the system for available devices that support the requested API (such as CUDA or HIP). It's useful for determining how many accelerators are available before allocating resources or distributing work. Example: ```mojo from gpu.host import DeviceContext # Get number of CUDA devices var num_cuda_devices = DeviceContext.number_of_devices(api="cuda") # Get number of devices for the default API var num_devices = DeviceContext.number_of_devices() ``` **Args:** * ​api (`String`): Requested device API (for example, "cuda" or "hip"). Defaults to the device API specified by the DeviceContext class. **Returns:** The number of available devices supporting the specified API. --- ## DeviceExternalFunction `struct DeviceExternalFunction` Represents an external device function loaded from PTX/SASS assembly. This class provides functionality to load and execute pre-compiled GPU functions from assembly code rather than compiling them from Mojo source. This is useful for integrating with existing CUDA/HIP code or for using specialized assembly optimizations. The `DeviceExternalFunction` handles reference counting of the underlying device function handle and provides methods for launching the function on a GPU with specified execution configuration. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self, existing: Self)` Creates a copy of an existing device function by incrementing its reference count. **Args:** * ​existing (`Self`): The device function to copy. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Moves an existing device function into this one. **Args:** * ​existing (`Self`): The device function to move from. ### `__del__` `__del__(owned self)` Releases resources associated with this device function. ### `get_attribute` `get_attribute(self, attr: Attribute) -> Int` Retrieves a specific attribute of this device function. **Args:** * ​attr (`Attribute`): The attribute to query. **Returns:** The value of the requested attribute. **Raises:** If the attribute query fails. --- ## DeviceFunction `struct DeviceFunction[func_type: AnyTrivialRegType, //, func: func_type, declared_arg_types: Optional[Variadic[AnyType]], *, target: target = _get_gpu_target[::StringSlice[::Bool(), _ptxas_info_verbose: Bool = False]` Represents a compiled device function for GPU execution. This struct encapsulates a compiled GPU function that can be launched on a device. It handles the compilation, loading, and resource management of device functions. Example: ```mojo from gpu.host import DeviceContext, DeviceFunction fn my_kernel(x: Int, y: Int): # Kernel implementation pass var ctx = DeviceContext() var kernel = ctx.compile_function[my_kernel]() ctx.enqueue_function(kernel, grid_dim=(1,1,1), block_dim=(32,1,1)) ``` ## Parameters * ​func\_type (`AnyTrivialRegType`): The type of the function to compile. * ​func (`func_type`): The function to compile for GPU execution. * ​declared\_arg\_types (`Optional[Variadic[AnyType]]`): An optional containing a variadic of the declared types of the kernel signature. * ​target (`target`): The target architecture for compilation. Defaults to the current GPU target. * ​\_ptxas\_info\_verbose (`Bool`): Whether to enable verbose PTX assembly output. Defaults to False. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self, existing: Self)` Creates a copy of an existing DeviceFunction. This increases the reference count of the underlying device function handle. **Args:** * ​existing (`Self`): The DeviceFunction to copy from. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Moves an existing DeviceFunction into this one. **Args:** * ​existing (`Self`): The DeviceFunction to move from. ### `__del__` `__del__(owned self)` Releases resources associated with this DeviceFunction. This decrements the reference count of the underlying device function handle. ### `dump_rep` `dump_rep[dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False)](self)` Dumps various representations of the compiled device function. This method dumps the assembly, LLVM IR, and/or SASS code for the compiled device function based on the provided parameters. The output can be directed to stdout or written to files. Notes: When a path contains '%', it will be replaced with the module name to help disambiguate multiple kernel dumps. **Parameters:** * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Controls dumping of assembly code. Can be a boolean, a file path, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Controls dumping of LLVM IR. Can be a boolean, a file path, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Controls dumping of SASS code (internal use). Can be a boolean, a file path, or a function returning a file path. **Raises:** If any file operations fail during the dumping process. ### `get_attribute` `get_attribute(self, attr: Attribute) -> Int` Retrieves a specific attribute value from the compiled device function. This method queries the device function for information about its resource requirements, execution capabilities, or other properties defined by the specified attribute. Example: ```mojo from gpu.host import Attribute, DeviceFunction var device_function = DeviceFunction(...) # Get the maximum number of threads per block for this function var max_threads = device_function.get_attribute(Attribute.MAX_THREADS_PER_BLOCK) ``` **Args:** * ​attr (`Attribute`): The attribute to query, defined in the Attribute enum. **Returns:** The integer value of the requested attribute. **Raises:** If the attribute query fails or the attribute is not supported. --- ## DeviceMulticastBuffer `struct DeviceMulticastBuffer[type: DType]` Represents a multicast memory object enables special memory operations to be broadcast across a group of devices. ## Parameters * ​type (`DType`): Data type to be stored in the associated memory regions. ## Implemented traits `AnyType`, `UnknownDestructibility` --- ## DeviceStream `struct DeviceStream` Represents a CUDA/HIP stream for asynchronous GPU operations. A DeviceStream provides a queue for GPU operations that can execute concurrently with operations in other streams. Operations within a single stream execute in the order they are issued, but operations in different streams may execute in any relative order or concurrently. This abstraction allows for better utilization of GPU resources by enabling overlapping of computation and data transfers. Example: ```mojo from gpu.host import DeviceContext, DeviceStream var ctx = DeviceContext(0) # Select first GPU var stream = DeviceStream(ctx) # Launch operations on the stream # ... # Wait for all operations in the stream to complete stream.synchronize() ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `synchronize` `synchronize(self)` Blocks the calling CPU thread until all operations in this stream complete. This function waits until all previously issued commands in this stream have completed execution. It provides a synchronization point between host and device code. Example: ```mojo # Launch kernel or memory operations on the stream # ... # Wait for completion stream.synchronize() # Now it's safe to use results on the host ``` **Raises:** If synchronization fails. --- ## HostBuffer `struct HostBuffer[type: DType]` Represents a block of host-resident storage. For GPU devices, a host buffer is allocated in the host's global memory. To allocate a `HostBuffer`, use one of the methods provided by `DeviceContext`, such as [`enqueue_create_host_buffer()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_create_host_buffer). ## Parameters * ​type (`DType`): Data type to be stored in the buffer. ## Implemented traits `AnyType`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__copyinit__` `__copyinit__(out self, existing: Self)` Creates a copy of an existing host buffer by incrementing its reference count. This copy constructor creates a new reference to the same underlying host buffer by incrementing the reference count of the native buffer object. Both the original and the copy will refer to the same memory on the device. **Args:** * ​existing (`Self`): The host buffer to copy. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Initializes this buffer by taking ownership of an existing buffer. This move constructor transfers ownership of the device buffer from the existing instance to the new instance without incrementing the reference count. **Args:** * ​existing (`Self`): The buffer to move from, which will no longer be valid after this call. ### `__del__` `__del__(owned self)` Releases resources associated with this host buffer. This function schedules an owned buffer free using the stream in the device context. The actual deallocation may occur asynchronously after all operations using this buffer have completed. ### `__getitem__` `__getitem__(self, idx: Int) -> SIMD[type, 1]` Retrieves the element at the specified index from the host buffer. This operator allows direct access to individual elements in the host buffer using array indexing syntax. **Args:** * ​idx (`Int`): The index of the element to retrieve. **Returns:** The scalar value at the specified index. ### `__setitem__` `__setitem__(self, idx: Int, val: SIMD[type, 1])` Sets the element at the specified index in the host buffer. This operator allows direct modification of individual elements in the host buffer using array indexing syntax. **Args:** * ​idx (`Int`): The index of the element to modify. * ​val (`SIMD[type, 1]`): The new value to store at the specified index. ### `copy` `copy(self) -> Self` Explicitly construct a copy of self. **Returns:** A copy of this value. ### `__len__` `__len__(self) -> Int` Returns the number of elements in this buffer. This method calculates the number of elements by dividing the total byte size of the buffer by the size of each element. **Returns:** The number of elements in the buffer. ### `create_sub_buffer` `create_sub_buffer[view_type: DType](self, offset: Int, size: Int) -> HostBuffer[view_type]` Creates a sub-buffer view of this buffer with a different element type. This method creates a new buffer that references a subset of the memory in this buffer, potentially with a different element type. The sub-buffer shares the underlying memory with the original buffer. **Parameters:** * ​view\_type (`DType`): The data type for elements in the new sub-buffer. **Args:** * ​offset (`Int`): The starting offset in elements from the beginning of this buffer. * ​size (`Int`): The number of elements in the new sub-buffer. **Returns:** A new HostBuffer referencing the specified region with the specified element type. ### `enqueue_copy_to` `enqueue_copy_to(self, dst: Self)` Enqueues an asynchronous copy from this buffer to another host buffer. This method schedules a memory copy operation from this buffer to the destination buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​dst (`Self`): The destination host buffer to copy data to. `enqueue_copy_to(self, dst: DeviceBuffer[type])` Enqueues an asynchronous copy from this buffer to a device buffer. This method schedules a memory copy operation from this buffer to the destination buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​dst (`DeviceBuffer[type]`): The destination device buffer to copy data to. `enqueue_copy_to(self, dst_ptr: UnsafePointer[SIMD[type, 1]])` Enqueues an asynchronous copy from this buffer to host memory. This method schedules a memory copy operation from this device buffer to the specified host memory location. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the destination host memory location. ### `enqueue_copy_from` `enqueue_copy_from(self, src: Self)` Enqueues an asynchronous copy to this buffer from another host buffer. This method schedules a memory copy operation to this buffer from the source buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​src (`Self`): The source host buffer to copy data from. `enqueue_copy_from(self, src: DeviceBuffer[type])` Enqueues an asynchronous copy to this buffer from a device buffer. This method schedules a memory copy operation to this buffer from the source buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​src (`DeviceBuffer[type]`): The source device buffer to copy data from. `enqueue_copy_from(self, src_ptr: UnsafePointer[SIMD[type, 1]])` Enqueues an asynchronous copy to this buffer from host memory. This method schedules a memory copy operation to this device buffer from the specified host memory location. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the source host memory location. ### `enqueue_fill` `enqueue_fill(self, val: SIMD[type, 1]) -> Self` Enqueues an operation to fill this buffer with a specified value. This method schedules a memory set operation that fills the entire buffer with the specified value. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​val (`SIMD[type, 1]`): The value to fill the buffer with. **Returns:** Self reference for method chaining. ### `reassign_ownership_to` `reassign_ownership_to(self, ctx: DeviceContext)` Transfers ownership of this buffer to another device context. This method changes the device context that owns this buffer. This can be useful when sharing buffers between different contexts or when migrating workloads between devices. **Args:** * ​ctx (`DeviceContext`): The new device context to take ownership of this buffer. ### `take_ptr` `take_ptr(owned self) -> UnsafePointer[SIMD[type, 1]]` Takes ownership of the device pointer from this buffer. This method releases the device pointer from the buffer's control and returns it to the caller. After this call, the buffer no longer owns the pointer, and the caller is responsible for managing its lifecycle. **Returns:** The raw device pointer that was owned by this buffer. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[SIMD[type, 1]]` Returns the raw device pointer without transferring ownership. This method provides direct access to the underlying device pointer for advanced use cases. The buffer retains ownership of the pointer. **Returns:** The raw device pointer owned by this buffer. ### `context` `context(self) -> DeviceContext` Returns the device context associated with this buffer. This method retrieves the device context that owns this buffer and is responsible for managing its lifecycle and operations. **Returns:** The device context associated with this buffer. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes a string representation of this buffer to the provided writer. This method formats the buffer's contents as a string and writes it to the specified writer. For large buffers, a compact representation is used. **Parameters:** * ​W (`Writer`): The writer type. **Args:** * ​writer (`W`): The writer to output the formatted string to. ### `__str__` `__str__(self) -> String` Returns a string representation of the `HostBuffer`. This method creates a human-readable string representation of the buffer's contents by mapping the device memory to host memory and formatting the elements. **Returns:** A string containing the formatted buffer contents. ### `as_span` `as_span(ref self) -> Span[SIMD[type, 1], self_is_origin]` Returns a `Span` pointing to the underlying memory of the `HostBuffer`. **Returns:** A `Span` pointing to the underlying memory of the `HostBuffer`. --- ## device_context This module provides functionality for interacting with accelerators. In particular the [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) struct, which represents a single stream of execution on a given accelerator. You can use this struct to allocate accelerator memory, copy data to and from the accelerator, and compile and execute functions on the accelerator. ## Structs * [​`DeviceBuffer`](/mojo/stdlib/gpu/host/device_context/DeviceBuffer): Represents a block of device-resident storage. For GPU devices, a device buffer is allocated in the device's global memory. * [​`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext): Represents a single stream of execution on a particular accelerator (GPU). * [​`DeviceExternalFunction`](/mojo/stdlib/gpu/host/device_context/DeviceExternalFunction): Represents an external device function loaded from PTX/SASS assembly. * [​`DeviceFunction`](/mojo/stdlib/gpu/host/device_context/DeviceFunction): Represents a compiled device function for GPU execution. * [​`DeviceMulticastBuffer`](/mojo/stdlib/gpu/host/device_context/DeviceMulticastBuffer): Represents a multicast memory object enables special memory operations to be broadcast across a group of devices. * [​`DeviceStream`](/mojo/stdlib/gpu/host/device_context/DeviceStream): Represents a CUDA/HIP stream for asynchronous GPU operations. * [​`HostBuffer`](/mojo/stdlib/gpu/host/device_context/HostBuffer): Represents a block of host-resident storage. For GPU devices, a host buffer is allocated in the host's global memory. --- ## Dim `@register_passable(trivial)` `struct Dim` Represents a dimension with up to three components (x, y, z). This struct is commonly used to represent grid and block dimensions for kernel launches. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `@implicit` `__init__[I: Indexer](x: I) -> Self` Initializes Dim with a single indexable value for x. y and z dimensions are set to 1. **Parameters:** * ​I (`Indexer`): The type of the indexable value. **Args:** * ​x (`I`): The value for the x dimension. `__init__[I0: Indexer, I1: Indexer](x: I0, y: I1) -> Self` Initializes Dim with indexable values for x and y. z dimension is set to 1. **Parameters:** * ​I0 (`Indexer`): The type of the first indexable value. * ​I1 (`Indexer`): The type of the second indexable value. **Args:** * ​x (`I0`): The value for the x dimension. * ​y (`I1`): The value for the y dimension. `__init__[I0: Indexer, I1: Indexer, I2: Indexer](x: I0, y: I1, z: I2) -> Self` Initializes Dim with indexable values for x, y, and z. **Parameters:** * ​I0 (`Indexer`): The type of the first indexable value. * ​I1 (`Indexer`): The type of the second indexable value. * ​I2 (`Indexer`): The type of the third indexable value. **Args:** * ​x (`I0`): The value for the x dimension. * ​y (`I1`): The value for the y dimension. * ​z (`I2`): The value for the z dimension. `@implicit` `__init__[I: Indexer](dims: Tuple[I]) -> Self` Initializes Dim with a tuple containing a single indexable value. y and z dimensions are set to 1. **Parameters:** * ​I (`Indexer`): The type of the indexable value in the tuple. **Args:** * ​dims (`Tuple[I]`): A tuple with one element for x dimension. `@implicit` `__init__[I0: Indexer, I1: Indexer](dims: Tuple[I0, I1]) -> Self` Initializes Dim with a tuple of two indexable values. The z dimension is set to 1. **Parameters:** * ​I0 (`Indexer`): The type of the first indexable value in the tuple. * ​I1 (`Indexer`): The type of the second indexable value in the tuple. **Args:** * ​dims (`Tuple[I0, I1]`): A tuple with two elements: x and y dimensions. `@implicit` `__init__[I0: Indexer, I1: Indexer, I2: Indexer](dims: Tuple[I0, I1, I2]) -> Self` Initializes Dim with a tuple of three indexable values. **Parameters:** * ​I0 (`Indexer`): The type of the first indexable value in the tuple. * ​I1 (`Indexer`): The type of the second indexable value in the tuple. * ​I2 (`Indexer`): The type of the third indexable value in the tuple. **Args:** * ​dims (`Tuple[I0, I1, I2]`): Tuple with three elements: x, y, and z dimensions. ### `__getitem__` `__getitem__(self, idx: Int) -> Int` Gets the dimension value at the specified index. **Args:** * ​idx (`Int`): The index (0 for x, 1 for y, 2 for z). **Returns:** The value of the dimension at the given index. ### `__str__` `__str__(self) -> String` Returns a string representation of the Dim. **Returns:** String representation of this Dim object. ### `__repr__` `__repr__(self) -> String` Returns a string representation of the Dim. **Returns:** String representation of this Dim object. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes a formatted string representation of the Dim. **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait. **Args:** * ​writer (`W`): The Writer to write to. ### `z` `z(self) -> Int` Returns the z dimension. **Returns:** The value of the z dimension. ### `y` `y(self) -> Int` Returns the y dimension. **Returns:** The value of the y dimension. ### `x` `x(self) -> Int` Returns the x dimension. **Returns:** The value of the x dimension. --- ## dim This module implements the dim type. ## Structs * [​`Dim`](/mojo/stdlib/gpu/host/dim/Dim): Represents a dimension with up to three components (x, y, z). --- ## Attribute `@register_passable(trivial)` `struct Attribute` Represents GPU kernel function attributes. This struct defines constants for various function attributes that can be queried or set for GPU kernels. These attributes provide information about resource requirements and execution constraints of kernel functions. ## Fields * ​code (`SIMD[int32, 1]`): The numeric code representing the attribute type. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility`, `Writable` ## Aliases ### `BINARY_VERSION` `alias BINARY_VERSION = Attribute(__init__[__mlir_type.!pop.int_literal](6))` The binary architecture version for which the function was compiled. This value is the major binary version \* 10 + the minor binary version, so a binary version 1.3 function would return the value 13. Note that this will return a value of 10 for legacy cubins that do not have a properly- encoded binary architecture version.. ### `CACHE_MODE_CA` `alias CACHE_MODE_CA = Attribute(__init__[__mlir_type.!pop.int_literal](7))` The attribute to indicate whether the function has been compiled with user specified option "-Xptxas --dlcm=ca" set . ### `CLUSTER_SCHEDULING_POLICY_PREFERENCE` `alias CLUSTER_SCHEDULING_POLICY_PREFERENCE = Attribute(__init__[__mlir_type.!pop.int_literal](15))` The block scheduling policy of a function. The value type is CUclusterSchedulingPolicy / cudaClusterSchedulingPolicy. ### `CLUSTER_SIZE_MUST_BE_SET` `alias CLUSTER_SIZE_MUST_BE_SET = Attribute(__init__[__mlir_type.!pop.int_literal](10))` If this attribute is set, the kernel must launch with a valid cluster size specified. ### `CONST_SIZE_BYTES` `alias CONST_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](2))` The size in bytes of user-allocated constant memory required by this function. ### `LOCAL_SIZE_BYTES` `alias LOCAL_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](3))` The size in bytes of local memory used by each thread of this function. ### `MAX_DYNAMIC_SHARED_SIZE_BYTES` `alias MAX_DYNAMIC_SHARED_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](8))` The maximum size in bytes of dynamically-allocated shared memory that can be used by this function. If the user-specified dynamic shared memory size is larger than this value. ### `MAX_THREADS_PER_BLOCK` `alias MAX_THREADS_PER_BLOCK = Attribute(__init__[__mlir_type.!pop.int_literal](0))` The maximum number of threads per block, beyond which a launch of the function would fail. This number depends on both the function and the device on which the function is currently loaded. ### `NON_PORTABLE_CLUSTER_SIZE_ALLOWED` `alias NON_PORTABLE_CLUSTER_SIZE_ALLOWED = Attribute(__init__[__mlir_type.!pop.int_literal](14))` Whether the function can be launched with non-portable cluster size. 1 is allowed, 0 is disallowed. A non-portable cluster size may only function on the specific SKUs the program is tested on. The launch might fail if the program is run on a different hardware platform.CUDA API provides cudaOccupancyMaxActiveClusters to assist with checking whether the desired size can be launched on the current device.Portable Cluster SizeA portable cluster size is guaranteed to be functional on all compute capabilities higher than the target compute capability. The portable cluster size for sm\_90 is 8 blocks per cluster. ### `NUM_REGS` `alias NUM_REGS = Attribute(__init__[__mlir_type.!pop.int_literal](4))` The number of registers used by each thread of this function. ### `PREFERRED_SHARED_MEMORY_CARVEOUT` `alias PREFERRED_SHARED_MEMORY_CARVEOUT = Attribute(__init__[__mlir_type.!pop.int_literal](9))` On devices where the L1 cache and shared memory use the same hardware resources, this sets the shared memory carveout preference, in percent of the total shared memory. ### `PTX_VERSION` `alias PTX_VERSION = Attribute(__init__[__mlir_type.!pop.int_literal](5))` The PTX virtual architecture version for which the function was compiled. This value is the major PTX version \* 10 + the minor PTX version, so a PTX version 1.3 function would return the value 13. Note that this may return the undefined value of 0 for cubins compiled prior to CUDA 3.0.. ### `REQUIRED_CLUSTER_DEPTH` `alias REQUIRED_CLUSTER_DEPTH = Attribute(__init__[__mlir_type.!pop.int_literal](13))` The required cluster depth in blocks. The values must either all be 0 or all be positive. The validity of the cluster dimensions is otherwise checked at launch time. ### `REQUIRED_CLUSTER_HEIGHT` `alias REQUIRED_CLUSTER_HEIGHT = Attribute(__init__[__mlir_type.!pop.int_literal](12))` The required cluster height in blocks. The values must either all be 0 or all be positive. The validity of the cluster dimensions is otherwise checked at launch time. ### `REQUIRED_CLUSTER_WIDTH` `alias REQUIRED_CLUSTER_WIDTH = Attribute(__init__[__mlir_type.!pop.int_literal](11))` The required cluster width in blocks. The values must either all be 0 or all be positive. The validity of the cluster dimensions is otherwise checked at launch time. ### `SHARED_SIZE_BYTES` `alias SHARED_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](1))` The size in bytes of statically-allocated shared memory required by this function. This does not include dynamically-allocated shared memory requested by the user at runtime. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two Attribute instances are equal. **Args:** * ​other (`Self`): The Attribute to compare with. **Returns:** True if both attributes have the same code, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if two Attribute instances are not equal. **Args:** * ​other (`Self`): The Attribute to compare with. **Returns:** True if the attributes have different codes, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Identity comparison operator for Attribute instances. **Args:** * ​other (`Self`): The Attribute to compare with. **Returns:** True if both attributes are identical (have the same code), False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Negative identity comparison operator for Attribute instances. **Args:** * ​other (`Self`): The Attribute to compare with. **Returns:** True if the attributes are not identical, False otherwise. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes a string representation of the `Attribute` to the provided writer. ``` This method converts the `Attribute` enum value to its corresponding string name and writes it to the provided writer object. ``` **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait. **Args:** * ​writer (`W`): A Writer object that will receive the string representation. --- ## FuncAttribute `@register_passable(trivial)` `struct FuncAttribute` Implements CUDA's CUfunction\_attribute enum for GPU kernel function attributes. This struct represents function attributes that can be set or queried for GPU kernels, following NVIDIA's CUDA driver API conventions. Each attribute consists of a type (represented by the Attribute enum) and an associated value. The struct provides factory methods for creating common attribute configurations, such as cache mode settings and shared memory allocations. Reference: ## Fields * ​attribute (`Attribute`): The type of function attribute. * ​value (`SIMD[int32, 1]`): The value associated with this attribute. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Movable`, `UnknownDestructibility` ## Aliases ### `NULL` `alias NULL = FuncAttribute(Attribute(__init__[__mlir_type.!pop.int_literal](-1)), __init__[__mlir_type.!pop.int_literal](-1))` A null/invalid function attribute constant. ## Methods ### `__init__` `__init__(*, other: Self) -> Self` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two `FuncAttribute` instances are equal. **Args:** * ​other (`Self`): The FuncAttribute to compare with. **Returns:** True if both the attribute type and value are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if two `FuncAttribute` instances are not equal. **Args:** * ​other (`Self`): The `FuncAttribute` to compare with. **Returns:** True if either the attribute type or value differs, False otherwise. ### `CACHE_MODE_CA` `static CACHE_MODE_CA(val: Bool) -> Self` Creates a CACHE\_MODE\_CA function attribute. Indicates whether the function has been compiled with user specified option `CacheMode.L1_CACHE_DISABLED` set. **Args:** * ​val (`Bool`): Boolean value indicating if L1 cache is disabled. **Returns:** A `FuncAttribute` instance with CACHE\_MODE\_CA attribute type. ### `MAX_DYNAMIC_SHARED_SIZE_BYTES` `static MAX_DYNAMIC_SHARED_SIZE_BYTES(val: SIMD[uint32, 1]) -> Self` Creates a MAX\_DYNAMIC\_SHARED\_SIZE\_BYTES function attribute. The maximum size in bytes of dynamically-allocated shared memory that can be used by this function. If the user-specified dynamic shared memory size is larger than this value, the launch will fail. **Args:** * ​val (`SIMD[uint32, 1]`): Maximum dynamic shared memory size in bytes. **Returns:** A `FuncAttribute` instance with `MAX_DYNAMIC_SHARED_SIZE_BYTES` attribute type. ### `PREFERRED_SHARED_MEMORY_CARVEOUT` `static PREFERRED_SHARED_MEMORY_CARVEOUT(val: SIMD[int32, 1]) -> Self` Creates a PREFERRED\_SHARED\_MEMORY\_CARVEOUT function attribute. On devices where the L1 cache and shared memory use the same hardware resources, this sets the shared memory carveout preference, in percent of the total shared memory. **Args:** * ​val (`SIMD[int32, 1]`): Shared memory carveout preference as a percentage (0-100). **Returns:** A FuncAttribute instance with `PREFERRED_SHARED_MEMORY_CARVEOUT` attribute type. --- ## func_attribute GPU Kernel Function Attributes Module This module provides structures for defining and managing GPU kernel function attributes. It implements functionality similar to CUDA's CUfunction\_attribute enum, allowing for querying and setting various attributes that control kernel execution behavior and resource allocation. The module includes: * `Attribute`: A value type representing different GPU kernel function attribute types * `FuncAttribute`: A structure that pairs an attribute type with its value These structures enable fine-grained control over GPU kernel execution parameters such as shared memory allocation, cache behavior, and cluster configuration. ## Structs * [​`Attribute`](/mojo/stdlib/gpu/host/func_attribute/Attribute): Represents GPU kernel function attributes. * [​`FuncAttribute`](/mojo/stdlib/gpu/host/func_attribute/FuncAttribute): Implements CUDA's CUfunction\_attribute enum for GPU kernel function attributes. --- ## host Implements the gpu host package. ## Modules * [​`constant_memory_mapping`](/mojo/stdlib/gpu/host/constant_memory_mapping/): This module provides functionality for mapping constant memory between host and device. * [​`device_attribute`](/mojo/stdlib/gpu/host/device_attribute/): This module defines GPU device attributes that can be queried from CUDA-compatible devices. * [​`device_context`](/mojo/stdlib/gpu/host/device_context/): This module provides functionality for interacting with accelerators. In particular the [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) struct, which represents a single stream of execution on a given accelerator. You can use this struct to allocate accelerator memory, copy data to and from the accelerator, and compile and execute functions on the accelerator. * [​`dim`](/mojo/stdlib/gpu/host/dim/): This module implements the dim type. * [​`func_attribute`](/mojo/stdlib/gpu/host/func_attribute/): GPU Kernel Function Attributes Module * [​`info`](/mojo/stdlib/gpu/host/info/): Contains information about GPU architectures and their capabilities. * [​`launch_attribute`](/mojo/stdlib/gpu/host/launch_attribute/): GPU Launch Attributes for Kernel Execution Control --- ## Info `@register_passable` `struct Info` Comprehensive information about a GPU architecture. This struct contains detailed specifications about GPU capabilities, including compute units, memory, thread organization, and performance characteristics. ## Fields * ​name (`StringSlice[StaticConstantOrigin]`): The model name of the GPU. * ​vendor (`Vendor`): The vendor/manufacturer of the GPU (e.g., NVIDIA, AMD). * ​api (`StringSlice[StaticConstantOrigin]`): The graphics/compute API supported by the GPU (e.g., CUDA, ROCm). * ​arch\_name (`StringSlice[StaticConstantOrigin]`): The architecture name of the GPU (e.g., sm\_80, gfx942). * ​compile\_options (`StringSlice[StaticConstantOrigin]`): Compiler options specific to this GPU architecture. * ​compute (`SIMD[float32, 1]`): Compute capability version number for NVIDIA GPUs. * ​version (`StringSlice[StaticConstantOrigin]`): Version string of the GPU architecture. * ​sm\_count (`Int`): Number of streaming multiprocessors (SMs) on the GPU. * ​warp\_size (`Int`): Number of threads in a warp/wavefront. * ​threads\_per\_sm (`Int`): Maximum number of threads per streaming multiprocessor. * ​threads\_per\_warp (`Int`): Number of threads that execute together in a warp/wavefront. * ​warps\_per\_multiprocessor (`Int`): Maximum number of warps that can be active on a multiprocessor. * ​threads\_per\_multiprocessor (`Int`): Maximum number of threads that can be active on a multiprocessor. * ​thread\_blocks\_per\_multiprocessor (`Int`): Maximum number of thread blocks that can be active on a multiprocessor. * ​shared\_memory\_per\_multiprocessor (`Int`): Size of shared memory available per multiprocessor in bytes. * ​register\_file\_size (`Int`): Total size of the register file per multiprocessor in bytes. * ​register\_allocation\_unit\_size (`Int`): Minimum allocation size for registers in bytes. * ​allocation\_granularity (`StringSlice[StaticConstantOrigin]`): Description of how resources are allocated on the GPU. * ​max\_registers\_per\_thread (`Int`): Maximum number of registers that can be allocated to a single thread. * ​max\_registers\_per\_block (`Int`): Maximum number of registers that can be allocated to a thread block. * ​max\_blocks\_per\_multiprocessor (`Int`): Maximum number of blocks that can be scheduled on a multiprocessor. * ​shared\_memory\_allocation\_unit\_size (`Int`): Minimum allocation size for shared memory in bytes. * ​warp\_allocation\_granularity (`Int`): Granularity at which warps are allocated resources. * ​max\_thread\_block\_size (`Int`): Maximum number of threads allowed in a thread block. ## Implemented traits `AnyType`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__lt__` `__lt__(self, other: Self) -> Bool` Compares if this GPU has lower compute capability than another. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if this GPU has lower compute capability, False otherwise. ### `__le__` `__le__(self, other: Self) -> Bool` Compares if this GPU has lower or equal compute capability. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if this GPU has lower or equal compute capability. ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two GPU Info instances represent the same GPU model. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if both instances represent the same GPU model. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if two GPU Info instances represent different GPU models. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if instances represent different GPU models. ### `__gt__` `__gt__(self, other: Self) -> Bool` Compares if this GPU has higher compute capability than another. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if this GPU has higher compute capability, False otherwise. ### `__ge__` `__ge__(self, other: Self) -> Bool` Compares if this GPU has higher or equal compute capability. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if this GPU has higher or equal compute capability. ### `__is__` `__is__(self, other: Self) -> Bool` Identity comparison operator for GPU Info instances. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if both instances represent the same GPU model. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Negative identity comparison operator for GPU Info instances. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if instances represent different GPU models. ### `target` `target(self) -> target` Gets the MLIR target configuration for this GPU. **Returns:** MLIR target configuration for the GPU. ### `from_target` `static from_target[target: target]() -> Self` Creates an Info instance from an MLIR target. **Parameters:** * ​target (`target`): MLIR target configuration. **Returns:** GPU info corresponding to the target. ### `from_name` `static from_name[name: StringSlice[StaticConstantOrigin]]() -> Self` Creates an Info instance from a GPU architecture name. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): GPU architecture name (e.g., "sm\_80", "gfx942"). **Returns:** GPU info corresponding to the architecture name. ### `occupancy` `occupancy(self, *, threads_per_block: Int, registers_per_thread: Int) -> SIMD[float64, 1]` Calculates theoretical occupancy for given thread and register config. Occupancy represents the ratio of active warps to the maximum possible warps on a streaming multiprocessor. Note: TODO (KERN-795): Add occupancy calculation based on shared memory usage and thread block size and take use the minimum value. **Args:** * ​threads\_per\_block (`Int`): Number of threads in each block. * ​registers\_per\_thread (`Int`): Number of registers used by each thread. **Returns:** Occupancy as a ratio between 0.0 and 1.0. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes GPU information to a writer. Outputs all GPU specifications and capabilities to the provided writer in a human-readable format. **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait. **Args:** * ​writer (`W`): A Writer instance to output the GPU information. ### `__str__` `__str__(self) -> String` Returns a string representation of the GPU information. Converts all GPU specifications and capabilities to a human-readable string format. **Returns:** String containing all GPU information. --- ## Vendor `@register_passable` `struct Vendor` Represents GPU vendors. This struct provides identifiers for different GPU vendors and utility methods for comparison and string representation. The Vendor struct defines constants for common GPU vendors (NVIDIA, AMD) and includes a NO\_GPU option for systems without GPU support. It provides comparison operators and string conversion methods for vendor identification. ## Implemented traits `AnyType`, `UnknownDestructibility`, `Writable` ## Aliases ### `AMD_GPU` `alias AMD_GPU = Vendor(__init__[__mlir_type.!pop.int_literal](1))` Represents AMD GPU vendor. ### `NO_GPU` `alias NO_GPU = Vendor(__init__[__mlir_type.!pop.int_literal](0))` Represents no GPU or CPU-only execution. ### `NVIDIA_GPU` `alias NVIDIA_GPU = Vendor(__init__[__mlir_type.!pop.int_literal](2))` Represents NVIDIA GPU vendor. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two `Vendor` instances are equal. **Args:** * ​other (`Self`): The `Vendor` to compare with. **Returns:** True if vendors are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if two `Vendor` instances are not equal. **Args:** * ​other (`Self`): The `Vendor` to compare with. **Returns:** True if vendors are not equal, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Identity comparison for vendors. **Args:** * ​other (`Self`): The `Vendor` to compare with. **Returns:** True if vendors are identical, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Negative identity comparison for vendors. **Args:** * ​other (`Self`): The Vendor to compare with. **Returns:** True if vendors are not identical, False otherwise. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes vendor information to a writer. **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait. **Args:** * ​writer (`W`): The writer to output vendor information to. ### `__str__` `__str__(self) -> String` Returns a string representation of the vendor. **Returns:** String representation of the vendor. --- ## info Contains information about GPU architectures and their capabilities. This module provides detailed specifications for various GPU models including NVIDIA and AMD GPUs. It includes information about compute capabilities, memory specifications, thread organization, and performance characteristics. ## Aliases ### `A10` `alias A10 = Info(__init__[__mlir_type.!kgen.string]("A10"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ampere"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.5999999999999996), __init__[__mlir_type.!kgen.string]("sm_86"), 72, 32, 1536, 32, 64, 2048, 32, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 16, 128, 4, 1024)` ### `A100` `alias A100 = Info(__init__[__mlir_type.!kgen.string]("A100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ampere"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8), __init__[__mlir_type.!kgen.string]("sm_80"), 108, 32, 2048, 32, 64, 2048, 32, 167936, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)` ### `B100` `alias B100 = Info(__init__[__mlir_type.!kgen.string]("B100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("blackwell"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](10), __init__[__mlir_type.!kgen.string]("sm_100a"), 132, 32, -1, 32, 64, 1536, 32, 262144, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)` ### `B200` `alias B200 = Info(__init__[__mlir_type.!kgen.string]("B200"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("blackwell"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](10), __init__[__mlir_type.!kgen.string]("sm_100a"), 148, 32, -1, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)` ### `DEFAULT_GPU` `alias DEFAULT_GPU = from_name[::StringSlice[::Bool()` ### `DEFAULT_GPU_ARCH` `alias DEFAULT_GPU_ARCH = _accelerator_arch()` ### `DEFAULT_GPU_TARGET` `alias DEFAULT_GPU_TARGET = from_name[::StringSlice[::Bool().target()` ### `H100` `alias H100 = Info(__init__[__mlir_type.!kgen.string]("H100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("hopper"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](9), __init__[__mlir_type.!kgen.string]("sm_90a"), 132, 32, 2048, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)` ### `L4` `alias L4 = Info(__init__[__mlir_type.!kgen.string]("L4"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ada"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.9000000000000004), __init__[__mlir_type.!kgen.string]("sm_89"), 58, 32, 1536, 32, 64, 2048, 32, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 24, 128, 4, 1024)` ### `MI300X` `alias MI300X = Info(__init__[__mlir_type.!kgen.string]("MI300X"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx942"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](9.4000000000000003), __init__[__mlir_type.!kgen.string]("CDNA3"), 304, 64, 2048, 64, 32, 2048, 2, 65536, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 2, 128, 4, 1024)` ### `NoGPU` `alias NoGPU = Info(__init__[__mlir_type.!kgen.string]("NoGPU"), Vendor(__init__[__mlir_type.!pop.int_literal](0)), __init__[__mlir_type.!kgen.string]("none"), __init__[__mlir_type.!kgen.string]("no_gpu"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.int_literal](0), __init__[__mlir_type.!kgen.string](""), 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, __init__[__mlir_type.!kgen.string]("none"), 0, 0, 0, 0, 0, 0)` ### `OrinNano` `alias OrinNano = Info(__init__[__mlir_type.!kgen.string]("Orin Nano"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ampere"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.6999999999999993), __init__[__mlir_type.!kgen.string]("sm_87"), 8, 32, 1536, 32, 64, 2048, 32, 167936, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 16, 128, 4, 1024)` ### `Radeon7600` `alias Radeon7600 = Info(__init__[__mlir_type.!kgen.string]("Radeon 7600"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx1102"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](11), __init__[__mlir_type.!kgen.string]("RDNA3"), 32, 32, 1024, 32, 32, 1024, 2, 32768, 32768, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 2, 128, 4, 1024)` ### `Radeon7800` `alias Radeon7800 = Info(__init__[__mlir_type.!kgen.string]("Radeon 7800/7700"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx1101"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](11), __init__[__mlir_type.!kgen.string]("RDNA3"), 60, 32, 1024, 32, 32, 1024, 2, 32768, 32768, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 2, 128, 4, 1024)` ### `Radeon780m` `alias Radeon780m = Info(__init__[__mlir_type.!kgen.string]("Radeon 780M"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx1103"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](11), __init__[__mlir_type.!kgen.string]("RDNA3"), 12, 32, 1024, 32, 32, 1024, 2, 32768, 32768, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 2, 128, 4, 1024)` ### `Radeon7900` `alias Radeon7900 = Info(__init__[__mlir_type.!kgen.string]("Radeon 7900"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx1100"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](11), __init__[__mlir_type.!kgen.string]("RDNA3"), 96, 32, 1024, 32, 32, 1024, 2, 32768, 32768, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 2, 128, 4, 1024)` ### `Radeon9060` `alias Radeon9060 = Info(__init__[__mlir_type.!kgen.string]("Radeon 9060"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx1200"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](12), __init__[__mlir_type.!kgen.string]("RDNA4"), 32, 32, 1024, 32, 32, 1024, 2, 32768, 32768, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 2, 128, 4, 1024)` ### `Radeon9070` `alias Radeon9070 = Info(__init__[__mlir_type.!kgen.string]("Radeon 9070"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx1201"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](12), __init__[__mlir_type.!kgen.string]("RDNA4"), 64, 32, 1024, 32, 32, 1024, 2, 32768, 32768, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 2, 128, 4, 1024)` ### `RTX2060` `alias RTX2060 = Info(__init__[__mlir_type.!kgen.string]("RTX2060"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("turing"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](7.5), __init__[__mlir_type.!kgen.string]("sm_75"), 30, 32, 2048, 32, 64, 2048, 16, 65536, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 16, 32, 4, 1024)` ### `RTX4090` `alias RTX4090 = Info(__init__[__mlir_type.!kgen.string]("RTX4090"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ada lovelace"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.9000000000000004), __init__[__mlir_type.!kgen.string]("sm_89"), 128, 32, -1, 32, 64, 1536, 24, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 24, 128, 4, 1024)` ### `RTX4090m` `alias RTX4090m = Info(__init__[__mlir_type.!kgen.string]("RTX4090m"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ada lovelace"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.9000000000000004), __init__[__mlir_type.!kgen.string]("sm_89"), 76, 32, -1, 32, 64, 1536, 24, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 24, 128, 4, 1024)` ### `RTX5090` `alias RTX5090 = Info(__init__[__mlir_type.!kgen.string]("RTX5090"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("blackwell"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](12), __init__[__mlir_type.!kgen.string]("sm_120a"), 170, 32, -1, 32, 64, 1536, 32, 59392, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)` ## Structs * [​`Info`](/mojo/stdlib/gpu/host/info/Info): Comprehensive information about a GPU architecture. * [​`Vendor`](/mojo/stdlib/gpu/host/info/Vendor): Represents GPU vendors. ## Functions * [​`is_cpu`](/mojo/stdlib/gpu/host/info/is_cpu): Checks if the target is a CPU (compile-time version). * [​`is_gpu`](/mojo/stdlib/gpu/host/info/is_gpu): Checks if the target is a GPU (compile-time version). * [​`is_valid_target`](/mojo/stdlib/gpu/host/info/is_valid_target): Checks if the target is valid (compile-time version). --- ## is_cpu `is_cpu[: Bool, : Origin[$0], //, target: StringSlice[$1]]() -> Bool` Checks if the target is a CPU (compile-time version). **Parameters:** * ​target (`StringSlice[$1]`): Target string to check. **Returns:** True if the target is a CPU, False otherwise. `is_cpu(target: StringSlice[origin]) -> Bool` Checks if the target is a CPU (runtime version). **Args:** * ​target (`StringSlice[origin]`): Target string to check. **Returns:** True if the target is a CPU, False otherwise. --- ## is_gpu `is_gpu[: Bool, : Origin[$0], //, target: StringSlice[$1]]() -> Bool` Checks if the target is a GPU (compile-time version). **Parameters:** * ​target (`StringSlice[$1]`): Target string to check. **Returns:** True if the target is a GPU, False otherwise. `is_gpu(target: StringSlice[origin]) -> Bool` Checks if the target is a GPU (runtime version). **Args:** * ​target (`StringSlice[origin]`): Target string to check. **Returns:** True if the target is a GPU, False otherwise. --- ## is_valid_target `is_valid_target[: Bool, : Origin[$0], //, target: StringSlice[$1]]() -> Bool` Checks if the target is valid (compile-time version). **Parameters:** * ​target (`StringSlice[$1]`): Target string to check. **Returns:** True if the target is valid (CPU or GPU), False otherwise. `is_valid_target(target: StringSlice[origin]) -> Bool` Checks if the target is valid (runtime version). **Args:** * ​target (`StringSlice[origin]`): Target string to check. **Returns:** True if the target is valid (CPU or GPU), False otherwise. --- ## AccessPolicyWindow `@register_passable(trivial)` `struct AccessPolicyWindow` Specifies an access policy for a window of memory. This struct defines a contiguous extent of memory beginning at base\_ptr and ending at base\_ptr + num\_bytes, with associated access policies. It allows fine-grained control over how memory is accessed and cached, which can significantly impact performance for memory-bound workloads. The window is partitioned into segments with different access properties based on the hit\_ratio. Accesses to "hit segments" use the hit\_prop policy, while accesses to "miss segments" use the miss\_prop policy. Note: The `num_bytes` value is limited by `CU_DEVICE_ATTRIBUTE_MAX_ACCESS_POLICY_WINDOW_SIZE`. The CUDA driver may align the `base_ptr` and restrict the maximum size. ## Fields * ​base\_ptr (`UnsafePointer[NoneType]`): Starting address of the access policy window. Driver may align it. * ​num\_bytes (`Int`): Size in bytes of the window policy. CUDA driver may restrict the maximum size and alignment. * ​hit\_ratio (`SIMD[float32, 1]`): Specifies percentage of lines assigned hit\_prop, rest are assigned miss\_prop. Value should be between 0.0 and 1.0. * ​hit\_prop (`AccessProperty`): AccessProperty applied to hit segments within the window. * ​miss\_prop (`AccessProperty`): AccessProperty applied to miss segments within the window. Must be either NORMAL or STREAMING. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Initializes a new AccessPolicyWindow with default values. `__init__[T: AnyType](*, base_ptr: UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin], count: Int, hit_ratio: SIMD[float32, 1], hit_prop: AccessProperty = AccessProperty(__init__[__mlir_type.!pop.int_literal](0)), miss_prop: AccessProperty = AccessProperty(__init__[__mlir_type.!pop.int_literal](0))) -> Self` Initializes an `AccessPolicyWindow` for a typed memory region. **Parameters:** * ​T (`AnyType`): The type of data in the memory region. **Args:** * ​base\_ptr (`UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the start of the memory region. * ​count (`Int`): Number of elements of type T in the memory region. * ​hit\_ratio (`SIMD[float32, 1]`): Fraction of the window that should use hit\_prop (0.0 to 1.0). * ​hit\_prop (`AccessProperty`): Access property for hit segments (default: NORMAL). * ​miss\_prop (`AccessProperty`): Access property for miss segments (default: NORMAL). ### `__str__` `__str__(self) -> String` Returns a string representation of the `AccessPolicyWindow`. **Returns:** A string representation of the `AccessPolicyWindow`. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes a string representation of the `AccessPolicyWindow` to a writer. This method formats all the fields of the AccessPolicyWindow into a human-readable string representation and writes it to the provided writer. **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait. **Args:** * ​writer (`W`): The writer instance to write the formatted string to. --- ## AccessProperty `@register_passable(trivial)` `struct AccessProperty` Specifies performance hint with AccessPolicyWindow for hit\_prop and miss\_prop fields. This struct defines cache persistence properties that can be used with `AccessPolicyWindow` to control how data is cached during GPU memory accesses. It provides hints to the memory subsystem about the expected access patterns, which can improve performance for specific workloads. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility`, `Writable` ## Aliases ### `NORMAL` `alias NORMAL = AccessProperty(__init__[__mlir_type.!pop.int_literal](0))` Normal cache persistence with default caching behavior. ### `PERSISTING` `alias PERSISTING = AccessProperty(__init__[__mlir_type.!pop.int_literal](2))` Persisting access is more likely to persist in cache, optimized for reused data. ### `STREAMING` `alias STREAMING = AccessProperty(__init__[__mlir_type.!pop.int_literal](1))` Streaming access is less likely to persist in cache, optimized for single-use data. ## Methods ### `__init__` `__init__(*, other: Self) -> Self` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. ### `__eq__` `__eq__(self, other: Self) -> Bool` Compares two `AccessProperty` instances for equality. **Args:** * ​other (`Self`): The `AccessProperty` to compare with. **Returns:** True if the instances have the same value, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Compares two `AccessProperty` instances for inequality. **Args:** * ​other (`Self`): The `AccessProperty` to compare with. **Returns:** True if the instances have different values, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Checks if two `AccessProperty` instances have the same value. **Args:** * ​other (`Self`): The `AccessProperty` to compare with. **Returns:** True if the instances have the same value, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Checks if two `AccessProperty` instances have different values. **Args:** * ​other (`Self`): The `AccessProperty` to compare with. **Returns:** True if the instances have different values, False otherwise. ### `__str__` `__str__(self) -> String` Returns a string representation of the `AccessProperty`. **Returns:** A string representation of the `AccessProperty`. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes a string representation of the `AccessProperty` to a writer. **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait. **Args:** * ​writer (`W`): The writer instance to write the formatted string to. --- ## LaunchAttribute `@register_passable(trivial)` `struct LaunchAttribute` Represents a complete launch attribute with ID and value. This struct combines a `LaunchAttributeID` and `LaunchAttributeValue` to form a complete attribute that can be passed to GPU kernel launches. It provides a way to specify various execution parameters that control kernel behavior. ## Fields * ​id (`LaunchAttributeID`): The identifier specifying the type of this launch attribute. * ​\_\_pad (`StaticTuple[SIMD[uint8, 1], ((sizeof[::AnyType,__mlir_type.!kgen.target]() * -1) + 8)]`): Padding to ensure proper alignment of the structure. * ​value (`LaunchAttributeValue`): The value associated with this launch attribute. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Initializes a new LaunchAttribute with IGNORE ID and zeroed value. `__init__(id: LaunchAttributeID, value: LaunchAttributeValue) -> Self` Initializes a `LaunchAttribute` with a specific ID and value. **Args:** * ​id (`LaunchAttributeID`): The `LaunchAttributeID` to set. * ​value (`LaunchAttributeValue`): The `LaunchAttributeValue` to set. `@implicit` `__init__(policy: AccessPolicyWindow) -> Self` Initializes a `LaunchAttribute` from an `AccessPolicyWindow`. Creates a launch attribute with `ACCESS_POLICY_WINDOW` ID and the provided policy. **Args:** * ​policy (`AccessPolicyWindow`): The `AccessPolicyWindow` to use for this attribute. ### `from_cluster_dim` `static from_cluster_dim(dim: Dim) -> Self` Creates a `LaunchAttribute` for cluster dimensions. Creates a launch attribute with `CLUSTER_DIMENSION` ID and the provided dimensions. **Args:** * ​dim (`Dim`): The dimensions to use for this attribute. **Returns:** A new `LaunchAttribute` configured with the specified cluster dimensions. --- ## LaunchAttributeID `@register_passable(trivial)` `struct LaunchAttributeID` Identifies the type of launch attribute for GPU kernel execution. This struct represents the various types of launch attributes that can be specified when launching GPU kernels or configuring streams and graph nodes. Each attribute controls different aspects of kernel execution behavior such as memory access policies, synchronization, scheduling, and resource allocation. The attributes are compatible with CUDA's launch attribute system and provide fine-grained control over kernel execution characteristics. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility`, `Writable` ## Aliases ### `ACCESS_POLICY_WINDOW` `alias ACCESS_POLICY_WINDOW = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](1))` Valid for streams, graph nodes, launches. ### `CLUSTER_DIMENSION` `alias CLUSTER_DIMENSION = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](4))` Valid for graph nodes, launches. ### `CLUSTER_SCHEDULING_POLICY_PREFERENCE` `alias CLUSTER_SCHEDULING_POLICY_PREFERENCE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](5))` Valid for graph nodes, launches. ### `COOPERATIVE` `alias COOPERATIVE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](2))` Valid for graph nodes, launches. ### `DEVICE_UPDATABLE_KERNEL_NODE` `alias DEVICE_UPDATABLE_KERNEL_NODE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](13))` Valid for graph nodes, launches. This attribute is graphs-only, and passing it to a launch in a non-capturing stream will result in an error. CUlaunchAttributeValue::deviceUpdatableKernelNode::deviceUpdatable can only be set to 0 or 1. Setting the field to 1 indicates that the corresponding kernel node should be device-updatable. On success, a handle will be returned via CUlaunchAttributeValue::deviceUpdatableKernelNode::devNode which can be passed to the various device-side update functions to update the node's kernel parameters from within another kernel. For more information on the types of device updates that can be made, as well as the relevant limitations thereof, see cudaGraphKernelNodeUpdatesApply. Nodes which are device-updatable have additional restrictions compared to regular kernel nodes. Firstly, device-updatable nodes cannot be removed from their graph via cuGraphDestroyNode. Additionally, once opted-in to this functionality, a node cannot opt out, and any attempt to set the deviceUpdatable attribute to 0 will result in an error. Device-updatable kernel nodes also cannot have their attributes copied to/from another kernel node via cuGraphKernelNodeCopyAttributes. Graphs containing one or more device-updatable nodes also do not allow multiple instantiation, and neither the graph nor its instantiated version can be passed to cuGraphExecUpdate. If a graph contains device-updatable nodes and updates those nodes from the device from within the graph, the graph must be uploaded with cuGraphUpload before it is launched. For such a graph, if host-side executable graph updates are made to the device-updatable nodes, the graph must be uploaded before it is launched again. ### `IGNORE` `alias IGNORE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](0))` Ignored entry, for convenient composition. ### `LAUNCH_COMPLETION_EVENT` `alias LAUNCH_COMPLETION_EVENT = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](12))` Valid for launches. Set CUlaunchAttributeValue::launchCompletionEvent to record the event. Nominally, the event is triggered once all blocks of the kernel have begun execution. Currently this is a best effort. If a kernel B has a launch completion dependency on a kernel A, B may wait until A is complete. Alternatively, blocks of B may begin before all blocks of A have begun, for example if B can claim execution resources unavailable to A (e.g. they run on different GPUs) or if B is a higher priority than A. Exercise caution if such an ordering inversion could lead to deadlock. A launch completion event is nominally similar to a programmatic event with triggerAtBlockStart set except that it is not visible to cudaGridDependencySynchronize() and can be used with compute capability less than 9.0. The event supplied must not be an interprocess or interop event. The event must disable timing (i.e. must be created with the CU\_EVENT\_DISABLE\_TIMING flag set). ### `MEM_SYNC_DOMAIN` `alias MEM_SYNC_DOMAIN = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](10))` Valid for streams, graph nodes, launches. ### `MEM_SYNC_DOMAIN_MAP` `alias MEM_SYNC_DOMAIN_MAP = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](9))` Valid for streams, graph nodes, launches. ### `PREFERRED_SHARED_MEMORY_CARVEOUT` `alias PREFERRED_SHARED_MEMORY_CARVEOUT = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](14))` Valid for launches. On devices where the L1 cache and shared memory use the same hardware resources, setting CUlaunchAttributeValue::sharedMemCarveout to a percentage between 0-100 signals the CUDA driver to set the shared memory carveout preference, in percent of the total shared memory for that kernel launch. This attribute takes precedence over CU\_FUNC\_ATTRIBUTE\_PREFERRED\_SHARED\_MEMORY\_CARVEOUT. This is only a hint, and the CUDA driver can choose a different configuration if required for the launch. ### `PRIORITY` `alias PRIORITY = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](8))` Valid for streams, graph nodes, launches. ### `PROGRAMMATIC_EVENT` `alias PROGRAMMATIC_EVENT = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](7))` Valid for launches. Set CUlaunchAttributeValue::programmaticEvent to record the event. Event recorded through this launch attribute is guaranteed to only trigger after all block in the associated kernel trigger the event. A block can trigger the event through PTX launchdep.release or CUDA builtin function cudaTriggerProgrammaticLaunchCompletion(). A trigger can also be inserted at the beginning of each block's execution if triggerAtBlockStart is set to non-0. The dependent launches can choose to wait on the dependency using the programmatic sync (cudaGridDependencySynchronize() or equivalent PTX instructions). Note that dependents (including the CPU thread calling cuEventSynchronize()) are not guaranteed to observe the release precisely when it is released. For example, cuEventSynchronize() may only observe the event trigger long after the associated kernel has completed. This recording type is primarily meant for establishing programmatic dependency between device tasks. Note also this type of dependency allows, but does not guarantee, concurrent execution of tasks. The event supplied must not be an interprocess or interop event. The event must disable timing (i.e. must be created with the CU\_EVENT\_DISABLE\_TIMING flag set). ### `PROGRAMMATIC_STREAM_SERIALIZATION` `alias PROGRAMMATIC_STREAM_SERIALIZATION = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](6))` Valid for launches. Setting CUlaunchAttributeValue:: programmaticStreamSerializationAllowed to non-0 signals that the kernel will use programmatic means to resolve its stream dependency, so that the CUDA runtime should opportunistically allow the grid's execution to overlap with the previous kernel in the stream, if that kernel requests the overlap. The dependent launches can choose to wait on the dependency using the programmatic sync. ### `SYNCHRONIZATION_POLICY` `alias SYNCHRONIZATION_POLICY = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](3))` Valid for streams. ## Methods ### `__init__` `__init__(*, other: Self) -> Self` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two `LaunchAttribute` instances are equal. Compares the underlying integer values of the attributes. **Args:** * ​other (`Self`): The other `LaunchAttribute` instance to compare with. **Returns:** True if the attributes are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if two `LaunchAttribute` instances are not equal. **Args:** * ​other (`Self`): The other `LaunchAttribute` instance to compare with. **Returns:** True if the attributes are not equal, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Checks if two `LaunchAttribute` instances have the same value. This is an identity comparison that delegates to equality comparison. **Args:** * ​other (`Self`): The other \`LaunchAttribute instance to compare with. **Returns:** True if the attributes have the same value, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Checks if two `LaunchAttribute` instances have different values. **Args:** * ​other (`Self`): The other `LaunchAttribute` instance to compare with. **Returns:** True if the attributes have different values, False otherwise. ### `__str__` `__str__(self) -> String` Returns a string representation of the `LaunchAttribute`. **Returns:** A string representation of the attribute. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the string representation of the attribute to a writer. **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer interface. **Args:** * ​writer (`W`): The writer to write to. --- ## LaunchAttributeValue `@register_passable(trivial)` `struct LaunchAttributeValue` Represents a value for a CUDA launch attribute. This struct emulates a C union to store different types of launch attribute values. It provides a fixed-size storage that can be initialized with different attribute types such as AccessPolicyWindow or dimension specifications. Note: This implementation uses a fixed-size byte array to emulate the union behavior defined in the CUDA Driver API's CUlaunchAttributeValue. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Initializes a new `LaunchAttributeValue` with zeroed storage. `@implicit` `__init__(policy: AccessPolicyWindow) -> Self` Initializes a `LaunchAttributeValue` from an `AccessPolicyWindow`. **Args:** * ​policy (`AccessPolicyWindow`): The `AccessPolicyWindow` to store in this attribute value. `@implicit` `__init__(dim: Dim) -> Self` Initializes a LaunchAttributeValue from a Dim (dimension) object. **Args:** * ​dim (`Dim`): The dimension specification to store in this attribute value. `@implicit` `__init__(value: Bool) -> Self` Initializes a LaunchAttributeValue from a boolean object.. **Args:** * ​value (`Bool`): The boolean value to store in this attribute value. --- ## launch_attribute GPU Launch Attributes for Kernel Execution Control This module provides structures for configuring GPU kernel execution through launch attributes. It implements a Mojo interface to CUDA's launch attribute system, allowing fine-grained control over kernel execution characteristics such as memory access policies, synchronization behavior, cluster dimensions, and resource allocation. The main components include: * `LaunchAttributeID`: Identifies different types of launch attributes * `LaunchAttributeValue`: Stores the value for a specific attribute type * `LaunchAttribute`: Combines an ID and value to form a complete attribute * `AccessPolicyWindow`: Configures memory access patterns and caching behavior * `AccessProperty`: Defines cache persistence properties for memory access These structures enable optimizing GPU kernel performance by controlling execution parameters at a granular level, similar to CUDA's native launch attribute system. ## Structs * [​`AccessPolicyWindow`](/mojo/stdlib/gpu/host/launch_attribute/AccessPolicyWindow): Specifies an access policy for a window of memory. * [​`AccessProperty`](/mojo/stdlib/gpu/host/launch_attribute/AccessProperty): Specifies performance hint with AccessPolicyWindow for hit\_prop and miss\_prop fields. * [​`LaunchAttribute`](/mojo/stdlib/gpu/host/launch_attribute/LaunchAttribute): Represents a complete launch attribute with ID and value. * [​`LaunchAttributeID`](/mojo/stdlib/gpu/host/launch_attribute/LaunchAttributeID): Identifies the type of launch attribute for GPU kernel execution. * [​`LaunchAttributeValue`](/mojo/stdlib/gpu/host/launch_attribute/LaunchAttributeValue): Represents a value for a CUDA launch attribute. --- ## id This module provides GPU thread and block indexing functionality. It defines aliases and functions for accessing GPU grid, block, thread and cluster dimensions and indices. These are essential primitives for GPU programming that allow code to determine its position and dimensions within the GPU execution hierarchy. Most functionality is architecture-agnostic, with some NVIDIA-specific features clearly marked. The module is designed to work seamlessly across different GPU architectures while providing optimal performance through hardware-specific optimizations where applicable. ## Aliases ### `block_dim` `alias block_dim = _BlockDim()` Contains the dimensions of the block as `x`, `y`, and `z` values (for example, `block_dim.y`) ### `block_id_in_cluster` `alias block_id_in_cluster = _Cluster_BlockIdx()` Contains the block id of the threadblock within a cluster, as `x`, `y`, and `z` values. ### `block_idx` `alias block_idx = _BlockIdx()` Contains the block index in the grid, as `x`, `y`, and `z` values. ### `cluster_dim` `alias cluster_dim = _ClusterDim()` Contains the dimensions of the cluster, as `x`, `y`, and `z` values. ### `cluster_idx` `alias cluster_idx = _ClusterIdx()` Contains the cluster index in the grid, as `x`, `y`, and `z` values. ### `global_idx` `alias global_idx = _GridIdx()` Contains the global offset of the kernel launch, as `x`, `y`, and `z` values. ### `grid_dim` `alias grid_dim = _GridDim()` Provides accessors for getting the `x`, `y`, and `z` dimensions of a grid. ### `thread_idx` `alias thread_idx = _ThreadIdx()` Contains the thread index in the block, as `x`, `y`, and `z` values. ## Functions * [​`lane_id`](/mojo/stdlib/gpu/id/lane_id): Returns the lane ID of the current thread within its warp. * [​`sm_id`](/mojo/stdlib/gpu/id/sm_id): Returns the Streaming Multiprocessor (SM) ID of the current thread. * [​`warp_id`](/mojo/stdlib/gpu/id/warp_id): Returns the warp ID of the current thread within its block. The warp ID is a unique identifier for each warp within a block, ranging from 0 to BLOCK\_SIZE/WARP\_SIZE-1. This ID is commonly used for warp-level programming and synchronization within a block. --- ## lane_id `lane_id() -> UInt` Returns the lane ID of the current thread within its warp. The lane ID is a unique identifier for each thread within a warp, ranging from 0 to WARP\_SIZE-1. This ID is commonly used for warp-level programming and thread synchronization within a warp. **Returns:** The lane ID (0 to WARP\_SIZE-1) of the current thread. --- ## sm_id `sm_id() -> UInt` Returns the Streaming Multiprocessor (SM) ID of the current thread. The SM ID uniquely identifies which physical streaming multiprocessor the thread is executing on. This is useful for SM-level optimizations and understanding hardware utilization. If called on non-NVIDIA GPUs, this function aborts as this functionality is only supported on NVIDIA hardware. **Returns:** The SM ID of the current thread. --- ## warp_id `warp_id() -> UInt` Returns the warp ID of the current thread within its block. The warp ID is a unique identifier for each warp within a block, ranging from 0 to BLOCK\_SIZE/WARP\_SIZE-1. This ID is commonly used for warp-level programming and synchronization within a block. **Returns:** The warp ID (0 to BLOCK\_SIZE/WARP\_SIZE-1) of the current thread. --- ## gpu Provides low-level programming constructs for working with GPUs. These low level constructs allow you to write code that runs on the GPU with traditional programming style--partitioning work across threads that are mapped onto 1-, 2-, or 3-dimensional blocks. The thread blocks can subsequently be grouped into a grid of thread blocks. A *kernel* is a function that runs on the GPU in parallel across many threads. Currently, the [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) struct provides the interface for compiling and launching GPU kernels inside MAX [custom operations](/max/custom-ops/). The [`gpu.host`](/mojo/stdlib/gpu/host/) package includes APIs to manage interaction between the *host* (that is, the CPU) and *device* (that is, the GPU or accelerator). See the [`gpu.id`](/mojo/stdlib/gpu/id#aliases) module for a list of aliases you can use to access information about the grid and the current thread, including block dimensions, block index in the grid and thread index. The [`sync`](/mojo/stdlib/gpu/sync/) module provides functions for synchronizing threads. For an example of launching a GPU kernel from a MAX custom operation, see the [vector addition example](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/vector_addition.mojo) in the MAX repo. ## Packages * [​`comm`](/mojo/stdlib/gpu/comm/): The `gpu.comm` package provides communication primitives for GPUs. * [​`host`](/mojo/stdlib/gpu/host/): Implements the gpu host package. ## Modules * [​`block`](/mojo/stdlib/gpu/block/): GPU block-level operations and utilities. * [​`cluster`](/mojo/stdlib/gpu/cluster/): This module provides low-level NVIDIA GPU cluster synchronization primitives for SM90+ architectures. * [​`globals`](/mojo/stdlib/gpu/globals/): This module provides GPU-specific global constants and configuration values. * [​`grid_controls`](/mojo/stdlib/gpu/grid_controls/): Grid Dependent Control primitives for NVIDIA Hopper (SM90+) GPUs. * [​`id`](/mojo/stdlib/gpu/id/): This module provides GPU thread and block indexing functionality. * [​`intrinsics`](/mojo/stdlib/gpu/intrinsics/): Provides low-level GPU intrinsic operations and memory access primitives. * [​`memory`](/mojo/stdlib/gpu/memory/): This module provides GPU memory operations and utilities. * [​`mma`](/mojo/stdlib/gpu/mma/): This module includes utilities for working with the warp-matrix-matrix-multiplication (wmma) instructions. * [​`mma_operand_descriptor`](/mojo/stdlib/gpu/mma_operand_descriptor/): * [​`mma_sm100`](/mojo/stdlib/gpu/mma_sm100/): This module includes utilities for working with the SM100 MMA instructions. * [​`mma_util`](/mojo/stdlib/gpu/mma_util/): Matrix multiply accumulate (MMA) utilities for GPU tensor cores. * [​`profiler`](/mojo/stdlib/gpu/profiler/): This module provides GPU profiling functionality. * [​`random`](/mojo/stdlib/gpu/random/): Random number generation for GPU kernels. * [​`semaphore`](/mojo/stdlib/gpu/semaphore/): This module provides a device-wide semaphore implementation for NVIDIA GPUs. * [​`sync`](/mojo/stdlib/gpu/sync/): This module provides GPU synchronization primitives and barriers. * [​`tcgen05`](/mojo/stdlib/gpu/tcgen05/): This module includes utilities for working with the tensorcore 5th generation (tcgen05) instructions. * [​`tensor_ops`](/mojo/stdlib/gpu/tensor_ops/): This module provides tensor core operations and utilities for GPU computation. * [​`warp`](/mojo/stdlib/gpu/warp/): GPU warp-level operations and utilities. --- ## Scope `struct Scope` Represents memory synchronization scope levels for GPU memory operations. Defines different scopes of memory visibility and synchronization, from thread-local to system-wide. Each scope level determines how memory operations are ordered and visible across different execution units. The scope levels form a hierarchy, with each higher level providing stronger ordering guarantees but potentially higher synchronization costs. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Movable`, `UnknownDestructibility`, `Writable` ## Aliases ### `BLOCK` `alias BLOCK = Scope(3)` Block-level scope. Memory operations ordered within a thread block/CTA. ### `CLUSTER` `alias CLUSTER = Scope(4)` Cluster-level scope. Memory operations ordered within a thread block cluster. ### `GPU` `alias GPU = Scope(5)` GPU-level scope. Memory operations are ordered across all threads on the GPU. ### `NONE` `alias NONE = Scope(0)` No memory ordering guarantees. Operations may be reordered freely. ### `SYSTEM` `alias SYSTEM = Scope(6)` System-wide scope. Memory operations ordered across the entire system. ### `THREAD` `alias THREAD = Scope(1)` Thread-level scope. Memory operations are ordered within a single thread. ### `WARP` `alias WARP = Scope(2)` Warp-level scope. Memory operations are ordered within a warp of threads. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two `Scope` instances are equal. Uses pointer comparison for efficiency. **Args:** * ​other (`Self`): The other `Scope` instance to compare with. **Returns:** True if the instances are the same, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if two `Scope` instances are not equal. **Args:** * ​other (`Self`): The other `Scope` instance to compare with. **Returns:** True if the instances are different, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Checks if two `Scope` instances have the same value. Compares the underlying integer values. **Args:** * ​other (`Self`): The other `Scope` instance to compare with. **Returns:** True if the values are the same, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Checks if two `Scope` instances have different values. **Args:** * ​other (`Self`): The other `Scope` instance to compare with. **Returns:** True if the values are different, False otherwise. ### `write_to` `write_to[W: Writer](self, mut w: W)` Writes the string representation of the scope to a writer. **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer interface. **Args:** * ​w (`W`): The writer to write to. ### `__str__` `__str__(self) -> String` Returns the string representation of the memory scope. **Returns:** A string representation of the memory scope. ### `__repr__` `__repr__(self) -> String` Returns the string representation of the memory scope. **Returns:** A string representation of the memory scope. ### `mnemonic` `mnemonic(self) -> StringSlice[StaticConstantOrigin]` Returns the mnemonic string representation of the memory scope. Converts the memory scope level into a string mnemonic used by LLVM/NVVM intrinsics for memory operations. **Returns:** A string literal containing the mnemonic. --- ## buffer_load `buffer_load[type: DType, width: Int](src_resource: SIMD[uint32, 4], gds_offset: SIMD[int32, 1]) -> SIMD[type, width]` Loads data from global memory into a SIMD register. This function provides a hardware-accelerated global memory load operation that maps directly to the AMDGPU buffer\_load instruction. It efficiently transfers data from global memory to registers. Note: * Only supported on AMD GPUs. * Uses non-glc loads by default (can hit L1 cache and persist across wavefronts). * Supports widths that map to 1, 2, 4, 8, or 16 byte loads. * Maps directly to llvm.amdgcn.raw\.buffer.load intrinsics. **Parameters:** * ​type (`DType`): The data type to load. * ​width (`Int`): The SIMD vector width for vectorized loads. **Args:** * ​src\_resource (`SIMD[uint32, 4]`): Buffer resource descriptor created by make\_buffer\_resource(). * ​gds\_offset (`SIMD[int32, 1]`): Offset in elements (not bytes) from the base address in the resource. **Returns:** SIMD vector containing the loaded data. --- ## buffer_load_store_lds `buffer_load_store_lds[type: DType](src_resource: SIMD[uint32, 4], gds_offset: SIMD[int32, 1], lds_ptr_base: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)], lds_offset: SIMD[int32, 1])` Loads four bytes from global memory and writes them to shared memory. Copies from global memory to shared memory (aka LDS) bypassing storing to register. **Parameters:** * ​type (`DType`): The type of the data to be loaded. **Args:** * ​src\_resource (`SIMD[uint32, 4]`): Buffer resource descriptor from make\_buffer\_resource. * ​gds\_offset (`SIMD[int32, 1]`): Global memory offset. * ​lds\_ptr\_base (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)]`): LDS base address. * ​lds\_offset (`SIMD[int32, 1]`): LDS offset. --- ## buffer_store `buffer_store[type: DType, width: Int](src_resource: SIMD[uint32, 4], gds_offset: SIMD[int32, 1], val: SIMD[type, width])` Stores a register variable to global memory. Writes to global memory from a register. **Parameters:** * ​type (`DType`): The data type. * ​width (`Int`): The SIMD vector width. **Args:** * ​src\_resource (`SIMD[uint32, 4]`): Buffer resource descriptor. * ​gds\_offset (`SIMD[int32, 1]`): Global memory offset. * ​val (`SIMD[type, width]`): Value to write. --- ## byte_permute `byte_permute(a: SIMD[uint32, 1], b: SIMD[uint32, 1], c: SIMD[uint32, 1]) -> SIMD[uint32, 1]` Permutes bytes from two 32-bit integers based on a control mask. Selects and rearranges bytes from two source integers based on a control mask to create a new 32-bit value. Note: Byte selection behavior depends on the GPU architecture: * On NVIDIA: Maps to PRMT instruction * On AMD: Maps to PERM instruction. **Args:** * ​a (`SIMD[uint32, 1]`): First source integer containing bytes to select from. * ​b (`SIMD[uint32, 1]`): Second source integer containing bytes to select from. * ​c (`SIMD[uint32, 1]`): Control mask that specifies which bytes to select and their positions. Each byte in the mask controls selection/placement of one output byte. **Returns:** A new 32-bit integer containing the selected and rearranged bytes --- ## intrinsics Provides low-level GPU intrinsic operations and memory access primitives. Implements hardware-specific intrinsics that map directly to GPU assembly instructions, focusing on NVIDIA GPU architectures. Includes: * Global memory load/store operations with cache control * Warp-level primitives and synchronization * Memory fence and barrier operations * Atomic operations and memory ordering primitives These low-level primitives should be used carefully as they correspond directly to hardware instructions and require understanding of the underlying GPU architecture. ## Structs * [​`Scope`](/mojo/stdlib/gpu/intrinsics/Scope): Represents memory synchronization scope levels for GPU memory operations. ## Functions * [​`buffer_load`](/mojo/stdlib/gpu/intrinsics/buffer_load): Loads data from global memory into a SIMD register. * [​`buffer_load_store_lds`](/mojo/stdlib/gpu/intrinsics/buffer_load_store_lds): Loads four bytes from global memory and writes them to shared memory. * [​`buffer_store`](/mojo/stdlib/gpu/intrinsics/buffer_store): Stores a register variable to global memory. * [​`byte_permute`](/mojo/stdlib/gpu/intrinsics/byte_permute): Permutes bytes from two 32-bit integers based on a control mask. * [​`ldg`](/mojo/stdlib/gpu/intrinsics/ldg): Load data from global memory through the non-coherent cache. * [​`load_acquire`](/mojo/stdlib/gpu/intrinsics/load_acquire): Performs an atomic load operation with acquire memory ordering semantics. * [​`load_volatile`](/mojo/stdlib/gpu/intrinsics/load_volatile): Performs a volatile load operation that cannot be optimized away. * [​`lop`](/mojo/stdlib/gpu/intrinsics/lop): Performs an arbitrary logical operation on 3 inputs using a lookup table. * [​`make_buffer_resource`](/mojo/stdlib/gpu/intrinsics/make_buffer_resource): Creates a 128-bit buffer resource descriptor for AMD GPU buffer operations. * [​`mulhi`](/mojo/stdlib/gpu/intrinsics/mulhi): Calculates the most significant 32 bits of the product of two 16-bit unsigned integers. * [​`mulwide`](/mojo/stdlib/gpu/intrinsics/mulwide): Performs a wide multiplication of two 32-bit unsigned integers. * [​`store_release`](/mojo/stdlib/gpu/intrinsics/store_release): Performs an atomic store with release memory ordering semantics. * [​`store_volatile`](/mojo/stdlib/gpu/intrinsics/store_volatile): Performs a volatile store operation that cannot be optimized away. * [​`threadfence`](/mojo/stdlib/gpu/intrinsics/threadfence): Enforces ordering of memory operations across threads. * [​`warpgroup_reg_alloc`](/mojo/stdlib/gpu/intrinsics/warpgroup_reg_alloc): Allocates additional registers for the executing warp group. * [​`warpgroup_reg_dealloc`](/mojo/stdlib/gpu/intrinsics/warpgroup_reg_dealloc): Deallocates additional registers for the executing warp group. --- ## ldg `ldg[type: DType, //, width: Int = 1, *, alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]()](x: UnsafePointer[SIMD[type, 1]]) -> SIMD[type, width]` Load data from global memory through the non-coherent cache. This function provides a hardware-accelerated global memory load operation that uses the GPU's non-coherent cache (equivalent to CUDA's `__ldg` instruction). It optimizes for read-only data access patterns. Note: * Uses invariant loads which indicate the memory won't change during kernel execution. * Particularly beneficial for read-only texture-like access patterns. * May improve performance on memory-bound kernels. **Parameters:** * ​type (`DType`): The data type to load (must be numeric). * ​width (`Int`): The SIMD vector width for vectorized loads. * ​alignment (`Int`): Memory alignment in bytes. Defaults to natural alignment of the SIMD vector type. **Args:** * ​x (`UnsafePointer[SIMD[type, 1]]`): Pointer to global memory location to load from. **Returns:** SIMD vector containing the loaded data. --- ## load_acquire `load_acquire[type: DType, //, *, scope: Scope = Scope(6), memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[type, 1]` Performs an atomic load operation with acquire memory ordering semantics. This function provides a memory barrier that ensures no subsequent memory operations from the calling thread are executed until after this load completes. Note: * Only supported on GPUs. * Maps directly to PTX ld.acquire instruction on NVIDIA, LLVM atomic load on AMDGPU. * Ensures subsequent memory operations don't execute until after load. * Critical for implementing synchronization primitives. **Parameters:** * ​type (`DType`): The data type to load. * ​scope (`Scope`): Memory scope for the operation (default: Scope.SYSTEM). * ​memory (`Bool`): Whether to include memory side effects in constraints (default: True). **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from. **Returns:** The loaded value. --- ## load_volatile `load_volatile[type: DType, //, memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[type, 1]` Performs a volatile load operation that cannot be optimized away. This function guarantees that the load operation will be performed exactly as specified, without being reordered or optimized away by the compiler. Note: * Only supported on NVIDIA GPUs. * Maps directly to PTX ld.volatile instruction. * Prevents compiler optimization of the load operation. * Useful for memory-mapped I/O or synchronization primitives. * May have performance implications compared to regular loads. **Parameters:** * ​type (`DType`): The data type to load. * ​memory (`Bool`): Whether to include memory side effects in constraints (default: True). **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from. **Returns:** The loaded value. --- ## lop `lop[lut: SIMD[int32, 1]](a: SIMD[int32, 1], b: SIMD[int32, 1], c: SIMD[int32, 1]) -> SIMD[int32, 1]` Performs an arbitrary logical operation on 3 inputs using a lookup table. Implements a 3-input lookup table (LUT) operation. The result is determined by bits in the lookup table value for each input combination. Note: * Only supported on NVIDIA GPUs. * Maps to the LOP3.B32 PTX instruction. * Lookup table value determines output for each possible input combo. **Parameters:** * ​lut (`SIMD[int32, 1]`): 32-bit lookup table value that defines the logical operation. **Args:** * ​a (`SIMD[int32, 1]`): First input value. * ​b (`SIMD[int32, 1]`): Second input value. * ​c (`SIMD[int32, 1]`): Third input value. **Returns:** Result of applying the lookup table operation to the inputs. --- ## make_buffer_resource `make_buffer_resource[type: DType](gds_ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], num_records: Int = __init__[::Intable](SIMD(max_or_inf[::DType]()))) -> SIMD[uint32, 4]` Creates a 128-bit buffer resource descriptor for AMD GPU buffer operations. This function constructs a 128-bit buffer resource descriptor used by AMD GPUs for buffer load/store operations. The descriptor contains information about the memory location, size, and access properties needed by the hardware to perform memory operations. Notes: * Only supported on AMD GPUs. * The descriptor follows AMD's hardware-specific format: * Bits 0-63: Base address * Bits 64-95: Number of records (size) * Bits 96-127: Flags controlling access properties * Used with buffer\_load and buffer\_store operations. * Performance-critical for optimized memory access patterns on AMD GPUs. Example: ```mojo from gpu.intrinsics import make_buffer_resource var ptr = UnsafePointer[Scalar[DType.float32]].alloc(1024) var resource = make_buffer_resource[DType.float32](ptr, 1024) # Use resource with buffer_load/buffer_store operations ``` . **Parameters:** * ​type (`DType`): The data type of elements in the buffer. **Args:** * ​gds\_ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Global memory base address pointer to the start of the buffer. * ​num\_records (`Int`): Maximum number of records that can be accessed through this resource descriptor. Reads with offsets beyond this value return 0. Defaults to UInt32.MAX for maximum possible range. **Returns:** A 128-bit buffer resource descriptor as a SIMD\[DType.uint32, 4]. --- ## mulhi `mulhi(a: SIMD[uint16, 1], b: SIMD[uint16, 1]) -> SIMD[uint32, 1]` Calculates the most significant 32 bits of the product of two 16-bit unsigned integers. Multiplies two 16-bit unsigned integers and returns the high 32 bits of their product. Useful for fixed-point arithmetic and overflow detection. Note: On NVIDIA GPUs, this maps directly to the MULHI.U16 PTX instruction. On others, it performs multiplication using 32-bit arithmetic. **Args:** * ​a (`SIMD[uint16, 1]`): First 16-bit unsigned integer operand. * ​b (`SIMD[uint16, 1]`): Second 16-bit unsigned integer operand. **Returns:** The high 32 bits of the product a \* b `mulhi(a: SIMD[int16, 1], b: SIMD[int16, 1]) -> SIMD[int32, 1]` Calculates the most significant 32 bits of the product of two 16-bit signed integers. Multiplies two 16-bit signed integers and returns the high 32 bits of their product. Useful for fixed-point arithmetic and overflow detection. Note: On NVIDIA GPUs, this maps directly to the MULHI.S16 PTX instruction. On others, it performs multiplication using 32-bit arithmetic. **Args:** * ​a (`SIMD[int16, 1]`): First 16-bit signed integer operand. * ​b (`SIMD[int16, 1]`): Second 16-bit signed integer operand. **Returns:** The high 32 bits of the product a \* b `mulhi(a: SIMD[uint32, 1], b: SIMD[uint32, 1]) -> SIMD[uint32, 1]` Calculates the most significant 32 bits of the product of two 32-bit unsigned integers. Multiplies two 32-bit unsigned integers and returns the high 32 bits of their product. Useful for fixed-point arithmetic and overflow detection. Note: On NVIDIA GPUs, this maps directly to the MULHI.U32 PTX instruction. On others, it performs multiplication using 64-bit arithmetic. **Args:** * ​a (`SIMD[uint32, 1]`): First 32-bit unsigned integer operand. * ​b (`SIMD[uint32, 1]`): Second 32-bit unsigned integer operand. **Returns:** The high 32 bits of the product a \* b `mulhi(a: SIMD[int32, 1], b: SIMD[int32, 1]) -> SIMD[int32, 1]` Calculates the most significant 32 bits of the product of two 32-bit signed integers. Multiplies two 32-bit signed integers and returns the high 32 bits of their product. Useful for fixed-point arithmetic and overflow detection. Note: On NVIDIA GPUs, this maps directly to the MULHI.S32 PTX instruction. On others, it performs multiplication using 64-bit arithmetic. **Args:** * ​a (`SIMD[int32, 1]`): First 32-bit signed integer operand. * ​b (`SIMD[int32, 1]`): Second 32-bit signed integer operand. **Returns:** The high 32 bits of the product a \* b --- ## mulwide `mulwide(a: SIMD[uint32, 1], b: SIMD[uint32, 1]) -> SIMD[uint64, 1]` Performs a wide multiplication of two 32-bit unsigned integers. Multiplies two 32-bit unsigned integers and returns the full 64-bit result. Useful when the product may exceed 32 bits. Note: On NVIDIA GPUs, this maps directly to the MUL.WIDE.U32 PTX instruction. On others, it performs multiplication using 64-bit casts. **Args:** * ​a (`SIMD[uint32, 1]`): First 32-bit unsigned integer operand. * ​b (`SIMD[uint32, 1]`): Second 32-bit unsigned integer operand. **Returns:** The full 64-bit product of a \* b `mulwide(a: SIMD[int32, 1], b: SIMD[int32, 1]) -> SIMD[int64, 1]` Performs a wide multiplication of two 32-bit signed integers. Multiplies two 32-bit signed integers and returns the full 64-bit result. Useful when the product may exceed 32 bits or be negative. Note: On NVIDIA GPUs, this maps directly to the MUL.WIDE.S32 PTX instruction. On others, it performs multiplication using 64-bit casts. **Args:** * ​a (`SIMD[int32, 1]`): First 32-bit signed integer operand. * ​b (`SIMD[int32, 1]`): Second 32-bit signed integer operand. **Returns:** The full 64-bit signed product of a \* b --- ## store_release `store_release[type: DType, //, scope: Scope = Scope(6), memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], value: SIMD[type, 1])` Performs an atomic store with release memory ordering semantics. This function provides a memory barrier that ensures all previous memory operations from the calling thread are visible to other threads before this store is performed. Note: * Only supported on GPUs. * Maps directly to PTX st.release instruction on NVIDIA, LLVM atomic store on AMDGPU. * Ensures all previous memory operations complete before this store. * Critical for implementing synchronization primitives. **Parameters:** * ​type (`DType`): The data type to store. * ​scope (`Scope`): Memory scope for the operation (default: Scope.SYSTEM). * ​memory (`Bool`): Whether to include memory side effects in constraints (default: True). **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to store to. * ​value (`SIMD[type, 1]`): Value to store. --- ## store_volatile `store_volatile[type: DType, //, memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], value: SIMD[type, 1])` Performs a volatile store operation that cannot be optimized away. This function guarantees that the store operation will be performed exactly as specified, without being reordered or optimized away by the compiler. Note: * Only supported on NVIDIA GPUs. * Maps directly to PTX st.volatile instruction. * Prevents compiler optimization of the store operation. * Useful for memory-mapped I/O or synchronization primitives. * May have performance implications compared to regular stores. **Parameters:** * ​type (`DType`): The data type to store. * ​memory (`Bool`): Whether to include memory side effects in constraints (default: True). **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to store to. * ​value (`SIMD[type, 1]`): Value to store. --- ## threadfence `threadfence[scope: Scope = Scope(5)]()` Enforces ordering of memory operations across threads. Acts as a memory fence/barrier that ensures all memory operations (both loads and stores) issued before the fence are visible to other threads within the specified scope before any memory operations after the fence. Note: * Maps directly to CUDA `__threadfence()` family of functions. * Critical for synchronizing memory access in parallel algorithms. * Performance impact increases with broader scopes. **Parameters:** * ​scope (`Scope`): Memory scope level for the fence. Defaults to GPU-wide scope. Valid values are: * Scope.BLOCK: Orders memory within a thread block/CTA. * Scope.GPU: Orders memory across all threads on the GPU (default). * Scope.SYSTEM: Orders memory across the entire system. --- ## warpgroup_reg_alloc `warpgroup_reg_alloc[count: Int]()` Allocates additional registers for the executing warp group. Hints to the system to increase per-thread registers owned by the executing warp. Requests additional registers to increase the absolute per-thread maximum register count from its current value to the specified count. Note: * Only supported on NVIDIA SM90+ GPUs * Performance optimization hint that may be ignored by the hardware * Pair with \`warpgroup\_reg\_dealloc() when extra registers are no longer needed **Parameters:** * ​count (`Int`): The desired number of registers per thread. Must be: * A multiple of 8 * Between 24 and 256 (inclusive). --- ## warpgroup_reg_dealloc `warpgroup_reg_dealloc[count: Int]()` Deallocates additional registers for the executing warp group. Hints to the system to decrease per-thread registers owned by the executing warp. Releases extra registers to reduce the absolute per-thread maximum register count from its current value to the specified count. Note: * Only supported on NVIDIA SM90+ GPUs. * Performance optimization hint that may be ignored by the hardware. * Pair with `warpgroup_reg_alloc()` when extra registers are needed. **Parameters:** * ​count (`Int`): The desired number of registers per thread. Must be: * A multiple of 8. * Between 24 and 256 (inclusive). --- ## CacheEviction `@register_passable(trivial)` `struct CacheEviction` Represents cache eviction policies for GPU memory operations. This struct defines different cache eviction priorities that control how data is evicted from cache when space is needed. The policies affect cache utilization and performance by controlling which data gets evicted first. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `EVICT_FIRST` `alias EVICT_FIRST = CacheEviction(1)` Highest eviction priority - data will be evicted first. Data cached with this priority is marked as the first candidate for eviction when cache space is needed. This is optimal for: * Streaming data that will not be reused * Single-pass algorithms * Data with low temporal locality ### `EVICT_LAST` `alias EVICT_LAST = CacheEviction(2)` Lowest eviction priority - data will be evicted last. Data cached with this priority remains in cache until all higher priority data is evicted. Best used for: * Frequently accessed data * Data needed across multiple kernel launches * Critical data structures that benefit from cache persistence ### `EVICT_NORMAL` `alias EVICT_NORMAL = CacheEviction(0)` Default cache eviction priority. Data cached with normal priority follows standard cache replacement policies. This is the default behavior and suitable for most general-purpose data access patterns where no special caching requirements exist. ### `EVICT_UNCHANGED` `alias EVICT_UNCHANGED = CacheEviction(3)` Preserves existing cache eviction priority. When this policy is used: * Existing cache entries maintain their current eviction priority * No changes are made to the cache replacement order * Useful for operations that should not affect caching behavior ### `NO_ALLOCATE` `alias NO_ALLOCATE = CacheEviction(4)` Prevents cache allocation for accessed data. Data is not cached when using this policy. Optimal for: * Large sequential reads/writes * Data that will only be accessed once * Preserving cache space for more critical data * Streaming operations with no data reuse ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Tests if two CacheEviction instances are equal. **Args:** * ​other (`Self`): The CacheEviction to compare against. **Returns:** True if the eviction policies are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Tests if two CacheEviction instances are not equal. **Args:** * ​other (`Self`): The CacheEviction to compare against. **Returns:** True if the eviction policies are not equal, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Tests if two CacheEviction instances are identical. **Args:** * ​other (`Self`): The CacheEviction to compare against. **Returns:** True if the eviction policies are identical, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Tests if two CacheEviction instances are not identical. **Args:** * ​other (`Self`): The CacheEviction to compare against. **Returns:** True if the eviction policies are not identical, False otherwise. ### `mnemonic` `mnemonic(self) -> StringSlice[StaticConstantOrigin]` Returns the string mnemonic for this cache eviction policy. Converts the cache eviction policy into its corresponding string representation used in GPU instructions and debugging. **Returns:** A string literal containing the mnemonic for this eviction policy. --- ## CacheOperation `@register_passable(trivial)` `struct CacheOperation` Represents different GPU cache operation policies. This struct defines various caching behaviors for GPU memory operations, controlling how data is cached and evicted at different cache levels. The policies affect performance and memory coherency. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ALWAYS` `alias ALWAYS = CacheOperation(0)` Cache at all levels. This will be accessed again. Best for data that will be frequently reused across multiple threads. Provides fastest subsequent access but uses the most cache space. ### `GLOBAL` `alias GLOBAL = CacheOperation(1)` Cache at global level. Caches data only in the L2 cache, bypassing L1. Good for data shared between different thread blocks. ### `LAST_USE` `alias LAST_USE = CacheOperation(3)` Indicates the cache line will not be used again. Hints to the cache that this data can be evicted after this access. Helps optimize cache utilization. ### `STREAMING` `alias STREAMING = CacheOperation(2)` Streaming, this is likely to be accessed once. Optimizes for streaming access patterns where data is only read once. May bypass certain cache levels for better throughput. ### `VOLATILE` `alias VOLATILE = CacheOperation(4)` Don't cache, and fetch again. Forces reads/writes to bypass cache and go directly to memory. Useful for memory-mapped I/O or when cache coherency is required. ### `WRITE_BACK` `alias WRITE_BACK = CacheOperation(5)` Write back at all coherent levels. Updates all cache levels and eventually writes to memory. Most efficient for multiple writes to same location. ### `WRITE_THROUGH` `alias WRITE_THROUGH = CacheOperation(6)` Write through to system memory. Immediately writes updates to memory while updating cache. Provides stronger consistency but lower performance than write-back. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Tests if two CacheOperation instances are equal. **Args:** * ​other (`Self`): The CacheOperation to compare against. **Returns:** True if the operations are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Tests if two CacheOperation instances are not equal. **Args:** * ​other (`Self`): The CacheOperation to compare against. **Returns:** True if the operations are not equal, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Tests if two CacheOperation instances are identical. **Args:** * ​other (`Self`): The CacheOperation to compare against. **Returns:** True if the operations are identical, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Tests if two CacheOperation instances are not identical. **Args:** * ​other (`Self`): The CacheOperation to compare against. **Returns:** True if the operations are not identical, False otherwise. ### `mnemonic` `mnemonic(self) -> StringSlice[StaticConstantOrigin]` Returns the PTX mnemonic string for this cache operation. Converts the cache operation into its corresponding PTX assembly mnemonic string used in GPU instructions. **Returns:** A string literal containing the PTX mnemonic for this operation. --- ## Consistency `@register_passable(trivial)` `struct Consistency` Represents memory consistency models for GPU memory operations. This struct defines different memory consistency levels that control how memory operations are ordered and synchronized between threads. The consistency model affects both performance and correctness of parallel algorithms. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ACQUIRE` `alias ACQUIRE = Consistency(2)` Acquire consistency for synchronization operations. Ensures all subsequent memory operations are ordered after this operation. Used in producer-consumer patterns. ### `RELAXED` `alias RELAXED = Consistency(1)` Relaxed consistency with basic ordering guarantees. Provides some ordering guarantees while still allowing optimizations. Suitable for operations that don't require strict ordering. ### `RELEASE` `alias RELEASE = Consistency(3)` Release consistency for synchronization operations. Ensures all previous memory operations are ordered before this operation. Paired with acquire operations for synchronization. ### `WEAK` `alias WEAK = Consistency(0)` Weakest consistency model with minimal ordering guarantees. Provides maximum flexibility for hardware/compiler optimizations but requires careful synchronization by the programmer. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Tests if two Consistency instances are equal. **Args:** * ​other (`Self`): The Consistency instance to compare against. **Returns:** True if the consistency levels are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Tests if two Consistency instances are not equal. **Args:** * ​other (`Self`): The Consistency instance to compare against. **Returns:** True if the consistency levels are different, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Tests if two Consistency instances are identical. **Args:** * ​other (`Self`): The Consistency instance to compare against. **Returns:** True if the consistency levels are identical, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Tests if two Consistency instances are not identical. **Args:** * ​other (`Self`): The Consistency instance to compare against. **Returns:** True if the consistency levels are not identical, False otherwise. ### `__str__` `__str__(self) -> String` Returns a string representation of the consistency level. **Returns:** A string describing the consistency level. ### `mnemonic` `mnemonic(self) -> StringSlice[StaticConstantOrigin]` Returns the mnemonic string for the consistency level. **Returns:** A string literal containing the consistency level mnemonic. --- ## Fill `@register_passable(trivial)` `struct Fill` Represents memory fill patterns for GPU memory operations. This struct defines different fill patterns that can be used when allocating or initializing GPU memory. The patterns control how memory is initialized, which can be important for debugging and performance optimization. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `NAN` `alias NAN = Fill(2)` Fill memory with NaN values. Useful for debugging floating point computations. ### `NONE` `alias NONE = Fill(0)` No fill pattern - memory is left uninitialized. ### `ZERO` `alias ZERO = Fill(1)` Fill memory with zeros. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Tests if two Fill instances have the same fill pattern. **Args:** * ​other (`Self`): The Fill instance to compare against. **Returns:** True if the fill patterns are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Tests if two Fill instances have different fill patterns. **Args:** * ​other (`Self`): The Fill instance to compare against. **Returns:** True if the fill patterns are different, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Tests if two Fill instances are identical. **Args:** * ​other (`Self`): The Fill instance to compare against. **Returns:** True if the fill patterns are identical, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Tests if two Fill instances are not identical. **Args:** * ​other (`Self`): The Fill instance to compare against. **Returns:** True if the fill patterns are not identical, False otherwise. ### `__str__` `__str__(self) -> String` Returns a string representation of the fill pattern. Converts the fill pattern into a human-readable string for debugging and display purposes. **Returns:** A string describing the fill pattern. --- ## ReduceOp `@register_passable(trivial)` `struct ReduceOp` Represents reduction operations for parallel reduction algorithms. This struct defines different reduction operations that can be performed across multiple threads in parallel. These operations are commonly used in parallel reduction algorithms on GPUs. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ADD` `alias ADD = ReduceOp(0)` Addition reduction operation. Combines values by adding them together. ### `AND` `alias AND = ReduceOp(3)` Bitwise AND reduction operation. Performs bitwise AND across all inputs. ### `MAX` `alias MAX = ReduceOp(2)` Maximum reduction operation. Finds the maximum value across all inputs. ### `MIN` `alias MIN = ReduceOp(1)` Minimum reduction operation. Finds the minimum value across all inputs. ### `OR` `alias OR = ReduceOp(4)` Bitwise OR reduction operation. Performs bitwise OR across all inputs. ### `XOR` `alias XOR = ReduceOp(5)` Bitwise XOR reduction operation. Performs bitwise XOR across all inputs. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Tests if two ReduceOp instances are equal. **Args:** * ​other (`Self`): The ReduceOp instance to compare against. **Returns:** True if the reduction operations are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Tests if two ReduceOp instances are not equal. **Args:** * ​other (`Self`): The ReduceOp instance to compare against. **Returns:** True if the reduction operations are different, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Tests if two ReduceOp instances are identical. **Args:** * ​other (`Self`): The ReduceOp instance to compare against. **Returns:** True if the reduction operations are identical, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Tests if two ReduceOp instances are not identical. **Args:** * ​other (`Self`): The ReduceOp instance to compare against. **Returns:** True if the reduction operations are not identical, False otherwise. ### `__str__` `__str__(self) -> String` Returns a string representation of the reduction operation. **Returns:** A string describing the reduction operation. ### `mnemonic` `mnemonic(self) -> StringSlice[StaticConstantOrigin]` Returns the mnemonic string for the reduction operation. **Returns:** A string literal containing the reduction operation mnemonic. --- ## async_copy `async_copy[type: DType, //, size: Int, *, fill: OptionalReg[SIMD[type, 1]] = OptionalReg[SIMD[type, 1]]({:i1 0, 1}), bypass_L1_16B: Bool = True, l2_prefetch: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), eviction_policy: CacheEviction = CacheEviction(0)](src: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)], dst: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)], src_size: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](0), predicate: Bool = False)` Asynchronously copies data from global memory to shared memory. This function provides a high-performance asynchronous memory copy operation with configurable caching behavior, prefetching, and fill values. It maps directly to the PTX cp.async instruction on NVIDIA GPUs. **Constraints:** * Fill value only supported for types type (`DType`): The data type to copy (e.g. float32, int32). * ​size (`Int`): Number of bytes to copy (must be 4, 8, or 16). * ​fill (`OptionalReg[SIMD[type, 1]]`): Optional fill value for uncopied bytes when src\_size bypass\_L1\_16B (`Bool`): If True, bypasses L1 cache for 16-byte copies. * ​l2\_prefetch (`OptionalReg[Int]`): Optional L2 prefetch size (64, 128, or 256 bytes). * ​eviction\_policy (`CacheEviction`): Cache eviction policy for the copy operation. **Args:** * ​src (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]`): Source pointer in global memory. * ​dst (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)]`): Destination pointer in shared memory. * ​src\_size (`SIMD[int32, 1]`): Actual bytes to copy from src (remaining bytes use fill value). * ​predicate (`Bool`): Optional predicate to conditionally execute the copy. --- ## async_copy_commit_group `async_copy_commit_group()` Commits all prior initiated but uncommitted cp.async instructions into a cp.async-group. This function creates a new cp.async-group containing all previously initiated but uncommitted asynchronous copy operations. The group can then be waited on using async\_copy\_wait\_group(). Notes: * Only supported on NVIDIA GPUs * Maps to the cp.async.commit.group PTX instruction * Used for managing asynchronous memory transfers * Should be paired with async\_copy\_wait\_group() or async\_copy\_wait\_all() --- ## async_copy_wait_all `async_copy_wait_all()` Waits for completion of all committed cp.async-groups. This function blocks execution until all previously committed cp.async-groups have completed their memory transfers. It provides a barrier to ensure all asynchronous copies are finished. Notes: * Only supported on NVIDIA GPUs. * Maps to the cp.async.wait.all PTX instruction. * Ensures all outstanding asynchronous transfers are complete. * More coarse-grained than `async_copy_wait_group()`. --- ## async_copy_wait_group `async_copy_wait_group(n: SIMD[int32, 1])` Waits for the completion of `n` most recently committed cp.async-groups. This function blocks execution until the specified number of previously committed cp.async-groups have completed their memory transfers. Notes: * Only supported on NVIDIA GPUs. * Maps to the cp.async.wait.group PTX instruction. * Provides fine-grained control over asynchronous transfer synchronization. * Can be used to implement a pipeline of asynchronous transfers. **Args:** * ​n (`SIMD[int32, 1]`): The number of pending cp.async-groups to wait for. Must be > 0. --- ## cp_async_bulk_tensor_global_shared_cta `cp_async_bulk_tensor_global_shared_cta[src_type: AnyType, rank: Int, /, eviction_policy: CacheEviction = CacheEviction(0)](src_mem: UnsafePointer[src_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], coords: IndexList[rank])` Initiates an asynchronous copy operation to transfer tensor data from shared CTA memory to global memory using NVIDIA's Tensor Memory Access (TMA) mechanism. This function provides an efficient way to write data back from shared memory to global memory using TMA. It supports both rank-1 and rank-2 tensors and allows control over cache eviction policy. Notes: * This operation is asynchronous - use appropriate memory barriers to ensure completion. * Only supports rank-1 and rank-2 tensors. * Requires NVIDIA GPU with TMA support. * The source memory must be properly aligned for TMA operations. * The TMA descriptor must be properly initialized before use. **Parameters:** * ​src\_type (`AnyType`): The data type of the source tensor elements. * ​rank (`Int`): The dimensionality of the tensor (must be 1 or 2). * ​eviction\_policy (`CacheEviction`): Optional cache eviction policy that controls how the data is handled in the cache hierarchy. Defaults to EVICT\_NORMAL. **Args:** * ​src\_mem (`UnsafePointer[src_type, address_space=AddressSpace(3)]`): Pointer to the source data in shared memory that will be copied to global memory. Must be properly aligned according to TMA requirements. * ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor containing metadata about tensor layout and memory access patterns. * ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to copy. For rank-1 tensors, this is a single coordinate. For rank-2 tensors, this contains both row and column coordinates. --- ## cp_async_bulk_tensor_reduce `cp_async_bulk_tensor_reduce[src_type: AnyType, rank: Int, /, *, reduction_kind: ReduceOp, eviction_policy: CacheEviction = CacheEviction(0)](src_mem: UnsafePointer[src_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], coords: IndexList[rank])` Initiates an asynchronous reduction operation between shared CTA memory and global memory using NVIDIA's Tensor Memory Access (TMA) mechanism. This function performs an in-place reduction operation, combining data from shared memory with data in global memory using the specified reduction operation. The operation is performed asynchronously and uses TMA's tile mode for efficient memory access. Notes: * This operation is asynchronous - use appropriate memory barriers to ensure completion. * Only supports rank-1 and rank-2 tensors. * Requires NVIDIA GPU with TMA support. * The source memory must be properly aligned for TMA operations. * The TMA descriptor must be properly initialized before use. * The reduction operation is performed atomically to ensure correctness. **Parameters:** * ​src\_type (`AnyType`): The data type of the source tensor elements. * ​rank (`Int`): The dimensionality of the tensor (must be 1 or 2). * ​reduction\_kind (`ReduceOp`): The type of reduction operation to perform. Supported operations are: "add", "min", "max", "inc", "dec", "and", "or", "xor". * ​eviction\_policy (`CacheEviction`): Optional cache eviction policy that controls how the data is handled in the cache hierarchy. Defaults to `EVICT_NORMAL`. **Args:** * ​src\_mem (`UnsafePointer[src_type, address_space=AddressSpace(3)]`): Pointer to the source data in shared memory that will be reduced with the global memory data. Must be properly aligned according to TMA requirements. * ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor containing metadata about tensor layout and memory access patterns. * ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to operate on. For rank-1 tensors, this is a single coordinate. For rank-2 tensors, this contains both row and column coordinates. --- ## cp_async_bulk_tensor_shared_cluster_global `cp_async_bulk_tensor_shared_cluster_global[dst_type: AnyType, mbr_type: AnyType, rank: Int, /, *, cta_group: Int = 1](dst_mem: UnsafePointer[dst_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], mem_bar: UnsafePointer[mbr_type, address_space=AddressSpace(3)], coords: IndexList[rank])` Initiates an asynchronous bulk copy operation of tensor data from global memory to shared memory. This function performs an asynchronous copy of tensor data using NVIDIA's Tensor Memory Access (TMA) mechanism. It supports both rank-1 and rank-2 tensors and uses cluster-level synchronization for efficient data movement. Notes: * This operation is asynchronous - use appropriate memory barriers to ensure copy completion. * Only supports rank-1 and rank-2 tensors. * Requires NVIDIA GPU with TMA support. * The memory barrier should be properly initialized before use. **Parameters:** * ​dst\_type (`AnyType`): The data type of the destination memory. * ​mbr\_type (`AnyType`): The data type of the memory barrier. * ​rank (`Int`): The dimensionality of the tensor (1, 2, or 3). * ​cta\_group (`Int`): The CTA group to use for the copy operation. Must be 1 or 2. **Args:** * ​dst\_mem (`UnsafePointer[dst_type, address_space=AddressSpace(3)]`): Pointer to the destination in shared memory where the tensor data will be copied. Must be properly aligned according to TMA requirements. * ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor that contains metadata about the tensor layout and memory access patterns. * ​mem\_bar (`UnsafePointer[mbr_type, address_space=AddressSpace(3)]`): Pointer to a shared memory barrier used for synchronizing the asynchronous copy operation across threads in the cluster. * ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to copy. For rank-1 tensors, this is a single coordinate. For rank-2 tensors, this contains both row and column coordinates. --- ## cp_async_bulk_tensor_shared_cluster_global_multicast `cp_async_bulk_tensor_shared_cluster_global_multicast[dst_type: AnyType, mbr_type: AnyType, rank: Int, /, *, cta_group: Int = 1](dst_mem: UnsafePointer[dst_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], mem_bar: UnsafePointer[mbr_type, address_space=AddressSpace(3)], coords: IndexList[rank], multicast_mask: SIMD[uint16, 1])` Initiates an asynchronous multicast load operation using NVIDIA's Tensor Memory Access (TMA) to copy tensor data from global memory to shared memories of multiple CTAs in a cluster. This function performs an optimized multicast copy operation where a single global memory read can be distributed to multiple CTAs' shared memories simultaneously, reducing memory bandwidth usage. It supports both rank-1 and rank-2 tensors and uses cluster-level synchronization. Notes: * This operation is asynchronous - use appropriate memory barriers to ensure copy completion. * Only supports rank-1 and rank-2 tensors. * Requires NVIDIA GPU with TMA support. * The memory barrier should be properly initialized before use. * The multicast\_mask must be properly configured based on cluster size and desired distribution. **Parameters:** * ​dst\_type (`AnyType`): The data type of the destination tensor elements. * ​mbr\_type (`AnyType`): The data type of the memory barrier. * ​rank (`Int`): The dimensionality of the tensor (must be 1 or 2). * ​cta\_group (`Int`): The CTA group to use for the copy operation. Must be 1 or 2. **Args:** * ​dst\_mem (`UnsafePointer[dst_type, address_space=AddressSpace(3)]`): Pointer to the destination in shared memory where the tensor data will be copied. Must be properly aligned according to TMA requirements. * ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor containing metadata about tensor layout and memory access patterns. * ​mem\_bar (`UnsafePointer[mbr_type, address_space=AddressSpace(3)]`): Pointer to a shared memory barrier used for synchronizing the asynchronous copy operation across threads in the cluster. * ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to copy. For rank-1 tensors, this is a single coordinate. For rank-2 tensors, this contains both row and column coordinates. * ​multicast\_mask (`SIMD[uint16, 1]`): A 16-bit bitmask where each bit corresponds to a CTA in the cluster. Set bits indicate which CTAs will receive a copy of the loaded data. This enables efficient data sharing across multiple CTAs. --- ## external_memory `external_memory[type: AnyTrivialRegType, *, address_space: AddressSpace, alignment: Int, name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("extern_ptr_syml")]() -> UnsafePointer[type, address_space=address_space, alignment=alignment]` Gets a pointer to dynamically allocated external memory. This function returns a pointer to external memory that can be used for dynamic shared memory allocations in GPU kernels. The memory is allocated in the specified address space with the given alignment requirements. Note: * The memory is not initialized and must be explicitly written before reading. * The allocation size is determined at kernel launch time. * The pointer is only valid within the GPU kernel execution context. * Care must be taken to respect alignment requirements when accessing the memory. **Parameters:** * ​type (`AnyTrivialRegType`): The type of elements stored in the memory. Must be a trivial register type. * ​address\_space (`AddressSpace`): The memory address space to allocate in (e.g. shared, global). * ​alignment (`Int`): The minimum alignment requirement in bytes for the allocated memory. * ​name (`StringSlice[StaticConstantOrigin]`): Optional symbolic name for the external memory allocation. Defaults to "extern\_ptr\_syml". **Returns:** A properly aligned pointer to the allocated external memory in the specified address space. --- ## fence_mbarrier_init `fence_mbarrier_init()` Creates a memory fence after mbarrier initialization. This function establishes a memory barrier that ensures the proper initialization of memory barriers (mbarrier) before they are used. It guarantees that the mbarrier initialization is complete and visible to all threads before subsequent operations. Note: Should be called immediately after mbarrier initialization to ensure proper synchronization semantics. --- ## fence_proxy_tensormap_generic_sys_acquire `fence_proxy_tensormap_generic_sys_acquire[type: AnyType](ptr: UnsafePointer[type, alignment=alignment, mut=mut, origin=origin], size: SIMD[int32, 1])` Acquires a system-wide memory fence for tensor map operations. This function establishes a memory fence that ensures proper synchronization between tensor map operations and system memory. It guarantees that all previous memory operations are completed before subsequent tensor map accesses. Note: This is a low-level synchronization primitive typically used in conjunction with TMA (Tensor Memory Access) operations on NVIDIA GPUs. **Parameters:** * ​type (`AnyType`): The data type of the tensor map object being synchronized. **Args:** * ​ptr (`UnsafePointer[type, alignment=alignment, mut=mut, origin=origin]`): Pointer to the tensor map object in system memory that needs to be synchronized. * ​size (`SIMD[int32, 1]`): The size in bytes of the tensor map object being synchronized. --- ## fence_proxy_tensormap_generic_sys_release `fence_proxy_tensormap_generic_sys_release()` Releases the system-wide memory fence for tensor map operations. This function releases the memory fence previously established by the acquire operation. It ensures that all tensor map operations are completed and visible to the system before proceeding. Note: Should be called after tensor map operations are complete to maintain proper memory ordering semantics. --- ## memory This module provides GPU memory operations and utilities. The module implements low-level memory operations for GPU programming, with a focus on: * Memory address space abstractions (global, shared, constant) * Cache control operations and policies * Memory access patterns and optimizations * Memory alignment and pointer manipulation It provides a unified interface for memory operations across different GPU architectures, with specialized implementations for NVIDIA and AMD GPUs where needed. The module is designed for performance-critical code and requires careful usage to achieve optimal memory access patterns and cache utilization. ## Aliases ### `AddressSpace` `alias AddressSpace = _GPUAddressSpace` ## Structs * [​`CacheEviction`](/mojo/stdlib/gpu/memory/CacheEviction): Represents cache eviction policies for GPU memory operations. * [​`CacheOperation`](/mojo/stdlib/gpu/memory/CacheOperation): Represents different GPU cache operation policies. * [​`Consistency`](/mojo/stdlib/gpu/memory/Consistency): Represents memory consistency models for GPU memory operations. * [​`Fill`](/mojo/stdlib/gpu/memory/Fill): Represents memory fill patterns for GPU memory operations. * [​`ReduceOp`](/mojo/stdlib/gpu/memory/ReduceOp): Represents reduction operations for parallel reduction algorithms. ## Functions * [​`async_copy`](/mojo/stdlib/gpu/memory/async_copy): Asynchronously copies data from global memory to shared memory. * [​`async_copy_commit_group`](/mojo/stdlib/gpu/memory/async_copy_commit_group): Commits all prior initiated but uncommitted cp.async instructions into a cp.async-group. * [​`async_copy_wait_all`](/mojo/stdlib/gpu/memory/async_copy_wait_all): Waits for completion of all committed cp.async-groups. * [​`async_copy_wait_group`](/mojo/stdlib/gpu/memory/async_copy_wait_group): Waits for the completion of `n` most recently committed cp.async-groups. * [​`cp_async_bulk_tensor_global_shared_cta`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_global_shared_cta): Initiates an asynchronous copy operation to transfer tensor data from shared CTA memory to global memory using NVIDIA's Tensor Memory Access (TMA) mechanism. * [​`cp_async_bulk_tensor_reduce`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_reduce): Initiates an asynchronous reduction operation between shared CTA memory and global memory using NVIDIA's Tensor Memory Access (TMA) mechanism. * [​`cp_async_bulk_tensor_shared_cluster_global`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_shared_cluster_global): Initiates an asynchronous bulk copy operation of tensor data from global memory to shared memory. * [​`cp_async_bulk_tensor_shared_cluster_global_multicast`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_shared_cluster_global_multicast): Initiates an asynchronous multicast load operation using NVIDIA's Tensor Memory Access (TMA) to copy tensor data from global memory to shared memories of multiple CTAs in a cluster. * [​`external_memory`](/mojo/stdlib/gpu/memory/external_memory): Gets a pointer to dynamically allocated external memory. * [​`fence_mbarrier_init`](/mojo/stdlib/gpu/memory/fence_mbarrier_init): Creates a memory fence after mbarrier initialization. * [​`fence_proxy_tensormap_generic_sys_acquire`](/mojo/stdlib/gpu/memory/fence_proxy_tensormap_generic_sys_acquire): Acquires a system-wide memory fence for tensor map operations. * [​`fence_proxy_tensormap_generic_sys_release`](/mojo/stdlib/gpu/memory/fence_proxy_tensormap_generic_sys_release): Releases the system-wide memory fence for tensor map operations. * [​`load`](/mojo/stdlib/gpu/memory/load): Loads data from global memory into a SIMD vector. * [​`multimem_ld_reduce`](/mojo/stdlib/gpu/memory/multimem_ld_reduce): Performs a vectorized load-reduce operation using NVIDIA's multimem feature. * [​`multimem_st`](/mojo/stdlib/gpu/memory/multimem_st): Stages an inline multimem.st instruction. * [​`tma_store_fence`](/mojo/stdlib/gpu/memory/tma_store_fence): Establishes a memory fence for shared memory stores in TMA operations. --- ## load `load[type: DType, //, width: Int = 1, *, read_only: Bool = False, prefetch_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), cache_policy: CacheOperation = CacheOperation(0), eviction_policy: CacheEviction = CacheEviction(0), alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]() if is_nvidia_gpu() else 1](ptr: UnsafePointer[SIMD[type, 1]]) -> SIMD[type, width]` Loads data from global memory into a SIMD vector. Provides a high-level interface for vectorized memory loads with configurable cache behavior and memory access patterns. **Parameters:** * ​type (`DType`): The data type to load. * ​width (`Int`): Vector width (number of elements to load). * ​read\_only (`Bool`): If True, marks the load as read-only for cache optimization. * ​prefetch\_size (`OptionalReg[Int]`): Optional L2 cache prefetch size (64, 128, or 256 bytes). * ​cache\_policy (`CacheOperation`): Cache operation policy for the load. * ​eviction\_policy (`CacheEviction`): Cache eviction policy. * ​alignment (`Int`): Memory alignment in bytes. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to global memory to load from. **Returns:** SIMD vector containing the loaded data. `load[OffsetType: Indexer, type: DType, //, width: Int = 1, *, read_only: Bool = False, prefetch_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), cache_policy: CacheOperation = CacheOperation(0), eviction_policy: CacheEviction = CacheEviction(0), alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]() if is_nvidia_gpu() else 1](ptr: UnsafePointer[SIMD[type, 1]], offset: OffsetType) -> SIMD[type, width]` Loads data from global memory with an offset into a SIMD vector. Provides a high-level interface for vectorized memory loads with configurable cache behavior and memory access patterns, supporting offset-based addressing. **Parameters:** * ​OffsetType (`Indexer`): Type of the offset value. * ​type (`DType`): The data type to load. * ​width (`Int`): Vector width (number of elements to load). * ​read\_only (`Bool`): If True, marks the load as read-only for cache optimization. * ​prefetch\_size (`OptionalReg[Int]`): Optional L2 cache prefetch size (64, 128, or 256 bytes). * ​cache\_policy (`CacheOperation`): Cache operation policy for the load. * ​eviction\_policy (`CacheEviction`): Cache eviction policy. * ​alignment (`Int`): Memory alignment in bytes. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1]]`): Base pointer to global memory. * ​offset (`OffsetType`): Offset from base pointer in elements. **Returns:** SIMD vector containing the loaded data. --- ## multimem_ld_reduce `multimem_ld_reduce[type: DType, *, count: Int, reduction: ReduceOp, scope: Scope, consistency: Consistency, accum_type: DType = get_accum_type[::DType,::DType](), output_width: Int = 1](addr: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]) -> StaticTuple[SIMD[accum_type, output_width], count]` Performs a vectorized load-reduce operation using NVIDIA's multimem feature. This function loads multiple values from global memory and performs a reduction operation across them in a single instruction. It utilizes NVIDIA's multimem feature available on SM90+ GPUs for improved performance. **Constraints:** * Only supported on SM90+ GPUs. * Count must be 2 or 4. * Type must be float32, float16, or bfloat16. **Parameters:** * ​type (`DType`): Data type for the operation (float32, float16, or bfloat16). * ​count (`Int`): Number of elements to load and reduce (2 or 4). * ​reduction (`ReduceOp`): Type of reduction operation to perform. * ​scope (`Scope`): Memory scope for the operation. * ​consistency (`Consistency`): Memory consistency model to use. * ​accum\_type (`DType`): Data type used for accumulation. Defaults to a wider type than input (e.g. float32 for float16 inputs) to maintain precision during reduction. * ​output\_width (`Int`): Width of each output SIMD vector (default 1). **Args:** * ​addr (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]`): Pointer to global memory where data will be loaded from. **Returns:** A StaticTuple containing 'count' SIMD vectors of width 'output\_width' holding the results of the load-reduce operation. --- ## multimem_st `multimem_st[type: DType, *, count: Int, scope: Scope, consistency: Consistency, width: Int = 1](addr: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)], values: StaticTuple[SIMD[type, width], count])` Stages an inline multimem.st instruction. This operation performs a store to all memory locations pointed to by the multimem address using the specified memory consistency model and scope. Notes: * Requires SM90+ GPU architecture (PTX ISA 8.1+). * The address must be a valid multimem address. * Supported type-width combinations must total 32/64/128 bits. * Default memory semantics: weak consistency (when not specified). * Vector stores (.v2/.v4) require matching total size constraints. Example: ```mojo from gpu.memory import * # Store 2 float32 values to multimem address. multimem_st[DType.float32, count=2, scope=Scope.CTA, consistency=Consistency.RELAXED]( addr, StaticTuple[DType.float32, 2](val1, val2) ) # Vector store of 4 float16x2 values. multimem_st[DType.float16, count=4, scope=Scope.CLUSTER, consistency=Consistency.RELEASE, width=2]( addr, StaticTuple[DType.float16, 4](vec1, vec2, vec3, vec4) ) ``` See Also: [PTX ISA Documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-multimem-ld-reduce-multimem-st-multimem-red). **Parameters:** * ​type (`DType`): The data type of elements to store (must be float16, bfloat16, or float32). * ​count (`Int`): Number of vector elements per store operation (2 or 4). * ​scope (`Scope`): Memory scope for visibility of the store operation (CTA/Cluster/GPU/System). * ​consistency (`Consistency`): Memory consistency semantics (weak/relaxed/release). * ​width (`Int`): Vector width modifier for packed data types (default 1). **Args:** * ​addr (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]`): Multimem address in global address space pointing to multiple locations. * ​values (`StaticTuple[SIMD[type, width], count]`): Packed SIMD values to store, with count matching the template parameter. --- ## tma_store_fence `tma_store_fence()` Establishes a memory fence for shared memory stores in TMA operations. This function creates a memory barrier that ensures all previous shared memory stores are completed before subsequent TMA (Tensor Memory Access) store operations begin. This is crucial for maintaining memory consistency in tensor operations. Note: This fence specifically targets the CTA (Cooperative Thread Array) scope and is used to synchronize async shared memory operations. --- ## WGMMADescriptor `@register_passable(trivial)` `struct WGMMADescriptor[dtype: DType]` Descriptor for shared memory operands used in warp group matrix multiply operations. This struct represents a descriptor that encodes information about shared memory layout and access patterns for warp group matrix multiply operations. The descriptor contains the following bit fields: * Start address (14 bits): Base address in shared memory. * Leading byte offset (14 bits): Leading dimension stride in bytes. * Stride byte offset (14 bits): Stride dimension offset in bytes. * Base offset (3 bits): Additional offset. * Swizzle mode (2 bits): Memory access pattern. The bit layout is: +----------+----+------------+----+------------+----+-----+----------+-----+ \| 0-13 |14-15| 16-29 |30-31| 32-45 |46-48|49-51| 52-61 |62-63| +----------+----+------------+----+------------+----+-----+----------+-----+ \| 14bits |2bits| 14bits |2bits| 14bits |2bits|3bits| 10bits |2bits| +----------+----+------------+----+------------+----+-----+----------+-----+ \| BaseAddr | 0 |LeadingDim | 0 | Stride | 0 |Offst| 0 |Swzle| +----------+----+------------+----+------------+----+-----+----------+-----+ See: ## Parameters * ​dtype (`DType`): The data type of the shared memory operand. This affects memory alignment and access patterns for the descriptor. ## Fields * ​desc (`SIMD[int64, 1]`): The 64-bit descriptor value that encodes shared memory layout information. This field stores the complete descriptor with all bit fields packed into a single 64-bit integer: * Bits 0-13: Base address in shared memory (14 bits) * Bits 16-29: Leading dimension stride in bytes (14 bits) * Bits 32-45: Stride dimension offset in bytes (14 bits) * Bits 49-51: Base offset (3 bits) * Bits 62-63: Swizzle mode for memory access pattern (2 bits) The descriptor is used by NVIDIA Hopper architecture's warp group matrix multiply instructions to efficiently access shared memory with the appropriate layout and access patterns. ## Implemented traits `AnyType`, `Copyable`, `MMAOperandDescriptor`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(val: SIMD[int64, 1]) -> Self` Initialize descriptor with raw 64-bit value. This constructor allows creating a descriptor directly from a 64-bit integer that already contains the properly formatted bit fields for the descriptor. The implicit attribute enables automatic conversion from `Int64` to `WGMMADescriptor`. **Args:** * ​val (`SIMD[int64, 1]`): A 64-bit integer containing the complete descriptor bit layout. ### `__add__` `__add__(self, offset: Int) -> Self` Add offset to descriptor's base address. **Args:** * ​offset (`Int`): Byte offset to add to base address. **Returns:** New descriptor with updated base address. ### `__iadd__` `__iadd__(mut self, offset: Int)` Add offset to descriptor's base address in-place. **Args:** * ​offset (`Int`): Byte offset to add to base address. ### `create` `static create[stride_byte_offset: Int, leading_byte_offset: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))](smem_ptr: UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)]) -> Self` Create a descriptor for shared memory operand. **Parameters:** * ​stride\_byte\_offset (`Int`): Stride dimension offset in bytes. * ​leading\_byte\_offset (`Int`): Leading dimension stride in bytes. * ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern mode. **Args:** * ​smem\_ptr (`UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)]`): Pointer to shared memory operand. **Returns:** Initialized descriptor for the shared memory operand. --- ## mma This module includes utilities for working with the warp-matrix-matrix-multiplication (wmma) instructions. ## Structs * [​`WGMMADescriptor`](/mojo/stdlib/gpu/mma/WGMMADescriptor): Descriptor for shared memory operands used in warp group matrix multiply operations. ## Functions * [​`ld_matrix`](/mojo/stdlib/gpu/mma/ld_matrix): Loads a matrix from shared memory into registers in a format suitable for tensor core operations. * [​`mma`](/mojo/stdlib/gpu/mma/mma): Performs warp sync Tensor Core based Matrix-multiply and accumulate (MMA) operation. * [​`st_matrix`](/mojo/stdlib/gpu/mma/st_matrix): Performs warp-synchronized copy from registers to shared memory. * [​`wgmma_async`](/mojo/stdlib/gpu/mma/wgmma_async): Performs warp group async Matrix-multiply and accumulate (WGMMA) operation. * [​`wgmma_commit_group_sync`](/mojo/stdlib/gpu/mma/wgmma_commit_group_sync): Commits pending warp group matrix multiply operations. * [​`wgmma_fence_aligned`](/mojo/stdlib/gpu/mma/wgmma_fence_aligned): Inserts a memory fence for warp group matrix multiply operations. * [​`wgmma_wait_group_sync`](/mojo/stdlib/gpu/mma/wgmma_wait_group_sync): Waits for all pending warp group matrix multiply operations to complete. --- ## ld_matrix `ld_matrix[type: DType, //, simd_width: Int, *, transpose: Bool = False](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[type, simd_width]` Loads a matrix from shared memory into registers in a format suitable for tensor core operations. This function performs a warp-synchronized load from shared memory to registers, formatting the data to be directly usable by tensor core Matrix Multiply-Accumulate (MMA) instructions. Note: * All threads in a warp must execute this operation together. * For transposed loads, only half precision (float16) is supported. * The register width is fixed at 4 bytes (32 bits). * Supported configurations: * x1: One 32-bit register per thread. * x2: Two 32-bit registers per thread. * x4: Four 32-bit registers per thread. Example: ```mojo from gpu.mma import ld_matrix # Load 8x8 matrix of float16 values var data = ld_matrix[DType.float16, 8](ptr) # Load transposed matrix var transposed = ld_matrix[DType.float16, 8, transpose=True](ptr) ``` . **Parameters:** * ​type (`DType`): The data type of the matrix elements (e.g. float16, float32). * ​simd\_width (`Int`): The width of the SIMD vector to load. * ​transpose (`Bool`): Whether to transpose the matrix during load (only supported for half precision). **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to shared memory containing the source matrix data. **Returns:** SIMD vector containing the loaded matrix data, properly formatted for MMA operations. --- ## mma `mma[block_size: Int = 1](mut d: SIMD[dtype, size], a: SIMD[dtype, size], b: SIMD[dtype, size], c: SIMD[dtype, size])` Performs warp sync Tensor Core based Matrix-multiply and accumulate (MMA) operation. This function executes a matrix multiply-accumulate operation using GPU Tensor Cores, synchronizing across the warp. It dispatches to architecture-specific implementations for NVIDIA and AMD GPUs. The operation performed is: d = (a \* b) + c Supported configurations depend on the GPU architecture: * NVIDIA: Various combinations of FP32, FP16, BF16, and FP8 formats * AMD: Limited subset of FP32 and FP16 operations Note: * All threads in a warp must execute this operation together * Input matrices must be properly loaded and formatted for Tensor Core operations * Matrix dimensions and data types must match hardware requirements **Parameters:** * ​block\_size (`Int`): The size of the block of the MMA operation (e.g., 4x4x4\_16B). Applies to AMD GPUs only. **Args:** * ​d (`SIMD[dtype, size]`): Output SIMD vector to store the result. * ​a (`SIMD[dtype, size]`): First input matrix as SIMD vector. * ​b (`SIMD[dtype, size]`): Second input matrix as SIMD vector. * ​c (`SIMD[dtype, size]`): Accumulator matrix as SIMD vector. --- ## st_matrix `st_matrix[dtype: DType, //, simd_width: Int, *, transpose: Bool = False](ptr: UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)], d: SIMD[float32, simd_width])` Performs warp-synchronized copy from registers to shared memory. This function stores data from registers to shared memory in a format that can be directly used by tensor core Matrix Multiply-Accumulate (MMA) instructions. It uses the NVIDIA stmatrix instruction to perform an efficient warp-synchronized store. Note: The function performs a warp-synchronized operation - all threads in the warp must execute this instruction to avoid deadlock. **Constraints:** * Must be used with shared memory pointers. * Number of registers must be 1, 2, or 4. * Data must be properly aligned for matrix operations. * All threads in warp must participate. * Only supported on NVIDIA GPUs with tensor core capabilities. **Parameters:** * ​dtype (`DType`): Data type of elements to store. * ​simd\_width (`Int`): Width of the SIMD vector. * ​transpose (`Bool`): If True, transposes the matrix during store. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)]`): Pointer to shared memory where data will be stored. * ​d (`SIMD[float32, simd_width]`): SIMD vector containing the data to store. --- ## wgmma_async `wgmma_async[m: Int, n: Int, k: Int, c_dtype: DType, width: Int, /, *, a_type: DType, b_type: DType, accum_type: DType = c_dtype, layout_a: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("row"), layout_b: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("col"), scale_d: Int = 1, scale_a: Int = 1, scale_b: Int = 1](mat_a_desc: WGMMADescriptor[dtype], mat_b_desc: WGMMADescriptor[dtype], c_reg: StaticTuple[SIMD[c_dtype, 1], width]) -> StaticTuple[SIMD[c_dtype, 1], width]` Performs warp group async Matrix-multiply and accumulate (WGMMA) operation. This function executes an asynchronous matrix multiplication using warp group MMA instructions. It supports various data types including tensor float32, bfloat16, float16, float8, int8, and uint8. **Constraints:** * The number of output registers must match the instruction shape: `(m * n // 128) * sizeof(accum_type) == width * sizeof(c_dtype)`. * Data type combinations must be compatible with hardware WGMMA instructions. **Parameters:** * ​m (`Int`): Number of rows in matrix A and output matrix. * ​n (`Int`): Number of columns in matrix B and output matrix. * ​k (`Int`): Number of columns in matrix A / rows in matrix B. * ​c\_dtype (`DType`): Data type of the output matrix C. * ​width (`Int`): Width of the InlineArray register for matrix C. * ​a\_type (`DType`): Data type of matrix A. * ​b\_type (`DType`): Data type of matrix B. * ​accum\_type (`DType`): Accumulation data type (defaults to c\_dtype). * ​layout\_a (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix A ("row" or "col"). * ​layout\_b (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix B ("row" or "col"). * ​scale\_d (`Int`): Scale factor for matrix C. * ​scale\_a (`Int`): Scale factor for matrix A. * ​scale\_b (`Int`): Scale factor for matrix B. **Args:** * ​mat\_a\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix A. * ​mat\_b\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix B. * ​c\_reg (`StaticTuple[SIMD[c_dtype, 1], width]`): StaticTuple containing matrix C values. **Returns:** `StaticTuple` containing the result of the matrix multiplication. `wgmma_async[m: Int, n: Int, k: Int, c_dtype: DType, width: Int, /, *, a_type: DType, b_type: DType, accum_type: DType = c_dtype, layout_a: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("row"), layout_b: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("col"), scale_d: Int = 1, scale_a: Int = 1, scale_b: Int = 1](mat_a_desc: WGMMADescriptor[dtype], mat_b_desc: WGMMADescriptor[dtype], c_reg: SIMD[c_dtype, width]) -> SIMD[c_dtype, width]` Performs warp group async Matrix-multiply and accumulate (WGMMA) operation. This function executes an asynchronous matrix multiplication using warp group MMA instructions. It supports various data types including tensor float32, bfloat16, float16, float8, int8, and uint8. **Constraints:** * The number of output registers must match the instruction shape: `(m * n // 128) * sizeof(accum_type) == width * sizeof(c_dtype)`. * Data type combinations must be compatible with hardware WGMMA instructions. **Parameters:** * ​m (`Int`): Number of rows in matrix A and output matrix. * ​n (`Int`): Number of columns in matrix B and output matrix. * ​k (`Int`): Number of columns in matrix A / rows in matrix B. * ​c\_dtype (`DType`): Data type of the output matrix C. * ​width (`Int`): Width of the SIMD register for matrix C. * ​a\_type (`DType`): Data type of matrix A. * ​b\_type (`DType`): Data type of matrix B. * ​accum\_type (`DType`): Accumulation data type (defaults to c\_dtype). * ​layout\_a (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix A ("row" or "col"). * ​layout\_b (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix B ("row" or "col"). * ​scale\_d (`Int`): Scale factor for matrix C. * ​scale\_a (`Int`): Scale factor for matrix A. * ​scale\_b (`Int`): Scale factor for matrix B. **Args:** * ​mat\_a\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix A. * ​mat\_b\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix B. * ​c\_reg (`SIMD[c_dtype, width]`): SIMD register containing matrix C values. **Returns:** SIMD register containing the result of the matrix multiplication. `wgmma_async[m: Int, n: Int, k: Int, a_dtype: DType, c_dtype: DType, frag_a_width: Int, frag_c_width: Int, /, *, a_type: DType, b_type: DType, accum_type: DType = c_dtype, layout_a: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("row"), layout_b: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("col"), scale_d: Int = 1, scale_a: Int = 1, scale_b: Int = 1](mat_a_frag: SIMD[a_dtype, frag_a_width], mat_b_desc: WGMMADescriptor[dtype], c: SIMD[c_dtype, frag_c_width]) -> SIMD[c_dtype, frag_c_width]` Performs warp group async Matrix-multiply and accumulate (WGMMA) operation. Currently only supports: * m=64, k=16. * BF16 input types. * FP32 accumulation. * Row major matrix A. * Column major matrix B (or row major for BF16). **Parameters:** * ​m (`Int`): Number of rows in output matrix. * ​n (`Int`): Number of columns in output matrix. * ​k (`Int`): Inner dimension for matrix multiplication. * ​a\_dtype (`DType`): Data type of matrix A fragment. * ​c\_dtype (`DType`): Data type of output matrix C. * ​frag\_a\_width (`Int`): Width of matrix A fragment. * ​frag\_c\_width (`Int`): Width of output matrix C fragment. * ​a\_type (`DType`): Data type of matrix A. * ​b\_type (`DType`): Data type of matrix B. * ​accum\_type (`DType`): Data type used for accumulation (defaults to c\_dtype). * ​layout\_a (`StringSlice[StaticConstantOrigin]`): Layout of matrix A ("row" or "col", defaults to "row"). * ​layout\_b (`StringSlice[StaticConstantOrigin]`): Layout of matrix B ("row" or "col", defaults to "col"). * ​scale\_d (`Int`): Scale factor for output matrix C (defaults to 1). * ​scale\_a (`Int`): Scale factor for matrix A (defaults to 1). * ​scale\_b (`Int`): Scale factor for matrix B (defaults to 1). **Args:** * ​mat\_a\_frag (`SIMD[a_dtype, frag_a_width]`): Fragment containing matrix A data. * ​mat\_b\_desc (`WGMMADescriptor[dtype]`): Descriptor for matrix B data. * ​c (`SIMD[c_dtype, frag_c_width]`): Fragment containing matrix C data. **Returns:** Updated matrix C fragment after WGMMA operation. --- ## wgmma_commit_group_sync `wgmma_commit_group_sync()` Commits pending warp group matrix multiply operations. This synchronizes the warp group and ensures all WGMMA operations have been committed. Must be called after a sequence of WGMMA operations before accessing results. --- ## wgmma_fence_aligned `wgmma_fence_aligned()` Inserts a memory fence for warp group matrix multiply operations. This ensures all prior shared memory accesses are visible before subsequent WGMMA operations. Must be called before starting a new sequence of WGMMA operations. --- ## wgmma_wait_group_sync `wgmma_wait_group_sync[group: Int = 0]()` Waits for all pending warp group matrix multiply operations to complete. This synchronizes the warp group and ensures all WGMMA operations have finished executing. Must be called after commit and before accessing results. **Parameters:** * ​group (`Int`): The number of pending wgmma-groups to wait until. --- ## MMAOperandDescriptor ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `__add__` `__add__(self: _Self, offset: Int) -> _Self` --- ## mma_operand_descriptor ## Traits * [​`MMAOperandDescriptor`](/mojo/stdlib/gpu/mma_operand_descriptor/MMAOperandDescriptor): --- ## MMASmemDescriptor `@register_passable(trivial)` `struct MMASmemDescriptor` Descriptor for shared memory operands tcgen05 mma instructions. This struct represents a descriptor that encodes information about shared memory layout and access patterns for warp group matrix multiply operations. The descriptor contains the following bit fields: bits layout: Bit-field | size | Description 0-13 | 14 | Base address in shared memory 14-15 | 2 | Unused, 0 16-29 | 14 | LBO: leading dim byte offset 30-31 | 2 | Unused, 0 32-45 | 14 | SBO: stride dim byte offset 46-48 | 3 | Unused, 0 49-51 | 3 | Matrix Base offset, 0 for canonical layouts 52 | 1 | LBO mode, only matters for 48B K tile 53-60 | 8 | fixed, 0 61-63 | 3 | Swizzle mode * Start address, LBO, SBO ignores 4 LSBs. See ## Fields * ​desc (`SIMD[uint64, 1]`): The 64-bit descriptor encodes shared memory operand information. ## Implemented traits `AnyType`, `Copyable`, `MMAOperandDescriptor`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(val: SIMD[uint64, 1]) -> Self` Initialize descriptor with raw 64-bit value. This constructor allows creating a descriptor directly from a 64-bit integer that already contains the properly formatted bit fields for the descriptor. The implicit attribute enables automatic conversion from `UInt64` to `MMASmemDescriptor`. **Args:** * ​val (`SIMD[uint64, 1]`): A 64-bit integer containing the complete descriptor bit layout. ### `__add__` `__add__(self, offset: Int) -> Self` Add offset to descriptor's base address. **Args:** * ​offset (`Int`): Byte offset to add to base address. **Returns:** New descriptor with updated base address. ### `__iadd__` `__iadd__(mut self, offset: Int)` Add offset to descriptor's base address in-place. **Args:** * ​offset (`Int`): Byte offset to add to base address. ### `create` `static create[stride_byte_offset: Int, leading_byte_offset: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))](smem_ptr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]) -> Self` Create a descriptor for shared memory operand. **Parameters:** * ​stride\_byte\_offset (`Int`): Stride dimension offset in bytes. * ​leading\_byte\_offset (`Int`): Leading dimension stride in bytes. * ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern mode. **Args:** * ​smem\_ptr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to shared memory operand. **Returns:** Initialized descriptor for the shared memory operand. --- ## UMMAInsDescriptor `@register_passable(trivial)` `struct UMMAInsDescriptor[mma_kind: UMMAKind]` Descriptor for UMMA instructions. This struct represents a descriptor that encodes information about UMMA instructions. The descriptor contains the following bit fields: * Sparsity (2 bits): The sparsity of the input matrices. Currently defaults to dense matrices. * Saturate for integer types (1 bits): Whether to saturate the result for integer types. Currently not supported. * Matrix D type (2 bits): Data type of matrix D. * Matrix A type (3 bits): Data type of matrix A. * Matrix B type (3 bits): Data type of matrix B. * Negate A matrix (1 bit): Whether to negate matrix A. Currently defaults to False. * Negate B matrix (1 bit): Whether to negate matrix B. Currently defaults to False. * Transpose A (1 bit): Whether to transpose matrix A. * Transpose B (1 bit): Whether to transpose matrix B. * N, Dimension of Matrix B (6 bits): Number of columns in matrix B. 3 LSBs are unused. * M, Dimension of Matrix A (6 bits): Number of rows in matrix A. 3 LSBs are unused. See: ## Parameters * ​mma\_kind (`UMMAKind`): The kind of UMMA instruction. ## Fields * ​desc (`SIMD[uint32, 1]`): The 32-bit descriptor value that encodes UMMA instruction information. This field stores the complete descriptor with all bit fields packed into a single 32-bit integer: * Bits 0-1: Sparsity selector(2 bits) * Bits 2: Sparsity enable(1 bit) * Bits 3: Saturate for integer types (1 bit) * Bits 4-5: Matrix D type (2 bits) * Bits 6: Reserved (1 bit) * Bits 7-9: Matrix A type (3 bits) * Bits 10-12: Matrix B type (3 bits) * Bits 13: Negate A matrix (1 bit) * Bits 14: Negate B matrix (1 bit) * Bits 15: Transpose A (1 bit) * Bits 16: Transpose B (1 bit) * Bits 17-22: N, Dimension of Matrix B (6 bits) * Bits 23: Reserved (1 bit) * Bits 24-28: M, Dimension of Matrix A (5 bits) * Bits 29: Reserved (1 bit) * Bits 30-31: Maximum shift while attempting B matrix (2 bits) ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(value: SIMD[uint32, 1]) -> Self` Initialize descriptor with raw 32-bit value. This constructor allows creating a descriptor directly from a 32-bit integer that already contains the properly formatted bit fields for the descriptor. **Args:** * ​value (`SIMD[uint32, 1]`): A 32-bit integer containing the complete descriptor bit layout. ### `create` `static create[d_type: DType, a_type: DType, b_type: DType, output_shape: IndexList[2, element_type=uint32], /, *, transpose_a: Bool = False, transpose_b: Bool = True]() -> Self` Create a descriptor for UMMA instructions. This function creates a descriptor for UMMA instructions based on the provided parameters. **Parameters:** * ​d\_type (`DType`): The data type of matrix D. * ​a\_type (`DType`): The data type of matrix A. * ​b\_type (`DType`): The data type of matrix B. * ​output\_shape (`IndexList[2, element_type=uint32]`): The shape of the output matrix. * ​transpose\_a (`Bool`): Whether to transpose matrix A. * ​transpose\_b (`Bool`): Whether to transpose matrix B. **Returns:** A 32-bit integer containing the complete descriptor bit layout. --- ## UMMAKind `@register_passable(trivial)` `struct UMMAKind` Struct for UMMA instruction types. This struct defines the different types of UMMA instructions that is supported by BlackWell. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `KIND_F16` `alias KIND_F16 = UMMAKind(__init__[__mlir_type.!pop.int_literal](2))` f16 type ### `KIND_F8F6F4` `alias KIND_F8F6F4 = UMMAKind(__init__[__mlir_type.!pop.int_literal](3))` f8f6f4 type ### `KIND_I8` `alias KIND_I8 = UMMAKind(__init__[__mlir_type.!pop.int_literal](4))` i8 type ### `KIND_TF32` `alias KIND_TF32 = UMMAKind(__init__[__mlir_type.!pop.int_literal](0))` tf32 type ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Check if two UMMA kinds are equal. **Args:** * ​other (`Self`): The other UMMA kind to compare with. **Returns:** True if the UMMA kinds are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Check if two UMMA kinds are not equal. **Args:** * ​other (`Self`): The other UMMA kind to compare with. **Returns:** True if the UMMA kinds are not equal, False otherwise. ### `__int__` `__int__(self) -> Int` Convert UMMA kind to an integer value. **Returns:** The integer value representing the UMMA instruction type. ### `__str__` `__str__(self) -> String` Convert UMMA kind to a string, this can be used as the instruction qualifier. **Returns:** The PTX qualifier representation of the UMMA kind. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Write the UMMA kind to a writer. **Parameters:** * ​W (`Writer`): The writer type that will receive the formatted output. **Args:** * ​writer (`W`): The writer to write the UMMA kind to. --- ## mma_sm100 This module includes utilities for working with the SM100 MMA instructions. ## Structs * [​`MMASmemDescriptor`](/mojo/stdlib/gpu/mma_sm100/MMASmemDescriptor): Descriptor for shared memory operands tcgen05 mma instructions. * [​`UMMAInsDescriptor`](/mojo/stdlib/gpu/mma_sm100/UMMAInsDescriptor): Descriptor for UMMA instructions. * [​`UMMAKind`](/mojo/stdlib/gpu/mma_sm100/UMMAKind): Struct for UMMA instruction types. ## Functions * [​`mma`](/mojo/stdlib/gpu/mma_sm100/mma): Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction. * [​`mma_arrive`](/mojo/stdlib/gpu/mma_sm100/mma_arrive): Arrive at the mbar pointer for the MMA instruction. * [​`mma_arrive_multicast`](/mojo/stdlib/gpu/mma_sm100/mma_arrive_multicast): Arrive at the mbar pointer for the MMA instruction for multiple ctas. --- ## mma `mma[kind: UMMAKind, //, cta_group: Int = 1, /, *, c_scale: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1)](a_desc: MMASmemDescriptor, b_desc: MMASmemDescriptor, c_tmem: SIMD[uint32, 1], inst_desc: UMMAInsDescriptor[kind])` Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction. **Parameters:** * ​kind (`UMMAKind`): Data type of the matrices. * ​cta\_group (`Int`): Number of ctas used by MMA. * ​c\_scale (`SIMD[uint32, 1]`): Scale factor for the C matrix, 0 or 1. **Args:** * ​a\_desc (`MMASmemDescriptor`): The descriptor for the A matrix. * ​b\_desc (`MMASmemDescriptor`): The descriptor for the B matrix. * ​c\_tmem (`SIMD[uint32, 1]`): The address of the C matrix in the tensor memory. * ​inst\_desc (`UMMAInsDescriptor[kind]`): The descriptor for the MMA instruction. `mma[kind: UMMAKind, //, cta_group: Int = 1, /](a_desc: MMASmemDescriptor, b_desc: MMASmemDescriptor, c_tmem: SIMD[uint32, 1], inst_desc: UMMAInsDescriptor[kind], c_scale: SIMD[uint32, 1])` Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction. **Parameters:** * ​kind (`UMMAKind`): Data type of the matrices. * ​cta\_group (`Int`): Number of ctas used by MMA. **Args:** * ​a\_desc (`MMASmemDescriptor`): The descriptor for the A matrix. * ​b\_desc (`MMASmemDescriptor`): The descriptor for the B matrix. * ​c\_tmem (`SIMD[uint32, 1]`): The address of the C matrix in the tensor memory. * ​inst\_desc (`UMMAInsDescriptor[kind]`): The descriptor for the MMA instruction. * ​c\_scale (`SIMD[uint32, 1]`): Scale factor for the C matrix. Any non-zero value is translated to `1`. `mma[kind: UMMAKind, //, cta_group: Int = 1, /](a_desc: SIMD[uint32, 1], b_desc: MMASmemDescriptor, c_tmem: SIMD[uint32, 1], inst_desc: UMMAInsDescriptor[kind], c_scale: SIMD[uint32, 1])` Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction. **Parameters:** * ​kind (`UMMAKind`): Data type of the matrices. * ​cta\_group (`Int`): Number of ctas used by MMA. **Args:** * ​a\_desc (`SIMD[uint32, 1]`): The descriptor for the A matrix. * ​b\_desc (`MMASmemDescriptor`): The descriptor for the B matrix. * ​c\_tmem (`SIMD[uint32, 1]`): The address of the C matrix in the tensor memory. * ​inst\_desc (`UMMAInsDescriptor[kind]`): The descriptor for the MMA instruction. * ​c\_scale (`SIMD[uint32, 1]`): Scale factor for the C matrix. Any non-zero value is interpreted as `1`. `mma[kind: UMMAKind, //, cta_group: Int = 1, /, *, c_scale: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1)](a_desc: SIMD[uint32, 1], b_desc: MMASmemDescriptor, c_tmem: SIMD[uint32, 1], inst_desc: UMMAInsDescriptor[kind])` Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction. **Parameters:** * ​kind (`UMMAKind`): Data type of the matrices. * ​cta\_group (`Int`): Number of ctas used by MMA. * ​c\_scale (`SIMD[uint32, 1]`): Scale factor for the C matrix, 0 or 1. **Args:** * ​a\_desc (`SIMD[uint32, 1]`): The descriptor for the A matrix. * ​b\_desc (`MMASmemDescriptor`): The descriptor for the B matrix. * ​c\_tmem (`SIMD[uint32, 1]`): The address of the C matrix in the tensor memory. * ​inst\_desc (`UMMAInsDescriptor[kind]`): The descriptor for the MMA instruction. --- ## mma_arrive `mma_arrive[cta_group: Int = 1](mbar_ptr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin])` Arrive at the mbar pointer for the MMA instruction. **Parameters:** * ​cta\_group (`Int`): Number of ctas used by MMA. **Args:** * ​mbar\_ptr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the mbar. --- ## mma_arrive_multicast `mma_arrive_multicast[cta_group: Int = 1](mbar_ptr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], cta_mask: SIMD[uint16, 1])` Arrive at the mbar pointer for the MMA instruction for multiple ctas. **Parameters:** * ​cta\_group (`Int`): Number of ctas used by MMA. **Args:** * ​mbar\_ptr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the mbar. * ​cta\_mask (`SIMD[uint16, 1]`): Mask of ctas to signal. --- ## mma_util Matrix multiply accumulate (MMA) utilities for GPU tensor cores. This module provides functions for loading matrix tiles from memory into registers and storing results back to memory when using tensor cores for matrix multiplication. It supports both NVIDIA and AMD GPUs with functions specialized for different data types (FP32, FP16, BF16). The key functions are: * load\_matrix\_a: Loads tiles from the first input matrix A * load\_matrix\_b: Loads tiles from the second input matrix B * store\_matrix\_d: Stores result tiles to the output matrix D Each function handles the specific memory access patterns required by the tensor core instructions on each GPU architecture. The tile sizes and data layouts match the hardware requirements documented in: NVIDIA PTX: AMD Matrix Cores: ## Functions * [​`load_matrix_a`](/mojo/stdlib/gpu/mma_util/load_matrix_a): Loads a tile of matrix A from memory to registers for TF32 tensor core operations. * [​`load_matrix_a_amd`](/mojo/stdlib/gpu/mma_util/load_matrix_a_amd): Loads a tile of matrix A from memory to registers for AMD FP32 tensor core operations. * [​`load_matrix_b`](/mojo/stdlib/gpu/mma_util/load_matrix_b): Loads a tile of matrix B from memory to registers for TF32 tensor core operations. * [​`load_matrix_b_amd`](/mojo/stdlib/gpu/mma_util/load_matrix_b_amd): Loads a tile of matrix B from memory to registers for AMD FP32 tensor core operations. * [​`store_matrix_d`](/mojo/stdlib/gpu/mma_util/store_matrix_d): Stores matrix D tile from registers to memory after tensor core operation. --- ## load_matrix_a `load_matrix_a[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 4]` Loads a tile of matrix A from memory to registers for TF32 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=8, k=8. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​a\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix A data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix A (stride between rows). **Returns:** SIMD vector containing 4 TF32 values loaded from matrix A in the required order. `load_matrix_a[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float16, 4]` Loads a tile of matrix A from memory to registers for FP16 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=8, k=8. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​a\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix A data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix A (stride between rows). **Returns:** SIMD vector containing 4 FP16 values loaded from matrix A in the required order. `load_matrix_a[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[bfloat16, (div_s(#lit.struct.extract, 2) + -1) if ((k , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]` Loads a tile of matrix A from memory to registers for BF16 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=8, k=8 or m=16, n=8, k=16. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​a\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix A data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix A (stride between rows). **Returns:** SIMD vector containing k//2 BF16 values loaded from matrix A in the required order. --- ## load_matrix_a_amd `load_matrix_a_amd[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 1]` Loads a tile of matrix A from memory to registers for AMD FP32 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=16, k=4. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​a\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix A data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix A (stride between rows). **Returns:** SIMD vector containing 1 FP32 value loaded from matrix A. `load_matrix_a_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](a_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float16, 4]` Loads a tile of matrix A from memory to registers for AMD FP16 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=16, k=16 and n\_blocks=1 or m=4, n=4, k=4 and n\_blocks=16. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. * ​n\_blocks (`Int`): Number of blocks. **Args:** * ​a\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix A data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix A (stride between rows). **Returns:** SIMD vector containing 4 FP16 values loaded from matrix A. `load_matrix_a_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](a_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[bfloat16, 4]` Loads a tile of matrix A from memory to registers for AMD BF16 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=16, k=16 and n\_blocks=1 or m=4, n=4, k=4 and n\_blocks=16. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. * ​n\_blocks (`Int`): Number of blocks. **Args:** * ​a\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix A data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix A (stride between rows). **Returns:** SIMD vector containing 4 BF16 values loaded from matrix A. --- ## load_matrix_b `load_matrix_b[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 2]` Loads a tile of matrix B from memory to registers for TF32 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=8, k=8. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​b\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix B data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix B (stride between rows). **Returns:** SIMD vector containing 2 TF32 values loaded from matrix B in the required order. `load_matrix_b[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float16, 2]` Loads a tile of matrix B from memory to registers for FP16 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=8, k=8. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​b\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix B data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix B (stride between rows). **Returns:** SIMD vector containing 2 FP16 values loaded from matrix B in the required order. `load_matrix_b[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[bfloat16, (div_s(#lit.struct.extract, 4) + -1) if ((k , 4) == 0) ^ True)) else div_s(#lit.struct.extract, 4)]` Loads a tile of matrix B from memory to registers for BF16 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=8, k=8 or m=16, n=8, k=16. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​b\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix B data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix B (stride between rows). **Returns:** SIMD vector containing k//4 BF16 values loaded from matrix B in the required order. --- ## load_matrix_b_amd `load_matrix_b_amd[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 1]` Loads a tile of matrix B from memory to registers for AMD FP32 tensor core operations. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​b\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix B data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix B (stride between rows). **Returns:** SIMD vector containing 1 FP32 value loaded from matrix B. `load_matrix_b_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](b_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int, tile_loops: Int = 1) -> SIMD[float16, 4]` Loads a tile of matrix B from memory to registers for AMD FP16 tensor core operations. This function loads 4 consecutive FP16 values per thread from matrix B in a pattern optimized for AMD GPU tensor core operations. Each thread loads values based on its position within the warp. Performance: * Optimized for AMD GPU memory access patterns. * Uses thread ID to determine which elements to load. * Loads 4 consecutive elements per thread for efficient vectorization. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. * ​n\_blocks (`Int`): Number of blocks. **Args:** * ​b\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix B data in memory (FP16 format). * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix B (stride between rows). * ​tile\_loops (`Int`): Number of tile loops across matrix B's row dimension. **Returns:** SIMD vector containing 4 FP16 values loaded from matrix B. `load_matrix_b_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](b_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int, tile_loops: Int = 1) -> SIMD[bfloat16, 4]` Loads a tile of matrix B from memory to registers for AMD BF16 tensor core operations. This function loads 4 consecutive BF16 values per thread from matrix B in a pattern optimized for AMD GPU tensor core operations. Each thread loads values based on its position within the warp. Performance: * Optimized for AMD GPU memory access patterns. * Uses thread ID to determine which elements to load. * Loads 4 consecutive elements per thread for efficient vectorization. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. * ​n\_blocks (`Int`): Number of blocks. **Args:** * ​b\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix B data in memory (BF16 format). * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix B (stride between rows). * ​tile\_loops (`Int`): Number of tile loops across matrix B's row dimension. **Returns:** SIMD vector containing 4 BF16 values loaded from matrix B. --- ## store_matrix_d `store_matrix_d[dtype: DType, //, m: Int, n: Int, k: Int, n_blocks: Int = 1](d_ptr: UnsafePointer[SIMD[dtype, 1]], d: SIMD[dtype, 4], tile_row: Int, tile_col: Int, ldm: Int)` Stores matrix D tile from registers to memory after tensor core operation. This function dispatches to architecture-specific implementations for storing the results of a tensor core matrix multiply-accumulate operation. It handles the different memory layouts required by NVIDIA and AMD tensor cores. Note: * Automatically selects appropriate implementation based on GPU architecture. * Each thread stores 4 elements in architecture-specific positions. * Must be called by all threads in a warp. **Parameters:** * ​dtype (`DType`): Data type of the matrix elements. * ​m (`Int`): Number of rows in matrix D. * ​n (`Int`): Number of columns in matrix D. * ​k (`Int`): Inner dimension for matrix multiply. * ​n\_blocks (`Int`): Number of blocks. **Args:** * ​d\_ptr (`UnsafePointer[SIMD[dtype, 1]]`): Pointer to destination memory for matrix D. * ​d (`SIMD[dtype, 4]`): SIMD vector containing 4 elements to store. * ​tile\_row (`Int`): Starting row index of the tile in matrix D. * ​tile\_col (`Int`): Starting column index of the tile in matrix D. * ​ldm (`Int`): Leading dimension (stride) of matrix D. --- ## ProfileBlock `struct ProfileBlock[enabled: Bool = False]` A struct for profiling code blocks. This struct provides context manager functionality to profile code blocks. When enabled, it records the start and end time of the block and prints the timing information. ## Parameters * ​enabled (`Bool`): Whether profiling is enabled for this block. ## Fields * ​name (`StringSlice[StaticConstantOrigin]`): Name of the profiling block used for identification in timing output. * ​loc (`_SourceLocation`): Source code location information for the profiling block, including file, line, and column. * ​start\_time (`UInt`): Start time of the profiling block in nanoseconds, captured using perf\_counter\_ns(). ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(out self, name: StringSlice[StaticConstantOrigin])` Initialize a new ProfileBlock. **Args:** * ​name (`StringSlice[StaticConstantOrigin]`): Name to identify this profiling block. ### `__enter__` `__enter__(mut self)` Enter the profiling block and record start time if enabled. ### `__exit__` `__exit__(mut self)` Exit the profiling block, record end time and print timing if enabled. --- ## profiler This module provides GPU profiling functionality. The profiler module enables performance profiling of GPU code blocks through a simple context manager interface. It includes: * ProfileBlock: A context manager for timing code blocks * Configurable profiling that can be enabled/disabled at compile time * Nanosecond precision timing using perf\_counter\_ns() * Source location tracking for profiled blocks * Formatted timing output Example: ```mojo from gpu import profiler with profiler.ProfileBlock("my_kernel"): # Code to profile run_gpu_kernel() ``` ## Structs * [​`ProfileBlock`](/mojo/stdlib/gpu/profiler/ProfileBlock): A struct for profiling code blocks. --- ## Random `struct Random[rounds: Int = 6]` A high-performance random number generator using the Philox algorithm. The Philox algorithm is a counter-based random number generator designed for parallel and GPU computing. It provides high-quality random numbers with excellent statistical properties. ## Parameters * ​rounds (`Int`): Number of mixing rounds to perform. Higher values provide better statistical quality at the cost of performance. Default is 6. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, *, seed: SIMD[uint64, 1] = __init__[__mlir_type.!pop.int_literal](0), subsequence: SIMD[uint64, 1] = __init__[__mlir_type.!pop.int_literal](0), offset: SIMD[uint64, 1] = __init__[__mlir_type.!pop.int_literal](0))` Initialize the random number generator. **Args:** * ​seed (`SIMD[uint64, 1]`): Initial seed value for reproducible sequences. Default is 0. * ​subsequence (`SIMD[uint64, 1]`): Subsequence number for generating independent streams. Default is 0. * ​offset (`SIMD[uint64, 1]`): Starting offset in the sequence. Default is 0. ### `step` `step(mut self) -> SIMD[uint32, 4]` Generate 4 random 32-bit unsigned integers. **Returns:** SIMD vector containing 4 random 32-bit unsigned integers. ### `step_uniform` `step_uniform(mut self) -> SIMD[float32, 4]` Generate 4 random floating point numbers uniformly distributed in \[0,1). **Returns:** SIMD vector containing 4 random float32 values in range \[0,1). --- ## random Random number generation for GPU kernels. This module implements a high-performance random number generator using the Philox algorithm, which is designed for parallel and GPU computing. The Philox algorithm is a counter-based random number generator that provides high-quality random numbers with excellent statistical properties. The main class is Random which generates both uniform random numbers and raw 32-bit integers. It supports: * Seeding for reproducible sequences * Multiple independent subsequences * Configurable number of rounds for quality vs performance tradeoff * Vectorized operations for efficiency Example: ```mojo from gpu.random import Random rng = Random(seed=42) uniform_values = rng.step_uniform() # Returns 4 random floats in [0,1) raw_values = rng.step() # Returns 4 raw 32-bit integers ``` ## Structs * [​`Random`](/mojo/stdlib/gpu/random/Random): A high-performance random number generator using the Philox algorithm. --- ## Semaphore `@register_passable` `struct Semaphore` A device-wide semaphore implementation for GPUs. This struct provides atomic operations and memory barriers for inter-CTA synchronization. It uses a single thread per CTA to perform atomic operations on a shared lock variable. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(lock: UnsafePointer[SIMD[int32, 1]], thread_id: Int) -> Self` Initialize a new Semaphore instance. **Args:** * ​lock (`UnsafePointer[SIMD[int32, 1]]`): Pointer to shared lock variable in global memory. * ​thread\_id (`Int`): Thread ID within the CTA, used to determine if this thread should perform atomic operations. ### `fetch` `fetch(mut self)` Fetch the current state of the semaphore from global memory. Only the designated wait thread (thread 0) performs the actual load, using an acquire memory ordering to ensure proper synchronization. ### `state` `state(self) -> SIMD[int32, 1]` Get the current state of the semaphore. **Returns:** The current state value of the semaphore. ### `wait` `wait(mut self, status: Int = 0)` Wait until the semaphore reaches the specified state. Uses a barrier-based spin loop where all threads participate in checking the state. Only proceeds when the state matches the expected status. **Args:** * ​status (`Int`): The state value to wait for (defaults to 0). ### `release` `release(mut self, status: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](0))` Release the semaphore by setting it to the specified state. Ensures all threads have reached this point via a barrier before the designated thread updates the semaphore state. **Args:** * ​status (`SIMD[int32, 1]`): The new state value to set (defaults to 0). --- ## semaphore This module provides a device-wide semaphore implementation for NVIDIA GPUs. The Semaphore struct enables inter-CTA (Cooperative Thread Array) synchronization by providing atomic operations and memory barriers. It uses NVIDIA-specific intrinsics to implement efficient thread synchronization. Example: ```` ```mojo from gpu import Semaphore var lock = UnsafePointer[Int32](...) var sem = Semaphore(lock, thread_id) # Wait for a specific state sem.wait(0) # Release the semaphore sem.release(1) ``` ```` ## Structs * [​`Semaphore`](/mojo/stdlib/gpu/semaphore/Semaphore): A device-wide semaphore implementation for GPUs. --- ## AMDScheduleBarrierMask `@register_passable(trivial)` `struct AMDScheduleBarrierMask` Represents different instruction scheduling masks for AMDGPU scheduling instructions. These masks control which types of instructions can be reordered across a barrier for performance optimization. When used with schedule\_barrier(), the mask determines which instructions the compiler is allowed to move across the barrier point. ## Implemented traits `AnyType`, `Copyable`, `Intable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ALL_ALU` `alias ALL_ALU = AMDScheduleBarrierMask(1)` Allows reordering of all arithmetic and logic instructions that don't involve memory operations. ### `ALL_DS` `alias ALL_DS = AMDScheduleBarrierMask(128)` Permits reordering of all Local Data Share (LDS) operations. ### `ALL_VMEM` `alias ALL_VMEM = AMDScheduleBarrierMask(16)` Enables reordering of all vector memory operations (reads and writes). ### `DS_READ` `alias DS_READ = AMDScheduleBarrierMask(256)` Enables reordering of LDS read operations only. ### `DS_WRITE` `alias DS_WRITE = AMDScheduleBarrierMask(512)` Enables reordering of LDS write operations only. ### `MFMA` `alias MFMA = AMDScheduleBarrierMask(8)` Allows reordering of matrix multiplication and WMMA instructions. ### `NONE` `alias NONE = AMDScheduleBarrierMask(0)` No instructions can cross the barrier. Most restrictive option. ### `SALU` `alias SALU = AMDScheduleBarrierMask(4)` Permits reordering of scalar arithmetic/logic unit instructions only. ### `TRANS` `alias TRANS = AMDScheduleBarrierMask(1024)` Allows reordering of transcendental instructions (sin, cos, exp, etc). ### `VALU` `alias VALU = AMDScheduleBarrierMask(2)` Permits reordering of vector arithmetic/logic unit instructions only. ### `VMEM_READ` `alias VMEM_READ = AMDScheduleBarrierMask(32)` Allows reordering of vector memory read operations only. ### `VMEM_WRITE` `alias VMEM_WRITE = AMDScheduleBarrierMask(64)` Allows reordering of vector memory write operations only. ## Methods ### `__init__` `@implicit` `__init__(value: Int) -> Self` Initializes an `AMDScheduleBarrierMask` from an integer value. This implicit constructor allows creating a barrier mask directly from an integer, which is useful for combining multiple mask flags using bitwise operations. **Args:** * ​value (`Int`): The integer value to use for the barrier mask. ### `__eq__` `__eq__(self, other: Self) -> Bool` Compares two `AMDScheduleBarrierMask` instances for equality. **Args:** * ​other (`Self`): The other `AMDScheduleBarrierMask` to compare with. **Returns:** True if the masks have the same value, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Compares two `AMDScheduleBarrierMask` instances for inequality. **Args:** * ​other (`Self`): The other `AMDScheduleBarrierMask` to compare with. **Returns:** True if the masks have different values, False otherwise. ### `__str__` `__str__(self) -> String` Returns a string representation of the `AMDScheduleBarrierMask`. Converts the mask to a human-readable string based on its value. **Returns:** A string representation of the mask, or aborts if the value is invalid. ### `__int__` `__int__(self) -> Int` Converts the `AMDScheduleBarrierMask` to an integer. **Returns:** The integer value of the mask, which can be used with low-level APIs. --- ## async_copy_arrive `async_copy_arrive[type: AnyType, address_space: AddressSpace](address: UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin])` Makes a memory barrier track all prior async copy operations from this thread. This function ensures that all previously initiated asynchronous copy operations from the executing thread are tracked by the memory barrier at the specified location. Only supported on NVIDIA GPUs. **Parameters:** * ​type (`AnyType`): The data type stored at the barrier location. * ​address\_space (`AddressSpace`): The memory address space where the barrier is located. **Args:** * ​address (`UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory barrier object location. --- ## barrier `barrier()` Performs a synchronization barrier at the block level. This is equivalent to \_\_syncthreads() in CUDA. All threads in a thread block must execute this function before any thread can proceed past the barrier. This ensures memory operations before the barrier are visible to all threads after the barrier. --- ## cp_async_bulk_commit_group `cp_async_bulk_commit_group()` Commits all prior initiated but uncommitted cp.async.bulk instructions into a cp.async.bulk-group. This function commits all previously initiated but uncommitted cp.async.bulk instructions into a cp.async.bulk-group. The cp.async.bulk instructions are used for asynchronous bulk memory transfers on NVIDIA GPUs. The function creates a synchronization point for bulk memory transfers, allowing better control over memory movement and synchronization between different stages of computation. Note: This functionality is only available on NVIDIA GPUs. Attempting to use this function on non-NVIDIA GPUs will result in a compile time error. --- ## cp_async_bulk_wait_group `cp_async_bulk_wait_group[n: SIMD[int32, 1], read: Bool = True]()` Waits for completion of asynchronous bulk memory transfer groups. This function causes the executing thread to wait until a specified number of the most recent bulk async-groups are pending. It provides synchronization control for bulk memory transfers on NVIDIA GPUs. Note: This functionality is only available on NVIDIA GPUs. Attempting to use this function on non-NVIDIA GPUs will result in a compile time error. Example: ```mojo from gpu.sync import cp_async_bulk_wait_group # Wait until at most 2 async groups are pending cp_async_bulk_wait_group[2]() # Wait for all async groups to complete cp_async_bulk_wait_group[0]() ``` **Parameters:** * ​n (`SIMD[int32, 1]`): The number of most recent bulk async-groups allowed to remain pending. When n=0, waits for all prior bulk async-groups to complete. * ​read (`Bool`): If True, indicates that subsequent reads to the transferred memory are expected, enabling optimizations for read access patterns. Defaults to True. --- ## sync This module provides GPU synchronization primitives and barriers. The module includes: * Block-level synchronization barriers (barrier()) * Warp-level synchronization (syncwarp()) * Memory barriers (mbarrier) for NVIDIA GPUs * Instruction scheduling controls for AMD GPUs * Asynchronous copy and bulk transfer synchronization The synchronization primitives help coordinate execution between threads within thread blocks and warps, and manage memory consistency across different memory spaces. ## Structs * [​`AMDScheduleBarrierMask`](/mojo/stdlib/gpu/sync/AMDScheduleBarrierMask): Represents different instruction scheduling masks for AMDGPU scheduling instructions. ## Functions * [​`async_copy_arrive`](/mojo/stdlib/gpu/sync/async_copy_arrive): Makes a memory barrier track all prior async copy operations from this thread. * [​`barrier`](/mojo/stdlib/gpu/sync/barrier): Performs a synchronization barrier at the block level. * [​`cp_async_bulk_commit_group`](/mojo/stdlib/gpu/sync/cp_async_bulk_commit_group): Commits all prior initiated but uncommitted cp.async.bulk instructions into a cp.async.bulk-group. * [​`cp_async_bulk_wait_group`](/mojo/stdlib/gpu/sync/cp_async_bulk_wait_group): Waits for completion of asynchronous bulk memory transfer groups. * [​`mbarrier_arrive`](/mojo/stdlib/gpu/sync/mbarrier_arrive): Signal thread arrival at a shared memory barrier. * [​`mbarrier_arrive_expect_tx_relaxed`](/mojo/stdlib/gpu/sync/mbarrier_arrive_expect_tx_relaxed): Configure a shared memory barrier to expect additional async transactions. * [​`mbarrier_arrive_expect_tx_shared`](/mojo/stdlib/gpu/sync/mbarrier_arrive_expect_tx_shared): Configure a shared memory barrier to expect additional async transactions. * [​`mbarrier_init`](/mojo/stdlib/gpu/sync/mbarrier_init): Initialize a shared memory barrier for synchronizing multiple threads. * [​`mbarrier_test_wait`](/mojo/stdlib/gpu/sync/mbarrier_test_wait): Test if all threads have arrived at the memory barrier. * [​`mbarrier_try_wait_parity_shared`](/mojo/stdlib/gpu/sync/mbarrier_try_wait_parity_shared): Wait for completion of a barrier phase with timeout. * [​`named_barrier`](/mojo/stdlib/gpu/sync/named_barrier): Performs a named synchronization barrier at the block level. * [​`schedule_barrier`](/mojo/stdlib/gpu/sync/schedule_barrier): Controls instruction scheduling across a barrier point in AMD GPU code. * [​`schedule_group_barrier`](/mojo/stdlib/gpu/sync/schedule_group_barrier): Controls instruction scheduling across a barrier point in AMD GPU code by creating schedule groups. * [​`syncwarp`](/mojo/stdlib/gpu/sync/syncwarp): Synchronizes threads within a warp using a barrier. --- ## mbarrier_arrive `mbarrier_arrive[type: AnyType](shared_mem: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]) -> Int` Signal thread arrival at a shared memory barrier. Records that the calling thread has reached the barrier synchronization point. Only supported on NVIDIA GPUs. **Parameters:** * ​type (`AnyType`): The data type stored at the barrier location. **Args:** * ​shared\_mem (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier. **Returns:** An integer representing the current state of the memory barrier. --- ## mbarrier_arrive_expect_tx_relaxed `mbarrier_arrive_expect_tx_relaxed[type: AnyType, scope: Scope = Scope(3), space: Scope = Scope(3)](addr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], tx_count: SIMD[int32, 1]) -> SIMD[uint64, 1]` Configure a shared memory barrier to expect additional async transactions. Updates the current phase of the memory barrier to track completion of additional asynchronous transactions. Only supported on NVIDIA GPUs. **Parameters:** * ​type (`AnyType`): The type of the memory barrier. * ​scope (`Scope`): The scope of the memory barrier. * ​space (`Scope`): The space of the memory barrier. **Args:** * ​addr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier. * ​tx\_count (`SIMD[int32, 1]`): Number of expected transactions to track. **Returns:** The state. --- ## mbarrier_arrive_expect_tx_shared `mbarrier_arrive_expect_tx_shared[type: AnyType](addr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], tx_count: SIMD[int32, 1])` Configure a shared memory barrier to expect additional async transactions. Updates the current phase of the memory barrier to track completion of additional asynchronous transactions. Only supported on NVIDIA GPUs. **Parameters:** * ​type (`AnyType`): The type of the memory barrier. **Args:** * ​addr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier. * ​tx\_count (`SIMD[int32, 1]`): Number of expected transactions to track. --- ## mbarrier_init `mbarrier_init[type: AnyType](shared_mem: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], num_threads: SIMD[int32, 1])` Initialize a shared memory barrier for synchronizing multiple threads. Sets up a memory barrier in shared memory that will be used to synchronize the specified number of threads. Only supported on NVIDIA GPUs. **Parameters:** * ​type (`AnyType`): The data type stored at the barrier location. **Args:** * ​shared\_mem (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to shared memory location for the barrier. * ​num\_threads (`SIMD[int32, 1]`): Number of threads that will synchronize on this barrier. --- ## mbarrier_test_wait `mbarrier_test_wait[type: AnyType](shared_mem: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], state: Int) -> Bool` Test if all threads have arrived at the memory barrier. Non-blocking check to see if all participating threads have reached the barrier. Only supported on NVIDIA GPUs. **Parameters:** * ​type (`AnyType`): The data type stored at the barrier location. **Args:** * ​shared\_mem (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier. * ​state (`Int`): Expected state of the memory barrier. **Returns:** True if all threads have arrived, False otherwise. --- ## mbarrier_try_wait_parity_shared `mbarrier_try_wait_parity_shared[type: AnyType](addr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], phase: SIMD[int32, 1], ticks: SIMD[int32, 1])` Wait for completion of a barrier phase with timeout. Waits for the shared memory barrier to complete the specified phase, or until the timeout period expires. Only supported on NVIDIA GPUs. **Parameters:** * ​type (`AnyType`): The type of the memory barrier. **Args:** * ​addr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier. * ​phase (`SIMD[int32, 1]`): Phase number to wait for. * ​ticks (`SIMD[int32, 1]`): Timeout period in nanoseconds. --- ## named_barrier `named_barrier[num_threads: SIMD[int32, 1], id: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](0)]()` Performs a named synchronization barrier at the block level. This function creates a synchronization point using a specific barrier ID, allowing for multiple independent barriers within a thread block. All threads in the block must execute this function with the same barrier ID and thread count before any thread can proceed past the barrier. Notes: * Only supported on NVIDIA GPUs. * Maps directly to the `nvvm.barrier` instruction. * Useful for fine-grained synchronization when different subsets of threads need to synchronize independently. * The barrier ID must not exceed 16. * All threads participating in the barrier must specify the same num\_threads value. **Parameters:** * ​num\_threads (`SIMD[int32, 1]`): The number of threads that must reach the barrier before any can proceed. * ​id (`SIMD[int32, 1]`): The barrier identifier (0-16). Default is 0. --- ## schedule_barrier `schedule_barrier(mask: AMDScheduleBarrierMask = AMDScheduleBarrierMask(0))` Controls instruction scheduling across a barrier point in AMD GPU code. This function creates a scheduling barrier that controls which types of instructions can be reordered across it by the compiler. The mask parameter specifies which instruction categories (ALU, memory, etc) are allowed to cross the barrier during scheduling optimization. Note: This function only has an effect on AMD GPUs. On other platforms it will raise a compile time error. **Args:** * ​mask (`AMDScheduleBarrierMask`): A bit mask of AMDScheduleBarrierMask flags indicating which instruction types can be scheduled across this barrier. Default is NONE, meaning no instructions can cross. --- ## schedule_group_barrier `schedule_group_barrier(mask: AMDScheduleBarrierMask, size: SIMD[int32, 1], sync_id: SIMD[int32, 1])` Controls instruction scheduling across a barrier point in AMD GPU code by creating schedule groups. This function creates a scheduling barrier that groups instructions into sequences with custom ordering. It affects the code that precedes the barrier. The barrier ensures instructions are scheduled according to the specified group parameters. Note: This function only has an effect on AMD GPUs. On other platforms it will raise a compile time error. The sync\_id parameter allows creating multiple schedule groups that can be ordered relative to each other. **Args:** * ​mask (`AMDScheduleBarrierMask`): A bit mask of AMDScheduleBarrierMask flags indicating which instruction types can be scheduled across this barrier. Similar to schedule\_barrier masks. * ​size (`SIMD[int32, 1]`): The number of times to repeat the instruction sequence in the schedule group. * ​sync\_id (`SIMD[int32, 1]`): A unique identifier for the group that determines the ordering of instructions within the same schedule group. --- ## syncwarp `syncwarp(mask: Int = -1)` Synchronizes threads within a warp using a barrier. This function creates a synchronization point where threads in a warp must wait until all threads specified by the mask reach this point. On NVIDIA GPUs, it uses warp-level synchronization primitives. On AMD GPUs, this is a no-op since threads execute in lock-step. Note: * On NVIDIA GPUs, this maps to the nvvm.bar.warp.sync intrinsic. * On AMD GPUs, this is a no-op since threads execute in lock-step. * Threads not participating in the sync must still execute the instruction. **Args:** * ​mask (`Int`): An integer bitmask specifying which lanes (threads) in the warp should be synchronized. Each bit corresponds to a lane, with bit i controlling lane i. A value of 1 means the lane participates in the sync, 0 means it does not. Default value of -1 (all bits set) synchronizes all lanes. --- ## TensorMemory `@register_passable(trivial)` `struct TensorMemory` A wrapper around tensor memory allocated for tcgen05 instructions. ## Fields * ​ptr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3), alignment=16]`): Pointer to the tensor memory address. * ​num\_cols (`SIMD[uint32, 1]`): The number of columns in the tensor memory. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(num_cols: SIMD[uint32, 1]) -> Self` Initialize the TensorMemory struct. **Args:** * ​num\_cols (`SIMD[uint32, 1]`): The number of columns to allocate. --- ## tcgen05 This module includes utilities for working with the tensorcore 5th generation (tcgen05) instructions. ## Aliases ### `check_blackwell_constraint` `alias check_blackwell_constraint = constrained[::Bool,::StringSlice[::Bool[_has_blackwell_tcgen05(), __init__[__mlir_type.!kgen.string]("The tcgen05 instructions are only applicable on nVidia Blackwell (sm_100a, sm_101a) hardware."), ?]` ## Structs * [​`TensorMemory`](/mojo/stdlib/gpu/tcgen05/TensorMemory): A wrapper around tensor memory allocated for tcgen05 instructions. ## Functions * [​`tcgen05_alloc`](/mojo/stdlib/gpu/tcgen05/tcgen05_alloc): Allocates tensor memory for use with tcgen05 instructions. * [​`tcgen05_cp`](/mojo/stdlib/gpu/tcgen05/tcgen05_cp): Copies data from shared memory described by the matrix descriptor `s_desc` to tensor memory `tmem_addr`. * [​`tcgen05_dealloc`](/mojo/stdlib/gpu/tcgen05/tcgen05_dealloc): Deallocates tensor memory allocated by tcgen05\_alloc(). * [​`tcgen05_fence_after`](/mojo/stdlib/gpu/tcgen05/tcgen05_fence_after): Orders all the subsequent asynchronous `tcgen05` operations. * [​`tcgen05_fence_before`](/mojo/stdlib/gpu/tcgen05/tcgen05_fence_before): Orders all the prior asynchronous `tcgen05` operations. * [​`tcgen05_ld`](/mojo/stdlib/gpu/tcgen05/tcgen05_ld): Loads data from tensor memory into registers. * [​`tcgen05_load_wait`](/mojo/stdlib/gpu/tcgen05/tcgen05_load_wait): Waits for tensor memory loads to complete. * [​`tcgen05_release_allocation_lock`](/mojo/stdlib/gpu/tcgen05/tcgen05_release_allocation_lock): Releases the allocation lock for the current CTA group. * [​`tcgen05_st`](/mojo/stdlib/gpu/tcgen05/tcgen05_st): Stores data from registers into tensor memory. * [​`tcgen05_store_wait`](/mojo/stdlib/gpu/tcgen05/tcgen05_store_wait): Waits for tensor memory stores to complete. --- ## tcgen05_alloc `tcgen05_alloc[cta_group: SIMD[int32, 1]](ptr_tmem_addr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3), alignment=16], num_cols: SIMD[uint32, 1])` Allocates tensor memory for use with tcgen05 instructions. Note: This function is only available on NVIDIA Blackwell GPUs (SM 100+). **Parameters:** * ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID. **Args:** * ​ptr\_tmem\_addr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3), alignment=16]`): Shared memory pointer to hold tensor memory address. * ​num\_cols (`SIMD[uint32, 1]`): The number of columns to allocate. --- ## tcgen05_cp `tcgen05_cp[*, cta_group: SIMD[int32, 1], datapaths: Int, bits: Int, src_fmt: String = __init__[__mlir_type.!kgen.string](""), dst_fmt: String = __init__[__mlir_type.!kgen.string](""), multicast: String = __init__[__mlir_type.!kgen.string]("")](tmem_addr: SIMD[uint32, 1], s_desc: MMASmemDescriptor)` Copies data from shared memory described by the matrix descriptor `s_desc` to tensor memory `tmem_addr`. Note: This function is only available on NVIDIA Blackwell GPUs (SM 100+). **Parameters:** * ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID. * ​datapaths (`Int`): The first dimension of the shape. * ​bits (`Int`): The second dimension of the shape. * ​src\_fmt (`String`): Source format string. * ​dst\_fmt (`String`): Destination format string. * ​multicast (`String`): Multicast string. **Args:** * ​tmem\_addr (`SIMD[uint32, 1]`): Address of the tensor memory. * ​s\_desc (`MMASmemDescriptor`): Matrix descriptor for the copy operation. --- ## tcgen05_dealloc `tcgen05_dealloc[cta_group: SIMD[int32, 1]](tmem_addr: SIMD[uint32, 1], num_cols: SIMD[uint32, 1])` Deallocates tensor memory allocated by tcgen05\_alloc(). This function deallocates tensor memory that was previously allocated using tcgen05\_alloc(). The deallocation must be performed by the same CTA group that performed the allocation. **Parameters:** * ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID. **Args:** * ​tmem\_addr (`SIMD[uint32, 1]`): Address of the tensor memory to deallocate. * ​num\_cols (`SIMD[uint32, 1]`): Number of columns in the tensor memory. --- ## tcgen05_fence_after `tcgen05_fence_after()` Orders all the subsequent asynchronous `tcgen05` operations. Note: This function is only available on NVIDIA Blackwell GPUs (SM 100+). --- ## tcgen05_fence_before `tcgen05_fence_before()` Orders all the prior asynchronous `tcgen05` operations. Note: This function is only available on NVIDIA Blackwell GPUs (SM 100+). --- ## tcgen05_ld `tcgen05_ld[*, datapaths: Int, bits: Int, repeat: Int, type: DType, pack: Bool, width: Int = (div_s(mul(#lit.struct.extract, #lit.struct.extract, #lit.struct.extract), 1024) + -1) if (((bits * datapaths * repeat) , #lit.struct.extract, #lit.struct.extract), 1024) == 0) ^ True)) else div_s(mul(#lit.struct.extract, #lit.struct.extract, #lit.struct.extract), 1024)](tmem_addr: SIMD[uint32, 1]) -> SIMD[type, width]` Loads data from tensor memory into registers. **Parameters:** * ​datapaths (`Int`): The first dimension of the shape. * ​bits (`Int`): The second dimension of the shape. * ​repeat (`Int`): The repeat factor. * ​type (`DType`): The data type to load. * ​pack (`Bool`): Whether to pack two 16-bit chunks of adjacent columns into a single 32-bit register. * ​width (`Int`): The number elements in the result vector. **Args:** * ​tmem\_addr (`SIMD[uint32, 1]`): The address of the tensor memory to load from. **Returns:** The SIMD register containing the loaded data. --- ## tcgen05_load_wait `tcgen05_load_wait()` Waits for tensor memory loads to complete. Note: This function is only available on NVIDIA Blackwell GPUs (SM 100+). --- ## tcgen05_release_allocation_lock `tcgen05_release_allocation_lock[cta_group: SIMD[int32, 1]]()` Releases the allocation lock for the current CTA group. Note: This function is only available on NVIDIA Blackwell GPUs (SM 100+). **Parameters:** * ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID. --- ## tcgen05_st `tcgen05_st[type: DType, width: Int, //, *, datapaths: Int, bits: Int, repeat: Int, pack: Bool](tmem_addr: SIMD[uint32, 1], data: SIMD[type, width])` Stores data from registers into tensor memory. **Parameters:** * ​type (`DType`): The data type to store. * ​width (`Int`): The number of elements in the data vector. * ​datapaths (`Int`): The first dimension of the shape. * ​bits (`Int`): The second dimension of the shape. * ​repeat (`Int`): The repeat factor. * ​pack (`Bool`): Whether to pack two 16-bit chunks of adjacent columns into a single 32-bit register. **Args:** * ​tmem\_addr (`SIMD[uint32, 1]`): The address of the tensor memory to store to. * ​data (`SIMD[type, width]`): The data to store into the tensor memory. --- ## tcgen05_store_wait `tcgen05_store_wait()` Waits for tensor memory stores to complete. Note: This function is only available on NVIDIA Blackwell GPUs (SM 100+). --- ## tensor_ops This module provides tensor core operations and utilities for GPU computation. The module includes functions for: * Tensor core based reductions (tc\_reduce) supporting various data types and SIMD widths * GEVM (General Matrix-Vector Multiplication) reductions using tensor cores * Efficient warp-level reductions leveraging tensor core operations The tensor core operations are optimized for NVIDIA GPUs and support different data types including float32, float16, and bfloat16. The module provides both scalar and vector variants of reduction operations with different SIMD widths for maximum performance. Key functions: * tc\_reduce: Main tensor core reduction function supporting various types and widths * tc\_reduce\_gevm\_8x: 8x GEVM reduction using tensor cores * tc\_reduce\_gevm\_4x: 4x GEVM reduction using tensor cores Note: Most operations require NVIDIA GPUs with tensor core support. Operations are optimized for warp-level execution. ## Functions * [​`tc_reduce`](/mojo/stdlib/gpu/tensor_ops/tc_reduce): Performs tensor core based reduction on a SIMD vector. * [​`tc_reduce_gevm_4x`](/mojo/stdlib/gpu/tensor_ops/tc_reduce_gevm_4x): Performs a 4x GEVM reduction using tensor cores. * [​`tc_reduce_gevm_8x`](/mojo/stdlib/gpu/tensor_ops/tc_reduce_gevm_8x): Performs an 8x GEVM reduction using tensor cores. --- ## tc_reduce `tc_reduce[in_type: DType, simd_width: Int, //, out_type: DType](val: SIMD[in_type, simd_width]) -> SIMD[out_type, 1]` Performs tensor core based reduction on a SIMD vector. Note: Dispatches to either scalar or vector reduction implementation based on SIMD width. Supports various input/output type combinations using tensor core operations. **Parameters:** * ​in\_type (`DType`): The input data type of the SIMD vector elements. * ​simd\_width (`Int`): The width of the SIMD vector. * ​out\_type (`DType`): The output data type for the reduced result. **Args:** * ​val (`SIMD[in_type, simd_width]`): Input SIMD vector to reduce. **Returns:** Scalar containing the reduced result. --- ## tc_reduce_gevm_4x `tc_reduce_gevm_4x[out_type: DType, in_type: DType, simd_width: Int](val1: SIMD[in_type, simd_width]) -> SIMD[out_type, simd_width]` Performs a 4x GEVM reduction using tensor cores. Note: Currently only supports bfloat16 input to float32 output conversion. Uses tensor core matrix multiply-accumulate (MMA) operations for reduction. **Parameters:** * ​out\_type (`DType`): The output data type for the reduction result (must be float32). * ​in\_type (`DType`): The input data type of the vector to reduce (must be bfloat16). * ​simd\_width (`Int`): The width of the SIMD vector. **Args:** * ​val1 (`SIMD[in_type, simd_width]`): Input SIMD vector to reduce. **Returns:** SIMD vector containing the reduced result. --- ## tc_reduce_gevm_8x `tc_reduce_gevm_8x[out_type: DType, in_type: DType, simd_width: Int](val1: SIMD[in_type, simd_width], val2: SIMD[in_type, simd_width]) -> SIMD[out_type, simd_width]` Performs an 8x GEVM reduction using tensor cores. Note: Currently only supports bfloat16 input to float32 output conversion. Uses tensor core matrix multiply-accumulate (MMA) operations for reduction. **Parameters:** * ​out\_type (`DType`): The output data type for the reduction result (must be float32). * ​in\_type (`DType`): The input data type of the vectors to reduce (must be bfloat16). * ​simd\_width (`Int`): The width of the SIMD vectors. **Args:** * ​val1 (`SIMD[in_type, simd_width]`): First input SIMD vector to reduce. * ​val2 (`SIMD[in_type, simd_width]`): Second input SIMD vector to reduce. **Returns:** SIMD vector containing the reduced result. --- ## ReductionMethod `@register_passable(trivial)` `struct ReductionMethod` Enumerates the supported reduction methods. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `TENSOR_CORE` `alias TENSOR_CORE = ReductionMethod(0)` Use tensor core for reduction. ### `WARP` `alias WARP = ReductionMethod(1)` Use warp shuffle for reduction. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two ReductionMethod are equal. **Args:** * ​other (`Self`): The other ReductionMethod to compare. **Returns:** True if the ReductionMethod are equal, false otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if two ReductionMethod are not equal. **Args:** * ​other (`Self`): The other ReductionMethod to compare. **Returns:** True if the ReductionMethod are not equal, false otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Checks if two ReductionMethod are identical. **Args:** * ​other (`Self`): The other ReductionMethod to compare. **Returns:** True if the ReductionMethod are identical, false otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Checks if two ReductionMethod are not identical. **Args:** * ​other (`Self`): The other ReductionMethod to compare. **Returns:** True if the ReductionMethod are not identical, false otherwise. --- ## broadcast `broadcast[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Broadcasts a SIMD value from lane 0 to all lanes in the warp. This function takes a SIMD value from lane 0 and copies it to all other lanes in the warp, effectively broadcasting the value across the entire warp. This is useful for sharing data between threads in a warp without using shared memory. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to broadcast from lane 0. **Returns:** A SIMD value where all lanes contain a copy of the input value from lane 0. `broadcast(val: Int) -> Int` Broadcasts an integer value from lane 0 to all lanes in the warp. This function takes an integer value from lane 0 and copies it to all other lanes in the warp. It provides a convenient way to share scalar integer data between threads without using shared memory. **Args:** * ​val (`Int`): The integer value to broadcast from lane 0. **Returns:** The broadcast integer value, where all lanes receive a copy of the input from lane 0. `broadcast(val: UInt) -> UInt` Broadcasts an unsigned integer value from lane 0 to all lanes in the warp. This function takes an unsigned integer value from lane 0 and copies it to all other lanes in the warp. It provides a convenient way to share scalar unsigned integer data between threads without using shared memory. **Args:** * ​val (`UInt`): The unsigned integer value to broadcast from lane 0. **Returns:** The broadcast unsigned integer value, where all lanes receive a copy of the input from lane 0. --- ## warp GPU warp-level operations and utilities. This module provides warp-level operations for NVIDIA and AMD GPUs, including: * Shuffle operations to exchange values between threads in a warp: * shuffle\_idx: Copy value from source lane to other lanes * shuffle\_up: Copy from lower lane IDs * shuffle\_down: Copy from higher lane IDs * shuffle\_xor: Exchange values in butterfly pattern * Warp-wide reductions: * sum: Compute sum across warp * max: Find maximum value across warp * min: Find minimum value across warp * broadcast: Broadcast value to all lanes The module handles both NVIDIA and AMD GPU architectures through architecture-specific implementations of the core operations. It supports various data types including integers, floats, and half-precision floats, with SIMD vectorization. ## Structs * [​`ReductionMethod`](/mojo/stdlib/gpu/warp/ReductionMethod): Enumerates the supported reduction methods. ## Functions * [​`broadcast`](/mojo/stdlib/gpu/warp/broadcast): Broadcasts a SIMD value from lane 0 to all lanes in the warp. * [​`lane_group_max`](/mojo/stdlib/gpu/warp/lane_group_max): Reduces a SIMD value to its maximum within a lane group using warp-level operations. * [​`lane_group_max_and_broadcast`](/mojo/stdlib/gpu/warp/lane_group_max_and_broadcast): Reduces and broadcasts the maximum value within a lane group using warp-level operations. * [​`lane_group_min`](/mojo/stdlib/gpu/warp/lane_group_min): Reduces a SIMD value to its minimum within a lane group using warp-level operations. * [​`lane_group_reduce`](/mojo/stdlib/gpu/warp/lane_group_reduce): Performs a generic warp-level reduction operation using shuffle operations. * [​`lane_group_sum`](/mojo/stdlib/gpu/warp/lane_group_sum): Computes the sum of values across a group of lanes using warp-level operations. * [​`lane_group_sum_and_broadcast`](/mojo/stdlib/gpu/warp/lane_group_sum_and_broadcast): Computes the sum across a lane group and broadcasts the result to all lanes. * [​`max`](/mojo/stdlib/gpu/warp/max): Computes the maximum value across all lanes in a warp. * [​`min`](/mojo/stdlib/gpu/warp/min): Computes the minimum value across all lanes in a warp. * [​`prefix_sum`](/mojo/stdlib/gpu/warp/prefix_sum): Computes a warp-level prefix sum (scan) operation. * [​`reduce`](/mojo/stdlib/gpu/warp/reduce): Performs a generic warp-wide reduction operation using shuffle operations. * [​`shuffle_down`](/mojo/stdlib/gpu/warp/shuffle_down): Copies values from threads with higher lane IDs in the warp. * [​`shuffle_idx`](/mojo/stdlib/gpu/warp/shuffle_idx): Copies a value from a source lane to other lanes in a warp. * [​`shuffle_up`](/mojo/stdlib/gpu/warp/shuffle_up): Copies values from threads with lower lane IDs in the warp. * [​`shuffle_xor`](/mojo/stdlib/gpu/warp/shuffle_xor): Exchanges values between threads in a warp using a butterfly pattern. * [​`sum`](/mojo/stdlib/gpu/warp/sum): Computes the sum of values across all lanes in a warp. --- ## lane_group_max `lane_group_max[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Reduces a SIMD value to its maximum within a lane group using warp-level operations. This function performs a parallel reduction across a group of lanes to find the maximum value. The reduction is done using warp shuffle operations for efficient communication between lanes. The result is stored in all participating lanes. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. * ​num\_lanes (`Int`): The number of threads participating in the reduction. * ​stride (`Int`): The stride between lanes participating in the reduction. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the maximum. **Returns:** A SIMD value where all participating lanes contain the maximum value found across the lane group. Non-participating lanes (lane\_id >= num\_lanes) retain their original values. --- ## lane_group_max_and_broadcast `lane_group_max_and_broadcast[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Reduces and broadcasts the maximum value within a lane group using warp-level operations. This function performs a parallel reduction to find the maximum value and broadcasts it to all lanes. The reduction and broadcast are done using warp shuffle operations in a butterfly pattern for efficient all-to-all communication between lanes. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. * ​num\_lanes (`Int`): The number of threads participating in the reduction. * ​stride (`Int`): The stride between lanes participating in the reduction. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce and broadcast. Each lane contributes its value. **Returns:** A SIMD value where all participating lanes contain the maximum value found across the lane group. Non-participating lanes (lane\_id >= num\_lanes) retain their original values. --- ## lane_group_min `lane_group_min[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Reduces a SIMD value to its minimum within a lane group using warp-level operations. This function performs a parallel reduction across a group of lanes to find the minimum value. The reduction is done using warp shuffle operations for efficient communication between lanes. The result is stored in all participating lanes. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. * ​num\_lanes (`Int`): The number of threads participating in the reduction. * ​stride (`Int`): The stride between lanes participating in the reduction. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the minimum. **Returns:** A SIMD value where all participating lanes contain the minimum value found across the lane group. Non-participating lanes (lane\_id >= num\_lanes) retain their original values. --- ## lane_group_reduce `lane_group_reduce[val_type: DType, simd_width: Int, //, shuffle: fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1], func: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1], num_lanes: Int, *, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Performs a generic warp-level reduction operation using shuffle operations. This function implements a parallel reduction across threads in a warp using a butterfly pattern. It allows customizing both the shuffle operation and reduction function. Example: ```mojo from gpu.warp import lane_group_reduce, shuffle_down # Compute sum across 16 threads using shuffle down @parameter fn add[type: DType, width: Int](x: SIMD[type, width], y: SIMD[type, width]) -> SIMD[type, width]: return x + y var val = SIMD[DType.float32, 16](42.0) var result = lane_group_reduce[shuffle_down, add, num_lanes=16](val) ``` . **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. * ​shuffle (`fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1]`): A function that performs the warp shuffle operation. Takes a SIMD value and offset and returns the shuffled result. * ​func (`fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]`): A binary function that combines two SIMD values during reduction. This defines the reduction operation (e.g. add, max, min). * ​num\_lanes (`Int`): The number of lanes in a group. The reduction is done within each group. Must be a power of 2. * ​stride (`Int`): The stride between lanes participating in the reduction. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value. **Returns:** A SIMD value containing the reduction result. --- ## lane_group_sum `lane_group_sum[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Computes the sum of values across a group of lanes using warp-level operations. This function performs a parallel reduction across a group of lanes to compute their sum. The reduction is done using warp shuffle operations for efficient communication between lanes. The result is stored in all participating lanes. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. * ​num\_lanes (`Int`): The number of threads participating in the reduction. * ​stride (`Int`): The stride between lanes participating in the reduction. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to the sum. **Returns:** A SIMD value where all participating lanes contain the sum found across the lane group. Non-participating lanes (lane\_id >= num\_lanes) retain their original values. --- ## lane_group_sum_and_broadcast `lane_group_sum_and_broadcast[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Computes the sum across a lane group and broadcasts the result to all lanes. This function performs a parallel reduction using a butterfly pattern to compute the sum, then broadcasts the result to all participating lanes. The butterfly pattern ensures efficient communication between lanes through warp shuffle operations. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. * ​num\_lanes (`Int`): The number of threads participating in the reduction. * ​stride (`Int`): The stride between lanes participating in the reduction. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to the sum. **Returns:** A SIMD value where all participating lanes contain the sum found across the lane group. Non-participating lanes (lane\_id >= num\_lanes) retain their original values. --- ## max `max[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Computes the maximum value across all lanes in a warp. This is a convenience wrapper around lane\_group\_max that operates on the entire warp. It performs a parallel reduction using warp shuffle operations to find the global maximum value across all lanes in the warp. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the maximum. **Returns:** A SIMD value where all lanes contain the maximum value found across the entire warp. --- ## min `min[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Computes the minimum value across all lanes in a warp. This is a convenience wrapper around lane\_group\_min that operates on the entire warp. It performs a parallel reduction using warp shuffle operations to find the global minimum value across all lanes in the warp. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the minimum. **Returns:** A SIMD value where all lanes contain the minimum value found across the entire warp. The minimum value is broadcast to all lanes. --- ## prefix_sum `prefix_sum[type: DType, //, intermediate_type: DType = type, *, output_type: DType = type, exclusive: Bool = False](x: SIMD[type, 1]) -> SIMD[output_type, 1]` Computes a warp-level prefix sum (scan) operation. Performs an inclusive or exclusive prefix sum across threads in a warp using a parallel scan algorithm with warp shuffle operations. This implements an efficient parallel scan with logarithmic complexity. For example, if we have a warp with the following elements: $$ [x_0, x_1, x_2, x_3, x_4] $$ The prefix sum is: $$ [x_0, x_0 + x_1, x_0 + x_1 + x_2, x_0 + x_1 + x_2 + x_3, x_0 + x_1 + x_2 + x_3 + x_4] $$ **Parameters:** * ​type (`DType`): The data type of the input SIMD elements. * ​intermediate\_type (`DType`): Type used for intermediate calculations (defaults to input type). * ​output\_type (`DType`): The desired output data type (defaults to input type). * ​exclusive (`Bool`): If True, performs exclusive scan where each thread receives the sum of all previous threads. If False (default), performs inclusive scan where each thread receives the sum including its own value. **Args:** * ​x (`SIMD[type, 1]`): The SIMD value to include in the prefix sum. **Returns:** A scalar containing the prefix sum at the current thread's position in the warp, cast to the specified output type. --- ## reduce `reduce[val_type: DType, simd_width: Int, //, shuffle: fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1], func: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Performs a generic warp-wide reduction operation using shuffle operations. This is a convenience wrapper around lane\_group\_reduce that operates on the entire warp. It allows customizing both the shuffle operation and reduction function. Example: ```mojo from gpu.warp import reduce, shuffle_down # Compute warp-wide sum using shuffle down @parameter fn add[type: DType, width: Int](x: SIMD[type, width], y: SIMD[type, width]) capturing -> SIMD[type, width]: return x + y val = SIMD[DType.float32, 4](2.0, 4.0, 6.0, 8.0) result = reduce[shuffle_down, add](val) ``` . **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. * ​shuffle (`fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1]`): A function that performs the warp shuffle operation. Takes a SIMD value and offset and returns the shuffled result. * ​func (`fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]`): A binary function that combines two SIMD values during reduction. This defines the reduction operation (e.g. add, max, min). **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value. **Returns:** A SIMD value containing the reduction result broadcast to all lanes in the warp. --- ## shuffle_down `shuffle_down[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Copies values from threads with higher lane IDs in the warp. Performs a shuffle operation where each thread receives a value from a thread with a higher lane ID, offset by the specified amount. Uses the full warp mask by default. For example, with offset=1: * Thread 0 gets value from thread 1 * Thread 1 gets value from thread 2 * Thread N gets value from thread N+1 * Last N threads get undefined values **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled down the warp. * ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values down by. Must be positive. **Returns:** The SIMD value from the thread offset lanes higher in the warp. Returns undefined values for threads where lane\_id + offset >= WARP\_SIZE. `shuffle_down[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Copies values from threads with higher lane IDs in the warp using a custom mask. Performs a shuffle operation where each thread receives a value from a thread with a higher lane ID, offset by the specified amount. The mask parameter controls which threads participate in the shuffle. For example, with offset=1: * Thread 0 gets value from thread 1 * Thread 1 gets value from thread 2 * Thread N gets value from thread N+1 * Last N threads get undefined values **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​mask (`UInt`): A bitmask controlling which threads participate in the shuffle. Only threads with their corresponding bit set will exchange values. * ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled down the warp. * ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values down by. Must be positive. **Returns:** The SIMD value from the thread offset lanes higher in the warp. Returns undefined values for threads where lane\_id + offset >= WARP\_SIZE or where the corresponding mask bit is not set. --- ## shuffle_idx `shuffle_idx[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Copies a value from a source lane to other lanes in a warp. ``` Broadcasts a value from a source thread in a warp to all participating threads without using shared memory. This is a convenience wrapper that uses the full warp mask by default. ``` Example: ```mojo from gpu.warp import shuffle_idx val = SIMD[DType.float32, 16](1.0) # Broadcast value from lane 0 to all lanes result = shuffle_idx(val, 0) # Get value from lane 5 result = shuffle_idx(val, 5) ``` . **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32, half). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​val (`SIMD[type, simd_width]`): The SIMD value to be broadcast from the source lane. * ​offset (`SIMD[uint32, 1]`): The source lane ID to copy the value from. **Returns:** A SIMD vector where all lanes contain the value from the source lane specified by offset. `shuffle_idx[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Copies a value from a source lane to other lanes in a warp with explicit mask control. ``` Broadcasts a value from a source thread in a warp to participating threads specified by the mask. This provides fine-grained control over which threads participate in the shuffle operation. ``` Example: ```mojo from gpu.warp import shuffle_idx # Only broadcast to first 16 lanes var mask = 0xFFFF # 16 ones var val = SIMD[DType.float32, 32](1.0) var result = shuffle_idx(mask, val, 5) ``` . **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32, half). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​mask (`UInt`): A bit mask specifying which lanes participate in the shuffle (1 bit per lane). * ​val (`SIMD[type, simd_width]`): The SIMD value to be broadcast from the source lane. * ​offset (`SIMD[uint32, 1]`): The source lane ID to copy the value from. **Returns:** A SIMD vector where participating lanes (set in mask) contain the value from the source lane specified by offset. Non-participating lanes retain their original values. --- ## shuffle_up `shuffle_up[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Copies values from threads with lower lane IDs in the warp. Performs a shuffle operation where each thread receives a value from a thread with a lower lane ID, offset by the specified amount. Uses the full warp mask by default. For example, with offset=1: * Thread N gets value from thread N-1 * Thread 1 gets value from thread 0 * Thread 0 gets undefined value **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled up the warp. * ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values up by. **Returns:** The SIMD value from the thread offset lanes lower in the warp. Returns undefined values for threads where lane\_id - offset `shuffle_up[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Copies values from threads with lower lane IDs in the warp. Performs a shuffle operation where each thread receives a value from a thread with a lower lane ID, offset by the specified amount. The operation is performed only for threads specified in the mask. For example, with offset=1: * Thread N gets value from thread N-1 if both threads are in the mask * Thread 1 gets value from thread 0 if both threads are in the mask * Thread 0 gets undefined value * Threads not in the mask get undefined values **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​mask (`UInt`): The warp mask specifying which threads participate in the shuffle. * ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled up the warp. * ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values up by. **Returns:** The SIMD value from the thread offset lanes lower in the warp. Returns undefined values for threads where lane\_id - offset --- ## shuffle_xor `shuffle_xor[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Exchanges values between threads in a warp using a butterfly pattern. Performs a butterfly exchange pattern where each thread swaps values with another thread whose lane ID differs by a bitwise XOR with the given offset. This creates a butterfly communication pattern useful for parallel reductions and scans. **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​val (`SIMD[type, simd_width]`): The SIMD value to be exchanged with another thread. * ​offset (`SIMD[uint32, 1]`): The lane offset to XOR with the current thread's lane ID to determine the exchange partner. Common values are powers of 2 for butterfly patterns. **Returns:** The SIMD value from the thread at lane (current\_lane XOR offset). `shuffle_xor[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Exchanges values between threads in a warp using a butterfly pattern with masking. Performs a butterfly exchange pattern where each thread swaps values with another thread whose lane ID differs by a bitwise XOR with the given offset. The mask parameter allows controlling which threads participate in the exchange. Example: ```mojo from gpu.warp import shuffle_xor # Exchange values between even-numbered threads 4 lanes apart mask = 0xAAAAAAAA # Even threads only var val = SIMD[DType.float32, 16](42.0) # Example value result = shuffle_xor(mask, val, 4.0) ``` . **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​mask (`UInt`): A bit mask specifying which threads participate in the exchange. Only threads with their corresponding bit set in the mask will exchange values. * ​val (`SIMD[type, simd_width]`): The SIMD value to be exchanged with another thread. * ​offset (`SIMD[uint32, 1]`): The lane offset to XOR with the current thread's lane ID to determine the exchange partner. Common values are powers of 2 for butterfly patterns. **Returns:** The SIMD value from the thread at lane (current\_lane XOR offset) if both threads are enabled by the mask, otherwise the original value is preserved. --- ## sum `sum[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Computes the sum of values across all lanes in a warp. This is a convenience wrapper around lane\_group\_sum\_and\_broadcast that operates on the entire warp. It performs a parallel reduction using warp shuffle operations to find the global sum across all lanes in the warp. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to the sum. **Returns:** A SIMD value where all lanes contain the sum found across the entire warp. The sum is broadcast to all lanes. `sum[intermediate_type: DType, *, reduction_method: ReductionMethod, output_type: DType](x: SIMD[dtype, size]) -> SIMD[output_type, 1]` Performs a warp-level reduction to compute the sum of values across threads. This function provides two reduction methods: 1. Warp shuffle: Uses warp shuffle operations to efficiently sum values across threads 2. Tensor core: Leverages tensor cores for high-performance reductions, with type casting The tensor core method will cast the input to the specified intermediate type before reduction to ensure compatibility with tensor core operations. The warp shuffle method requires the output type to match the input type. **Constraints:** * For warp shuffle reduction, output\_type must match the input value type. * For tensor core reduction, input will be cast to intermediate\_type. **Parameters:** * ​intermediate\_type (`DType`): The data type to cast to when using tensor core reduction. * ​reduction\_method (`ReductionMethod`): `WARP` for warp shuffle or `TENSOR_CORE` for tensor core reduction. * ​output\_type (`DType`): The desired output data type for the reduced value. **Args:** * ​x (`SIMD[dtype, size]`): The SIMD value to reduce across the warp. **Returns:** A scalar containing the sum of the input values across all threads in the warp, cast to the specified output type. --- ## Hashable A trait for types which specify a function to hash their data. This hash function will be used for applications like hash maps, and don't need to be cryptographically secure. A good hash function will hash similar / common types to different values, and in particular the *low order bits* of the hash, which are used in smaller dictionaries, should be sensitive to any changes in the data structure. If your type's hash function doesn't meet this criteria it will get poor performance in common hash map implementations. ```mojo @fieldwise_init struct Foo(Hashable): fn __hash__(self) -> UInt: return 4 # chosen by fair random dice roll var foo = Foo() print(hash(foo)) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__hash__` `__hash__(self: _Self) -> UInt` Return a 64-bit hash of the type's data. **Returns:** A 64-bit integer hash of this instance's data. --- ## hash `hash[T: Hashable](hashable: T) -> UInt` Hash a Hashable type using its underlying hash implementation. **Parameters:** * ​T (`Hashable`): Any Hashable type. **Args:** * ​hashable (`T`): The input data to hash. **Returns:** A 64-bit integer hash based on the underlying implementation. `hash(bytes: UnsafePointer[SIMD[uint8, 1], alignment=alignment, mut=False, origin=origin], n: Int) -> UInt` Hash a byte array using a SIMD-modified DJBX33A hash algorithm. *This hash function is not suitable for cryptographic purposes.* The algorithm is easy to reverse and produce deliberate hash collisions. The hash function is designed to have relatively good mixing and statistical properties for use in hash-based data structures. We *do* however initialize a random hash secret which is mixed into the final hash output. This can help prevent DDOS attacks on applications which make use of this function for dictionary hashing. As a consequence, hash values are deterministic within an individual runtime instance ie. a value will always hash to the same thing, but in between runs this value will change based on the hash secret. We take advantage of Mojo's first-class SIMD support to create a SIMD-vectorized hash function, using some simple hash algorithm as a base. * Interpret those bytes as a SIMD vector, padded with zeros to align to the system SIMD width. * Apply the simple hash function parallelized across SIMD vectors. * Hash the final SIMD vector state to reduce to a single value. Python uses DJBX33A with a hash secret for smaller strings, and then the SipHash algorithm for longer strings. The arguments and tradeoffs are well documented in PEP 456. We should consider this and deeper performance/security tradeoffs as Mojo evolves. References: * [Wikipedia: Non-cryptographic hash function](https://en.wikipedia.org/wiki/Non-cryptographic_hash_function) * [Python PEP 456](https://peps.python.org/pep-0456/) * [PHP Hash algorithm and collisions](https://www.phpinternalsbook.com/php5/hashtables/hash_algorithm.html) ```mojo from random import rand var n = 64 var rand_bytes = UnsafePointer[UInt8].alloc(n) rand(rand_bytes, n) hash(rand_bytes, n) ``` **Args:** * ​bytes (`UnsafePointer[SIMD[uint8, 1], alignment=alignment, mut=False, origin=origin]`): The byte array to hash. * ​n (`Int`): The length of the byte array. **Returns:** A 64-bit integer hash. This hash is *not* suitable for cryptographic purposes, but will have good low-bit hash collision statistical properties for common data structures. --- ## hash Implements the `Hashable` trait and `hash()` built-in function. There are a few main tools in this module: * `Hashable` trait for types implementing `__hash__(self) -> UInt` * `hash[T: Hashable](hashable: T) -> Int` built-in function. * A `hash()` implementation for arbitrary byte strings, `hash(data: UnsafePointer[UInt8], n: Int) -> Int`, is the workhorse function, which implements efficient hashing via SIMD vectors. See the documentation of this function for more details on the hash implementation. * `hash(SIMD)` and `hash(UInt8)` implementations These are useful helpers to specialize for the general bytes implementation. ## Traits * [​`Hashable`](/mojo/stdlib/hashlib/hash/Hashable): A trait for types which specify a function to hash their data. ## Functions * [​`hash`](/mojo/stdlib/hashlib/hash/hash): Hash a Hashable type using its underlying hash implementation. --- ## hashlib Implements the hashlib package that provides various hash algorithms. ## Modules * [​`hash`](/mojo/stdlib/hashlib/hash/): Implements the `Hashable` trait and `hash()` built-in function. --- ## stdlib ## Packages * [​`algorithm`](/mojo/stdlib/algorithm/): Implements the algorithm package. * [​`base64`](/mojo/stdlib/base64/): Implements the base64 package. * [​`benchmark`](/mojo/stdlib/benchmark/): Implements the benchmark package for runtime benchmarking. * [​`bit`](/mojo/stdlib/bit/): Implements the bit package. * [​`buffer`](/mojo/stdlib/buffer/): Implements the buffer package. * [​`builtin`](/mojo/stdlib/builtin/): Implements the builtin package. * [​`collections`](/mojo/stdlib/collections/): Implements the collections package. * [​`compile`](/mojo/stdlib/compile/): Provides utilities for compiling and inspecting Mojo code at runtime. * [​`complex`](/mojo/stdlib/complex/): Provides types and functions for working with complex numbers. * [​`documentation`](/mojo/stdlib/documentation/): Implements the documentation package. * [​`gpu`](/mojo/stdlib/gpu/): Provides low-level programming constructs for working with GPUs. * [​`hashlib`](/mojo/stdlib/hashlib/): Implements the hashlib package that provides various hash algorithms. * [​`logger`](/mojo/stdlib/logger/): Provides logging functionality with different severity levels. * [​`math`](/mojo/stdlib/math/): Implements the math package. * [​`memory`](/mojo/stdlib/memory/): The memory package provides several pointer types, as well as utility functions for dealing with memory. * [​`os`](/mojo/stdlib/os/): Provides access to operating-system dependent functionality. * [​`pathlib`](/mojo/stdlib/pathlib/): Implements the pathlib package. * [​`prelude`](/mojo/stdlib/prelude/): Implements the prelude package. This package provide the public entities that are automatically imported into every Mojo program. * [​`pwd`](/mojo/stdlib/pwd/): Provides access to user and group information from the password database. * [​`python`](/mojo/stdlib/python/): Implements the python package. * [​`random`](/mojo/stdlib/random/): Implements the random package. * [​`runtime`](/mojo/stdlib/runtime/): Implements the runtime package. * [​`stat`](/mojo/stdlib/stat/): Implements the stat package. * [​`subprocess`](/mojo/stdlib/subprocess/): Implements the subprocess package. * [​`sys`](/mojo/stdlib/sys/): Implements the sys package. * [​`tempfile`](/mojo/stdlib/tempfile/): Implements the tempfile package. * [​`testing`](/mojo/stdlib/testing/): Implements the testing package. * [​`time`](/mojo/stdlib/time/): Implements the time package. * [​`utils`](/mojo/stdlib/utils/): Implements the utils package. --- ## logger Provides logging functionality with different severity levels. ## Modules * [​`logger`](/mojo/stdlib/logger/logger/): Provides logging functionality with different severity levels. --- ## Level `struct Level` Represents logging severity levels. Defines the available logging levels in ascending order of severity. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `CRITICAL` `alias CRITICAL = Level(50)` A serious error indicating that the program itself may be unable to continue running. ### `DEBUG` `alias DEBUG = Level(10)` Detailed information, typically of interest only when diagnosing problems. ### `ERROR` `alias ERROR = Level(40)` Due to a more serious problem, the software has not been able to perform some function. ### `INFO` `alias INFO = Level(20)` Confirmation that things are working as expected. ### `NOTSET` `alias NOTSET = Level(0)` Lowest level, used when no level is set. ### `WARNING` `alias WARNING = Level(30)` Indication that something unexpected happened, or may happen in the near future. ## Methods ### `__lt__` `__lt__(self, other: Self) -> Bool` Returns True if this level is less than the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if this level is less than the other level, False otherwise. ### `__le__` `__le__(self, other: Self) -> Bool` Returns True if this level is less than or equal to the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if this level is less than or equal to the other level, False otherwise. ### `__eq__` `__eq__(self, other: Self) -> Bool` Returns True if this level equals the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if the levels are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Returns True if this level does not equal the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if the levels are not equal, False otherwise. ### `__gt__` `__gt__(self, other: Self) -> Bool` Returns True if this level is greater than the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if this level is greater than the other level, False otherwise. ### `__ge__` `__ge__(self, other: Self) -> Bool` Returns True if this level is greater than or equal to the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if this level is greater than or equal to the other level, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Returns True if this level is identical to the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if this level is identical to the other level, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Returns True if this level is not identical to the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if this level is not identical to the other level, False otherwise. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the string representation of this level to a writer. **Parameters:** * ​W (`Writer`): The writer type that implements the Writer trait. **Args:** * ​writer (`W`): The writer to write to. ### `__str__` `__str__(self) -> String` Returns the string representation of this level. **Returns:** String: A human-readable string representation of the level (e.g., "DEBUG", "INFO"). ### `__repr__` `__repr__(self) -> String` Returns the detailed string representation of this level. **Returns:** String: A string representation including the type name and level value (e.g., "Level.DEBUG"). --- ## Logger `struct Logger[level: Level = _from_str[::Bool,::Origin[$0]](env_get_string[::StringSlice[::Bool())]` A logger that outputs messages at or above a specified severity level. ## Parameters * ​level (`Level`): The minimum severity level for messages to be logged. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(out self, fd: FileDescriptor = FileDescriptor(1))` Initializes a new Logger. **Args:** * ​fd (`FileDescriptor`): The file descriptor to write log messages to (defaults to stdout). ### `debug` `debug[*Ts: Writable](self, *values: *Ts)` Logs a debug message. **Parameters:** * ​\*Ts (`Writable`): The types of values to log. **Args:** * ​\*values (`*Ts`): The values to log. ### `info` `info[*Ts: Writable](self, *values: *Ts)` Logs an info message. **Parameters:** * ​\*Ts (`Writable`): The types of values to log. **Args:** * ​\*values (`*Ts`): The values to log. ### `warning` `warning[*Ts: Writable](self, *values: *Ts)` Logs a warning message. **Parameters:** * ​\*Ts (`Writable`): The types of values to log. **Args:** * ​\*values (`*Ts`): The values to log. ### `error` `error[*Ts: Writable](self, *values: *Ts)` Logs an error message. **Parameters:** * ​\*Ts (`Writable`): The types of values to log. **Args:** * ​\*values (`*Ts`): The values to log. ### `critical` `critical[*Ts: Writable](self, *values: *Ts)` Logs a critical message and aborts execution. **Parameters:** * ​\*Ts (`Writable`): The types of values to log. **Args:** * ​\*values (`*Ts`): The values to log. --- ## logger Provides logging functionality with different severity levels. This module implements a simple logging system with configurable severity levels: `NOTSET`, `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`. The logging level can be set via the LOGGING\_LEVEL environment variable. The main components are: * `Level`: An enum-like struct defining the available logging levels * `Logger`: A struct that handles logging messages with different severity levels Example: ```mojo from logger import Logger var logger = Logger() # Uses default level from LOGGING_LEVEL env var logger.info("Starting process") logger.debug("Debug information") logger.error("An error occurred") ``` The logger can be configured to write to different file descriptors (default stdout). Messages below the configured level will be silently ignored. ## Aliases ### `DEFAULT_LEVEL` `alias DEFAULT_LEVEL = _from_str[::Bool,::Origin[$0]](env_get_string[::StringSlice[::Bool())` ## Structs * [​`Level`](/mojo/stdlib/logger/logger/Level): Represents logging severity levels. * [​`Logger`](/mojo/stdlib/logger/logger/Logger): A logger that outputs messages at or above a specified severity level. --- ## constants Defines math utilities. You can import these APIs from the `math` package. For example: ```mojo from math import pi ``` ## Aliases ### `e` `alias e = 2.7182818284590451` The euler constant e = 2.718281... ### `log2e` `alias log2e = 1.4426950408889634` log2e = log2(e), where e is Euler's constant. ### `pi` `alias pi = 3.1415926535897931` The mathematical constant π = 3.141592... ### `tau` `alias tau = 6.2831853071795862` The mathematical constant τ = 6.283185.... Tau is a circumference of a circle (2π). --- ## math Implements the math package. ## Modules * [​`constants`](/mojo/stdlib/math/constants/): Defines math utilities. * [​`math`](/mojo/stdlib/math/math/): Defines math utilities. * [​`polynomial`](/mojo/stdlib/math/polynomial/): Provides two implementations for evaluating polynomials. --- ## CeilDivable The `CeilDivable` trait describes a type that defines a ceil division operation. Types that conform to `CeilDivable` will work with the `math.ceildiv` function. For example: ```mojo from math import CeilDivable @fieldwise_init struct Foo(CeilDivable, Copyable): var x: Float64 fn __ceildiv__(self, denominator: Self) -> Self: return Self(self.x // denominator.x) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__ceildiv__` `__ceildiv__(self: _Self, denominator: _Self) -> _Self` Return the rounded-up result of dividing self by denominator. **Args:** * ​denominator (`_Self`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. --- ## CeilDivableRaising The `CeilDivable` trait describes a type that define a floor division and negation operation that can raise. Types that conform to `CeilDivableRaising` will work with the `//` operator as well as the `math.ceildiv` function. For example: ```mojo from math import CeilDivableRaising @fieldwise_init struct Foo(CeilDivableRaising, Copyable): var x: Float64 fn __ceildiv__(self, denominator: Self) raises -> Self: return Self(self.x // denominator.x) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__ceildiv__` `__ceildiv__(self: _Self, denominator: _Self) -> _Self` Return the rounded-up result of dividing self by denominator. **Args:** * ​denominator (`_Self`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. --- ## Ceilable The `Ceilable` trait describes a type that defines a ceiling operation. Types that conform to `Ceilable` will work with the builtin `ceil` function. The ceiling operation always returns the same type as the input. For example: ```mojo from math import Ceilable, ceil @fieldwise_init struct Complex(Ceilable, Copyable): var re: Float64 var im: Float64 fn __ceil__(self) -> Self: return Self(ceil(self.re), ceil(self.im)) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__ceil__` `__ceil__(self: _Self) -> _Self` Return the ceiling of the Int value, which is itself. **Returns:** The Int value itself. --- ## Floorable The `Floorable` trait describes a type that defines a floor operation. Types that conform to `Floorable` will work with the builtin `floor` function. The floor operation always returns the same type as the input. For example: ```mojo from math import Floorable, floor @fieldwise_init struct Complex(Floorable, Copyable): var re: Float64 var im: Float64 fn __floor__(self) -> Self: return Self(floor(self.re), floor(self.im)) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__floor__` `__floor__(self: _Self) -> _Self` Return the floor of the Int value, which is itself. **Returns:** The Int value itself. --- ## Truncable The `Truncable` trait describes a type that defines a truncation operation. Types that conform to `Truncable` will work with the builtin `trunc` function. The truncation operation always returns the same type as the input. For example: ```mojo from math import Truncable, trunc @fieldwise_init struct Complex(Truncable, Copyable): var re: Float64 var im: Float64 fn __trunc__(self) -> Self: return Self(trunc(self.re), trunc(self.im)) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__trunc__` `__trunc__(self: _Self) -> _Self` Return the truncated Int value, which is itself. **Returns:** The Int value itself. --- ## acos `acos[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `acos` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `acos` of the input. --- ## acosh `acosh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `acosh` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `acosh` of the input. --- ## align_down `align_down(value: Int, alignment: Int) -> Int` Returns the closest multiple of alignment that is less than or equal to value. **Args:** * ​value (`Int`): The value to align. * ​alignment (`Int`): Value to align to. **Returns:** Closest multiple of the alignment that is less than or equal to the input value. In other words, floor(value / alignment) \* alignment. `align_down(value: UInt, alignment: UInt) -> UInt` Returns the closest multiple of alignment that is less than or equal to value. **Args:** * ​value (`UInt`): The value to align. * ​alignment (`UInt`): Value to align to. **Returns:** Closest multiple of the alignment that is less than or equal to the input value. In other words, floor(value / alignment) \* alignment. --- ## align_up `align_up(value: Int, alignment: Int) -> Int` Returns the closest multiple of alignment that is greater than or equal to value. **Args:** * ​value (`Int`): The value to align. * ​alignment (`Int`): Value to align to. **Returns:** Closest multiple of the alignment that is greater than or equal to the input value. In other words, ceiling(value / alignment) \* alignment. `align_up(value: UInt, alignment: UInt) -> UInt` Returns the closest multiple of alignment that is greater than or equal to value. **Args:** * ​value (`UInt`): The value to align. * ​alignment (`UInt`): Value to align to. **Returns:** Closest multiple of the alignment that is greater than or equal to the input value. In other words, ceiling(value / alignment) \* alignment. --- ## asin `asin[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `asin` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `asin` of the input. --- ## asinh `asinh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `asinh` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `asinh` of the input. --- ## atan `atan[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `atan` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `atan` of the input. --- ## atan2 `atan2[dtype: DType, width: Int, //](y: SIMD[dtype, width], x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `atan2` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​y (`SIMD[dtype, width]`): The first input argument. * ​x (`SIMD[dtype, width]`): The second input argument. **Returns:** The `atan2` of the inputs. --- ## atanh `atanh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `atanh` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `atanh` of the input. --- ## cbrt `cbrt[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `cbrt` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `cbrt` of the input. --- ## ceil `ceil[T: Ceilable, //](value: T) -> T` Get the ceiling value of the given object. **Parameters:** * ​T (`Ceilable`): The type conforming to `Ceilable`. **Args:** * ​value (`T`): The object to get the ceiling value of. **Returns:** The ceiling value of the object. --- ## ceildiv `ceildiv[T: CeilDivable, //](numerator: T, denominator: T) -> T` Return the rounded-up result of dividing numerator by denominator. **Parameters:** * ​T (`CeilDivable`): A type that support floor division. **Args:** * ​numerator (`T`): The numerator. * ​denominator (`T`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. `ceildiv[T: CeilDivableRaising, //](numerator: T, denominator: T) -> T` Return the rounded-up result of dividing numerator by denominator, potentially raising. **Parameters:** * ​T (`CeilDivableRaising`): A type that support floor division. **Args:** * ​numerator (`T`): The numerator. * ​denominator (`T`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. `ceildiv(numerator: IntLiteral[value], denominator: IntLiteral[value]) -> IntLiteral[(0 - (value // (0 - value)))]` Return the rounded-up result of dividing numerator by denominator. **Args:** * ​numerator (`IntLiteral[value]`): The numerator. * ​denominator (`IntLiteral[value]`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. --- ## clamp `clamp(val: Int, lower_bound: Int, upper_bound: Int) -> Int` Clamps the integer value vector to be in a certain range. **Args:** * ​val (`Int`): The value to clamp. * ​lower\_bound (`Int`): Minimum of the range to clamp to. * ​upper\_bound (`Int`): Maximum of the range to clamp to. **Returns:** An integer clamped to be within lower\_bound and upper\_bound. `clamp(val: UInt, lower_bound: UInt, upper_bound: UInt) -> UInt` Clamps the integer value vector to be in a certain range. **Args:** * ​val (`UInt`): The value to clamp. * ​lower\_bound (`UInt`): Minimum of the range to clamp to. * ​upper\_bound (`UInt`): Maximum of the range to clamp to. **Returns:** An integer clamped to be within lower\_bound and upper\_bound. `clamp[dtype: DType, width: Int, //](val: SIMD[dtype, width], lower_bound: SIMD[dtype, width], upper_bound: SIMD[dtype, width]) -> SIMD[dtype, width]` Clamps the values in a SIMD vector to be in a certain range. Clamp cuts values in the input SIMD vector off at the upper bound and lower bound values. For example, SIMD vector `[0, 1, 2, 3]` clamped to a lower bound of 1 and an upper bound of 2 would return `[1, 1, 2, 2]`. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​val (`SIMD[dtype, width]`): The value to clamp. * ​lower\_bound (`SIMD[dtype, width]`): Minimum of the range to clamp to. * ​upper\_bound (`SIMD[dtype, width]`): Maximum of the range to clamp to. **Returns:** A SIMD vector containing x clamped to be within lower\_bound and upper\_bound. --- ## copysign `copysign[dtype: DType, width: Int, //](magnitude: SIMD[dtype, width], sign: SIMD[dtype, width]) -> SIMD[dtype, width]` Returns a value with the magnitude of the first operand and the sign of the second operand. **Constraints:** The type of the input must be numeric. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​magnitude (`SIMD[dtype, width]`): The magnitude to use. * ​sign (`SIMD[dtype, width]`): The sign to copy. **Returns:** Copies the sign from sign to magnitude. --- ## cos `cos[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `cos` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `cos` of the input. --- ## cosh `cosh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `cosh` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `cosh` of the input. --- ## erf `erf[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs the elementwise Erf on a SIMD vector. **Constraints:** The type must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector to perform elementwise Erf on. **Returns:** The result of the elementwise Erf operation. --- ## erfc `erfc[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `erfc` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `erfc` of the input. --- ## exp `exp[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Calculates elementwise exponential of the input vector. Given an input vector $X$ and an output vector $Y$, sets $Y_i = e^{X_i}$ for each position $i$ in the input vector (where $e$ is the mathematical constant $e$). **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input SIMD vector. **Returns:** A SIMD vector containing $e$ raised to the power $X_i$ where $X_i$ is an element in the input SIMD vector. `exp[T: _Expable](x: T) -> T` Computes the exponential of the input value. **Parameters:** * ​T (`_Expable`): The type of the input value. **Args:** * ​x (`T`): The input value. **Returns:** The exponential of the input value. --- ## exp2 `exp2[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes elementwise 2 raised to the power of n, where n is an element of the input SIMD vector. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector to perform exp2 on. **Returns:** Vector containing $2^n$ computed elementwise, where n is an element in the input SIMD vector. --- ## expm1 `expm1[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `expm1` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `expm1` of the input. --- ## factorial `factorial(n: Int) -> Int` Computes the factorial of the integer. **Args:** * ​n (`Int`): The input value. Must be non-negative. **Returns:** The factorial of the input. Results are undefined for negative inputs. --- ## floor `floor[T: Floorable, //](value: T) -> T` Get the floor value of the given object. **Parameters:** * ​T (`Floorable`): The type conforming to `Floorable`. **Args:** * ​value (`T`): The object to get the floor value of. **Returns:** The floor value of the object. --- ## fma `fma(a: Int, b: Int, c: Int) -> Int` Performs `fma` (fused multiply-add) on the inputs. The result is `(a * b) + c`. **Args:** * ​a (`Int`): The first input. * ​b (`Int`): The second input. * ​c (`Int`): The third input. **Returns:** `(a * b) + c`. `fma(a: UInt, b: UInt, c: UInt) -> UInt` Performs `fma` (fused multiply-add) on the inputs. The result is `(a * b) + c`. **Args:** * ​a (`UInt`): The first input. * ​b (`UInt`): The second input. * ​c (`UInt`): The third input. **Returns:** `(a * b) + c`. `fma[dtype: DType, width: Int, //](a: SIMD[dtype, width], b: SIMD[dtype, width], c: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs elementwise `fma` (fused multiply-add) on the inputs. Each element in the result SIMD vector is $(A_i * B_i) + C_i$, where $A_i$, $B_i$ and $C_i$ are elements at index $i$ in a, b, and c respectively. **Parameters:** * ​dtype (`DType`): The `dtype` of the input SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​a (`SIMD[dtype, width]`): The first vector of inputs. * ​b (`SIMD[dtype, width]`): The second vector of inputs. * ​c (`SIMD[dtype, width]`): The third vector of inputs. **Returns:** Elementwise `fma` of a, b and c. --- ## frexp `frexp[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> StaticTuple[SIMD[dtype, width], 2]` Breaks floating point values into a fractional part and an exponent part. This follows C and Python in increasing the exponent by 1 and normalizing the fraction from 0.5 to 1.0 instead of 1.0 to 2.0. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input values. **Returns:** A tuple of two SIMD vectors containing the fractional and exponent parts of the input floating point values. --- ## gamma `gamma[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the Gamma of the input. For details, see . **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The Gamma function evaluated at the input. --- ## gcd `gcd(m: Int, n: Int, /) -> Int` Compute the greatest common divisor of two integers. **Args:** * ​m (`Int`): The first integer. * ​n (`Int`): The second integrer. **Returns:** The greatest common divisor of the two integers. `gcd(s: Span[Int, origin], /) -> Int` Computes the greatest common divisor of a span of integers. **Args:** * ​s (`Span[Int, origin]`): A span containing a collection of integers. **Returns:** The greatest common divisor of all the integers in the span. `gcd(l: List[Int, hint_trivial_type], /) -> Int` Computes the greatest common divisor of a list of integers. **Args:** * ​l (`List[Int, hint_trivial_type]`): A list containing a collection of integers. **Returns:** The greatest common divisor of all the integers in the list. `gcd(*values: Int) -> Int` Computes the greatest common divisor of a variadic number of integers. **Args:** * ​\*values (`Int`): A variadic list of integers. **Returns:** The greatest common divisor of the given integers. --- ## hypot `hypot[dtype: DType, width: Int, //](arg0: SIMD[dtype, width], arg1: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `hypot` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​arg0 (`SIMD[dtype, width]`): The first input argument. * ​arg1 (`SIMD[dtype, width]`): The second input argument. **Returns:** The `hypot` of the inputs. --- ## math Defines math utilities. You can import these APIs from the `math` package. For example: ```mojo from math import floor ``` ## Traits * [​`Ceilable`](/mojo/stdlib/math/math/Ceilable): The `Ceilable` trait describes a type that defines a ceiling operation. * [​`CeilDivable`](/mojo/stdlib/math/math/CeilDivable): The `CeilDivable` trait describes a type that defines a ceil division operation. * [​`CeilDivableRaising`](/mojo/stdlib/math/math/CeilDivableRaising): The `CeilDivable` trait describes a type that define a floor division and negation operation that can raise. * [​`Floorable`](/mojo/stdlib/math/math/Floorable): The `Floorable` trait describes a type that defines a floor operation. * [​`Truncable`](/mojo/stdlib/math/math/Truncable): The `Truncable` trait describes a type that defines a truncation operation. ## Functions * [​`acos`](/mojo/stdlib/math/math/acos): Computes the `acos` of the inputs. * [​`acosh`](/mojo/stdlib/math/math/acosh): Computes the `acosh` of the inputs. * [​`align_down`](/mojo/stdlib/math/math/align_down): Returns the closest multiple of alignment that is less than or equal to value. * [​`align_up`](/mojo/stdlib/math/math/align_up): Returns the closest multiple of alignment that is greater than or equal to value. * [​`asin`](/mojo/stdlib/math/math/asin): Computes the `asin` of the inputs. * [​`asinh`](/mojo/stdlib/math/math/asinh): Computes the `asinh` of the inputs. * [​`atan`](/mojo/stdlib/math/math/atan): Computes the `atan` of the inputs. * [​`atan2`](/mojo/stdlib/math/math/atan2): Computes the `atan2` of the inputs. * [​`atanh`](/mojo/stdlib/math/math/atanh): Computes the `atanh` of the inputs. * [​`cbrt`](/mojo/stdlib/math/math/cbrt): Computes the `cbrt` of the inputs. * [​`ceil`](/mojo/stdlib/math/math/ceil): Get the ceiling value of the given object. * [​`ceildiv`](/mojo/stdlib/math/math/ceildiv): Return the rounded-up result of dividing numerator by denominator. * [​`clamp`](/mojo/stdlib/math/math/clamp): Clamps the integer value vector to be in a certain range. * [​`copysign`](/mojo/stdlib/math/math/copysign): Returns a value with the magnitude of the first operand and the sign of the second operand. * [​`cos`](/mojo/stdlib/math/math/cos): Computes the `cos` of the inputs. * [​`cosh`](/mojo/stdlib/math/math/cosh): Computes the `cosh` of the inputs. * [​`erf`](/mojo/stdlib/math/math/erf): Performs the elementwise Erf on a SIMD vector. * [​`erfc`](/mojo/stdlib/math/math/erfc): Computes the `erfc` of the inputs. * [​`exp`](/mojo/stdlib/math/math/exp): Calculates elementwise exponential of the input vector. * [​`exp2`](/mojo/stdlib/math/math/exp2): Computes elementwise 2 raised to the power of n, where n is an element of the input SIMD vector. * [​`expm1`](/mojo/stdlib/math/math/expm1): Computes the `expm1` of the inputs. * [​`factorial`](/mojo/stdlib/math/math/factorial): Computes the factorial of the integer. * [​`floor`](/mojo/stdlib/math/math/floor): Get the floor value of the given object. * [​`fma`](/mojo/stdlib/math/math/fma): Performs `fma` (fused multiply-add) on the inputs. * [​`frexp`](/mojo/stdlib/math/math/frexp): Breaks floating point values into a fractional part and an exponent part. This follows C and Python in increasing the exponent by 1 and normalizing the fraction from 0.5 to 1.0 instead of 1.0 to 2.0. * [​`gamma`](/mojo/stdlib/math/math/gamma): Computes the Gamma of the input. * [​`gcd`](/mojo/stdlib/math/math/gcd): Compute the greatest common divisor of two integers. * [​`hypot`](/mojo/stdlib/math/math/hypot): Computes the `hypot` of the inputs. * [​`iota`](/mojo/stdlib/math/math/iota): Creates a SIMD vector containing an increasing sequence, starting from offset. * [​`isclose`](/mojo/stdlib/math/math/isclose): Returns a boolean SIMD vector indicating which element pairs of `a` and `b` are equal within a given tolerance. * [​`isqrt`](/mojo/stdlib/math/math/isqrt): Performs elementwise reciprocal square root on a SIMD vector. * [​`j0`](/mojo/stdlib/math/math/j0): Computes the Bessel function of the first kind of order 0 for each input value. * [​`j1`](/mojo/stdlib/math/math/j1): Computes the Bessel function of the first kind of order 1 for each input value. * [​`lcm`](/mojo/stdlib/math/math/lcm): Computes the least common multiple of two integers. * [​`ldexp`](/mojo/stdlib/math/math/ldexp): Computes elementwise ldexp function. * [​`lgamma`](/mojo/stdlib/math/math/lgamma): Computes the `lgamma` of the inputs. * [​`log`](/mojo/stdlib/math/math/log): Performs elementwise natural log (base E) of a SIMD vector. * [​`log10`](/mojo/stdlib/math/math/log10): Computes the `log10` of the inputs. * [​`log1p`](/mojo/stdlib/math/math/log1p): Computes the `log1p` of the inputs. * [​`log2`](/mojo/stdlib/math/math/log2): Performs elementwise log (base 2) of a SIMD vector. * [​`logb`](/mojo/stdlib/math/math/logb): Computes the `logb` of the inputs. * [​`modf`](/mojo/stdlib/math/math/modf): Computes the integral and fractional part of the value. * [​`recip`](/mojo/stdlib/math/math/recip): Performs elementwise reciprocal on a SIMD vector. * [​`remainder`](/mojo/stdlib/math/math/remainder): Computes the `remainder` of the inputs. * [​`scalb`](/mojo/stdlib/math/math/scalb): Computes the `scalb` of the inputs. * [​`sin`](/mojo/stdlib/math/math/sin): Computes the `sin` of the inputs. * [​`sinh`](/mojo/stdlib/math/math/sinh): Computes the `sinh` of the inputs. * [​`sqrt`](/mojo/stdlib/math/math/sqrt): Performs square root on an integer. * [​`tan`](/mojo/stdlib/math/math/tan): Computes the `tan` of the inputs. * [​`tanh`](/mojo/stdlib/math/math/tanh): Performs elementwise evaluation of the tanh function. * [​`trunc`](/mojo/stdlib/math/math/trunc): Get the truncated value of the given object. * [​`ulp`](/mojo/stdlib/math/math/ulp): Computes the ULP (units of last place) or (units of least precision) of the number. * [​`y0`](/mojo/stdlib/math/math/y0): Computes the Bessel function of the second kind of order 0 for each input value. * [​`y1`](/mojo/stdlib/math/math/y1): Computes the Bessel function of the second kind of order 1 for each input value. --- ## iota `iota[dtype: DType, width: Int](offset: SIMD[dtype, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[dtype, width]` Creates a SIMD vector containing an increasing sequence, starting from offset. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​offset (`SIMD[dtype, 1]`): The value to start the sequence at. Default is zero. **Returns:** An increasing sequence of values, starting from offset. `iota[dtype: DType, //](buff: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], len: Int, offset: Int = 0)` Fill the buffer with numbers ranging from offset to offset + len - 1, spaced by 1. The function doesn't return anything, the buffer is updated inplace. **Parameters:** * ​dtype (`DType`): DType of the underlying data. **Args:** * ​buff (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The buffer to fill. * ​len (`Int`): The length of the buffer to fill. * ​offset (`Int`): The value to fill at index 0. `iota[dtype: DType, //](mut v: List[SIMD[dtype, 1], hint_trivial_type], offset: Int = 0)` Fill a list with consecutive numbers starting from the specified offset. **Parameters:** * ​dtype (`DType`): DType of the underlying data. **Args:** * ​v (`List[SIMD[dtype, 1], hint_trivial_type]`): The list to fill with numbers. * ​offset (`Int`): The starting value to fill at index 0. `iota(mut v: List[Int, hint_trivial_type], offset: Int = 0)` Fill a list with consecutive numbers starting from the specified offset. **Args:** * ​v (`List[Int, hint_trivial_type]`): The list to fill with numbers. * ​offset (`Int`): The starting value to fill at index 0. --- ## isclose `isclose[dtype: DType, width: Int, *, symmetrical: Bool = True](a: SIMD[dtype, width], b: SIMD[dtype, width], *, atol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0E-8), rtol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0000000000000001E-5), equal_nan: Bool = False) -> SIMD[bool, width]` Returns a boolean SIMD vector indicating which element pairs of `a` and `b` are equal within a given tolerance. For floating-point dtypes, the following criteria apply: * Symmetric (Python `math.isclose` style), when `symmetrical` is true: ``` |a - b| ≤ max(atol, rtol * max(|a|, |b|)) ``` * Asymmetric (NumPy style), when `symmetrical` is false: ``` |a - b| ≤ atol + rtol * |b| ``` NaN values are considered equal only if `equal_nan` is true. **Parameters:** * ​dtype (`DType`): Element type of the input and output vectors. * ​width (`Int`): Number of lanes in each SIMD vector. * ​symmetrical (`Bool`): If true, use the symmetric comparison formula (default: true). **Args:** * ​a (`SIMD[dtype, width]`): First input vector. * ​b (`SIMD[dtype, width]`): Second input vector. * ​atol (`SIMD[float64, 1]`): Absolute tolerance. * ​rtol (`SIMD[float64, 1]`): Relative tolerance. * ​equal\_nan (`Bool`): If true, treat NaNs as equal (default: false). **Returns:** A boolean vector where `a` and `b` are equal within the given tolerance. --- ## isqrt `isqrt[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs elementwise reciprocal square root on a SIMD vector. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector to perform reciprocal square root on. **Returns:** The elementwise reciprocal square root of x. --- ## j0 `j0[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the Bessel function of the first kind of order 0 for each input value. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input vector. **Returns:** A vector containing the computed value for each value in the input. --- ## j1 `j1[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the Bessel function of the first kind of order 1 for each input value. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input vector. **Returns:** A vector containing the computed value for each value in the input. --- ## lcm `lcm(m: Int, n: Int, /) -> Int` Computes the least common multiple of two integers. **Args:** * ​m (`Int`): The first integer. * ​n (`Int`): The second integer. **Returns:** The least common multiple of the two integers. `lcm(s: Span[Int, origin], /) -> Int` Computes the least common multiple of a span of integers. **Args:** * ​s (`Span[Int, origin]`): A span of integers. **Returns:** The least common multiple of the span. `lcm(l: List[Int, hint_trivial_type], /) -> Int` Computes the least common multiple of a list of integers. **Args:** * ​l (`List[Int, hint_trivial_type]`): A list of integers. **Returns:** The least common multiple of the list. `lcm(*values: Int) -> Int` Computes the least common multiple of a variadic list of integers. **Args:** * ​\*values (`Int`): A variadic list of integers. **Returns:** The least common multiple of the list. --- ## ldexp `ldexp[dtype: DType, width: Int, //](x: SIMD[dtype, width], exp: SIMD[int32, width]) -> SIMD[dtype, width]` Computes elementwise ldexp function. The ldexp function multiplies a floating point value x by the number 2 raised to the exp power. I.e. $ldexp(x,exp)$ calculate the value of $x * 2^{exp}$ and is used within the $erf$ function. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector of floating point values. * ​exp (`SIMD[int32, width]`): SIMD vector containing the exponents. **Returns:** Vector containing elementwise result of ldexp on x and exp. --- ## lgamma `lgamma[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `lgamma` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `lgamma` of the input. --- ## log `log[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs elementwise natural log (base E) of a SIMD vector. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): Vector to perform logarithm operation on. **Returns:** Vector containing result of performing natural log base E on x. --- ## log10 `log10[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `log10` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `log10` of the input. --- ## log1p `log1p[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `log1p` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `log1p` of the input. --- ## log2 `log2[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs elementwise log (base 2) of a SIMD vector. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): Vector to perform logarithm operation on. **Returns:** Vector containing result of performing log base 2 on x. --- ## logb `logb[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `logb` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `logb` of the input. --- ## modf `modf[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> Tuple[SIMD[dtype, width], SIMD[dtype, width]]` Computes the integral and fractional part of the value. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input value. **Returns:** A tuple containing the integral and fractional part of the value. --- ## recip `recip[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs elementwise reciprocal on a SIMD vector. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector to perform reciprocal on. **Returns:** The elementwise reciprocal of x. --- ## remainder `remainder[dtype: DType, width: Int, //](x: SIMD[dtype, width], y: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `remainder` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The first input argument. * ​y (`SIMD[dtype, width]`): The second input argument. **Returns:** The `remainder` of the inputs. --- ## scalb `scalb[dtype: DType, width: Int, //](arg0: SIMD[dtype, width], arg1: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `scalb` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​arg0 (`SIMD[dtype, width]`): The first input argument. * ​arg1 (`SIMD[dtype, width]`): The second input argument. **Returns:** The `scalb` of the inputs. --- ## sin `sin[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `sin` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `sin` of the input. --- ## sinh `sinh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `sinh` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `sinh` of the input. --- ## sqrt `sqrt(x: Int) -> Int` Performs square root on an integer. **Args:** * ​x (`Int`): The integer value to perform square root on. **Returns:** The square root of x. `sqrt[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs elementwise square root on the elements of a SIMD vector. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector to perform square root on. **Returns:** The elementwise square root of x. --- ## tan `tan[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `tan` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `tan` of the input. --- ## tanh `tanh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs elementwise evaluation of the tanh function. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The vector to perform the elementwise tanh on. **Returns:** The result of the elementwise tanh operation. --- ## trunc `trunc[T: Truncable, //](value: T) -> T` Get the truncated value of the given object. **Parameters:** * ​T (`Truncable`): The type conforming to Truncable. **Args:** * ​value (`T`): The object to get the truncated value of. **Returns:** The truncated value of the object. --- ## ulp `ulp[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the ULP (units of last place) or (units of least precision) of the number. **Constraints:** The element type of the inpiut must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector input. **Returns:** The ULP of x. --- ## y0 `y0[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the Bessel function of the second kind of order 0 for each input value. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input vector. **Returns:** A vector containing the computed value for each value in the input. --- ## y1 `y1[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the Bessel function of the second kind of order 1 for each input value. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input vector. **Returns:** A vector containing the computed value for each value in the input. --- ## polynomial Provides two implementations for evaluating polynomials. You can import these APIs from the `math` package. For example: ```mojo from math.polynomial import polynomial_evaluate ``` ## Functions * [​`polynomial_evaluate`](/mojo/stdlib/math/polynomial/polynomial_evaluate): Evaluates the polynomial. --- ## polynomial_evaluate `polynomial_evaluate[: Bool, dtype: DType, width: Int, //, coefficients: List[SIMD[dtype, 1], $0]](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Evaluates the polynomial. **Parameters:** * ​dtype (`DType`): The dtype of the value. * ​width (`Int`): The width of the computed value. * ​coefficients (`List[SIMD[dtype, 1], $0]`): The coefficients. **Args:** * ​x (`SIMD[dtype, width]`): The value to compute the polynomial with. **Returns:** The polynomial evaluation results using the specified value and the constant coefficients. --- ## ArcPointer `@register_passable` `struct ArcPointer[T: Movable]` Atomic reference-counted pointer. This smart pointer owns an instance of `T` indirectly managed on the heap. This pointer is copyable, including across threads, maintaining a reference count to the underlying data. When you initialize an `ArcPointer` with a value, it allocates memory and moves the value into the allocated memory. Copying an instance of an `ArcPointer` increments the reference count. Destroying an instance decrements the reference count. When the reference count reaches zero, `ArcPointer` destroys the value and frees its memory. This pointer itself is thread-safe using atomic accesses to reference count the underlying data, but references returned to the underlying data are not thread-safe. Subscripting an `ArcPointer` (`ptr[]`) returns a mutable reference to the stored value. This is the only safe way to access the stored value. Other methods, such as using the `unsafe_ptr()` method to retrieve an unsafe pointer to the stored value, or accessing the private fields of an `ArcPointer`, are unsafe and may result in memory errors. For a comparison with other pointer types, see [Intro to pointers](/mojo/manual/pointers/) in the Mojo Manual. Examples: ```mojo from memory import ArcPointer var p = ArcPointer(4) var p2 = p p2[]=3 print(3 == p[]) ``` ## Parameters * ​T (`Movable`): The type of the stored value. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Identifiable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(owned value: T) -> Self` Construct a new thread-safe, reference-counted smart pointer, and move the value into heap memory managed by the new pointer. **Args:** * ​value (`T`): The value to manage. ### `__copyinit__` `__copyinit__(existing: Self) -> Self` Copy an existing reference. Increment the refcount to the object. **Args:** * ​existing (`Self`): The existing reference. ### `__del__` `__del__(owned self)` Delete the smart pointer. Decrement the reference count for the stored value. If there are no more references, delete the object and free its memory. ### `__getitem__` `__getitem__[self_life: ImmutableOrigin](ref [self_life] self) -> ref [self_life] T` Returns a mutable reference to the managed value. **Parameters:** * ​self\_life (`ImmutableOrigin`): The origin of self. **Returns:** A reference to the managed value. ### `__is__` `__is__(self, rhs: Self) -> Bool` Returns True if the two `ArcPointer` instances point at the same object. **Args:** * ​rhs (`Self`): The other `ArcPointer`. **Returns:** True if the two `ArcPointers` instances point at the same object and False otherwise. ### `__isnot__` `__isnot__(self, rhs: Self) -> Bool` Returns True if the two `ArcPointer` instances point at different objects. **Args:** * ​rhs (`Self`): The other `ArcPointer`. **Returns:** True if the two `ArcPointer` instances point at different objects and False otherwise. ### `copy` `copy(self) -> Self` Copy the object. **Returns:** A copy of the value. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[T]` Retrieves a pointer to the underlying memory. **Returns:** The `UnsafePointer` to the pointee. ### `count` `count(self) -> SIMD[uint64, 1]` Count the amount of current references. **Returns:** The current amount of references to the pointee. --- ## arc Reference-counted smart pointers. You can import these APIs from the `memory` package. For example: ```mojo from memory import ArcPointer ``` ## Structs * [​`ArcPointer`](/mojo/stdlib/memory/arc/ArcPointer): Atomic reference-counted pointer. --- ## memory The memory package provides several pointer types, as well as utility functions for dealing with memory. ## Modules * [​`arc`](/mojo/stdlib/memory/arc/): Reference-counted smart pointers. * [​`maybe_uninitialized`](/mojo/stdlib/memory/maybe_uninitialized/): * [​`memory`](/mojo/stdlib/memory/memory/): Defines functions for memory manipulations. * [​`owned_pointer`](/mojo/stdlib/memory/owned_pointer/): Implements `OwnedPointer`, a safe, single-ownership smart pointer. * [​`pointer`](/mojo/stdlib/memory/pointer/): Implements the Pointer type. * [​`span`](/mojo/stdlib/memory/span/): Implements the `Span` type. * [​`unsafe`](/mojo/stdlib/memory/unsafe/): Provides utility functions for unsafe manipulation of SIMD values. * [​`unsafe_pointer`](/mojo/stdlib/memory/unsafe_pointer/): Implement a generic unsafe pointer type. --- ## UnsafeMaybeUninitialized `struct UnsafeMaybeUninitialized[ElementType: AnyType]` A memory location that may or may not be initialized. Note that the destructor is a no-op. If the memory was initialized, the caller is responsible for calling `assume_initialized_destroy` before the memory is deallocated. Every method in this struct is unsafe and the caller must know at all times if the memory is initialized or not. Calling a method that assumes the memory is initialized when it is not will result in undefined behavior. ## Parameters * ​ElementType (`AnyType`): The type of the element to store. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `type` `alias type = array ElementType>` ## Methods ### `__init__` `__init__(out self)` The memory is now considered uninitialized. `__init__[MovableType: Movable](out self: UnsafeMaybeUninitialized[MovableType], owned value: MovableType)` The memory is now considered initialized. **Parameters:** * ​MovableType (`Movable`): The type of the element to store. **Args:** * ​value (`MovableType`): The value to initialize the memory with. ### `__copyinit__` `__copyinit__(out self, other: Self)` Copy another object. This method should never be called as implicit copy should not be done on memory that may be uninitialized. Trying to call this method will abort. If you wish to perform a copy, you should manually call the method `copy_from` instead. **Args:** * ​other (`Self`): The object to copy. ### `__moveinit__` `__moveinit__(out self, owned other: Self)` Move another object. This method should never be called as implicit moves should not be done on memory that may be uninitialized. Trying to call this method will abort. If you wish to perform a move, you should manually call the method `move_from` instead. **Args:** * ​other (`Self`): The object to move. ### `__del__` `__del__(owned self)` This is a no-op. Calling this method assumes that the memory is uninitialized. If the memory was initialized, the caller should use `assume_initialized_destroy` before. ### `copy_from` `copy_from[CopyableType: ExplicitlyCopyable](mut self: UnsafeMaybeUninitialized[CopyableType], other: UnsafeMaybeUninitialized[CopyableType])` Copy another object. This function assumes that the current memory is uninitialized and the other object is initialized memory. **Parameters:** * ​CopyableType (`ExplicitlyCopyable`): The type object to copy. **Args:** * ​other (`UnsafeMaybeUninitialized[CopyableType]`): The object to copy. `copy_from[CopyableType: ExplicitlyCopyable](mut self: UnsafeMaybeUninitialized[CopyableType], other: CopyableType)` Copy another object. This function assumes that the current memory is uninitialized. **Parameters:** * ​CopyableType (`ExplicitlyCopyable`): The type object to copy. **Args:** * ​other (`CopyableType`): The object to copy. ### `move_from` `move_from[MovableType: Movable](mut self: UnsafeMaybeUninitialized[MovableType], mut other: UnsafeMaybeUninitialized[MovableType])` Move another object. This function assumes that the current memory is uninitialized and the other object is initialized memory. After the function is called, the other object is considered uninitialized. **Parameters:** * ​MovableType (`Movable`): The type object to move. **Args:** * ​other (`UnsafeMaybeUninitialized[MovableType]`): The object to move. `move_from[MovableType: Movable](mut self: UnsafeMaybeUninitialized[MovableType], other: UnsafePointer[MovableType])` Move another object. This function assumes that the current memory is uninitialized and the other object is initialized memory. After the function is called, the `other` object is considered uninitialized. **Parameters:** * ​MovableType (`Movable`): The type object to move. **Args:** * ​other (`UnsafePointer[MovableType]`): The pointer to the object to move. ### `write` `write[MovableType: Movable](mut self: UnsafeMaybeUninitialized[MovableType], owned value: MovableType)` Write a value into an uninitialized memory location. Calling this method assumes that the memory is uninitialized. **Parameters:** * ​MovableType (`Movable`): The type of the element to store. **Args:** * ​value (`MovableType`): The value to write. ### `assume_initialized` `assume_initialized(ref self) -> ref [self] ElementType` Returns a reference to the internal value. Calling this method assumes that the memory is initialized. **Returns:** A reference to the internal value. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[ElementType]` Get a pointer to the underlying element. Note that this method does not assumes that the memory is initialized or not. It can always be called. **Returns:** A pointer to the underlying element. ### `assume_initialized_destroy` `assume_initialized_destroy(mut self)` Runs the destructor of the internal value. Calling this method assumes that the memory is initialized. --- ## maybe_uninitialized ## Structs * [​`UnsafeMaybeUninitialized`](/mojo/stdlib/memory/maybe_uninitialized/UnsafeMaybeUninitialized): A memory location that may or may not be initialized. --- ## memory Defines functions for memory manipulations. You can import these APIs from the `memory` package. For example: ```mojo from memory import memcmp ``` ## Functions * [​`memcmp`](/mojo/stdlib/memory/memory/memcmp): Compares two buffers. Both strings are assumed to be of the same length. * [​`memcpy`](/mojo/stdlib/memory/memory/memcpy): Copies a memory area. * [​`memset`](/mojo/stdlib/memory/memory/memset): Fills memory with the given value. * [​`memset_zero`](/mojo/stdlib/memory/memory/memset_zero): Fills memory with zeros. * [​`stack_allocation`](/mojo/stdlib/memory/memory/stack_allocation): Allocates data buffer space on the stack given a data type and number of elements. --- ## memcmp `memcmp[type: AnyType, address_space: AddressSpace](s1: UnsafePointer[type, address_space=address_space], s2: UnsafePointer[type, address_space=address_space], count: Int) -> Int` Compares two buffers. Both strings are assumed to be of the same length. **Parameters:** * ​type (`AnyType`): The element type. * ​address\_space (`AddressSpace`): The address space of the pointer. **Args:** * ​s1 (`UnsafePointer[type, address_space=address_space]`): The first buffer address. * ​s2 (`UnsafePointer[type, address_space=address_space]`): The second buffer address. * ​count (`Int`): The number of elements in the buffers. **Returns:** Returns 0 if the bytes strings are identical, 1 if s1 > s2, and -1 if s1 --- ## memcpy `memcpy[T: AnyType](dest: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], src: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], count: Int)` Copies a memory area. **Parameters:** * ​T (`AnyType`): The element type. **Args:** * ​dest (`UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]`): The destination pointer. * ​src (`UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]`): The source pointer. * ​count (`Int`): The number of elements to copy. --- ## memset `memset[type: AnyType, address_space: AddressSpace](ptr: UnsafePointer[type, address_space=address_space], value: SIMD[uint8, 1], count: Int)` Fills memory with the given value. **Parameters:** * ​type (`AnyType`): The element dtype. * ​address\_space (`AddressSpace`): The address space of the pointer. **Args:** * ​ptr (`UnsafePointer[type, address_space=address_space]`): UnsafePointer to the beginning of the memory block to fill. * ​value (`SIMD[uint8, 1]`): The value to fill with. * ​count (`Int`): Number of elements to fill (in elements, not bytes). --- ## memset_zero `memset_zero[type: AnyType, address_space: AddressSpace, //](ptr: UnsafePointer[type, address_space=address_space], count: Int)` Fills memory with zeros. **Parameters:** * ​type (`AnyType`): The element type. * ​address\_space (`AddressSpace`): The address space of the pointer. **Args:** * ​ptr (`UnsafePointer[type, address_space=address_space]`): UnsafePointer to the beginning of the memory block to fill. * ​count (`Int`): Number of elements to fill (in elements, not bytes). `memset_zero[dtype: DType, address_space: AddressSpace, //, *, count: Int](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space])` Fills memory with zeros. **Parameters:** * ​dtype (`DType`): The element type. * ​address\_space (`AddressSpace`): The address space of the pointer. * ​count (`Int`): Number of elements to fill (in elements, not bytes). **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space]`): UnsafePointer to the beginning of the memory block to fill. --- ## stack_allocation `stack_allocation[count: Int, dtype: DType, /, alignment: Int = alignof[::DType,__mlir_type.!kgen.target]() if is_gpu() else 1, address_space: AddressSpace = AddressSpace(0)]() -> UnsafePointer[SIMD[dtype, 1], address_space=address_space]` Allocates data buffer space on the stack given a data type and number of elements. **Parameters:** * ​count (`Int`): Number of elements to allocate memory for. * ​dtype (`DType`): The data type of each element. * ​alignment (`Int`): Address alignment of the allocated data. * ​address\_space (`AddressSpace`): The address space of the pointer. **Returns:** A data pointer of the given type pointing to the allocated space. `stack_allocation[count: Int, type: AnyType, /, name: Optional[StringSlice[StaticConstantOrigin]] = Optional(None), alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]() if is_gpu() else 1, address_space: AddressSpace = AddressSpace(0)]() -> UnsafePointer[type, address_space=address_space]` Allocates data buffer space on the stack given a data type and number of elements. **Parameters:** * ​count (`Int`): Number of elements to allocate memory for. * ​type (`AnyType`): The data type of each element. * ​name (`Optional[StringSlice[StaticConstantOrigin]]`): The name of the global variable (only honored in certain cases). * ​alignment (`Int`): Address alignment of the allocated data. * ​address\_space (`AddressSpace`): The address space of the pointer. **Returns:** A data pointer of the given type pointing to the allocated space. --- ## OwnedPointer `@register_passable` `struct OwnedPointer[T: AnyType]` A safe, owning, smart pointer. This smart pointer is designed for cases where there is clear ownership of the underlying data, and restricts access to it through the origin system such that no more than one mutable alias for the underlying data may exist. For a comparison with other pointer types, see [Intro to pointers](/mojo/manual/pointers/) in the Mojo Manual. ## Parameters * ​T (`AnyType`): The type to be stored in the `OwnedPointer`. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__[T: Movable](owned value: T) -> OwnedPointer[T]` Construct a new `OwnedPointer` by moving the passed value into a new backing allocation. **Parameters:** * ​T (`Movable`): The type of the data to store. It is restricted to `Movable` here to allow efficient move construction. **Args:** * ​value (`T`): The value to move into the `OwnedPointer`. `__init__[T: ExplicitlyCopyable](*, copy_value: T) -> OwnedPointer[T]` Construct a new `OwnedPointer` by explicitly copying the passed value into a new backing allocation. **Parameters:** * ​T (`ExplicitlyCopyable`): The type of the data to store, which must be `ExplicitlyCopyable`. **Args:** * ​copy\_value (`T`): The value to explicitly copy into the `OwnedPointer`. `__init__[T: Copyable, U: NoneType = NoneType(None)](value: T) -> OwnedPointer[T]` Construct a new `OwnedPointer` by copying the passed value into a new backing allocation. **Parameters:** * ​T (`Copyable`): The type of the data to store. * ​U (`NoneType`): A dummy type parameter, to lower the selection priority of this ctor. **Args:** * ​value (`T`): The value to copy into the `OwnedPointer`. `__init__[T: ExplicitlyCopyable](*, other: OwnedPointer[T]) -> OwnedPointer[T]` Construct a new `OwnedPointer` by explicitly copying the value from another `OwnedPointer`. **Parameters:** * ​T (`ExplicitlyCopyable`): The type of the data to store. **Args:** * ​other (`OwnedPointer[T]`): The `OwnedPointer` to copy. ### `__del__` `__del__(owned self)` Destroy the OwnedPointer\[]. ### `__getitem__` `__getitem__(ref self) -> ref [self] T` Returns a reference to the pointers's underlying data with parametric mutability. **Returns:** A reference to the data underlying the `OwnedPointer`. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[T]` UNSAFE: returns the backing pointer for this `OwnedPointer`. **Returns:** An UnsafePointer to the backing allocation for this `OwnedPointer`. ### `take` `take[T: Movable](owned self: OwnedPointer[T]) -> T` Move the value within the `OwnedPointer` out of it, consuming the `OwnedPointer` in the process. **Parameters:** * ​T (`Movable`): The type of the data backing this `OwnedPointer`. `take()` only exists for `T: Movable` since this consuming operation only makes sense for types that you want to avoid copying. For types that are `Copyable` or `ExplicitlyCopyable` but are not `Movable`, you can copy them through `__getitem__` as in `var v = some_ptr_var[]`. **Returns:** The data that is (was) backing the `OwnedPointer`. ### `steal_data` `steal_data(owned self) -> UnsafePointer[T]` Take ownership over the heap allocated pointer backing this `OwnedPointer`. **Safety:** This function is not unsafe to call, as a memory leak is not considered unsafe. However, to avoid a memory leak, callers should ensure that the returned pointer is eventually deinitialized and deallocated. Failure to do so will leak memory. **Returns:** The pointer owned by this instance. --- ## owned_pointer Implements `OwnedPointer`, a safe, single-ownership smart pointer. You can import these APIs from the `memory` package. For example: ```mojo from memory import OwnedPointer ``` ## Structs * [​`OwnedPointer`](/mojo/stdlib/memory/owned_pointer/OwnedPointer): A safe, owning, smart pointer. --- ## AddressSpace `@register_passable(trivial)` `struct AddressSpace` Address space of the pointer. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Intable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `GENERIC` `alias GENERIC = AddressSpace(0)` Generic address space. ## Methods ### `__init__` `__init__(value: Int) -> Self` Initializes the address space from the underlying integral value. **Args:** * ​value (`Int`): The address space value. `__init__(value: _GPUAddressSpace) -> Self` Initializes the address space from the underlying integral value. **Args:** * ​value (`_GPUAddressSpace`): The address space value. ### `__eq__` `__eq__(self, other: Self) -> Bool` True if the two address spaces are equal and False otherwise. **Args:** * ​other (`Self`): The other address space value. **Returns:** True if the two address spaces are equal and False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` True if the two address spaces are inequal and False otherwise. **Args:** * ​other (`Self`): The other address space value. **Returns:** True if the two address spaces are inequal and False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` True if the two address spaces are equal and False otherwise. **Args:** * ​other (`Self`): The other address space value. **Returns:** True if the two address spaces are equal and False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` True if the two address spaces are equal and False otherwise. **Args:** * ​other (`Self`): The other address space value. **Returns:** True if the two address spaces are equal and False otherwise. ### `value` `value(self) -> Int` The integral value of the address space. **Returns:** The integral value of the address space. ### `__int__` `__int__(self) -> Int` The integral value of the address space. **Returns:** The integral value of the address space. ### `__index__` `__index__(self) -> index` Convert to index. **Returns:** The corresponding \_\_mlir\_type.index value. ### `__str__` `__str__(self) -> String` Gets a string representation of the AddressSpace. **Returns:** The string representation of the AddressSpace. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats the address space to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. --- ## Pointer `@register_passable(trivial)` `struct Pointer[mut: Bool, //, type: AnyType, origin: Origin[mut], address_space: AddressSpace = AddressSpace(0)]` Defines a non-nullable safe pointer. For a comparison with other pointer types, see [Intro to pointers](/mojo/manual/pointers/) in the Mojo Manual. ## Parameters * ​mut (`Bool`): Whether the pointee data may be mutated through this. * ​type (`AnyType`): Type of the underlying data. * ​origin (`Origin[mut]`): The origin of the pointer. * ​address\_space (`AddressSpace`): The address space of the pointee data. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Stringable`, `UnknownDestructibility` ## Aliases ### `Immutable` `alias Immutable = Pointer[type, (muttoimm origin._mlir_origin), address_space]` The immutable version of the `Pointer`. ### `Mutable` `alias Mutable = Pointer[type, (mutcast origin._mlir_origin), address_space]` The mutable version of the `Pointer`. ## Methods ### `__init__` `__init__(*, ref [origin, address_space] to: type) -> Self` Constructs a Pointer from a reference to a value. **Args:** * ​to (`type`): The value to construct a pointer to. ### `__getitem__` `__getitem__(self) -> ref [origin, address_space] type` Enable subscript syntax `ptr[]` to access the element. **Returns:** A reference to the underlying value in memory. ### `__eq__` `__eq__(self, rhs: Pointer[type, origin, address_space]) -> Bool` Returns True if the two pointers are equal. **Args:** * ​rhs (`Pointer[type, origin, address_space]`): The value of the other pointer. **Returns:** True if the two pointers are equal and False otherwise. ### `__ne__` `__ne__(self, rhs: Pointer[type, origin, address_space]) -> Bool` Returns True if the two pointers are not equal. **Args:** * ​rhs (`Pointer[type, origin, address_space]`): The value of the other pointer. **Returns:** True if the two pointers are not equal and False otherwise. ### `address_of` `static address_of(ref [origin, address_space] value: type) -> Self` Constructs a Pointer from a reference to a value. **Args:** * ​value (`type`): The value to get the address of. **Returns:** The result Pointer. ### `copy` `copy(self) -> Self` Constructs a copy from another Pointer. Note that this does **not** copy the underlying data. **Returns:** A copy of the value. ### `get_immutable` `get_immutable(self) -> Pointer[type, (muttoimm origin._mlir_origin), address_space]` Constructs a new Pointer with the same underlying target and an ImmutableOrigin. Notes: This does **not** copy the underlying data. **Returns:** A new Pointer with the same target as self and an ImmutableOrigin. ### `__str__` `__str__(self) -> String` Gets a string representation of the Pointer. **Returns:** The string representation of the Pointer. ### `__merge_with__` `__merge_with__[: Bool, : Origin[$0], //, other_type: AnyStruct[Pointer[type, $1, address_space]]](self) -> Pointer[type, origin, address_space]` Returns a pointer merged with the specified `other_type`. **Parameters:** * ​other\_type (`AnyStruct[Pointer[type, $1, address_space]]`): The type of the pointer to merge with. **Returns:** A pointer merged with the specified `other_type`. --- ## pointer Implements the Pointer type. You can import these APIs from the `memory` package. For example: ```mojo from memory import Pointer ``` ## Structs * [​`AddressSpace`](/mojo/stdlib/memory/pointer/AddressSpace): Address space of the pointer. * [​`Pointer`](/mojo/stdlib/memory/pointer/Pointer): Defines a non-nullable safe pointer. --- ## Span `@register_passable(trivial)` `struct Span[mut: Bool, //, T: Copyable & Movable, origin: Origin[mut], *, address_space: AddressSpace = AddressSpace(0), alignment: Int = _default_alignment[::AnyType]()]` A non-owning view of contiguous data. ## Parameters * ​mut (`Bool`): Whether the span is mutable. * ​T (`Copyable & Movable`): The type of the elements in the span. * ​origin (`Origin[mut]`): The origin of the Span. * ​address\_space (`AddressSpace`): The address space associated with the allocated memory. * ​alignment (`Int`): The minimum alignment of the underlying pointer known statically. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Defaultable`, `ExplicitlyCopyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Aliases ### `Immutable` `alias Immutable = Span[T, (muttoimm origin._mlir_origin)]` The immutable version of the `Span`. ### `Mutable` `alias Mutable = Span[T, (mutcast origin._mlir_origin)]` The mutable version of the `Span`. ## Methods ### `__init__` `__init__() -> Self` Create an empty / zero-length span. `__init__(*, ptr: UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin], length: UInt) -> Self` Unsafe construction from a pointer and length. **Args:** * ​ptr (`UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The underlying pointer of the span. * ​length (`UInt`): The length of the view. `@implicit` `__init__(ref [origin, address_space] list: List[T, hint_trivial_type]) -> Self` Construct a `Span` from a `List`. **Args:** * ​list (`List[T, hint_trivial_type]`): The list to which the span refers. `@implicit` `__init__[size: Int, //](ref [origin] array: InlineArray[T, size]) -> Self` Construct a `Span` from an `InlineArray`. **Parameters:** * ​size (`Int`): The size of the `InlineArray`. **Args:** * ​array (`InlineArray[T, size]`): The array to which the span refers. ### `__bool__` `__bool__(self) -> Bool` Check if a span is non-empty. **Returns:** True if a span is non-empty, False otherwise. ### `__getitem__` `__getitem__[I: Indexer](self, idx: I) -> ref [origin, address_space] T` Get a reference to an element in the span. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The index of the value to return. **Returns:** An element reference. `__getitem__(self, slc: Slice) -> Self` Get a new span from a slice of the current span. Allocation: This function allocates when the step is negative, to avoid a memory leak, take ownership of the value. **Args:** * ​slc (`Slice`): The slice specifying the range of the new subslice. **Returns:** A new span that points to the same data as the current span. ### `__eq__` `__eq__[T: EqualityComparable & Copyable & Movable, rhs_alignment: Int, //](self: Span[T, origin, alignment=alignment], rhs: Span[T, origin, alignment=rhs_alignment]) -> Bool` Verify if span is equal to another span. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the span. Must implement the traits `EqualityComparable`, `Copyable` and `Movable`. * ​rhs\_alignment (`Int`): The inferred alignment of the rhs span. **Args:** * ​rhs (`Span[T, origin, alignment=rhs_alignment]`): The span to compare against. **Returns:** True if the spans are equal in length and contain the same elements, False otherwise. ### `__ne__` `__ne__[T: EqualityComparable & Copyable & Movable, //](self: Span[T, origin, alignment=alignment], rhs: Span[T, origin]) -> Bool` Verify if span is not equal to another span. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the span. Must implement the traits `EqualityComparable`, `Copyable` and `Movable`. **Args:** * ​rhs (`Span[T, origin]`): The span to compare against. **Returns:** True if the spans are not equal in length or contents, False otherwise. ### `__contains__` `__contains__[dtype: DType, //](self: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment], value: SIMD[dtype, 1]) -> Bool` Verify if a given value is present in the Span. **Parameters:** * ​dtype (`DType`): The DType of the scalars stored in the Span. **Args:** * ​value (`SIMD[dtype, 1]`): The value to find. **Returns:** True if the value is contained in the list, False otherwise. ### `copy` `copy(self) -> Self` Explicitly construct a copy of the provided `Span`. **Returns:** A copy of the `Span`. ### `__iter__` `__iter__(self) -> _SpanIter[T, origin, address_space=address_space, alignment=alignment]` Get an iterator over the elements of the `Span`. **Returns:** An iterator over the elements of the `Span`. ### `__reversed__` `__reversed__(self) -> _SpanIter[T, origin, False, address_space, alignment]` Iterate backwards over the `Span`. **Returns:** A reversed iterator of the `Span` elements. ### `__len__` `__len__(self) -> Int` Returns the length of the span. This is a known constant value. **Returns:** The size of the span. ### `get_immutable` `get_immutable(self) -> Span[T, (muttoimm origin._mlir_origin)]` Return an immutable version of this `Span`. **Returns:** An immutable version of the same `Span`. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin]` Retrieves a pointer to the underlying memory. **Returns:** The pointer to the underlying memory. ### `as_ref` `as_ref(self) -> Pointer[T, origin, address_space]` Gets a `Pointer` to the first element of this span. **Returns:** A `Pointer` pointing at the first element of this span. ### `copy_from` `copy_from[origin: MutableOrigin, other_alignment: Int, //](self: Span[T, origin, alignment=alignment], other: Span[T, origin, alignment=other_alignment])` Performs an element wise copy from all elements of `other` into all elements of `self`. **Parameters:** * ​origin (`MutableOrigin`): The inferred mutable origin of the data within the Span. * ​other\_alignment (`Int`): The inferred alignment of the data within the Span. **Args:** * ​other (`Span[T, origin, alignment=other_alignment]`): The `Span` to copy all elements from. ### `fill` `fill[origin: MutableOrigin, //](self: Span[T, origin, alignment=alignment], value: T)` Fill the memory that a span references with a given value. **Parameters:** * ​origin (`MutableOrigin`): The inferred mutable origin of the data within the Span. **Args:** * ​value (`T`): The value to assign to each element. ### `swap_elements` `swap_elements(self: Span[T, origin, alignment=alignment], a: UInt, b: UInt)` Swap the values at indices `a` and `b`. **Args:** * ​a (`UInt`): The first argument index. * ​b (`UInt`): The second argument index. **Raises:** If a or b are larger than the length of the span. ### `__merge_with__` `__merge_with__[: Bool, : Origin[$0], //, other_type: AnyStruct[Span[T, $1, address_space=address_space, alignment=alignment]]](self) -> Span[T, origin, address_space=address_space, alignment=alignment]` Returns a pointer merged with the specified `other_type`. **Parameters:** * ​other\_type (`AnyStruct[Span[T, $1, address_space=address_space, alignment=alignment]]`): The type of the pointer to merge with. **Returns:** A pointer merged with the specified `other_type`. --- ## span Implements the `Span` type. You can import these APIs from the `memory` module. For example: ```mojo from memory import Span ``` ## Structs * [​`Span`](/mojo/stdlib/memory/span/Span): A non-owning view of contiguous data. --- ## bitcast `bitcast[src_dtype: DType, src_width: Int, //, dtype: DType, width: Int = src_width](val: SIMD[src_dtype, src_width]) -> SIMD[dtype, width]` Bitcasts a SIMD value to another SIMD value. For a discussion of byte order, see [Converting data: bitcasting and byte order](/mojo/manual/pointers/unsafe-pointers#converting-data-bitcasting-and-byte-order) in the Mojo Manual. Examples: The following example uses `bitcast` to break a 32-bit integer into a vector of four 8-bit integers: ```mojo from memory import bitcast u32 = SIMD[DType.uint32, 1](4631) u8x4 = bitcast[DType.uint8, 4](u32) print(u32, u8x4) # 4631 [23, 18, 0, 0] ``` **Constraints:** The bitwidth of the two types must be the same. **Parameters:** * ​src\_dtype (`DType`): The source type. * ​src\_width (`Int`): The source width. * ​dtype (`DType`): The target type. * ​width (`Int`): The target width. **Args:** * ​val (`SIMD[src_dtype, src_width]`): The source value. **Returns:** A new SIMD value with the specified type and width with a bitcopy of the source SIMD value. --- ## unsafe Provides utility functions for unsafe manipulation of SIMD values. You can import these APIs from the `memory` package. For example: ```mojo from memory import bitcast ``` ## Functions * [​`bitcast`](/mojo/stdlib/memory/unsafe/bitcast): Bitcasts a SIMD value to another SIMD value. * [​`pack_bits`](/mojo/stdlib/memory/unsafe/pack_bits): Packs a SIMD vector of `bool` values into an integer. --- ## pack_bits `pack_bits[src_width: Int, //, dtype: DType = ui1 if (src_width == 1) else ui2 if (src_width == 2) else ui4 if (src_width == 4) else uint8 if (src_width == 8) else uint16 if (src_width == 16) else uint32 if (src_width == 32) else uint64 if (src_width == 64) else ui128 if (src_width == 128) else ui256 if (src_width == 256) else invalid, width: Int = 1](val: SIMD[bool, src_width]) -> SIMD[dtype, width]` Packs a SIMD vector of `bool` values into an integer. Examples: This example packs a vector of 8 `bool` values into a single 8-bit integer. ```mojo from memory import pack_bits bits = SIMD[DType.bool, 8](1, 1, 0, 1, 0, 0, 0, 0) u8 = pack_bits[DType.uint8](bits) print(bits, u8) # [True, True, False, True, False, False, False, False] 11 ``` **Constraints:** The logical bitwidth of the bool vector must be the same as the bitwidth of the target type. The target type must be a unsigned type. **Parameters:** * ​src\_width (`Int`): The source width. * ​dtype (`DType`): The target type. * ​width (`Int`): The target width. **Args:** * ​val (`SIMD[bool, src_width]`): The source value. **Returns:** A new integer scalar which has the same bitwidth as the bool vector. --- ## UnsafePointer `@register_passable(trivial)` `struct UnsafePointer[type: AnyType, *, address_space: AddressSpace = AddressSpace(0), alignment: Int = _default_alignment[::AnyType](), mut: Bool = True, origin: Origin[mut] = SomeAnyOrigin]` UnsafePointer\[T] represents an indirect reference to one or more values of type T consecutively in memory, and can refer to uninitialized memory. Because it supports referring to uninitialized memory, it provides unsafe methods for initializing and destroying instances of T, as well as methods for accessing the values once they are initialized. For more information see [Unsafe pointers](/mojo/manual/pointers/unsafe-pointers) in the Mojo Manual. For a comparison with other pointer types, see [Intro to pointers](/mojo/manual/pointers/). ## Parameters * ​type (`AnyType`): The type the pointer points to. * ​address\_space (`AddressSpace`): The address space associated with the UnsafePointer allocated memory. * ​alignment (`Int`): The minimum alignment of this pointer known statically. * ​mut (`Bool`): Whether the origin is mutable. * ​origin (`Origin[mut]`): The origin of the memory being addressed. ## Fields * ​address (`pointer *"type", #lit.struct.extract, "value">>`): The underlying pointer. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Defaultable`, `EqualityComparable`, `ExplicitlyCopyable`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `ImplicitlyBoolable`, `Intable`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Create a null pointer. `__init__(*, ref [origin, address_space] to: type) -> Self` Constructs a Pointer from a reference to a value. **Args:** * ​to (`type`): The value to construct a pointer to. `@implicit` `__init__(other: UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self` Exclusivity parameter cast a pointer. **Args:** * ​other (`UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to cast. `__init__(*, ref [origin] unchecked_downcast_value: PythonObject) -> UnsafePointer[type, mut=mut, origin=origin]` Downcast a `PythonObject` known to contain a Mojo object to a pointer. This operation is only valid if the provided Python object contains an initialized Mojo object of matching type. **Args:** * ​unchecked\_downcast\_value (`PythonObject`): The Python object to downcast from. ### `__bool__` `__bool__(self) -> Bool` Return true if the pointer is non-null. **Returns:** Whether the pointer is null. ### `__getitem__` `__getitem__(self) -> ref [origin, address_space] type` Return a reference to the underlying data. **Returns:** A reference to the value. `__getitem__[I: Indexer, //](self, offset: I) -> ref [origin, address_space] type` Return a reference to the underlying data, offset by the given index. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​offset (`I`): The offset index. **Returns:** An offset reference. ### `__lt__` `__lt__(self, rhs: Self) -> Bool` Returns True if this pointer represents a lower address than rhs. **Args:** * ​rhs (`Self`): The value of the other pointer. **Returns:** True if this pointer represents a lower address and False otherwise. ### `__le__` `__le__(self, rhs: Self) -> Bool` Returns True if this pointer represents a lower than or equal address than rhs. **Args:** * ​rhs (`Self`): The value of the other pointer. **Returns:** True if this pointer represents a lower address and False otherwise. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Returns True if the two pointers are equal. **Args:** * ​rhs (`Self`): The value of the other pointer. **Returns:** True if the two pointers are equal and False otherwise. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Returns True if the two pointers are not equal. **Args:** * ​rhs (`Self`): The value of the other pointer. **Returns:** True if the two pointers are not equal and False otherwise. ### `__gt__` `__gt__(self, rhs: Self) -> Bool` Returns True if this pointer represents a higher address than rhs. **Args:** * ​rhs (`Self`): The value of the other pointer. **Returns:** True if this pointer represents a higher than or equal address and False otherwise. ### `__ge__` `__ge__(self, rhs: Self) -> Bool` Returns True if this pointer represents a higher than or equal address than rhs. **Args:** * ​rhs (`Self`): The value of the other pointer. **Returns:** True if this pointer represents a higher than or equal address and False otherwise. ### `__add__` `__add__[I: Indexer, //](self, offset: I) -> Self` Return a pointer at an offset from the current one. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​offset (`I`): The offset index. **Returns:** An offset pointer. ### `__sub__` `__sub__[I: Indexer, //](self, offset: I) -> Self` Return a pointer at an offset from the current one. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​offset (`I`): The offset index. **Returns:** An offset pointer. ### `__iadd__` `__iadd__[I: Indexer, //](mut self, offset: I)` Add an offset to this pointer. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​offset (`I`): The offset index. ### `__isub__` `__isub__[I: Indexer, //](mut self, offset: I)` Subtract an offset from this pointer. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​offset (`I`): The offset index. ### `copy` `copy(self) -> Self` Copy an existing pointer. **Returns:** A copy of the value. ### `address_of` `static address_of(ref [address_space] arg: type) -> UnsafePointer[type, address_space=address_space, alignment=1, mut=arg_is_mut, origin=arg_is_origin]` Gets the address of the argument. **Args:** * ​arg (`type`): The value to get the address of. **Returns:** An UnsafePointer which contains the address of the argument. ### `alloc` `static alloc(count: Int) -> UnsafePointer[type, alignment=alignment, origin={}]` Allocate an array with specified or default alignment. **Args:** * ​count (`Int`): The number of elements in the array. **Returns:** The pointer to the newly allocated array. ### `offset` `offset[I: Indexer, //](self, idx: I) -> Self` Returns a new pointer shifted by the specified offset. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The offset of the new pointer. **Returns:** The new constructed UnsafePointer. ### `__merge_with__` `__merge_with__[: Int, : Bool, : Origin[$1], //, other_type: AnyStruct[UnsafePointer[type, address_space=address_space, alignment=$0, mut=$1, origin=$2]]](self) -> UnsafePointer[type, address_space=address_space, alignment=min(alignment, alignment), mut=mut, origin=origin]` Returns a pointer merged with the specified `other_type`. **Parameters:** * ​other\_type (`AnyStruct[UnsafePointer[type, address_space=address_space, alignment=$0, mut=$1, origin=$2]]`): The type of the pointer to merge with. **Returns:** A pointer merged with the specified `other_type`. ### `__as_bool__` `__as_bool__(self) -> Bool` Return true if the pointer is non-null. **Returns:** Whether the pointer is null. ### `__int__` `__int__(self) -> Int` Returns the pointer address as an integer. **Returns:** The address of the pointer as an Int. ### `__str__` `__str__(self) -> String` Gets a string representation of the pointer. **Returns:** The string representation of the pointer. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this pointer address to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `as_noalias_ptr` `as_noalias_ptr(self) -> Self` Cast the pointer to a new pointer that is known not to locally alias any other pointer. In other words, the pointer transitively does not alias any other memory value declared in the local function context. This information is relayed to the optimizer. If the pointer does locally alias another memory value, the behaviour is undefined. **Returns:** A noalias pointer. ### `load` `load[dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False, invariant: Bool = _default_invariant[::Bool]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[dtype, width]` Loads the value the pointer points to. **Constraints:** The width and alignment must be positive integer values. **Parameters:** * ​dtype (`DType`): The data type of SIMD vector. * ​width (`Int`): The size of the SIMD vector. * ​alignment (`Int`): The minimal alignment of the address. * ​volatile (`Bool`): Whether the operation is volatile or not. * ​invariant (`Bool`): Whether the memory is load invariant. **Returns:** The loaded value. `load[dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False, invariant: Bool = _default_invariant[::Bool]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[dtype, 1]) -> SIMD[dtype, width]` Loads the value the pointer points to with the given offset. **Constraints:** The width and alignment must be positive integer values. The offset must be integer. **Parameters:** * ​dtype (`DType`): The data type of SIMD vector elements. * ​width (`Int`): The size of the SIMD vector. * ​alignment (`Int`): The minimal alignment of the address. * ​volatile (`Bool`): Whether the operation is volatile or not. * ​invariant (`Bool`): Whether the memory is load invariant. **Args:** * ​offset (`SIMD[dtype, 1]`): The offset to load from. **Returns:** The loaded value. `load[I: Indexer, dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False, invariant: Bool = _default_invariant[::Bool]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: I) -> SIMD[dtype, width]` Loads the value the pointer points to with the given offset. **Constraints:** The width and alignment must be positive integer values. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. * ​dtype (`DType`): The data type of SIMD vector elements. * ​width (`Int`): The size of the SIMD vector. * ​alignment (`Int`): The minimal alignment of the address. * ​volatile (`Bool`): Whether the operation is volatile or not. * ​invariant (`Bool`): Whether the memory is load invariant. **Args:** * ​offset (`I`): The offset to load from. **Returns:** The loaded value. ### `store` `store[I: Indexer, dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: I, val: SIMD[dtype, width])` Stores a single element value at the given offset. **Constraints:** The width and alignment must be positive integer values. The offset must be integer. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. * ​dtype (`DType`): The data type of SIMD vector elements. * ​width (`Int`): The size of the SIMD vector. * ​alignment (`Int`): The minimal alignment of the address. * ​volatile (`Bool`): Whether the operation is volatile or not. **Args:** * ​offset (`I`): The offset to store to. * ​val (`SIMD[dtype, width]`): The value to store. `store[dtype: DType, offset_type: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[offset_type, 1], val: SIMD[dtype, width])` Stores a single element value at the given offset. **Constraints:** The width and alignment must be positive integer values. **Parameters:** * ​dtype (`DType`): The data type of SIMD vector elements. * ​offset\_type (`DType`): The data type of the offset value. * ​width (`Int`): The size of the SIMD vector. * ​alignment (`Int`): The minimal alignment of the address. * ​volatile (`Bool`): Whether the operation is volatile or not. **Args:** * ​offset (`SIMD[offset_type, 1]`): The offset to store to. * ​val (`SIMD[dtype, width]`): The value to store. `store[dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], val: SIMD[dtype, width])` Stores a single element value. **Constraints:** The width and alignment must be positive integer values. **Parameters:** * ​dtype (`DType`): The data type of SIMD vector elements. * ​width (`Int`): The size of the SIMD vector. * ​alignment (`Int`): The minimal alignment of the address. * ​volatile (`Bool`): Whether the operation is volatile or not. **Args:** * ​val (`SIMD[dtype, width]`): The value to store. ### `strided_load` `strided_load[dtype: DType, T: Intable, //, width: Int](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], stride: T) -> SIMD[dtype, width]` Performs a strided load of the SIMD vector. **Parameters:** * ​dtype (`DType`): DType of returned SIMD value. * ​T (`Intable`): The Intable type of the stride. * ​width (`Int`): The SIMD width. **Args:** * ​stride (`T`): The stride between loads. **Returns:** A vector which is stride loaded. ### `strided_store` `strided_store[dtype: DType, T: Intable, //, width: Int = 1](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], val: SIMD[dtype, width], stride: T)` Performs a strided store of the SIMD vector. **Parameters:** * ​dtype (`DType`): DType of `val`, the SIMD value to store. * ​T (`Intable`): The Intable type of the stride. * ​width (`Int`): The SIMD width. **Args:** * ​val (`SIMD[dtype, width]`): The SIMD value to store. * ​stride (`T`): The stride between stores. ### `gather` `gather[dtype: DType, //, *, width: Int = 1, alignment: Int = _default_alignment[::DType,::Int]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[dtype, width], mask: SIMD[bool, width] = SIMD(True), default: SIMD[dtype, width] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[dtype, width]` Gathers a SIMD vector from offsets of the current pointer. This method loads from memory addresses calculated by appropriately shifting the current pointer according to the `offset` SIMD vector, or takes from the `default` SIMD vector, depending on the values of the `mask` SIMD vector. If a mask element is `True`, the respective result element is given by the current pointer and the `offset` SIMD vector; otherwise, the result element is taken from the `default` SIMD vector. **Constraints:** The offset type must be an integral type. The alignment must be a power of two integer value. **Parameters:** * ​dtype (`DType`): DType of the return SIMD. * ​width (`Int`): The SIMD width. * ​alignment (`Int`): The minimal alignment of the address. **Args:** * ​offset (`SIMD[dtype, width]`): The SIMD vector of offsets to gather from. * ​mask (`SIMD[bool, width]`): The SIMD vector of boolean values, indicating for each element whether to load from memory or to take from the `default` SIMD vector. * ​default (`SIMD[dtype, width]`): The SIMD vector providing default values to be taken where the `mask` SIMD vector is `False`. **Returns:** The SIMD vector containing the gathered values. ### `scatter` `scatter[dtype: DType, //, *, width: Int = 1, alignment: Int = _default_alignment[::DType,::Int]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[dtype, width], val: SIMD[dtype, width], mask: SIMD[bool, width] = SIMD(True))` Scatters a SIMD vector into offsets of the current pointer. This method stores at memory addresses calculated by appropriately shifting the current pointer according to the `offset` SIMD vector, depending on the values of the `mask` SIMD vector. If a mask element is `True`, the respective element in the `val` SIMD vector is stored at the memory address defined by the current pointer and the `offset` SIMD vector; otherwise, no action is taken for that element in `val`. If the same offset is targeted multiple times, the values are stored in the order they appear in the `val` SIMD vector, from the first to the last element. **Constraints:** The offset type must be an integral type. The alignment must be a power of two integer value. **Parameters:** * ​dtype (`DType`): DType of `value`, the result SIMD buffer. * ​width (`Int`): The SIMD width. * ​alignment (`Int`): The minimal alignment of the address. **Args:** * ​offset (`SIMD[dtype, width]`): The SIMD vector of offsets to scatter into. * ​val (`SIMD[dtype, width]`): The SIMD vector containing the values to be scattered. * ​mask (`SIMD[bool, width]`): The SIMD vector of boolean values, indicating for each element whether to store at memory or not. ### `free` `free(self: UnsafePointer[type, alignment=alignment, mut=mut, origin=origin])` Free the memory referenced by the pointer. ### `bitcast` `bitcast[T: AnyType = type](self) -> UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin]` Bitcasts a UnsafePointer to a different type. **Parameters:** * ​T (`AnyType`): The target type. **Returns:** A new UnsafePointer object with the specified type and the same address, as the original UnsafePointer. ### `static_alignment_cast` `static_alignment_cast[alignment: Int = alignment](self) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]` Changes the `alignment` of an `UnsafePointer`. The static alignment of an UnsafePointer must be greater or equal to the actual alignment of the runtime pointer value. Casting an UnsafePointer to a static alignment greater than its runtime alignment may cause undefined behavior". This only changes the compile-time alignment encoded in the type of this pointer. This does not change the alignment of the pointer address at runtime. **Parameters:** * ​alignment (`Int`): Alignment of the destination pointer. **Returns:** A new UnsafePointer object with the same type, address\_space, and address, as the original UnsafePointer, and the new specified alignment. ### `origin_cast` `origin_cast[mut: Bool = mut, origin: Origin[mut] = origin](self) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]` Changes the origin or mutability of a pointer. **Parameters:** * ​mut (`Bool`): Whether the origin is mutable. * ​origin (`Origin[mut]`): Origin of the destination pointer. **Returns:** A new UnsafePointer object with the same type and the same address, as the original UnsafePointer and the new specified mutability and origin. ### `address_space_cast` `address_space_cast[address_space: AddressSpace = address_space](self) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]` Casts an UnsafePointer to a different address space. **Parameters:** * ​address\_space (`AddressSpace`): The address space of the result. **Returns:** A new UnsafePointer object with the same type and the same address, as the original UnsafePointer and the new address space. ### `destroy_pointee` `destroy_pointee(self: UnsafePointer[type, alignment=alignment, mut=mut, origin=origin])` Destroy the pointed-to value. The pointer must not be null, and the pointer memory location is assumed to contain a valid initialized instance of `type`. This is equivalent to `_ = self.take_pointee()` but doesn't require `Movable` and is more efficient because it doesn't invoke `__moveinit__`. ### `take_pointee` `take_pointee[T: Movable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]) -> T` Move the value at the pointer out, leaving it uninitialized. The pointer must not be null, and the pointer memory location is assumed to contain a valid initialized instance of `T`. This performs a *consuming* move, ending the origin of the value stored in this pointer memory location. Subsequent reads of this pointer are not valid. If a new valid value is stored using `init_pointee_move()`, then reading from this pointer becomes valid again. **Parameters:** * ​T (`Movable`): The type the pointer points to, which must be `Movable`. **Returns:** The value at the pointer. ### `init_pointee_move` `init_pointee_move[T: Movable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], owned value: T)` Emplace a new value into the pointer location, moving from `value`. The pointer memory location is assumed to contain uninitialized data, and consequently the current contents of this pointer are not destructed before writing `value`. Similarly, ownership of `value` is logically transferred into the pointer location. When compared to `init_pointee_copy`, this avoids an extra copy on the caller side when the value is an `owned` rvalue. **Parameters:** * ​T (`Movable`): The type the pointer points to, which must be `Movable`. **Args:** * ​value (`T`): The value to emplace. ### `init_pointee_copy` `init_pointee_copy[T: Copyable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], value: T)` Emplace a copy of `value` into the pointer location. The pointer memory location is assumed to contain uninitialized data, and consequently the current contents of this pointer are not destructed before writing `value`. Similarly, ownership of `value` is logically transferred into the pointer location. When compared to `init_pointee_move`, this avoids an extra move on the callee side when the value must be copied. **Parameters:** * ​T (`Copyable`): The type the pointer points to, which must be `Copyable`. **Args:** * ​value (`T`): The value to emplace. ### `init_pointee_explicit_copy` `init_pointee_explicit_copy[T: ExplicitlyCopyable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], value: T)` Emplace a copy of `value` into this pointer location. The pointer memory location is assumed to contain uninitialized data, and consequently the current contents of this pointer are not destructed before writing `value`. Similarly, ownership of `value` is logically transferred into the pointer location. When compared to `init_pointee_move`, this avoids an extra move on the callee side when the value must be copied. **Parameters:** * ​T (`ExplicitlyCopyable`): The type the pointer points to, which must be `ExplicitlyCopyable`. **Args:** * ​value (`T`): The value to emplace. ### `move_pointee_into` `move_pointee_into[T: Movable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], dst: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin])` Moves the value `self` points to into the memory location pointed to by `dst`. This performs a consuming move (using `__moveinit__()`) out of the memory location pointed to by `self`. Subsequent reads of this pointer are not valid unless and until a new, valid value has been moved into this pointer's memory location using `init_pointee_move()`. This transfers the value out of `self` and into `dest` using at most one `__moveinit__()` call. **Safety:** * `self` must be non-null * `self` must contain a valid, initialized instance of `T` * `dst` must not be null * The contents of `dst` should be uninitialized. If `dst` was previously written with a valid value, that value will be be overwritten and its destructor will NOT be run. **Parameters:** * ​T (`Movable`): The type the pointer points to, which must be `Movable`. **Args:** * ​dst (`UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]`): Destination pointer that the value will be moved into. --- ## unsafe_pointer Implement a generic unsafe pointer type. You can import these APIs from the `memory` package. For example: ```mojo from memory import UnsafePointer ``` ## Structs * [​`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer): UnsafePointer\[T] represents an indirect reference to one or more values of type T consecutively in memory, and can refer to uninitialized memory. --- ## Atomic `struct Atomic[dtype: DType, *, scope: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")]` Represents a value with atomic operations. The class provides atomic `add` and `sub` methods for mutating the value. ## Parameters * ​dtype (`DType`): DType of the value. * ​scope (`StringSlice[StaticConstantOrigin]`): The memory synchronization scope. ## Fields * ​value (`SIMD[dtype, 1]`): The atomic value. This is the underlying value of the atomic. Access to the value can only occur through atomic primitive operations. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(out self, value: SIMD[dtype, 1])` Constructs a new atomic value. **Args:** * ​value (`SIMD[dtype, 1]`): Initial value represented as `Scalar[dtype]` type. ### `__iadd__` `__iadd__(mut self, rhs: SIMD[dtype, 1])` Performs atomic in-place add. Atomically replaces the current value with the result of arithmetic addition of the value and arg. That is, it performs atomic post-increment. The operation is a read-modify-write operation. Memory is affected according to the value of order which is sequentially consistent. **Args:** * ​rhs (`SIMD[dtype, 1]`): Value to add. ### `__isub__` `__isub__(mut self, rhs: SIMD[dtype, 1])` Performs atomic in-place sub. Atomically replaces the current value with the result of arithmetic subtraction of the value and arg. That is, it performs atomic post-decrement. The operation is a read-modify-write operation. Memory is affected according to the value of order which is sequentially consistent. **Args:** * ​rhs (`SIMD[dtype, 1]`): Value to subtract. ### `load` `load(mut self) -> SIMD[dtype, 1]` Loads the current value from the atomic. **Returns:** The current value of the atomic. ### `fetch_add` `static fetch_add[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], rhs: SIMD[dtype, 1]) -> SIMD[dtype, 1]` Performs atomic in-place add. Atomically replaces the current value with the result of arithmetic addition of the value and arg. That is, it performs atomic post-increment. The operation is a read-modify-write operation. Memory is affected according to the value of order which is sequentially consistent. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The source pointer. * ​rhs (`SIMD[dtype, 1]`): Value to add. **Returns:** The original value before addition. `fetch_add[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](mut self, rhs: SIMD[dtype, 1]) -> SIMD[dtype, 1]` Performs atomic in-place add. Atomically replaces the current value with the result of arithmetic addition of the value and arg. That is, it performs atomic post-increment. The operation is a read-modify-write operation. Memory is affected according to the value of order which is sequentially consistent. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​rhs (`SIMD[dtype, 1]`): Value to add. **Returns:** The original value before addition. ### `store` `static store[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], value: SIMD[dtype, 1])` Performs atomic store. The operation is a read-modify-write operation. Memory is affected according to the value of order which is sequentially consistent. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The source pointer. * ​value (`SIMD[dtype, 1]`): The value to store. ### `fetch_sub` `fetch_sub[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](mut self, rhs: SIMD[dtype, 1]) -> SIMD[dtype, 1]` Performs atomic in-place sub. Atomically replaces the current value with the result of arithmetic subtraction of the value and arg. That is, it performs atomic post-decrement. The operation is a read-modify-write operation. Memory is affected according to the value of order which is sequentially consistent. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​rhs (`SIMD[dtype, 1]`): Value to subtract. **Returns:** The original value before subtraction. ### `compare_exchange_weak` `compare_exchange_weak[*, failure_ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6)), success_ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](self, mut expected: SIMD[dtype, 1], desired: SIMD[dtype, 1]) -> Bool` Atomically compares the self value with that of the expected value. If the values are equal, then the self value is replaced with the desired value and True is returned. Otherwise, False is returned the the expected value is rewritten with the self value. **Parameters:** * ​failure\_ordering (`Consistency`): The memory ordering for the failure case. * ​success\_ordering (`Consistency`): The memory ordering for the success case. **Args:** * ​expected (`SIMD[dtype, 1]`): The expected value. * ​desired (`SIMD[dtype, 1]`): The desired value. **Returns:** True if self == expected and False otherwise. ### `max` `static max[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], rhs: SIMD[dtype, 1])` Performs atomic in-place max on the pointer. Atomically replaces the current value pointer to by `ptr` by the result of max of the value and arg. The operation is a read-modify-write operation. The operation is a read-modify-write operation perform according to sequential consistency semantics. **Constraints:** The input type must be either integral or floating-point type. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The source pointer. * ​rhs (`SIMD[dtype, 1]`): Value to max. `max[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](self, rhs: SIMD[dtype, 1])` Performs atomic in-place max. Atomically replaces the current value with the result of max of the value and arg. The operation is a read-modify-write operation perform according to sequential consistency semantics. **Constraints:** The input type must be either integral or floating-point type. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​rhs (`SIMD[dtype, 1]`): Value to max. ### `min` `static min[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], rhs: SIMD[dtype, 1])` Performs atomic in-place min on the pointer. Atomically replaces the current value pointer to by `ptr` by the result of min of the value and arg. The operation is a read-modify-write operation. The operation is a read-modify-write operation perform according to sequential consistency semantics. **Constraints:** The input type must be either integral or floating-point type. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The source pointer. * ​rhs (`SIMD[dtype, 1]`): Value to min. `min[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](self, rhs: SIMD[dtype, 1])` Performs atomic in-place min. Atomically replaces the current value with the result of min of the value and arg. The operation is a read-modify-write operation. The operation is a read-modify-write operation perform according to sequential consistency semantics. **Constraints:** The input type must be either integral or floating-point type. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​rhs (`SIMD[dtype, 1]`): Value to min. --- ## Consistency `@register_passable(trivial)` `struct Consistency` Represents the consistency model for atomic operations. The class provides a set of constants that represent different consistency models for atomic operations. Attributes: NOT\_ATOMIC: Not atomic. UNORDERED: Unordered. MONOTONIC: Monotonic. ACQUIRE: Acquire. RELEASE: Release. ACQUIRE\_RELEASE: Acquire-release. SEQUENTIAL: Sequentially consistent. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ACQUIRE` `alias ACQUIRE = Consistency(__init__[__mlir_type.!pop.int_literal](3))` Acquire. ### `ACQUIRE_RELEASE` `alias ACQUIRE_RELEASE = Consistency(__init__[__mlir_type.!pop.int_literal](5))` Acquire-release. ### `MONOTONIC` `alias MONOTONIC = Consistency(__init__[__mlir_type.!pop.int_literal](2))` Monotonic. ### `NOT_ATOMIC` `alias NOT_ATOMIC = Consistency(__init__[__mlir_type.!pop.int_literal](0))` Not atomic. ### `RELEASE` `alias RELEASE = Consistency(__init__[__mlir_type.!pop.int_literal](4))` Release. ### `SEQUENTIAL` `alias SEQUENTIAL = Consistency(__init__[__mlir_type.!pop.int_literal](6))` Sequentially consistent. ### `UNORDERED` `alias UNORDERED = Consistency(__init__[__mlir_type.!pop.int_literal](1))` Unordered. ## Methods ### `__init__` `__init__(value: SIMD[uint8, 1]) -> Self` Constructs a new Consistency object. **Args:** * ​value (`SIMD[uint8, 1]`): The value of the consistency model. ### `__eq__` `__eq__(self, other: Self) -> Bool` Compares two Consistency objects for equality. **Args:** * ​other (`Self`): The other Consistency object to compare with. **Returns:** True if the objects are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Compares two Consistency objects for inequality. **Args:** * ​other (`Self`): The other Consistency object to compare with. **Returns:** True if the objects are not equal, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Checks if the Consistency object is the same as another. **Args:** * ​other (`Self`): The other Consistency object to compare with. **Returns:** True if the objects are the same, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Checks if the Consistency object is not the same as another. **Args:** * ​other (`Self`): The other Consistency object to compare with. **Returns:** True if the objects are not the same, False otherwise. ### `__mlir_attr` `__mlir_attr(self) -> !kgen.deferred` Returns the MLIR attribute representation of the Consistency object. **Returns:** The MLIR attribute representation of the Consistency object. --- ## atomic Implements the `Atomic` struct. You can import these APIs from the `os` package. For example: ```mojo from os import Atomic ``` ## Structs * [​`Atomic`](/mojo/stdlib/os/atomic/Atomic): Represents a value with atomic operations. * [​`Consistency`](/mojo/stdlib/os/atomic/Consistency): Represents the consistency model for atomic operations. --- ## getenv `getenv(owned name: String, default: String = __init__[__mlir_type.!kgen.string]("")) -> String` Returns the value of the given environment variable. **Constraints:** The function only works on macOS or Linux and returns an empty string otherwise. **Args:** * ​name (`String`): The name of the environment variable. * ​default (`String`): The default value to return if the environment variable doesn't exist. **Returns:** The value of the environment variable. --- ## env Provides functions for working with environment variables. You can import these APIs from the `os` package. For example: ```mojo from os import setenv ``` ## Functions * [​`getenv`](/mojo/stdlib/os/env/getenv): Returns the value of the given environment variable. * [​`setenv`](/mojo/stdlib/os/env/setenv): Changes or adds an environment variable. * [​`unsetenv`](/mojo/stdlib/os/env/unsetenv): Unsets an environment variable. --- ## setenv `setenv(owned name: String, owned value: String, overwrite: Bool = True) -> Bool` Changes or adds an environment variable. **Constraints:** The function only works on macOS or Linux and returns False otherwise. **Args:** * ​name (`String`): The name of the environment variable. * ​value (`String`): The value of the environment variable. * ​overwrite (`Bool`): If an environment variable with the given name already exists, its value is not changed unless `overwrite` is True. **Returns:** False if the name is empty or contains an `=` character. In any other case, True is returned. --- ## unsetenv `unsetenv(owned name: String) -> Bool` Unsets an environment variable. **Args:** * ​name (`String`): The name of the environment variable. **Returns:** True if unsetting the variable succeeded. Otherwise, False is returned. --- ## fstat Implements file system status operations. You can import these APIs from the `os` package. For example: ```mojo from os import stat ``` ## Structs * [​`stat_result`](/mojo/stdlib/os/fstat/stat_result): Object whose fields correspond to the members of the stat structure. ## Functions * [​`lstat`](/mojo/stdlib/os/fstat/lstat): Get the status of a file or a file descriptor (similar to stat, but does not follow symlinks). * [​`stat`](/mojo/stdlib/os/fstat/stat): Get the status of a file or a file descriptor. --- ## lstat `lstat[PathLike: PathLike](path: PathLike) -> stat_result` Get the status of a file or a file descriptor (similar to stat, but does not follow symlinks). **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** Returns the stat\_result on the path. --- ## stat `stat[PathLike: PathLike](path: PathLike) -> stat_result` Get the status of a file or a file descriptor. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** Returns the stat\_result on the path. --- ## stat_result `struct stat_result` Object whose fields correspond to the members of the stat structure. ## Fields * ​st\_mode (`Int`): File mode: file type and file mode bits (permissions). * ​st\_ino (`Int`): Platform dependent, but if non-zero, uniquely identifies the file for a given value of st\_dev. * ​st\_dev (`Int`): Identifier of the device on which this file resides. * ​st\_nlink (`Int`): Number of hard links. * ​st\_uid (`Int`): User identifier of the file owner. * ​st\_gid (`Int`): Group identifier of the file owner. * ​st\_size (`Int`): Size of the file in bytes, if it is a regular file or a symbolic link. * ​st\_atimespec (`_CTimeSpec`): Time of file most recent access. * ​st\_mtimespec (`_CTimeSpec`): Time of file most recent modification. * ​st\_ctimespec (`_CTimeSpec`): Time of file most recent change. * ​st\_birthtimespec (`_CTimeSpec`): Time of file creation. * ​st\_blocks (`Int`): Number of 512-byte blocks allocated for file. * ​st\_blksize (`Int`): Preferred blocksize for efficient file system I/O. * ​st\_rdev (`Int`): Type of device if an inode device. * ​st\_flags (`Int`): User defined flags for file. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(out self, *, st_mode: Int, st_ino: Int, st_dev: Int, st_nlink: Int, st_uid: Int, st_gid: Int, st_size: Int, st_atimespec: _CTimeSpec, st_mtimespec: _CTimeSpec, st_ctimespec: _CTimeSpec, st_birthtimespec: _CTimeSpec, st_blocks: Int, st_blksize: Int, st_rdev: Int, st_flags: Int)` Initialize the stat\_result structure. **Args:** * ​st\_mode (`Int`): File mode: file type and file mode bits (permissions). * ​st\_ino (`Int`): Uniquely identifier for a file. * ​st\_dev (`Int`): Identifier of the device on which this file resides. * ​st\_nlink (`Int`): Number of hard links. * ​st\_uid (`Int`): User identifier of the file owner. * ​st\_gid (`Int`): Group identifier of the file owner. * ​st\_size (`Int`): Size of the file (bytes), if it is a file or a symlink. * ​st\_atimespec (`_CTimeSpec`): Time of file most recent access. * ​st\_mtimespec (`_CTimeSpec`): Time of file most recent modification. * ​st\_ctimespec (`_CTimeSpec`): Time of file most recent change. * ​st\_birthtimespec (`_CTimeSpec`): Time of file creation. * ​st\_blocks (`Int`): Number of 512-byte blocks allocated for file. * ​st\_blksize (`Int`): Preferred blocksize for efficient file system I/O. * ​st\_rdev (`Int`): Type of device if an inode device. * ​st\_flags (`Int`): User defined flags for file. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this path to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__str__` `__str__(self) -> String` Constructs a string representation of stat\_result. **Returns:** A string representation of stat\_result. ### `__repr__` `__repr__(self) -> String` Constructs a representation of stat\_result. **Returns:** A representation of stat\_result. --- ## os Provides access to operating-system dependent functionality. The types and functions in this package primarily provide operating-system independent access to operating-system dependent features, such as file systems and environment variables. For accessing files, see built-in [`open()`](/mojo/stdlib/builtin/file/open) function and the [`file`](/mojo/stdlib/builtin/file/) module. For manipulating file system paths, see the [`os.path`](/mojo/stdlib/os/path/) package for OS-independent path manipulation functions and the `pathlib` package for the [`Path`](/mojo/stdlib/pathlib/path/Path) struct, an abstraction for handling paths. ## Packages * [​`path`](/mojo/stdlib/os/path/): Provides a set of operating-system independent functions for manipulating file system paths. ## Modules * [​`atomic`](/mojo/stdlib/os/atomic/): Implements the `Atomic` struct. * [​`env`](/mojo/stdlib/os/env/): Provides functions for working with environment variables. * [​`fstat`](/mojo/stdlib/os/fstat/): Implements file system status operations. * [​`os`](/mojo/stdlib/os/os/): Provides functions to access operating-system dependent functionality, including file system operations. * [​`pathlike`](/mojo/stdlib/os/pathlike/): Implements the `PathLike` trait. --- ## abort `abort[result: AnyType = None]() -> result` Calls a target dependent trap instruction if available. **Parameters:** * ​result (`AnyType`): The result type. **Returns:** A null result type. `abort[result: AnyType = None](message: String) -> result` Calls a target dependent trap instruction if available. **Parameters:** * ​result (`AnyType`): The result type. **Args:** * ​message (`String`): The message to include when aborting. **Returns:** A null result type. --- ## getuid `getuid() -> Int` Retrieve the user ID of the calling process. **Constraints:** This function is constrained to run on Linux or macOS operating systems only. **Returns:** The user ID of the calling process. --- ## os Provides functions to access operating-system dependent functionality, including file system operations. You can import a method from the `os` package. For example: ```mojo from os import listdir ``` ## Aliases ### `SEEK_CUR` `alias SEEK_CUR = __init__[__mlir_type.!pop.int_literal](1)` Seek from the current position. ### `SEEK_END` `alias SEEK_END = __init__[__mlir_type.!pop.int_literal](2)` Seek from the end of the file. ### `SEEK_SET` `alias SEEK_SET = __init__[__mlir_type.!pop.int_literal](0)` Seek from the beginning of the file. ### `sep` `alias sep = "\\".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]() if os_is_windows() else "/".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]()` ## Functions * [​`abort`](/mojo/stdlib/os/os/abort): Calls a target dependent trap instruction if available. * [​`getuid`](/mojo/stdlib/os/os/getuid): Retrieve the user ID of the calling process. * [​`listdir`](/mojo/stdlib/os/os/listdir): Gets the list of entries contained in the path provided. * [​`makedirs`](/mojo/stdlib/os/os/makedirs): Creates a specified leaf directory along with any necessary intermediate directories that don't already exist. * [​`mkdir`](/mojo/stdlib/os/os/mkdir): Creates a directory at the specified path. * [​`remove`](/mojo/stdlib/os/os/remove): Removes the specified file. * [​`removedirs`](/mojo/stdlib/os/os/removedirs): Removes a leaf directory and all empty intermediate ones. * [​`rmdir`](/mojo/stdlib/os/os/rmdir): Removes the specified directory. * [​`unlink`](/mojo/stdlib/os/os/unlink): Removes the specified file. --- ## listdir `listdir[PathLike: PathLike](path: PathLike) -> List[String]` Gets the list of entries contained in the path provided. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** Returns the list of entries in the path provided. --- ## makedirs `makedirs[PathLike: PathLike](path: PathLike, mode: Int = 511, exist_ok: Bool = False)` Creates a specified leaf directory along with any necessary intermediate directories that don't already exist. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. * ​mode (`Int`): The mode to create the directory with. * ​exist\_ok (`Bool`): Ignore error if `True` and path exists (default `False`). --- ## mkdir `mkdir[PathLike: PathLike](path: PathLike, mode: Int = 511)` Creates a directory at the specified path. If the directory can not be created an error is raised. Absolute and relative paths are allowed, relative paths are resolved from cwd. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. * ​mode (`Int`): The mode to create the directory with. --- ## remove `remove[PathLike: PathLike](path: PathLike)` Removes the specified file. If the path is a directory or it can not be deleted, an error is raised. Absolute and relative paths are allowed, relative paths are resolved from cwd. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the file. --- ## removedirs `removedirs[PathLike: PathLike](path: PathLike)` Removes a leaf directory and all empty intermediate ones. Directories corresponding to rightmost path segments will be pruned away until either the whole path is consumed or an error occurs. Errors during this latter phase are ignored, which occur when a directory was not empty. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. --- ## rmdir `rmdir[PathLike: PathLike](path: PathLike)` Removes the specified directory. If the path is not a directory or it can not be deleted, an error is raised. Absolute and relative paths are allowed, relative paths are resolved from cwd. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. --- ## unlink `unlink[PathLike: PathLike](path: PathLike)` Removes the specified file. If the path is a directory or it can not be deleted, an error is raised. Absolute and relative paths are allowed, relative paths are resolved from cwd. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the file. --- ## path Provides a set of operating-system independent functions for manipulating file system paths. ## Modules * [​`path`](/mojo/stdlib/os/path/path/): Provides a set of operating-system independent functions for manipulating file system paths. --- ## basename `basename[PathLike: PathLike, //](path: PathLike) -> String` Returns the tail section of a path. ```mojo from os.path import basename basename("a/path/foo.txt") # returns "foo.txt" ``` **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to retrieve the basename from. **Returns:** The basename from the path. --- ## dirname `dirname[PathLike: PathLike, //](path: PathLike) -> String` Returns the directory component of a pathname. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to a file. **Returns:** The directory component of a pathname. --- ## exists `exists[PathLike: PathLike, //](path: PathLike) -> Bool` Return True if path exists. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** Returns True if the path exists and is not a broken symbolic link. --- ## expanduser `expanduser[PathLike: PathLike, //](path: PathLike) -> String` Expands a tilde "\~" prefix in `path` to the user's home directory. For example, `~/folder` becomes `/home/current_user/folder`. On macOS and Linux a path starting with `~user/` will expand to the specified user's home directory, so `~user/folder` becomes `/home/user/folder`. If the home directory cannot be determined, or the `path` is not prefixed with "\~", the original path is returned unchanged. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path that is being expanded. **Returns:** The expanded path. --- ## expandvars `expandvars[PathLike: PathLike, //](path: PathLike) -> String` Replaces `${var}` or `$var` in the path with values from the current environment variables. Malformed variable names and references to non-existing variables are left unchanged. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path that is being expanded. **Returns:** The expanded path. --- ## getsize `getsize[PathLike: PathLike, //](path: PathLike) -> Int` Return the size, in bytes, of the specified path. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the file. **Returns:** The size of the path in bytes. --- ## path Provides a set of operating-system independent functions for manipulating file system paths. You can import these APIs from the `os.path` package. For example: ```mojo from os.path import isdir ``` ## Functions * [​`basename`](/mojo/stdlib/os/path/path/basename): Returns the tail section of a path. * [​`dirname`](/mojo/stdlib/os/path/path/dirname): Returns the directory component of a pathname. * [​`exists`](/mojo/stdlib/os/path/path/exists): Return True if path exists. * [​`expanduser`](/mojo/stdlib/os/path/path/expanduser): Expands a tilde "\~" prefix in `path` to the user's home directory. * [​`expandvars`](/mojo/stdlib/os/path/path/expandvars): Replaces `${var}` or `$var` in the path with values from the current environment variables. Malformed variable names and references to non-existing variables are left unchanged. * [​`getsize`](/mojo/stdlib/os/path/path/getsize): Return the size, in bytes, of the specified path. * [​`is_absolute`](/mojo/stdlib/os/path/path/is_absolute): Return True if `path` is an absolute path name. On Unix, that means it begins with a slash. * [​`isdir`](/mojo/stdlib/os/path/path/isdir): Return True if path is an existing directory. This follows symbolic links, so both islink() and isdir() can be true for the same path. * [​`isfile`](/mojo/stdlib/os/path/path/isfile): Test whether a path is a regular file. * [​`islink`](/mojo/stdlib/os/path/path/islink): Return True if path refers to an existing directory entry that is a symbolic link. * [​`join`](/mojo/stdlib/os/path/path/join): Join two or more pathname components, inserting '/' as needed. If any component is an absolute path, all previous path components will be discarded. An empty last part will result in a path that ends with a separator. * [​`lexists`](/mojo/stdlib/os/path/path/lexists): Return True if path exists or is a broken symlink. * [​`split`](/mojo/stdlib/os/path/path/split): Split a given pathname into two components: head and tail. This is useful for separating the directory path from the filename. If the input path ends with a separator, the tail component will be empty. If there is no separator in the path, the head component will be empty, and the entire path will be considered the tail. Trailing separators in the head are stripped unless the head is the root directory. * [​`split_extension`](/mojo/stdlib/os/path/path/split_extension): Splits `path` into the root and extension. * [​`splitroot`](/mojo/stdlib/os/path/path/splitroot): Splits `path` into drive, root and tail. The tail contains anything after the root. --- ## is_absolute `is_absolute[PathLike: PathLike, //](path: PathLike) -> Bool` Return True if `path` is an absolute path name. On Unix, that means it begins with a slash. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to check. **Returns:** Return `True` if path is an absolute path name. --- ## isdir `isdir[PathLike: PathLike, //](path: PathLike) -> Bool` Return True if path is an existing directory. This follows symbolic links, so both islink() and isdir() can be true for the same path. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** True if the path is a directory or a link to a directory and False otherwise. --- ## isfile `isfile[PathLike: PathLike, //](path: PathLike) -> Bool` Test whether a path is a regular file. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** Returns True if the path is a regular file. --- ## islink `islink[PathLike: PathLike, //](path: PathLike) -> Bool` Return True if path refers to an existing directory entry that is a symbolic link. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** True if the path is a link to a directory and False otherwise. --- ## join `join(owned path: String, *paths: String) -> String` Join two or more pathname components, inserting '/' as needed. If any component is an absolute path, all previous path components will be discarded. An empty last part will result in a path that ends with a separator. **Args:** * ​path (`String`): The path to join. * ​\*paths (`String`): The paths to join. **Returns:** The joined path. --- ## lexists `lexists[PathLike: PathLike, //](path: PathLike) -> Bool` Return True if path exists or is a broken symlink. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** Returns True if the path exists or is a broken symbolic link. --- ## split `split[PathLike: PathLike, //](path: PathLike) -> Tuple[String, String]` Split a given pathname into two components: head and tail. This is useful for separating the directory path from the filename. If the input path ends with a separator, the tail component will be empty. If there is no separator in the path, the head component will be empty, and the entire path will be considered the tail. Trailing separators in the head are stripped unless the head is the root directory. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to be split. **Returns:** A tuple containing two strings: (head, tail). --- ## split_extension `split_extension[PathLike: PathLike, //](path: PathLike) -> Tuple[String, String]` Splits `path` into the root and extension. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to be split. **Returns:** A tuple containing two strings: (root, extension). --- ## splitroot `splitroot[PathLike: PathLike, //](path: PathLike) -> Tuple[String, String, String]` Splits `path` into drive, root and tail. The tail contains anything after the root. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to be split. **Returns:** A tuple containing three strings: (drive, root, tail). --- ## PathLike A trait representing file system paths. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__fspath__` `__fspath__(self: _Self) -> String` Return the file system path representation of the object. **Returns:** The file system path representation as a string. --- ## pathlike Implements the `PathLike` trait. You can import the trait from the `os` package. For example: ```mojo from os import PathLike ``` ## Traits * [​`PathLike`](/mojo/stdlib/os/pathlike/PathLike): A trait representing file system paths. --- ## pathlib Implements the pathlib package. ## Modules * [​`path`](/mojo/stdlib/pathlib/path/): Implements `Path` and related functions. --- ## Path `struct Path` The Path object. ## Fields * ​path (`String`): The underlying path string representation. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `Hashable`, `Movable`, `PathLike`, `Stringable`, `UnknownDestructibility`, `Writable`, `_HashableWithHasher` ## Methods ### `__init__` `__init__(out self)` Initializes a path with the current directory. `__init__(out self, path: StringSlice[origin])` Initializes a path with the provided path. **Args:** * ​path (`StringSlice[origin]`): The file system path. `@implicit` `__init__(out self, owned path: String)` Initializes a path with the provided path. **Args:** * ​path (`String`): The file system path. `@implicit` `__init__(out self, path: StringLiteral[value])` Initializes a path with the provided path. **Args:** * ​path (`StringLiteral[value]`): The file system path. ### `__bool__` `__bool__(self) -> Bool` Checks if the path is not empty. **Returns:** True if the path length is greater than zero, and False otherwise. ### `__eq__` `__eq__(self, other: Self) -> Bool` Returns True if the two paths are equal. **Args:** * ​other (`Self`): The other path to compare against. **Returns:** True if the paths are equal and False otherwise. `__eq__(self, other: StringSlice[origin]) -> Bool` Returns True if the two paths are equal. **Args:** * ​other (`StringSlice[origin]`): The other path to compare against. **Returns:** True if the String and Path are equal, and False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Returns True if the two paths are not equal. **Args:** * ​other (`Self`): The other path to compare against. **Returns:** True if the paths are not equal and False otherwise. ### `__truediv__` `__truediv__(self, suffix: Self) -> Self` Joins two paths using the system-defined path separator. **Args:** * ​suffix (`Self`): The suffix to append to the path. **Returns:** A new path with the suffix appended to the current path. `__truediv__(self, suffix: StringSlice[origin]) -> Self` Joins two paths using the system-defined path separator. **Args:** * ​suffix (`StringSlice[origin]`): The suffix to append to the path. **Returns:** A new path with the suffix appended to the current path. ### `__itruediv__` `__itruediv__(mut self, suffix: StringSlice[origin])` Joins two paths using the system-defined path separator. **Args:** * ​suffix (`StringSlice[origin]`): The suffix to append to the path. ### `copy` `copy(self) -> Self` Copy the object. **Returns:** A copy of the value. ### `__str__` `__str__(self) -> String` Returns a string representation of the path. **Returns:** A string representation of the path. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this path to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__fspath__` `__fspath__(self) -> String` Returns a string representation of the path. **Returns:** A string representation of the path. ### `__repr__` `__repr__(self) -> String` Returns a printable representation of the path. **Returns:** A printable representation of the path. ### `__hash__` `__hash__(self) -> UInt` Hash the underlying path string using builtin hash. **Returns:** An integer value containing the hash of the path string. `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with the path string value. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. ### `stat` `stat(self) -> stat_result` Returns the stat information on the path. **Returns:** A stat\_result object containing information about the path. ### `lstat` `lstat(self) -> stat_result` Returns the lstat information on the path. This is similar to stat, but if the file is a symlink then it gives you information about the symlink rather than the target. **Returns:** A stat\_result object containing information about the path. ### `exists` `exists(self) -> Bool` Returns True if the path exists and False otherwise. **Returns:** True if the path exists on disk and False otherwise. ### `expanduser` `expanduser(self) -> Self` Expands a prefixed `~` with `$HOME` on posix or `$USERPROFILE` on windows. If environment variables are not set or the `path` is not prefixed with `~`, returns the `path` unmodified. **Returns:** The expanded path. ### `home` `static home() -> Self` Returns `$HOME` on posix or `$USERPROFILE` on windows. If environment variables are not set it returns `~`. **Returns:** Path to user home directory. ### `is_dir` `is_dir(self) -> Bool` Returns True if the path is a directory and False otherwise. **Returns:** Return True if the path points to a directory (or a link pointing to a directory). ### `is_file` `is_file(self) -> Bool` Returns True if the path is a file and False otherwise. **Returns:** Return True if the path points to a file (or a link pointing to a file). ### `read_text` `read_text(self) -> String` Returns content of the file. **Returns:** Contents of file as string. ### `read_bytes` `read_bytes(self) -> List[SIMD[uint8, 1]]` Returns content of the file as bytes. **Returns:** Contents of file as list of bytes. ### `write_text` `write_text[T: Writable](self, value: T)` Writes the value to the file as text. **Parameters:** * ​T (`Writable`): The type of an object conforming to the `Writable` trait. **Args:** * ​value (`T`): The value to write. ### `write_bytes` `write_bytes(self, bytes: Span[SIMD[uint8, 1], origin])` Writes bytes to the file. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The bytes to write to this file. ### `suffix` `suffix(self) -> String` The path's extension, if any. This includes the leading period. For example: '.txt'. If no extension is found, returns the empty string. **Returns:** The path's extension. ### `joinpath` `joinpath(self, *pathsegments: String) -> Self` Joins the Path using the pathsegments. **Args:** * ​\*pathsegments (`String`): The path segments. **Returns:** The path concatenation with the pathsegments using the directory separator. ### `listdir` `listdir(self) -> List[Path]` Gets the list of entries contained in the path provided. **Returns:** The list of entries in the path provided. --- ## cwd `cwd() -> Path` Gets the current directory. **Returns:** The current directory. --- ## path Implements `Path` and related functions. ## Aliases ### `DIR_SEPARATOR` `alias DIR_SEPARATOR = "\\".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]() if os_is_windows() else "/".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]()` ## Structs * [​`Path`](/mojo/stdlib/pathlib/path/Path): The Path object. ## Functions * [​`cwd`](/mojo/stdlib/pathlib/path/cwd): Gets the current directory. --- ## prelude Implements the prelude package. This package provide the public entities that are automatically imported into every Mojo program. --- ## pwd Provides access to user and group information from the password database. Use the [`Passwd`](/mojo/stdlib/pwd/pwd/Passwd) type to access user account information such as user name, ID, group, and home directory. ## Modules * [​`pwd`](/mojo/stdlib/pwd/pwd/): --- ## Passwd `struct Passwd` Represents user account information retrieved from the user password database related to a user ID. ## Fields * ​pw\_name (`String`): User name. * ​pw\_passwd (`String`): User password. * ​pw\_uid (`Int`): User ID. * ​pw\_gid (`Int`): Group ID. * ​pw\_gecos (`String`): Real name or comment field. * ​pw\_dir (`String`): Home directory. * ​pw\_shell (`String`): Shell program. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this string to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__str__` `__str__(self) -> String` Gets the Passwd struct as a string. **Returns:** A compact string of the Passwd struct. ### `__repr__` `__repr__(self) -> String` Gets the Passwd struct as a string. **Returns:** A compact string representation of Passwd struct. --- ## getpwnam `getpwnam(owned name: String) -> Passwd` Retrieves the user ID in the password database for the given user name. **Constraints:** This function is constrained to run on Linux or macOS operating systems only. **Args:** * ​name (`String`): The name of the user to retrieve the password entry for. **Returns:** An object containing the user's account information, including login name, encrypted password, user ID, group ID, real name, home directory, and shell program. **Raises:** If the user name does not exist or there is an error retrieving the information. --- ## getpwuid `getpwuid(uid: Int) -> Passwd` Retrieve the password database entry for a given user ID. **Constraints:** This function is constrained to run on Linux or macOS operating systems only. **Args:** * ​uid (`Int`): The user ID for which to retrieve the password database entry. **Returns:** An object containing the user's account information, including login name, encrypted password, user ID, group ID, real name, home directory, and shell program. **Raises:** If the user ID does not exist or there is an error retrieving the information. --- ## pwd ## Structs * [​`Passwd`](/mojo/stdlib/pwd/pwd/Passwd): Represents user account information retrieved from the user password database related to a user ID. ## Functions * [​`getpwnam`](/mojo/stdlib/pwd/pwd/getpwnam): Retrieves the user ID in the password database for the given user name. * [​`getpwuid`](/mojo/stdlib/pwd/pwd/getpwuid): Retrieve the password database entry for a given user ID. --- ## PyMojoObject `struct PyMojoObject[T: AnyType]` Storage backing a PyObject\* wrapping a Mojo value. This struct represents the C-level layout of a Python object that contains a wrapped Mojo value. It must be ABI-compatible with CPython's PyObject structure to enable seamless interoperability between Mojo and Python. The struct follows Python's object model where all Python objects begin with a PyObject header (ob\_base), followed by type-specific data. In this case, the type-specific data is a Mojo value of type T. See for more details. ## Parameters * ​T (`AnyType`): The Mojo type being wrapped. Can be any type that satisfies `AnyType`. ## Fields * ​ob\_base (`PyObject`): The standard Python object header containing reference count and type information. This must be the first field to maintain ABI compatibility with Python's object layout. All Python objects begin with this header structure. * ​mojo\_value (`T`): The actual Mojo value being wrapped and exposed to Python. This field stores the Mojo data that Python code can interact with through the registered type methods and bindings. ## Implemented traits `AnyType`, `UnknownDestructibility` --- ## PythonModuleBuilder `struct PythonModuleBuilder` A builder for creating Python modules with Mojo function and type bindings. This builder provides a high-level API for declaring Python bindings for Mojo functions and types within a Python module. It manages the registration of functions, types, and their associated metadata, then finalizes everything into a complete Python module object. The builder follows a declarative pattern where you: 1. Create a builder instance with a module name 2. Add function bindings using `def_function()`, `def_py_function()`, `def_py_c_function()` 3. Add type bindings using `add_type[T]()` and configure them 4. Call `finalize()` to finish building the Python module. Example: ```mojo from python.bindings import PythonModuleBuilder var builder = PythonModuleBuilder("my_module") builder.def_function[my_func]("my_func", "Documentation for my_func") _ = builder.add_type[MyType]("MyType").def_method[my_method]("my_method") var module = builder.finalize() ``` Note: After calling `finalize()`, the builder's internal state is cleared and it should not be reused for creating additional modules. TODO: This should be enforced programmatically in the future. ## Fields * ​module (`PythonObject`): The Python module being built. * ​functions (`List[PyMethodDef]`): List of function definitions that will be exposed in the module. * ​type\_builders (`List[PythonTypeBuilder]`): List of type builders for types that will be exposed in the module. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, name: StringSlice[StaticConstantOrigin])` Construct a Python module builder with the given module name. **Args:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the module. **Raises:** If the module creation fails. `__init__(out self, module: PythonObject)` Construct a Python module builder with the given module. **Args:** * ​module (`PythonObject`): The module to build. ### `add_type` `add_type[T: Movable & Defaultable & Representable](mut self, type_name: StringSlice[StaticConstantOrigin]) -> ref [*[0,0].type_builders] PythonTypeBuilder` Add a type to the module and return a builder for it. **Parameters:** * ​T (`Movable & Defaultable & Representable`): The mojo type to bind in the module. **Args:** * ​type\_name (`StringSlice[StaticConstantOrigin]`): The name of the type to expose in the module. **Returns:** A reference to a type builder registered in the module builder. ### `def_py_c_function` `def_py_c_function(mut self, func: fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PyCFunction signature in the module. **Args:** * ​func (`fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr`): The function to declare a binding for. * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. ### `def_py_function` `def_py_function[func: fn(mut PythonObject, mut PythonObject) -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PyFunction signature in the module. **Parameters:** * ​func (`fn(mut PythonObject, mut PythonObject) -> PythonObject`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_py_function[func: fn(mut PythonObject, mut PythonObject) raises -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PyFunctionRaising signature in the module. **Parameters:** * ​func (`fn(mut PythonObject, mut PythonObject) raises -> PythonObject`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. ### `def_function` `def_function[func_type: AnyTrivialRegType, //, func: PyObjectFunction[func_type, False]](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. These signatures can have any number of positional PythonObject arguments up to 3, can optionally return a PythonObject, and can raise. Example signature types: ```mojo alias F1 = fn (mut PythonObject) raises -> PythonObject alias F2 = fn (mut PythonObject, PythonObject) -> PythonObject alias F3 = fn (mut PythonObject, PythonObject, mut PythonObject) ``` **Parameters:** * ​func\_type (`AnyTrivialRegType`): The type of the function to declare a binding for. * ​func (`PyObjectFunction[func_type, False]`): The function to declare a binding for. Users can pass their function directly, and it will be implicitly converted to a PyObjectFunction if and only if its signature is supported. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. ### `finalize` `finalize(mut self) -> PythonObject` Finalize the module builder, creating the module object. All types and functions added to the builder will be built and exposed in the module. After calling this method, the builder's internal state is cleared and it should not be reused for creating additional modules. **Returns:** The finalized Python module containing all registered functions and types. **Raises:** If the module creation fails or if we fail to add any of the declared functions or types to the module. --- ## PythonTypeBuilder `struct PythonTypeBuilder` A builder for a Python 'type' binding. This is typically used to build a type description of a `PyMojoObject[T]`. This builder is used to declare method bindings for a Python type, and then create the type binding. Finalizing builder created with `PythonTypeObject.bind[T]()` will globally register the resulting Python 'type' object as the single canonical type object for the Mojo type `T`. Subsequent attempts to register a Python type for `T` will raise an exception. Registering a Python type object for `T` is necessary to be able to construct a `PythonObject` from an instance of `T`, or to downcast an existing `PythonObject` to a pointer to the inner `T` value. ## Fields * ​type\_name (`StringSlice[StaticConstantOrigin]`): The name the type will be exposed as in the Python module. * ​basicsize (`Int`): The required allocation size to hold an instance of this type as a Python object. * ​methods (`List[PyMethodDef]`): List of method definitions that will be exposed on the Python type. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, type_name: StringSlice[StaticConstantOrigin], *, basicsize: Int)` Construct a new builder for a Python type binding. **Args:** * ​type\_name (`StringSlice[StaticConstantOrigin]`): The name the type will be exposed as in the Python module. * ​basicsize (`Int`): The required allocation size to hold an instance of this type as a Python object. ### `bind` `static bind[T: Movable & Defaultable & Representable](type_name: StringSlice[StaticConstantOrigin]) -> Self` Construct a new builder for a Python type that binds a Mojo type. **Parameters:** * ​T (`Movable & Defaultable & Representable`): The mojo type to bind. **Args:** * ​type\_name (`StringSlice[StaticConstantOrigin]`): The name the type will be exposed as in the Python module. **Returns:** A new type builder instance. ### `finalize` `finalize(mut self, module: PythonObject)` Finalize the builder and add the created type to a Python module. This method completes the type building process by calling the parameterless `finalize()` method to create the Python type object, then automatically adds the resulting type to the specified Python module using the builder's configured type name. After successful completion, the builder's method list is cleared to prevent accidental reuse. This is a convenience method that combines type finalization and module registration in a single operation, which is the most common use case when creating Python-accessible Mojo types. Note: After calling this method, the builder's internal state is modified (methods list is cleared), so the builder should not be reused for creating additional type objects. If you need the type object for further operations, use the parameterless `finalize()` method instead and manually add it to the module. **Args:** * ​module (`PythonObject`): The Python module to which the finalized type will be added. The type will be accessible from Python code that imports this module using the name specified during builder construction. **Raises:** If the type object creation fails (see `finalize()` for details) or if adding the type to the module fails, typically due to name conflicts or module state issues. ### `def_py_c_method` `def_py_c_method(mut self, method: fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PyObjectPtr signature for the type. **Args:** * ​method (`fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr`): The method to declare a binding for. * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. ### `def_py_method` `def_py_method[method: fn(mut PythonObject, mut PythonObject) -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PyObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject, mut PythonObject) -> PythonObject`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. `def_py_method[method: fn(mut PythonObject, mut PythonObject) raises -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PyObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject, mut PythonObject) raises -> PythonObject`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. ### `def_method` `def_method[method_type: AnyTrivialRegType, //, method: PyObjectFunction[method_type, True]](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PythonObject signature for the type. These signatures can have any number of positional PythonObject arguments up to 3 (including self), can optionally return a PythonObject, and can raise. Example signature types: ```mojo alias F1 = fn (mut PythonObject) raises -> PythonObject alias F2 = fn (mut PythonObject, PythonObject) -> PythonObject alias F3 = fn (mut PythonObject, PythonObject, mut PythonObject) ``` **Parameters:** * ​method\_type (`AnyTrivialRegType`): The type of the method to declare a binding for. * ​method (`PyObjectFunction[method_type, True]`): The method to declare a binding for. Users can pass their function directly, and it will be implicitly converted to a PyObjectFunction if and only if its signature is supported. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. --- ## check_arguments_arity `check_arguments_arity(arity: Int, args: PythonObject)` Validate that the provided arguments match the expected function arity. This function checks if the number of arguments in the provided tuple matches the expected arity for a function call. If the counts don't match, it raises a descriptive error message similar to Python's built-in TypeError messages. **Args:** * ​arity (`Int`): The expected number of arguments for the function. * ​args (`PythonObject`): A tuple containing the actual arguments passed to the function. **Raises:** Error: If the argument count doesn't match the expected arity. The error message follows Python's convention for TypeError messages, indicating whether too few or too many arguments were provided. `check_arguments_arity(arity: Int, args: PythonObject, func_name: StringSlice[origin])` Validate that the provided arguments match the expected function arity. This function checks if the number of arguments in the provided tuple matches the expected arity for a function call. If the counts don't match, it raises a descriptive error message similar to Python's built-in TypeError messages. **Args:** * ​arity (`Int`): The expected number of arguments for the function. * ​args (`PythonObject`): A tuple containing the actual arguments passed to the function. * ​func\_name (`StringSlice[origin]`): The name of the function being called, used in error messages to provide better debugging information. **Raises:** Error: If the argument count doesn't match the expected arity. The error message follows Python's convention for TypeError messages, indicating whether too few or too many arguments were provided, along with the specific function name. --- ## bindings ## Aliases ### `MOJO_PYTHON_TYPE_OBJECTS` `alias MOJO_PYTHON_TYPE_OBJECTS = _Global[__init__[__mlir_type.!kgen.string]("MOJO_PYTHON_TYPE_OBJECTS"), Dict[StringSlice[StaticConstantOrigin], PythonObject], _init_python_type_objects]` Mapping of Mojo type identifiers to unique `PyTypeObject*` binding that Mojo type to this CPython interpreter instance. ### `Typed_initproc` `alias Typed_initproc = fn(PyObjectPtr, PythonObject, PyObjectPtr) -> SIMD[int32, 1]` ### `Typed_newfunc` `alias Typed_newfunc = fn(UnsafePointer[PyTypeObject], PythonObject, PyObjectPtr) -> PyObjectPtr` ## Structs * [​`PyMojoObject`](/mojo/stdlib/python/bindings/PyMojoObject): Storage backing a PyObject\* wrapping a Mojo value. * [​`PythonModuleBuilder`](/mojo/stdlib/python/bindings/PythonModuleBuilder): A builder for creating Python modules with Mojo function and type bindings. * [​`PythonTypeBuilder`](/mojo/stdlib/python/bindings/PythonTypeBuilder): A builder for a Python 'type' binding. ## Functions * [​`check_arguments_arity`](/mojo/stdlib/python/bindings/check_arguments_arity): Validate that the provided arguments match the expected function arity. * [​`lookup_py_type_object`](/mojo/stdlib/python/bindings/lookup_py_type_object): Retrieve a reference to the unique Python type describing Python objects containing Mojo values of type `T`. --- ## lookup_py_type_object `lookup_py_type_object[T: AnyType]() -> PythonObject` Retrieve a reference to the unique Python type describing Python objects containing Mojo values of type `T`. This function looks up the Python type object that was previously registered for the Mojo type `T` using a `PythonTypeBuilder`. The returned type object can be used to create Python objects that wrap Mojo values of type `T`. **Parameters:** * ​T (`AnyType`): The Mojo type to look up. **Returns:** A `PythonObject` representing the Python type object that binds the Mojo type `T` to the current CPython interpreter instance. **Raises:** If no `PythonTypeBuilder` was ever finalized for type `T`, or if no Python type object has been registered for the provided type identifier. --- ## python Implements the python package. ## Modules * [​`bindings`](/mojo/stdlib/python/bindings/): * [​`python`](/mojo/stdlib/python/python/): Implements Python interoperability. * [​`python_object`](/mojo/stdlib/python/python_object/): Implements PythonObject. --- ## Python `struct Python` Provides methods that help you use Python code in Mojo. ## Implemented traits `AnyType`, `Defaultable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Default constructor. ### `__copyinit__` `__copyinit__(out self, existing: Self)` Copy constructor. **Args:** * ​existing (`Self`): The existing instance to copy from. ### `cpython` `cpython(self) -> ref [StaticConstantOrigin] CPython` Handle to the low-level C API of the CPython interpreter present in the current process. **Returns:** Handle to the CPython interpreter instance in the current process. ### `eval` `eval(self, owned code: String) -> Bool` Executes the given Python code. **Args:** * ​code (`String`): The python code to execute. **Returns:** `True` if the code executed successfully or `False` if the code raised an exception. ### `evaluate` `static evaluate(owned expr: String, file: Bool = False, name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("__main__")) -> PythonObject` Executes the given Python code. **Args:** * ​expr (`String`): The Python expression to evaluate. * ​file (`Bool`): Evaluate as a file and return the module. * ​name (`StringSlice[StaticConstantOrigin]`): The name of the module (most relevant if `file` is True). **Returns:** `PythonObject` containing the result of the evaluation. ### `add_to_path` `static add_to_path(dir_path: StringSlice[origin])` Adds a directory to the Python path. This might be necessary to import a Python module via `import_module()`. For example: ```mojo from python import Python # Specify path to `mypython.py` module Python.add_to_path("path/to/module") var mypython = Python.import_module("mypython") var c = mypython.my_algorithm(2, 3) ``` **Args:** * ​dir\_path (`StringSlice[origin]`): The path to a Python module you want to import. ### `import_module` `static import_module(owned module: String) -> PythonObject` Imports a Python module. This provides you with a module object you can use just like you would in Python. For example: ```mojo from python import Python # This is equivalent to Python's `import numpy as np` np = Python.import_module("numpy") a = np.array([1, 2, 3]) ``` **Args:** * ​module (`String`): The Python module name. This module must be visible from the list of available Python paths (you might need to add the module's path with `add_to_path()`). **Returns:** The Python module. ### `create_module` `static create_module(name: StringSlice[StaticConstantOrigin]) -> PythonObject` Creates a Python module using the provided name. Inspired by TODO: allow specifying a doc-string to attach to the module upon creation or lazily added? **Args:** * ​name (`StringSlice[StaticConstantOrigin]`): The Python module name. **Returns:** The Python module. ### `add_functions` `static add_functions(module: PythonObject, owned functions: List[PyMethodDef])` Adds functions to a Python module object. **Args:** * ​module (`PythonObject`): The Python module object. * ​functions (`List[PyMethodDef]`): List of function data. **Raises:** If we fail to add the functions to the module. ### `add_object` `static add_object(module: PythonObject, owned name: String, value: PythonObject)` Add a new object to `module` with the given name and value. The provided object can be any type of Python object: an instance, a type object, a function, etc. The added value will be inserted into the `__dict__` of the provided module. **Args:** * ​module (`PythonObject`): The Python module to modify. * ​name (`String`): The name of the new object. * ​value (`PythonObject`): The python object value. ### `dict` `static dict[V: PythonConvertible & Copyable & Movable = PythonObject](*, owned **kwargs: V) -> PythonObject` Construct an Python dictionary from keyword arguments. **Parameters:** * ​V (`PythonConvertible & Copyable & Movable`): The type of the values in the dictionary. Must implement the `PythonConvertible`, `Copyable`, and `Movable` traits. **Args:** * ​\*\*kwargs (`V`): The keyword arguments to construct the dictionary with. **Returns:** The constructed Python dictionary. **Raises:** On failure to construct the dictionary or convert the values to Python objects. `static dict[K: PythonConvertible & Copyable & Movable = PythonObject, V: PythonConvertible & Copyable & Movable = PythonObject](tuples: Span[Tuple[K, V], origin]) -> PythonObject` Construct an Python dictionary from a list of key-value tuples. **Parameters:** * ​K (`PythonConvertible & Copyable & Movable`): The type of the keys in the dictionary. Must implement the `PythonConvertible`, `Copyable`, and `Movable` traits. * ​V (`PythonConvertible & Copyable & Movable`): The type of the values in the dictionary. Must implement the `PythonConvertible`, `Copyable`, and `Movable` traits. **Args:** * ​tuples (`Span[Tuple[K, V], origin]`): The list of key-value tuples to construct the dictionary with. **Returns:** The constructed Python dictionary. **Raises:** On failure to construct the dictionary or convert the keys or values to Python objects. ### `list` `static list[T: PythonConvertible & Copyable & Movable](values: Span[T, origin]) -> PythonObject` Initialize the object from a list of values. **Parameters:** * ​T (`PythonConvertible & Copyable & Movable`): The span element type. **Args:** * ​values (`Span[T, origin]`): The values to initialize the list with. **Returns:** A PythonObject representing the list. `static list[*Ts: PythonConvertible & Copyable](owned *values: *Ts) -> PythonObject` Construct an Python list of objects. **Parameters:** * ​\*Ts (`PythonConvertible & Copyable`): The list element types. **Args:** * ​\*values (`*Ts`): The values to initialize the list with. **Returns:** The constructed Python list. ### `tuple` `static tuple[*Ts: PythonConvertible & Copyable](owned *values: *Ts) -> PythonObject` Construct an Python tuple of objects. **Parameters:** * ​\*Ts (`PythonConvertible & Copyable`): The list element types. **Args:** * ​\*values (`*Ts`): The values to initialize the tuple with. **Returns:** The constructed Python tuple. ### `as_string_slice` `as_string_slice(self, str_obj: PythonObject) -> StringSlice[MutableAnyOrigin]` Return a string representing the given Python object. **Args:** * ​str\_obj (`PythonObject`): The Python object. **Returns:** Mojo string representing the given Python object. ### `type` `static type(obj: PythonObject) -> PythonObject` Return Type of this PythonObject. **Args:** * ​obj (`PythonObject`): PythonObject we want the type of. **Returns:** A PythonObject that holds the type object. ### `none` `static none() -> PythonObject` Get a `PythonObject` representing `None`. **Returns:** `PythonObject` representing `None`. ### `str` `static str(obj: PythonObject) -> PythonObject` Convert a PythonObject to a Python `str`. **Args:** * ​obj (`PythonObject`): The PythonObject to convert. **Returns:** A Python `str` object. **Raises:** An error if the conversion failed. ### `int` `static int(obj: PythonObject) -> PythonObject` Convert a PythonObject to a Python `int` (i.e. arbitrary precision integer). **Args:** * ​obj (`PythonObject`): The PythonObject to convert. **Returns:** A PythonObject representing the result of the conversion to `int`. **Raises:** If the conversion to `int` fails. ### `float` `static float(obj: PythonObject) -> PythonObject` Convert a PythonObject to a Python `float` object. **Args:** * ​obj (`PythonObject`): The PythonObject to convert. **Returns:** A Python `float` object. **Raises:** If the conversion fails. ### `py_long_as_ssize_t` `static py_long_as_ssize_t(obj: PythonObject) -> Int` Get the value of a Python `long` object. **Args:** * ​obj (`PythonObject`): The Python `long` object. **Returns:** The value of the `long` object as a `Py_ssize_t`. **Raises:** If `obj` is not a Python `long` object, or if the `long` object value overflows `Py_ssize_t`. ### `is_true` `static is_true(obj: PythonObject) -> Bool` Check if the PythonObject is truthy. **Args:** * ​obj (`PythonObject`): The PythonObject to check. **Returns:** True if the PythonObject is truthy and False otherwise. **Raises:** If the boolean value of the PythonObject cannot be determined. --- ## python Implements Python interoperability. You can import these APIs from the `python` package. For example: ```mojo from python import Python ``` ## Structs * [​`Python`](/mojo/stdlib/python/python/Python): Provides methods that help you use Python code in Mojo. --- ## ConvertibleFromPython Denotes a type that can attempt construction from a read-only Python object. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self: _Self, obj: PythonObject)` Attempt to construct an instance of this object from a read-only Python value. **Args:** * ​obj (`PythonObject`): The Python object to convert from. **Raises:** If conversion was not successful. ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. --- ## PythonConvertible A trait that indicates a type can be converted to a PythonObject, and that specifies the behavior with a `to_python_object` method. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `to_python_object` `to_python_object(owned self: _Self) -> PythonObject` Convert a value to a PythonObject. **Returns:** A PythonObject representing the value. **Raises:** If the conversion to a PythonObject failed. --- ## PythonObject `@register_passable` `struct PythonObject` A Python object. ## Fields * ​py\_object (`PyObjectPtr`): A pointer to the underlying Python object. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Defaultable`, `Movable`, `PythonConvertible`, `SizedRaising`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Initialize the object with a `None` value. `__init__(*, from_owned_ptr: PyObjectPtr) -> Self` Initialize this object from an owned reference-counted Python object pointer. Ownership of the reference will be assumed by `PythonObject`. **Args:** * ​from\_owned\_ptr (`PyObjectPtr`): The `PyObjectPtr` to take ownership of. `__init__(*, from_borrowed_ptr: PyObjectPtr) -> Self` Initialize this object from a read-only reference-counted Python object pointer. The reference count of the pointee object will be incremented, and ownership of the additional reference count will be assumed by the initialized `PythonObject`. The CPython API documentation indicates the ownership semantics of the returned object on any function that returns a `PyObject*` value. The two possible annotations are: * "Return value: New reference." * "Return value: Borrowed reference. This function should be used to construct a `PythonObject` from the pointer returned by 'Borrowed reference'-type objects. **Args:** * ​from\_borrowed\_ptr (`PyObjectPtr`): A read-only reference counted pointer to a Python object. **Returns:** An owned PythonObject pointer. `__init__[T: Movable](out self, *, owned alloc: T)` Allocate a new `PythonObject` and store a Mojo value in it. The newly allocated Python object will contain the provided Mojo `T` instance directly, without attempting conversion to an equivalent Python builtin type. Only Mojo types that have a registered Python 'type' object can be stored as a Python object. Mojo types are registered using a `PythonTypeBuilder`. **Parameters:** * ​T (`Movable`): The Mojo type of the value that the resulting Python object holds. **Args:** * ​alloc (`T`): The Mojo value to store in the new Python object. **Raises:** If no Python type object has been registered for `T` by a `PythonTypeBuilder`. `@implicit` `__init__(none: NoneType) -> Self` Initialize a none value object from a `None` literal. **Args:** * ​none (`NoneType`): None. `@implicit` `__init__(value: Bool) -> Self` Initialize the object from a bool. **Args:** * ​value (`Bool`): The boolean value. `@implicit` `__init__(integer: Int) -> Self` Initialize the object with an integer value. **Args:** * ​integer (`Int`): The integer value. `@implicit` `__init__[dtype: DType](value: SIMD[dtype, 1]) -> Self` Initialize the object with a generic scalar value. If the scalar value type is bool, it is converted to a boolean. Otherwise, it is converted to the appropriate integer or floating point type. **Parameters:** * ​dtype (`DType`): The scalar value type. **Args:** * ​value (`SIMD[dtype, 1]`): The scalar value. `@implicit` `__init__(out self, value: StringLiteral[value])` Initialize the object from a string literal. **Args:** * ​value (`StringLiteral[value]`): The string value. `@implicit` `__init__(out self, value: String)` Initialize the object from a string. **Args:** * ​value (`String`): The string value. `@implicit` `__init__(out self, string: StringSlice[origin])` Initialize the object from a string. **Args:** * ​string (`StringSlice[origin]`): The string value. **Raises:** If the string is not valid UTF-8. `@implicit` `__init__(slice: Slice) -> Self` Initialize the object from a Mojo Slice. **Args:** * ​slice (`Slice`): The dictionary value. `__init__[*Ts: PythonConvertible & Copyable](out self, owned *values: *Ts, *, __list_literal__: Tuple[])` Construct an Python list of objects. **Parameters:** * ​\*Ts (`PythonConvertible & Copyable`): The types of the input values. **Args:** * ​\*values (`*Ts`): The values to initialize the list with. * ​**list\_literal** (`Tuple[]`): Tell Mojo to use this method for list literals. **Returns:** The constructed Python list. `__init__[*Ts: PythonConvertible & Copyable](out self, owned *values: *Ts, *, __set_literal__: Tuple[])` Construct an Python set of objects. **Parameters:** * ​\*Ts (`PythonConvertible & Copyable`): The types of the input values. **Args:** * ​\*values (`*Ts`): The values to initialize the set with. * ​**set\_literal** (`Tuple[]`): Tell Mojo to use this method for set literals. **Returns:** The constructed Python set. `__init__(out self, owned keys: List[PythonObject], owned values: List[PythonObject], __dict_literal__: Tuple[])` Construct a Python dictionary from a list of keys and a list of values. **Args:** * ​keys (`List[PythonObject]`): The keys of the dictionary. * ​values (`List[PythonObject]`): The values of the dictionary. * ​**dict\_literal** (`Tuple[]`): Tell Mojo to use this method for dict literals. ### `__copyinit__` `__copyinit__(existing: Self) -> Self` Copy the object. This increments the underlying refcount of the existing object. **Args:** * ​existing (`Self`): The value to copy. ### `__del__` `__del__(owned self)` Destroy the object. This decrements the underlying refcount of the pointed-to object. ### `__bool__` `__bool__(self) -> Bool` Evaluate the boolean value of the object. **Returns:** Whether the object evaluates as true. ### `__getitem__` `__getitem__(self, *args: Self) -> Self` Return the value for the given key or keys. **Args:** * ​\*args (`Self`): The key or keys to access on this object. **Returns:** The value corresponding to the given key for this object. `__getitem__(self, *args: Slice) -> Self` Return the sliced value for the given Slice or Slices. **Args:** * ​\*args (`Slice`): The Slice or Slices to apply to this object. **Returns:** The sliced value corresponding to the given Slice(s) for this object. ### `__setitem__` `__setitem__(self, *args: Self, *, value: Self)` Set the value with the given key or keys. **Args:** * ​\*args (`Self`): The key or keys to set on this object. * ​value (`Self`): The value to set. ### `__neg__` `__neg__(self) -> Self` Negative. Calls the underlying object's `__neg__` method. **Returns:** The result of prefixing this object with a `-` operator. For most numerical objects, this returns the negative. ### `__pos__` `__pos__(self) -> Self` Positive. Calls the underlying object's `__pos__` method. **Returns:** The result of prefixing this object with a `+` operator. For most numerical objects, this does nothing. ### `__invert__` `__invert__(self) -> Self` Inversion. Calls the underlying object's `__invert__` method. **Returns:** The logical inverse of this object: a bitwise representation where all bits are flipped, from zero to one, and from one to zero. ### `__lt__` `__lt__(self, rhs: Self) -> Self` Less than (rich) comparison operator. **Args:** * ​rhs (`Self`): The value of the right hand side of the comparison. **Returns:** The result of the comparison, not necessarily a boolean. **Raises:** If the object doesn't implement the `__lt__` method, or if it fails. ### `__le__` `__le__(self, rhs: Self) -> Self` Less than or equal (rich) comparison operator. **Args:** * ​rhs (`Self`): The value of the right hand side of the comparison. **Returns:** The result of the comparison, not necessarily a boolean. **Raises:** If the object doesn't implement the `__le__` method, or if it fails. ### `__eq__` `__eq__(self, rhs: Self) -> Self` Equality (rich) comparison operator. **Args:** * ​rhs (`Self`): The value of the right hand side of the comparison. **Returns:** The result of the comparison, not necessarily a boolean. **Raises:** If the object doesn't implement the `__eq__` method, or if it fails. ### `__ne__` `__ne__(self, rhs: Self) -> Self` Inequality (rich) comparison operator. **Args:** * ​rhs (`Self`): The value of the right hand side of the comparison. **Returns:** The result of the comparison, not necessarily a boolean. **Raises:** If the object doesn't implement the `__ne__` method, or if it fails. ### `__gt__` `__gt__(self, rhs: Self) -> Self` Greater than (rich) comparison operator. **Args:** * ​rhs (`Self`): The value of the right hand side of the comparison. **Returns:** The result of the comparison, not necessarily a boolean. **Raises:** If the object doesn't implement the `__gt__` method, or if it fails. ### `__ge__` `__ge__(self, rhs: Self) -> Self` Greater than or equal (rich) comparison operator. **Args:** * ​rhs (`Self`): The value of the right hand side of the comparison. **Returns:** The result of the comparison, not necessarily a boolean. **Raises:** If the object doesn't implement the `__ge__` method, or if it fails. ### `__is__` `__is__(self, other: Self) -> Bool` Test if the PythonObject is the `other` PythonObject, the same as `x is y` in Python. **Args:** * ​other (`Self`): The right-hand-side value in the comparison. **Returns:** True if they are the same object and False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Test if the PythonObject is not the `other` PythonObject, the same as `x is not y` in Python. **Args:** * ​other (`Self`): The right-hand-side value in the comparison. **Returns:** True if they are not the same object and False otherwise. ### `__contains__` `__contains__(self, rhs: Self) -> Bool` Contains dunder. Calls the underlying object's `__contains__` method. **Args:** * ​rhs (`Self`): Right hand value. **Returns:** True if rhs is in self. ### `__add__` `__add__(self, rhs: Self) -> Self` Addition and concatenation. Calls the underlying object's `__add__` method. **Args:** * ​rhs (`Self`): Right hand value. **Returns:** The sum or concatenated values. ### `__sub__` `__sub__(self, rhs: Self) -> Self` Subtraction. Calls the underlying object's `__sub__` method. **Args:** * ​rhs (`Self`): Right hand value. **Returns:** The difference. ### `__mul__` `__mul__(self, rhs: Self) -> Self` Multiplication. Calls the underlying object's `__mul__` method. **Args:** * ​rhs (`Self`): Right hand value. **Returns:** The product. ### `__truediv__` `__truediv__(self, rhs: Self) -> Self` Division. Calls the underlying object's `__truediv__` method. **Args:** * ​rhs (`Self`): The right-hand-side value by which this object is divided. **Returns:** The result of dividing the right-hand-side value by this. ### `__floordiv__` `__floordiv__(self, rhs: Self) -> Self` Return the division of self and rhs rounded down to the nearest integer. Calls the underlying object's `__floordiv__` method. **Args:** * ​rhs (`Self`): The right-hand-side value by which this object is divided. **Returns:** The result of dividing this by the right-hand-side value, modulo any remainder. ### `__mod__` `__mod__(self, rhs: Self) -> Self` Return the remainder of self divided by rhs. Calls the underlying object's `__mod__` method. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** The remainder of dividing self by rhs. ### `__pow__` `__pow__(self, exp: Self) -> Self` Raises this object to the power of the given value. **Args:** * ​exp (`Self`): The exponent. **Returns:** The result of raising this by the given exponent. ### `__lshift__` `__lshift__(self, rhs: Self) -> Self` Bitwise left shift. **Args:** * ​rhs (`Self`): The right-hand-side value by which this object is bitwise shifted to the left. **Returns:** This value, shifted left by the given value. ### `__rshift__` `__rshift__(self, rhs: Self) -> Self` Bitwise right shift. **Args:** * ​rhs (`Self`): The right-hand-side value by which this object is bitwise shifted to the right. **Returns:** This value, shifted right by the given value. ### `__and__` `__and__(self, rhs: Self) -> Self` Bitwise AND. **Args:** * ​rhs (`Self`): The right-hand-side value with which this object is bitwise AND'ed. **Returns:** The bitwise AND result of this and the given value. ### `__or__` `__or__(self, rhs: Self) -> Self` Bitwise OR. **Args:** * ​rhs (`Self`): The right-hand-side value with which this object is bitwise OR'ed. **Returns:** The bitwise OR result of this and the given value. ### `__xor__` `__xor__(self, rhs: Self) -> Self` Exclusive OR. **Args:** * ​rhs (`Self`): The right-hand-side value with which this object is exclusive OR'ed. **Returns:** The exclusive OR result of this and the given value. ### `__radd__` `__radd__(self, lhs: Self) -> Self` Reverse addition and concatenation. Calls the underlying object's `__radd__` method. **Args:** * ​lhs (`Self`): The left-hand-side value to which this object is added or concatenated. **Returns:** The sum. ### `__rsub__` `__rsub__(self, lhs: Self) -> Self` Reverse subtraction. Calls the underlying object's `__rsub__` method. **Args:** * ​lhs (`Self`): The left-hand-side value from which this object is subtracted. **Returns:** The result of subtracting this from the given value. ### `__rmul__` `__rmul__(self, lhs: Self) -> Self` Reverse multiplication. Calls the underlying object's `__rmul__` method. **Args:** * ​lhs (`Self`): The left-hand-side value that is multiplied by this object. **Returns:** The product of the multiplication. ### `__rtruediv__` `__rtruediv__(self, lhs: Self) -> Self` Reverse division. Calls the underlying object's `__rtruediv__` method. **Args:** * ​lhs (`Self`): The left-hand-side value that is divided by this object. **Returns:** The result of dividing the given value by this. ### `__rfloordiv__` `__rfloordiv__(self, lhs: Self) -> Self` Reverse floor division. Calls the underlying object's `__rfloordiv__` method. **Args:** * ​lhs (`Self`): The left-hand-side value that is divided by this object. **Returns:** The result of dividing the given value by this, modulo any remainder. ### `__rmod__` `__rmod__(self, lhs: Self) -> Self` Reverse modulo. Calls the underlying object's `__rmod__` method. **Args:** * ​lhs (`Self`): The left-hand-side value that is divided by this object. **Returns:** The remainder from dividing the given value by this. ### `__rpow__` `__rpow__(self, lhs: Self) -> Self` Reverse power of. **Args:** * ​lhs (`Self`): The number that is raised to the power of this object. **Returns:** The result of raising the given value by this exponent. ### `__rlshift__` `__rlshift__(self, lhs: Self) -> Self` Reverse bitwise left shift. **Args:** * ​lhs (`Self`): The left-hand-side value that is bitwise shifted to the left by this object. **Returns:** The given value, shifted left by this. ### `__rrshift__` `__rrshift__(self, lhs: Self) -> Self` Reverse bitwise right shift. **Args:** * ​lhs (`Self`): The left-hand-side value that is bitwise shifted to the right by this object. **Returns:** The given value, shifted right by this. ### `__rand__` `__rand__(self, lhs: Self) -> Self` Reverse bitwise and. **Args:** * ​lhs (`Self`): The left-hand-side value that is bitwise AND'ed with this object. **Returns:** The bitwise AND result of the given value and this. ### `__ror__` `__ror__(self, lhs: Self) -> Self` Reverse bitwise OR. **Args:** * ​lhs (`Self`): The left-hand-side value that is bitwise OR'ed with this object. **Returns:** The bitwise OR result of the given value and this. ### `__rxor__` `__rxor__(self, lhs: Self) -> Self` Reverse exclusive OR. **Args:** * ​lhs (`Self`): The left-hand-side value that is exclusive OR'ed with this object. **Returns:** The exclusive OR result of the given value and this. ### `__iadd__` `__iadd__(mut self, rhs: Self)` Immediate addition and concatenation. **Args:** * ​rhs (`Self`): The right-hand-side value that is added to this object. ### `__isub__` `__isub__(mut self, rhs: Self)` Immediate subtraction. **Args:** * ​rhs (`Self`): The right-hand-side value that is subtracted from this object. ### `__imul__` `__imul__(mut self, rhs: Self)` In-place multiplication. Calls the underlying object's `__imul__` method. **Args:** * ​rhs (`Self`): The right-hand-side value by which this object is multiplied. ### `__itruediv__` `__itruediv__(mut self, rhs: Self)` Immediate division. **Args:** * ​rhs (`Self`): The value by which this object is divided. ### `__ifloordiv__` `__ifloordiv__(mut self, rhs: Self)` Immediate floor division. **Args:** * ​rhs (`Self`): The value by which this object is divided. ### `__imod__` `__imod__(mut self, rhs: Self)` Immediate modulo. **Args:** * ​rhs (`Self`): The right-hand-side value that is used to divide this object. ### `__ipow__` `__ipow__(mut self, rhs: Self)` Immediate power of. **Args:** * ​rhs (`Self`): The exponent. ### `__ilshift__` `__ilshift__(mut self, rhs: Self)` Immediate bitwise left shift. **Args:** * ​rhs (`Self`): The right-hand-side value by which this object is bitwise shifted to the left. ### `__irshift__` `__irshift__(mut self, rhs: Self)` Immediate bitwise right shift. **Args:** * ​rhs (`Self`): The right-hand-side value by which this object is bitwise shifted to the right. ### `__iand__` `__iand__(mut self, rhs: Self)` Immediate bitwise AND. **Args:** * ​rhs (`Self`): The right-hand-side value with which this object is bitwise AND'ed. ### `__ixor__` `__ixor__(mut self, rhs: Self)` Immediate exclusive OR. **Args:** * ​rhs (`Self`): The right-hand-side value with which this object is exclusive OR'ed. ### `__ior__` `__ior__(mut self, rhs: Self)` Immediate bitwise OR. **Args:** * ​rhs (`Self`): The right-hand-side value with which this object is bitwise OR'ed. ### `copy` `copy(self) -> Self` Copy the object. **Returns:** A copy of the value. ### `__iter__` `__iter__(self) -> _PyIter` Iterate over the object. **Returns:** An iterator object. **Raises:** If the object is not iterable. ### `__getattr__` `__getattr__(self, owned name: String) -> Self` Return the value of the object attribute with the given name. **Args:** * ​name (`String`): The name of the object attribute to return. **Returns:** The value of the object attribute with the given name. ### `__setattr__` `__setattr__(self, owned name: String, new_value: Self)` Set the given value for the object attribute with the given name. **Args:** * ​name (`String`): The name of the object attribute to set. * ​new\_value (`Self`): The new value to be set for that attribute. ### `__call__` `__call__(self, *args: Self, *, owned **kwargs: Self) -> Self` Call the underlying object as if it were a function. **Args:** * ​\*args (`Self`): Positional arguments to the function. * ​\*\*kwargs (`Self`): Keyword arguments to the function. **Returns:** The return value from the called object. **Raises:** If the function cannot be called for any reason. ### `__len__` `__len__(self) -> Int` Returns the length of the object. **Returns:** The length of the object. ### `__hash__` `__hash__(self) -> Int` Returns the hash value of the object. **Returns:** The hash value of the object. ### `__int__` `__int__(self) -> Self` Convert the PythonObject to a Python `int` (i.e. arbitrary precision integer). **Returns:** A Python `int` object. **Raises:** An error if the conversion failed. ### `__float__` `__float__(self) -> Self` Convert the PythonObject to a Python `float` object. **Returns:** A Python `float` object. **Raises:** If the conversion fails. ### `__str__` `__str__(self) -> Self` Convert the PythonObject to a Python `str`. **Returns:** A Python `str` object. **Raises:** An error if the conversion failed. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this Python object to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `to_python_object` `to_python_object(owned self) -> Self` Convert this value to a PythonObject. **Returns:** A PythonObject representing the value. ### `unsafe_as_py_object_ptr` `unsafe_as_py_object_ptr(self) -> PyObjectPtr` Get the underlying PyObject pointer. Safety: Use-after-free: The caller must take care that `self` outlives the usage of the pointer returned by this function. **Returns:** The underlying PyObject pointer. ### `steal_data` `steal_data(owned self) -> PyObjectPtr` Take ownership of the underlying pointer from the Python object. **Returns:** The underlying data. ### `unsafe_get_as_pointer` `unsafe_get_as_pointer[dtype: DType](self) -> UnsafePointer[SIMD[dtype, 1]]` Reinterpret a Python integer as a Mojo pointer. Warning: converting from an integer to a pointer is unsafe! The compiler assumes the resulting pointer DOES NOT alias any Mojo-derived pointer. This is OK if the pointer originates from and is owned by Python, e.g. the data underpinning a torch tensor. **Parameters:** * ​dtype (`DType`): The desired DType of the pointer. **Returns:** An `UnsafePointer` for the underlying Python data. ### `downcast_value_ptr` `downcast_value_ptr[T: AnyType](self, *, func: Optional[StringSlice[StaticConstantOrigin]] = Optional(None)) -> UnsafePointer[T]` Get a pointer to the expected contained Mojo value of type `T`. This method validates that this object actually contains an instance of `T`, and will raise an error if it does not. Mojo values are stored as Python objects backed by the `PyMojoObject[T]` struct. **Parameters:** * ​T (`AnyType`): The type of the Mojo value that this Python object is expected to contain. **Args:** * ​func (`Optional[StringSlice[StaticConstantOrigin]]`): Optional name of bound Mojo function that the raised TypeError should reference if downcasting fails. **Returns:** A pointer to the inner Mojo value. **Raises:** If the Python object does not contain an instance of the Mojo `T` type. ### `unchecked_downcast_value_ptr` `unchecked_downcast_value_ptr[T: AnyType](self) -> UnsafePointer[T]` Get a pointer to the expected Mojo value of type `T`. This function assumes that this Python object was allocated as an instance of `PyMojoObject[T]`. # Safety The user must be certain that this Python object type matches the bound Python type object for `T`. **Parameters:** * ​T (`AnyType`): The type of the Mojo value stored in this object. **Returns:** A pointer to the inner Mojo value. --- ## python_object Implements PythonObject. You can import these APIs from the `python` package. For example: ```mojo from python import PythonObject ``` ## Aliases ### `PyFunction` `alias PyFunction = fn(mut PythonObject, mut PythonObject) -> PythonObject` ### `PyFunctionRaising` `alias PyFunctionRaising = fn(mut PythonObject, mut PythonObject) raises -> PythonObject` ## Structs * [​`PythonObject`](/mojo/stdlib/python/python_object/PythonObject): A Python object. ## Traits * [​`ConvertibleFromPython`](/mojo/stdlib/python/python_object/ConvertibleFromPython): Denotes a type that can attempt construction from a read-only Python object. * [​`PythonConvertible`](/mojo/stdlib/python/python_object/PythonConvertible): A trait that indicates a type can be converted to a PythonObject, and that specifies the behavior with a `to_python_object` method. --- ## random Implements the random package. ## Modules * [​`random`](/mojo/stdlib/random/random/): Provides functions for random numbers. --- ## random Provides functions for random numbers. You can import these APIs from the `random` package. For example: ```mojo from random import seed ``` ## Functions * [​`rand`](/mojo/stdlib/random/random/rand): Fills memory with random values from a uniform distribution. * [​`randint`](/mojo/stdlib/random/random/randint): Fills memory with uniform random in range \[low, high]. * [​`randn`](/mojo/stdlib/random/random/randn): Fills memory with random values from a Normal(mean, standard\_deviation) distribution. * [​`randn_float64`](/mojo/stdlib/random/random/randn_float64): Returns a random double sampled from a Normal(mean, standard\_deviation) distribution. * [​`random_float64`](/mojo/stdlib/random/random/random_float64): Returns a random `Float64` number from the given range. * [​`random_si64`](/mojo/stdlib/random/random/random_si64): Returns a random `Int64` number from the given range. * [​`random_ui64`](/mojo/stdlib/random/random/random_ui64): Returns a random `UInt64` number from the given range. * [​`seed`](/mojo/stdlib/random/random/seed): Seeds the random number generator using the current time. * [​`shuffle`](/mojo/stdlib/random/random/shuffle): Shuffles the elements of the list randomly. --- ## rand `rand[dtype: DType](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], size: Int, /, *, min: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](0), max: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1), int_scale: Optional[Int] = Optional(None))` Fills memory with random values from a uniform distribution. **Parameters:** * ​dtype (`DType`): The dtype of the pointer. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The pointer to the memory area to fill. * ​size (`Int`): The number of elements to fill. * ​min (`SIMD[float64, 1]`): The minimum value for random. * ​max (`SIMD[float64, 1]`): The maximum value for random. * ​int\_scale (`Optional[Int]`): The scale for error checking (float type only). --- ## randint `randint[dtype: DType](ptr: UnsafePointer[SIMD[dtype, 1]], size: Int, low: Int, high: Int)` Fills memory with uniform random in range \[low, high]. **Constraints:** The type should be integral. **Parameters:** * ​dtype (`DType`): The dtype of the pointer. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1]]`): The pointer to the memory area to fill. * ​size (`Int`): The number of elements to fill. * ​low (`Int`): The minimal value for random. * ​high (`Int`): The maximal value for random. --- ## randn `randn[dtype: DType](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], size: Int, mean: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](0), standard_deviation: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1))` Fills memory with random values from a Normal(mean, standard\_deviation) distribution. **Constraints:** The type should be floating point. **Parameters:** * ​dtype (`DType`): The dtype of the pointer. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The pointer to the memory area to fill. * ​size (`Int`): The number of elements to fill. * ​mean (`SIMD[float64, 1]`): Normal distribution mean. * ​standard\_deviation (`SIMD[float64, 1]`): Normal distribution standard deviation. --- ## randn_float64 `randn_float64(mean: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](0), standard_deviation: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1)) -> SIMD[float64, 1]` Returns a random double sampled from a Normal(mean, standard\_deviation) distribution. **Args:** * ​mean (`SIMD[float64, 1]`): Normal distribution mean. * ​standard\_deviation (`SIMD[float64, 1]`): Normal distribution standard deviation. **Returns:** A random float64 sampled from Normal(mean, standard\_deviation). --- ## random_float64 `random_float64(min: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](0), max: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](1)) -> SIMD[float64, 1]` Returns a random `Float64` number from the given range. **Args:** * ​min (`SIMD[float64, 1]`): The minimum number in the range (default is 0.0). * ​max (`SIMD[float64, 1]`): The maximum number in the range (default is 1.0). **Returns:** A random number from the specified range. --- ## random_si64 `random_si64(min: SIMD[int64, 1], max: SIMD[int64, 1]) -> SIMD[int64, 1]` Returns a random `Int64` number from the given range. **Args:** * ​min (`SIMD[int64, 1]`): The minimum number in the range. * ​max (`SIMD[int64, 1]`): The maximum number in the range. **Returns:** A random number from the specified range. --- ## random_ui64 `random_ui64(min: SIMD[uint64, 1], max: SIMD[uint64, 1]) -> SIMD[uint64, 1]` Returns a random `UInt64` number from the given range. **Args:** * ​min (`SIMD[uint64, 1]`): The minimum number in the range. * ​max (`SIMD[uint64, 1]`): The maximum number in the range. **Returns:** A random number from the specified range. --- ## seed `seed()` Seeds the random number generator using the current time. `seed(a: Int)` Seeds the random number generator using the value provided. **Args:** * ​a (`Int`): The seed value. --- ## shuffle `shuffle[T: Copyable & Movable, //](mut list: List[T])` Shuffles the elements of the list randomly. Performs an in-place Fisher-Yates shuffle on the provided list. **Parameters:** * ​T (`Copyable & Movable`): The type of element in the List. **Args:** * ​list (`List[T]`): The list to modify. --- ## DeviceContextPtr `@register_passable(trivial)` `struct DeviceContextPtr` Exposes a pointer to a C++ DeviceContext to Mojo. Note: When initializing a `DeviceContext` from a pointer, the refcount is not incremented. This is considered safe because `get_device_context()` is only used within kernels and the `DeviceContext` lifetime is managed by the graph compiler. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Initialize an empty `DeviceContextPtr` with a null pointer. This creates a `DeviceContextPtr` that doesn't point to any device context. `@implicit` `__init__(handle: UnsafePointer[NoneType]) -> Self` Initialize a `DeviceContextPtr` from a raw pointer. **Args:** * ​handle (`UnsafePointer[NoneType]`): A raw pointer to a C++ `DeviceContext`. `@implicit` `__init__(device: DeviceContext) -> Self` Initialize a DeviceContextPtr from a `DeviceContext`. This constructor allows implicit conversion from `DeviceContext` to `DeviceContextPtr`. **Args:** * ​device (`DeviceContext`): The `DeviceContext` to wrap in this pointer. ### `__getitem__` `__getitem__(self) -> DeviceContext` Dereference the pointer to get the `DeviceContext`. **Returns:** The `DeviceContext` that this pointer points to. ### `get_device_context` `get_device_context(self) -> DeviceContext` Get the `DeviceContext` that this pointer points to. This is an alias for the dereference operator. **Returns:** The `DeviceContext` that this pointer points to. --- ## DeviceContextPtrList `@register_passable(trivial)` `struct DeviceContextPtrList[size: Int]` A fixed-size collection of `DeviceContextPtr` objects. This struct provides a lightweight, register-passable container for a fixed number of `DeviceContextPtr` objects, with array-like access semantics. ## Parameters * ​size (`Int`): The fixed number of `DeviceContextPtr` objects in the collection. ## Fields * ​ptrs (`StaticTuple[DeviceContextPtr, size]`): The underlying storage for the device context pointers. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(ptrs: StaticTuple[DeviceContextPtr, size]) -> Self` Initialize with a StaticTuple of `DeviceContextPtr` objects. **Args:** * ​ptrs (`StaticTuple[DeviceContextPtr, size]`): A StaticTuple containing the `DeviceContextPtr` objects to store. ### `__getitem__` `__getitem__[index: Int](self) -> DeviceContext` Access a `DeviceContext` at a compile-time known index. **Parameters:** * ​index (`Int`): A compile-time integer index. **Returns:** The `DeviceContext` at the specified index. `__getitem__[I: Indexer, //](self, idx: I) -> DeviceContext` Access a `DeviceContext` using a runtime index value. **Parameters:** * ​I (`Indexer`): A type that conforms to the `Indexer` trait. **Args:** * ​idx (`I`): A runtime index value that conforms to the Indexer trait. **Returns:** The `DeviceContext` at the specified index. ### `__len__` `__len__(self) -> Int` Get the number of `DeviceContextPtr` objects in the collection. **Returns:** The size of the collection as specified by the size parameter. --- ## Task `struct Task[type: AnyType, origins: origin.set]` Represents an asynchronous task that will produce a value of the specified type. A Task encapsulates a coroutine that is executing asynchronously and will eventually produce a result. Tasks can be awaited in async functions or waited on in synchronous code. ## Parameters * ​type (`AnyType`): The type of value that this task will produce when completed. * ​origins (`origin.set`): The set of origins for the coroutine wrapped by this task. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(out self, owned handle: Coroutine[type, origins])` Initialize a task with a coroutine. Takes ownership of the provided coroutine and sets up the task to receive its result when completed. **Args:** * ​handle (`Coroutine[type, origins]`): The coroutine to execute as a task. Ownership is transferred. ### `__del__` `__del__(owned self)` Destroy the memory associated with a task. This must be manually called when a task goes out of scope. ### `__await__` `__await__(self) -> ref [*[0,0]._result] type` Suspend the current async function until the task completes and its result becomes available. This function must be force inlined into the calling async function. This method enables the use of the 'await' keyword with Task objects in async functions. **Returns:** A reference to the result value produced by the task. ### `get` `get(self) -> ref [*[0,0]._result] type` Get the task's result value. Calling this on an incomplete task is undefined behavior. **Returns:** A reference to the result value produced by the task. ### `wait` `wait(self) -> ref [*[0,0]._result] type` Block the current thread until the future value becomes available. This method is used in synchronous code to wait for an asynchronous task to complete. Unlike `__await__`, this method does not suspend the current coroutine but instead blocks the entire thread. **Returns:** A reference to the result value produced by the task. --- ## TaskGroup `struct TaskGroup` A group of tasks that can be executed concurrently. TaskGroup manages a collection of coroutines that can be executed in parallel. It provides mechanisms to create, track, and wait for the completion of tasks. ## Fields * ​counter (`Atomic[index]`): Atomic counter tracking the number of active tasks in the group. * ​chain (`_Chain`): Chain used for asynchronous completion notification. * ​tasks (`List[_TaskGroupBox]`): Collection of tasks managed by this TaskGroup. ## Implemented traits `AnyType`, `Defaultable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Initialize a new TaskGroup with an empty task list and initialized chain. ### `__del__` `__del__(owned self)` Clean up resources associated with the TaskGroup. ### `__await__` `__await__(mut self)` Make TaskGroup awaitable in async contexts. This allows using 'await task\_group' syntax in async functions. ### `create_task` `create_task(mut self, owned task: Coroutine[None, origins])` Add a new task to the TaskGroup for execution. **Args:** * ​task (`Coroutine[None, origins]`): The coroutine to be executed as a task. ### `await_body_impl` `static await_body_impl(hdl: !co.routine, mut task_group: Self)` Implementation of the await functionality for TaskGroup. **Args:** * ​hdl (`!co.routine`): The coroutine handle to be awaited. * ​task\_group (`Self`): The TaskGroup to be awaited. ### `wait` `wait[origins: origin.set = {}](mut self)` Wait for all tasks in the `TaskGroup` to complete. This is a blocking call that returns only when all tasks have finished. **Parameters:** * ​origins (`origin.set`): The origin set for the wait operation. --- ## TaskGroupContext `@register_passable(trivial)` `struct TaskGroupContext` Context structure for task group operations. This structure holds a callback function and a pointer to a TaskGroup, allowing asynchronous operations to interact with their parent TaskGroup when they complete. ## Fields * ​callback (`fn(mut TaskGroup) -> None`): Callback function to be invoked on the TaskGroup when an operation completes. * ​task\_group (`UnsafePointer[TaskGroup]`): Pointer to the TaskGroup that owns or is associated with this context. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `tg_callback_fn_type` `alias tg_callback_fn_type = fn(mut TaskGroup) -> None` Type definition for callback functions that operate on TaskGroups. --- ## create_task `create_task(owned handle: Coroutine[type, origins], out task: Task[type, origins])` Run the coroutine as a task on the AsyncRT Runtime. This function creates a task from a coroutine and schedules it for execution on the async runtime. The task will execute asynchronously without blocking the current execution context. **Args:** * ​handle (`Coroutine[type, origins]`): The coroutine to execute as a task. Ownership is transferred. **Returns:** The `task` output parameter is initialized with the created task. --- ## asyncrt This module implements the low level concurrency library. ## Structs * [​`DeviceContextPtr`](/mojo/stdlib/runtime/asyncrt/DeviceContextPtr): Exposes a pointer to a C++ DeviceContext to Mojo. * [​`DeviceContextPtrList`](/mojo/stdlib/runtime/asyncrt/DeviceContextPtrList): A fixed-size collection of `DeviceContextPtr` objects. * [​`Task`](/mojo/stdlib/runtime/asyncrt/Task): Represents an asynchronous task that will produce a value of the specified type. * [​`TaskGroup`](/mojo/stdlib/runtime/asyncrt/TaskGroup): A group of tasks that can be executed concurrently. * [​`TaskGroupContext`](/mojo/stdlib/runtime/asyncrt/TaskGroupContext): Context structure for task group operations. ## Functions * [​`create_task`](/mojo/stdlib/runtime/asyncrt/create_task): Run the coroutine as a task on the AsyncRT Runtime. * [​`parallelism_level`](/mojo/stdlib/runtime/asyncrt/parallelism_level): Gets the parallelism level of the Runtime. --- ## parallelism_level `parallelism_level() -> Int` Gets the parallelism level of the Runtime. **Returns:** The number of worker threads available in the async runtime. --- ## runtime Implements the runtime package. ## Modules * [​`asyncrt`](/mojo/stdlib/runtime/asyncrt/): This module implements the low level concurrency library. * [​`tracing`](/mojo/stdlib/runtime/tracing/): Provides tracing utilities. --- ## Trace `struct Trace[level: TraceLevel, *, category: TraceCategory = TraceCategory(4), target: Optional[StringSlice[StaticConstantOrigin]] = Optional(None)]` An object representing a specific trace. This struct provides functionality for creating and managing trace events for profiling and debugging purposes. ## Parameters * ​level (`TraceLevel`): The trace level to use. * ​category (`TraceCategory`): The trace category to use (defaults to TraceCategory.MAX). * ​target (`Optional[StringSlice[StaticConstantOrigin]]`): Optional target information to include in the trace. ## Fields * ​int\_payload (`OptionalReg[Int]`): Optional integer payload, typically used for task IDs that are appended to trace names. * ​detail (`String`): Additional details about the trace event, included when detailed tracing is enabled. * ​event\_id (`Int`): Unique identifier for the trace event, assigned when the trace begins. * ​parent\_id (`Int`): Identifier of the parent trace event, used for creating hierarchical trace relationships. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, *, owned _name_value: Variant[String, StringSlice[StaticConstantOrigin]], detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` Creates a Mojo trace with the given name. **Args:** * ​\_name\_value (`Variant[String, StringSlice[StaticConstantOrigin]]`): The name that is used to identify this Mojo trace. * ​detail (`String`): Details of the trace entry. * ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be appended to parent name. 0 (default) indicates no parent. * ​task\_id (`OptionalReg[Int]`): Int that is appended to name. `__init__(out self, owned name: String, detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, *, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` Creates a Mojo trace with the given string name. **Args:** * ​name (`String`): The name that is used to identify this Mojo trace. * ​detail (`String`): Details of the trace entry. * ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be appended to parent name. 0 (default) indicates no parent. * ​task\_id (`OptionalReg[Int]`): Int that is appended to name. `__init__(out self, name: StringSlice[StaticConstantOrigin], detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, *, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` Creates a Mojo trace with the given static string name. **Args:** * ​name (`StringSlice[StaticConstantOrigin]`): The name that is used to identify this Mojo trace. * ​detail (`String`): Details of the trace entry. * ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be appended to parent name. 0 (default) indicates no parent. * ​task\_id (`OptionalReg[Int]`): Int that is appended to name. `__init__(out self, name: StringLiteral[value], detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, *, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` Creates a Mojo trace with the given string literal name. **Args:** * ​name (`StringLiteral[value]`): The name that is used to identify this Mojo trace. * ​detail (`String`): Details of the trace entry. * ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be appended to parent name. 0 (default) indicates no parent. * ​task\_id (`OptionalReg[Int]`): Int that is appended to name. ### `__enter__` `__enter__(mut self)` Enters the trace context. This begins recording of the trace event. ### `__exit__` `__exit__(self)` Exits the trace context. This finishes recording of the trace event. ### `mark` `mark(self)` Marks the tracer with the info at the specific point of time. This creates a point event in the trace timeline rather than a range. ### `name` `name(self) -> String` Returns the name of the trace. **Returns:** The name of the trace as a String. ### `start` `start(mut self)` Start recording trace event. This begins recording of the trace event, similar to **enter**. ### `end` `end(mut self)` End recording trace event. This finishes recording of the trace event, similar to **exit**. --- ## TraceCategory `@register_passable(trivial)` `struct TraceCategory` An enum-like struct specifying the type of tracing to perform. ## Fields * ​value (`Int`): The integer value representing the trace category. Used for bitwise operations when determining if profiling is enabled for a specific category. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Intable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ASYNCRT` `alias ASYNCRT = TraceCategory(1)` ### `Kernel` `alias Kernel = TraceCategory(3)` ### `MAX` `alias MAX = TraceCategory(4)` ### `MEM` `alias MEM = TraceCategory(2)` ### `OTHER` `alias OTHER = TraceCategory(0)` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compares for equality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are equal. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Compares for inequality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are not equal. ### `__is__` `__is__(self, rhs: Self) -> Bool` Compares for equality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are equal. ### `__isnot__` `__isnot__(self, rhs: Self) -> Bool` Compares for inequality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are not equal. ### `__int__` `__int__(self) -> Int` Converts the trace category to an integer. **Returns:** The integer value of the trace category. --- ## TraceLevel `@register_passable(trivial)` `struct TraceLevel` An enum-like struct specifying the level of tracing to perform. ## Fields * ​value (`Int`): The integer value representing the trace level. Lower values indicate higher priority trace levels: * 0 (ALWAYS): Always traced * 1 (OP): Operation-level tracing * 2 (THREAD): Thread-level tracing ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ALWAYS` `alias ALWAYS = TraceLevel(0)` ### `OP` `alias OP = TraceLevel(1)` ### `THREAD` `alias THREAD = TraceLevel(2)` ## Methods ### `__init__` `@implicit` `__init__(value: Int) -> Self` Initializes a TraceLevel with the given integer value. **Args:** * ​value (`Int`): The integer value for the trace level. ### `__le__` `__le__(self, rhs: Self) -> Bool` Performs less than or equal to comparison. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if this value is less than or equal to `rhs`. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compares for equality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are equal. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Compares for inequality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are not equal. ### `__is__` `__is__(self, rhs: Self) -> Bool` Compares for equality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are equal. ### `__isnot__` `__isnot__(self, rhs: Self) -> Bool` Compares for inequality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are not equal. ### `__int__` `__int__(self) -> Int` Converts the trace level to an integer. **Returns:** The integer value of the trace level. --- ## get_current_trace_id `get_current_trace_id[level: TraceLevel]() -> Int` Returns the id of last created trace entry on the current thread. **Parameters:** * ​level (`TraceLevel`): The trace level to check. **Returns:** The ID of the current trace if profiling is enabled, otherwise 0. --- ## tracing Provides tracing utilities. ## Structs * [​`Trace`](/mojo/stdlib/runtime/tracing/Trace): An object representing a specific trace. * [​`TraceCategory`](/mojo/stdlib/runtime/tracing/TraceCategory): An enum-like struct specifying the type of tracing to perform. * [​`TraceLevel`](/mojo/stdlib/runtime/tracing/TraceLevel): An enum-like struct specifying the level of tracing to perform. ## Functions * [​`get_current_trace_id`](/mojo/stdlib/runtime/tracing/get_current_trace_id): Returns the id of last created trace entry on the current thread. * [​`is_profiling_disabled`](/mojo/stdlib/runtime/tracing/is_profiling_disabled): Returns False if the profiling is enabled for that specific type and level and True otherwise. * [​`is_profiling_enabled`](/mojo/stdlib/runtime/tracing/is_profiling_enabled): Returns True if the profiling is enabled for that specific type and level and False otherwise. * [​`trace_arg`](/mojo/stdlib/runtime/tracing/trace_arg): Helper to stringify the type and shape of a kernel argument for tracing. --- ## is_profiling_disabled `is_profiling_disabled[type: TraceCategory, level: TraceLevel]() -> Bool` Returns False if the profiling is enabled for that specific type and level and True otherwise. **Parameters:** * ​type (`TraceCategory`): The trace category to check. * ​level (`TraceLevel`): The trace level to check. **Returns:** True if profiling is disabled for the specified type and level. --- ## is_profiling_enabled `is_profiling_enabled[type: TraceCategory, level: TraceLevel]() -> Bool` Returns True if the profiling is enabled for that specific type and level and False otherwise. **Parameters:** * ​type (`TraceCategory`): The trace category to check. * ​level (`TraceLevel`): The trace level to check. **Returns:** True if profiling is enabled for the specified type and level. --- ## trace_arg `trace_arg(name: String, shape: IndexList[size, element_type=element_type]) -> String` Helper to stringify the type and shape of a kernel argument for tracing. **Args:** * ​name (`String`): The name of the argument. * ​shape (`IndexList[size, element_type=element_type]`): The shape of the argument. **Returns:** A string representation of the argument with its shape. `trace_arg(name: String, shape: IndexList[size, element_type=element_type], dtype: DType) -> String` Helper to stringify the type and shape of a kernel argument for tracing. **Args:** * ​name (`String`): The name of the argument. * ​shape (`IndexList[size, element_type=element_type]`): The shape of the argument. * ​dtype (`DType`): The data type of the argument. **Returns:** A string representation of the argument with its shape and data type. `trace_arg(name: String, buf: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> String` Helper to stringify the type and shape of a kernel argument for tracing. **Args:** * ​name (`String`): The name of the argument. * ​buf (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The NDBuffer to trace. **Returns:** A string representation of the buffer with its shape and data type. --- ## stat Implements the stat package. ## Modules * [​`stat`](/mojo/stdlib/stat/stat/): Implements the stat module. --- ## S_ISBLK `S_ISBLK[intable: Intable](mode: intable) -> Bool` Returns True if the mode is a block device. **Parameters:** * ​intable (`Intable`): A type conforming to Intable. **Args:** * ​mode (`intable`): The file mode. **Returns:** True if the mode is a block device and False otherwise. --- ## S_ISCHR `S_ISCHR[intable: Intable](mode: intable) -> Bool` Returns True if the mode is a character device. **Parameters:** * ​intable (`Intable`): A type conforming to Intable. **Args:** * ​mode (`intable`): The file mode. **Returns:** True if the mode is a character device and False otherwise. --- ## S_ISDIR `S_ISDIR[intable: Intable](mode: intable) -> Bool` Returns True if the mode is a directory. **Parameters:** * ​intable (`Intable`): A type conforming to Intable. **Args:** * ​mode (`intable`): The file mode. **Returns:** True if the mode is a directory and False otherwise. --- ## S_ISFIFO `S_ISFIFO[intable: Intable](mode: intable) -> Bool` Returns True if the mode is a fifo. **Parameters:** * ​intable (`Intable`): A type conforming to Intable. **Args:** * ​mode (`intable`): The file mode. **Returns:** True if the mode is a fifo and False otherwise. --- ## S_ISLNK `S_ISLNK[intable: Intable](mode: intable) -> Bool` Returns True if the mode is a symlink. **Parameters:** * ​intable (`Intable`): A type conforming to Intable. **Args:** * ​mode (`intable`): The file mode. **Returns:** True if the mode is a symlink and False otherwise. --- ## S_ISREG `S_ISREG[intable: Intable](mode: intable) -> Bool` Returns True if the mode is a regular file. **Parameters:** * ​intable (`Intable`): A type conforming to Intable. **Args:** * ​mode (`intable`): The file mode. **Returns:** True if the mode is a regular file and False otherwise. --- ## S_ISSOCK `S_ISSOCK[intable: Intable](mode: intable) -> Bool` Returns True if the mode is a socket. **Parameters:** * ​intable (`Intable`): A type conforming to Intable. **Args:** * ​mode (`intable`): The file mode. **Returns:** True if the mode is a socket and False otherwise. --- ## stat Implements the stat module. ## Aliases ### `S_IFBLK` `alias S_IFBLK = 24576` Bits that determine the block device. ### `S_IFCHR` `alias S_IFCHR = 8192` Bits that determine the char device. ### `S_IFDIR` `alias S_IFDIR = 16384` Bits that determine the directory. ### `S_IFIFO` `alias S_IFIFO = 4096` Bits that determine the fifo. ### `S_IFLNK` `alias S_IFLNK = 40960` Bits that determine the symlink. ### `S_IFMT` `alias S_IFMT = 61440` Bits that determine the file type. ### `S_IFREG` `alias S_IFREG = 32768` Bits that determine the regular file. ### `S_IFSOCK` `alias S_IFSOCK = 49152` Bits that determine the socket. ## Functions * [​`S_ISBLK`](/mojo/stdlib/stat/stat/S_ISBLK): Returns True if the mode is a block device. * [​`S_ISCHR`](/mojo/stdlib/stat/stat/S_ISCHR): Returns True if the mode is a character device. * [​`S_ISDIR`](/mojo/stdlib/stat/stat/S_ISDIR): Returns True if the mode is a directory. * [​`S_ISFIFO`](/mojo/stdlib/stat/stat/S_ISFIFO): Returns True if the mode is a fifo. * [​`S_ISLNK`](/mojo/stdlib/stat/stat/S_ISLNK): Returns True if the mode is a symlink. * [​`S_ISREG`](/mojo/stdlib/stat/stat/S_ISREG): Returns True if the mode is a regular file. * [​`S_ISSOCK`](/mojo/stdlib/stat/stat/S_ISSOCK): Returns True if the mode is a socket. --- ## subprocess Implements the subprocess package. ## Modules * [​`subprocess`](/mojo/stdlib/subprocess/subprocess/): Implements the subprocess package. --- ## subprocess Implements the subprocess package. ## Functions * [​`run`](/mojo/stdlib/subprocess/subprocess/run): Runs the specified command and returns the output as a string. --- ## run `run(cmd: String) -> String` Runs the specified command and returns the output as a string. This function executes the given command in a subprocess, captures its standard output, and returns it as a string. It automatically handles opening and closing the subprocess. **Args:** * ​cmd (`String`): The command to execute as a string. **Returns:** The standard output of the command as a string, with trailing whitespace removed. **Raises:** This function raises if: * The command cannot be executed. * There is an IO error reading from the subprocess. * The data written by the subprocess is not valid UTF-8. --- ## argv `argv() -> VariadicList[StringSlice[StaticConstantOrigin]]` Gets the list of command line arguments given to the `mojo` CLI. For example: ```mojo title="app.mojo" from sys import argv def main(): args = argv() for arg in args: print(arg) ``` ```sh mojo app.mojo "Hello world" ``` ```output app.mojo Hello world ``` **Returns:** The list of command line arguments provided when mojo was invoked. --- ## arg Implements functions and variables for interacting with execution and system environment. ## Functions * [​`argv`](/mojo/stdlib/sys/arg/argv): Gets the list of command line arguments given to the `mojo` CLI. --- ## compile Implements functions that return compile-time information. ## Aliases ### `DebugLevel` `alias DebugLevel = _DebugLevel()` Represents the debug level used during compilation. ### `OptimizationLevel` `alias OptimizationLevel = _OptimizationLevel()` Represents the optimization level used during compilation. ## Functions * [​`is_compile_time`](/mojo/stdlib/sys/compile/is_compile_time): Returns true if the current code is executed at compile time, false otherwise. --- ## is_compile_time `is_compile_time() -> Bool` Returns true if the current code is executed at compile time, false otherwise. **Returns:** A boolean value indicating whether the code is being compiled. --- ## breakpointhook `breakpointhook()` Cause an execution trap with the intention of requesting the attention of a debugger. --- ## debug This module includes the debug hook functions. ## Functions * [​`breakpointhook`](/mojo/stdlib/sys/debug/breakpointhook): Cause an execution trap with the intention of requesting the attention of a debugger. --- ## DLHandle `@register_passable(trivial)` `struct DLHandle` Represents a dynamically linked library that can be loaded and unloaded. The library is loaded on initialization and unloaded by `close`. ## Fields * ​handle (`UnsafePointer[NoneType]`): The handle to the dynamic library. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, flags: Int = (256 if os_is_linux() else 8 | 2))` Initialize a dynamic library handle to all global symbols in the current process. Notes: On POSIX-compatible operating systems, this performs `dlopen(nullptr, flags)`. **Args:** * ​flags (`Int`): The flags to load the dynamic library. `__init__[PathLike: PathLike, //](out self, path: PathLike, flags: Int = (256 if os_is_linux() else 8 | 2))` Initialize a DLHandle object by loading the dynamic library at the given path. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the `os.PathLike` trait. **Args:** * ​path (`PathLike`): The path to the dynamic library file. * ​flags (`Int`): The flags to load the dynamic library. ### `__bool__` `__bool__(self) -> Bool` Checks if the handle is valid. **Returns:** True if the DLHandle is not null and False otherwise. ### `copy` `copy(self) -> Self` Copy the object. **Returns:** A copy of the value. ### `check_symbol` `check_symbol(self, owned name: String) -> Bool` Check that the symbol exists in the dynamic library. **Args:** * ​name (`String`): The symbol to check. **Returns:** `True` if the symbol exists. ### `close` `close(mut self)` Delete the DLHandle object unloading the associated dynamic library. ### `get_function` `get_function[result_type: AnyTrivialRegType](self, owned name: String) -> result_type` Returns a handle to the function with the given name in the dynamic library. **Parameters:** * ​result\_type (`AnyTrivialRegType`): The type of the function pointer to return. **Args:** * ​name (`String`): The name of the function to get the handle for. **Returns:** A handle to the function. ### `get_symbol` `get_symbol[result_type: AnyType](self, name: StringSlice[origin]) -> UnsafePointer[result_type]` Returns a pointer to the symbol with the given name in the dynamic library. **Parameters:** * ​result\_type (`AnyType`): The type of the symbol to return. **Args:** * ​name (`StringSlice[origin]`): The name of the symbol to get the handle for. **Returns:** A pointer to the symbol. `get_symbol[result_type: AnyType](self, *, cstr_name: UnsafePointer[SIMD[int8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> UnsafePointer[result_type]` Returns a pointer to the symbol with the given name in the dynamic library. **Parameters:** * ​result\_type (`AnyType`): The type of the symbol to return. **Args:** * ​cstr\_name (`UnsafePointer[SIMD[int8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The name of the symbol to get the handle for. **Returns:** A pointer to the symbol. ### `call` `call[name: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType = NoneType, *T: AnyType = *?](self, *args: *T) -> return_type` Call a function with any amount of arguments. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the function. * ​return\_type (`AnyTrivialRegType`): The return type of the function. * ​\*T (`AnyType`): The types of `args`. **Args:** * ​\*args (`*T`): The arguments. **Returns:** The result. `call[name: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType = NoneType](self, args: VariadicPack[is_owned, origin, AnyType, element_types]) -> return_type` Call a function with any amount of arguments. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the function. * ​return\_type (`AnyTrivialRegType`): The return type of the function. **Args:** * ​args (`VariadicPack[is_owned, origin, AnyType, element_types]`): The arguments. **Returns:** The result. --- ## RTLD `struct RTLD` Enumeration of the RTLD flags used during dynamic library loading. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `GLOBAL` `alias GLOBAL = 256 if os_is_linux() else 8` Make symbols available for symbol resolution of subsequently loaded libraries. ### `LAZY` `alias LAZY = 1` Load library lazily (defer function resolution until needed). ### `LOCAL` `alias LOCAL = 4` Make symbols not available for symbol resolution of subsequently loaded libraries. ### `NOW` `alias NOW = 2` Load library immediately (resolve all symbols on load). --- ## external_call `external_call[callee: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType, *types: AnyType](*args: *types) -> return_type` Calls an external function. **Parameters:** * ​callee (`StringSlice[StaticConstantOrigin]`): The name of the external function. * ​return\_type (`AnyTrivialRegType`): The return type. * ​\*types (`AnyType`): The argument types. **Args:** * ​\*args (`*types`): The arguments to pass to the external function. **Returns:** The external call result. `external_call[callee: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType](args: VariadicPack[is_owned, origin, AnyType, element_types]) -> return_type` Calls an external function. **Parameters:** * ​callee (`StringSlice[StaticConstantOrigin]`): The name of the external function. * ​return\_type (`AnyTrivialRegType`): The return type. **Args:** * ​args (`VariadicPack[is_owned, origin, AnyType, element_types]`): The arguments to pass to the external function. **Returns:** The external call result. --- ## ffi Implements a foreign functions interface (FFI). ## Aliases ### `c_char` `alias c_char = SIMD[int8, 1]` C `char` type. ### `c_double` `alias c_double = SIMD[float64, 1]` C `double` type. ### `c_float` `alias c_float = SIMD[float32, 1]` C `float` type. ### `c_int` `alias c_int = SIMD[int32, 1]` C `int` type. The C `int` type is typically a signed 32-bit integer on commonly used targets today. ### `c_long` `alias c_long = SIMD[_c_long_dtype(), 1]` C `long` type. The C `long` type is typically a signed 64-bit integer on macOS and Linux, and a 32-bit integer on Windows. ### `c_long_long` `alias c_long_long = SIMD[_c_long_long_dtype(), 1]` C `long long` type. The C `long long` type is typically a signed 64-bit integer on commonly used targets today. ### `c_short` `alias c_short = SIMD[int16, 1]` C `short` type. ### `c_size_t` `alias c_size_t = UInt` C `size_t` type. ### `c_ssize_t` `alias c_ssize_t = Int` C `ssize_t` type. ### `c_uchar` `alias c_uchar = SIMD[uint8, 1]` C `unsigned char` type. ### `c_uint` `alias c_uint = SIMD[uint32, 1]` C `unsigned int` type. ### `c_ushort` `alias c_ushort = SIMD[uint16, 1]` C `unsigned short` type. ### `DEFAULT_RTLD` `alias DEFAULT_RTLD = (256 if os_is_linux() else 8 | 2)` ### `OpaquePointer` `alias OpaquePointer = UnsafePointer[NoneType]` An opaque pointer, equivalent to the C `void*` type. ## Structs * [​`DLHandle`](/mojo/stdlib/sys/ffi/DLHandle): Represents a dynamically linked library that can be loaded and unloaded. * [​`RTLD`](/mojo/stdlib/sys/ffi/RTLD): Enumeration of the RTLD flags used during dynamic library loading. ## Functions * [​`external_call`](/mojo/stdlib/sys/ffi/external_call): Calls an external function. --- ## sys Implements the sys package. ## Modules * [​`arg`](/mojo/stdlib/sys/arg/): Implements functions and variables for interacting with execution and system environment. * [​`compile`](/mojo/stdlib/sys/compile/): Implements functions that return compile-time information. * [​`debug`](/mojo/stdlib/sys/debug/): This module includes the debug hook functions. * [​`ffi`](/mojo/stdlib/sys/ffi/): Implements a foreign functions interface (FFI). * [​`info`](/mojo/stdlib/sys/info/): Implements methods for querying the host target info. * [​`intrinsics`](/mojo/stdlib/sys/intrinsics/): Defines intrinsics. * [​`param_env`](/mojo/stdlib/sys/param_env/): Implements functions for retrieving compile-time defines. * [​`terminate`](/mojo/stdlib/sys/terminate/): This module includes the exit functions. --- ## CompilationTarget `@register_passable(trivial)` `struct CompilationTarget[value: target = _current_target()]` A struct that provides information about a target architecture. This struct encapsulates various methods to query target-specific information such as architecture features, OS details, endianness, and memory characteristics. ## Parameters * ​value (`target`): The target architecture to query. Defaults to the current target. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `has_sse4` `static has_sse4() -> Bool` Checks if the target supports SSE4 instructions. **Returns:** True if the target supports SSE4, False otherwise. ### `is_x86` `static is_x86() -> Bool` Checks if the target is an x86 architecture. **Returns:** True if the target is x86, False otherwise. --- ## alignof `alignof[type: AnyType, target: target = _current_target()]() -> Int` Returns the align of (in bytes) of the type. **Parameters:** * ​type (`AnyType`): The type in question. * ​target (`target`): The target architecture. **Returns:** The alignment of the type in bytes. `alignof[dtype: DType, target: target = _current_target()]() -> Int` Returns the align of (in bytes) of the dtype. **Parameters:** * ​dtype (`DType`): The DType in question. * ​target (`target`): The target architecture. **Returns:** The alignment of the dtype in bytes. --- ## bitwidthof `bitwidthof[type: AnyTrivialRegType, target: target = _current_target()]() -> Int` Returns the size of (in bits) of the type. **Parameters:** * ​type (`AnyTrivialRegType`): The type in question. * ​target (`target`): The target architecture. **Returns:** The size of the type in bits. `bitwidthof[dtype: DType, target: target = _current_target()]() -> Int` Returns the size of (in bits) of the dtype. **Parameters:** * ​dtype (`DType`): The type in question. * ​target (`target`): The target architecture. **Returns:** The size of the dtype in bits. --- ## has_accelerator `has_accelerator() -> Bool` Returns True if the host system has an accelerator and False otherwise. **Returns:** True if the host system has an accelerator. --- ## has_amd_gpu_accelerator `has_amd_gpu_accelerator() -> Bool` Returns True if the host system has an AMD GPU and False otherwise. **Returns:** True if the host system has an AMD GPU. --- ## has_avx `has_avx() -> Bool` Returns True if the host system has AVX, otherwise returns False. **Returns:** True if the host system has AVX, otherwise returns False. --- ## has_avx2 `has_avx2() -> Bool` Returns True if the host system has AVX2, otherwise returns False. **Returns:** True if the host system has AVX2, otherwise returns False. --- ## has_avx512f `has_avx512f() -> Bool` Returns True if the host system has AVX512, otherwise returns False. **Returns:** True if the host system has AVX512, otherwise returns False. --- ## has_fma `has_fma() -> Bool` Returns True if the host system has FMA (Fused Multiply-Add) support, otherwise returns False. **Returns:** True if the host system has FMA support, otherwise returns False. --- ## has_intel_amx `has_intel_amx() -> Bool` Returns True if the host system has Intel AMX support, otherwise returns False. **Returns:** True if the host system has Intel AMX and False otherwise. --- ## has_neon `has_neon() -> Bool` Returns True if the host system has Neon support, otherwise returns False. **Returns:** True if the host system support the Neon instruction set. --- ## has_neon_int8_dotprod `has_neon_int8_dotprod() -> Bool` Returns True if the host system has the Neon int8 dot product extension, otherwise returns False. **Returns:** True if the host system support the Neon int8 dot product extension and False otherwise. --- ## has_neon_int8_matmul `has_neon_int8_matmul() -> Bool` Returns True if the host system has the Neon int8 matrix multiplication extension (I8MM), otherwise returns False. **Returns:** True if the host system support the Neon int8 matrix multiplication extension (I8MM) and False otherwise. --- ## has_nvidia_gpu_accelerator `has_nvidia_gpu_accelerator() -> Bool` Returns True if the host system has an NVIDIA GPU and False otherwise. **Returns:** True if the host system has an NVIDIA GPU. --- ## has_sse4 `has_sse4() -> Bool` Returns True if the host system has sse4, otherwise returns False. **Deprecated:** Use `CompilationTarget.has_sse4()` instead. **Returns:** True if the host system has sse4, otherwise returns False. --- ## has_vnni `has_vnni() -> Bool` Returns True if the host system has avx512\_vnni, otherwise returns False. **Returns:** True if the host system has avx512\_vnni, otherwise returns False. --- ## info Implements methods for querying the host target info. You can import these APIs from the `sys` package. For example: ```mojo from sys import CompilationTarget print(CompilationTarget.is_x86()) ``` ## Structs * [​`CompilationTarget`](/mojo/stdlib/sys/info/CompilationTarget): A struct that provides information about a target architecture. ## Functions * [​`alignof`](/mojo/stdlib/sys/info/alignof): Returns the align of (in bytes) of the type. * [​`bitwidthof`](/mojo/stdlib/sys/info/bitwidthof): Returns the size of (in bits) of the type. * [​`has_accelerator`](/mojo/stdlib/sys/info/has_accelerator): Returns True if the host system has an accelerator and False otherwise. * [​`has_amd_gpu_accelerator`](/mojo/stdlib/sys/info/has_amd_gpu_accelerator): Returns True if the host system has an AMD GPU and False otherwise. * [​`has_avx`](/mojo/stdlib/sys/info/has_avx): Returns True if the host system has AVX, otherwise returns False. * [​`has_avx2`](/mojo/stdlib/sys/info/has_avx2): Returns True if the host system has AVX2, otherwise returns False. * [​`has_avx512f`](/mojo/stdlib/sys/info/has_avx512f): Returns True if the host system has AVX512, otherwise returns False. * [​`has_fma`](/mojo/stdlib/sys/info/has_fma): Returns True if the host system has FMA (Fused Multiply-Add) support, otherwise returns False. * [​`has_intel_amx`](/mojo/stdlib/sys/info/has_intel_amx): Returns True if the host system has Intel AMX support, otherwise returns False. * [​`has_neon`](/mojo/stdlib/sys/info/has_neon): Returns True if the host system has Neon support, otherwise returns False. * [​`has_neon_int8_dotprod`](/mojo/stdlib/sys/info/has_neon_int8_dotprod): Returns True if the host system has the Neon int8 dot product extension, otherwise returns False. * [​`has_neon_int8_matmul`](/mojo/stdlib/sys/info/has_neon_int8_matmul): Returns True if the host system has the Neon int8 matrix multiplication extension (I8MM), otherwise returns False. * [​`has_nvidia_gpu_accelerator`](/mojo/stdlib/sys/info/has_nvidia_gpu_accelerator): Returns True if the host system has an NVIDIA GPU and False otherwise. * [​`has_sse4`](/mojo/stdlib/sys/info/has_sse4): Returns True if the host system has sse4, otherwise returns False. * [​`has_vnni`](/mojo/stdlib/sys/info/has_vnni): Returns True if the host system has avx512\_vnni, otherwise returns False. * [​`is_32bit`](/mojo/stdlib/sys/info/is_32bit): Returns True if the maximum integral value is 32 bit. * [​`is_64bit`](/mojo/stdlib/sys/info/is_64bit): Returns True if the maximum integral value is 64 bit. * [​`is_amd_gpu`](/mojo/stdlib/sys/info/is_amd_gpu): Returns True if the target triple of the compiler is `amdgcn-amd-amdhsa` False otherwise. * [​`is_apple_m1`](/mojo/stdlib/sys/info/is_apple_m1): Returns True if the host system is an Apple M1 with AMX support, otherwise returns False. * [​`is_apple_m2`](/mojo/stdlib/sys/info/is_apple_m2): Returns True if the host system is an Apple M2 with AMX support, otherwise returns False. * [​`is_apple_m3`](/mojo/stdlib/sys/info/is_apple_m3): Returns True if the host system is an Apple M3 with AMX support, otherwise returns False. * [​`is_apple_m4`](/mojo/stdlib/sys/info/is_apple_m4): Returns True if the host system is an Apple M4 with AMX support, otherwise returns False. * [​`is_apple_silicon`](/mojo/stdlib/sys/info/is_apple_silicon): Returns True if the host system is an Apple Silicon with AMX support, otherwise returns False. * [​`is_big_endian`](/mojo/stdlib/sys/info/is_big_endian): Returns True if the host endianness is big and False otherwise. * [​`is_gpu`](/mojo/stdlib/sys/info/is_gpu): Returns True if the target triple is GPU and False otherwise. * [​`is_little_endian`](/mojo/stdlib/sys/info/is_little_endian): Returns True if the host endianness is little and False otherwise. * [​`is_neoverse_n1`](/mojo/stdlib/sys/info/is_neoverse_n1): Returns True if the host system is a Neoverse N1 system, otherwise returns False. * [​`is_nvidia_gpu`](/mojo/stdlib/sys/info/is_nvidia_gpu): Returns True if the target triple of the compiler is `nvptx64-nvidia-cuda` False otherwise. * [​`is_triple`](/mojo/stdlib/sys/info/is_triple): Returns True if the target triple of the compiler matches the input and False otherwise. * [​`is_x86`](/mojo/stdlib/sys/info/is_x86): Returns True if the host system architecture is X86 and False otherwise. * [​`num_logical_cores`](/mojo/stdlib/sys/info/num_logical_cores): Returns the number of hardware threads, including hyperthreads across all CPU sockets. * [​`num_performance_cores`](/mojo/stdlib/sys/info/num_performance_cores): Returns the number of physical performance cores across all CPU sockets. If not known, returns the total number of physical cores. * [​`num_physical_cores`](/mojo/stdlib/sys/info/num_physical_cores): Returns the number of physical cores across all CPU sockets. * [​`os_is_linux`](/mojo/stdlib/sys/info/os_is_linux): Returns True if the host operating system is Linux. * [​`os_is_macos`](/mojo/stdlib/sys/info/os_is_macos): Returns True if the host operating system is macOS. * [​`os_is_windows`](/mojo/stdlib/sys/info/os_is_windows): Returns True if the host operating system is Windows. * [​`simdbitwidth`](/mojo/stdlib/sys/info/simdbitwidth): Returns the vector size (in bits) of the specified target. * [​`simdbytewidth`](/mojo/stdlib/sys/info/simdbytewidth): Returns the vector size (in bytes) of the specified target. * [​`simdwidthof`](/mojo/stdlib/sys/info/simdwidthof): Returns the vector size of the type on the host system. * [​`sizeof`](/mojo/stdlib/sys/info/sizeof): Returns the size of (in bytes) of the type. --- ## is_32bit `is_32bit[target: target = _current_target()]() -> Bool` Returns True if the maximum integral value is 32 bit. **Parameters:** * ​target (`target`): The target architecture. **Returns:** True if the maximum integral value is 32 bit, False otherwise. --- ## is_64bit `is_64bit[target: target = _current_target()]() -> Bool` Returns True if the maximum integral value is 64 bit. **Parameters:** * ​target (`target`): The target architecture. **Returns:** True if the maximum integral value is 64 bit, False otherwise. --- ## is_amd_gpu `is_amd_gpu() -> Bool` Returns True if the target triple of the compiler is `amdgcn-amd-amdhsa` False otherwise. **Returns:** True if the triple target is amdgpu and False otherwise. --- ## is_apple_m1 `is_apple_m1() -> Bool` Returns True if the host system is an Apple M1 with AMX support, otherwise returns False. **Returns:** True if the host system is an Apple M1 with AMX support and False otherwise. --- ## is_apple_m2 `is_apple_m2() -> Bool` Returns True if the host system is an Apple M2 with AMX support, otherwise returns False. **Returns:** True if the host system is an Apple M2 with AMX support and False otherwise. --- ## is_apple_m3 `is_apple_m3() -> Bool` Returns True if the host system is an Apple M3 with AMX support, otherwise returns False. **Returns:** True if the host system is an Apple M3 with AMX support and False otherwise. --- ## is_apple_m4 `is_apple_m4() -> Bool` Returns True if the host system is an Apple M4 with AMX support, otherwise returns False. **Returns:** True if the host system is an Apple M4 with AMX support and False otherwise. --- ## is_apple_silicon `is_apple_silicon() -> Bool` Returns True if the host system is an Apple Silicon with AMX support, otherwise returns False. **Returns:** True if the host system is an Apple Silicon with AMX support and False otherwise. --- ## is_big_endian `is_big_endian[target: target = _current_target()]() -> Bool` Returns True if the host endianness is big and False otherwise. **Parameters:** * ​target (`target`): The target architecture. **Returns:** True if the host target is big endian and False otherwise. --- ## is_gpu `is_gpu() -> Bool` Returns True if the target triple is GPU and False otherwise. **Returns:** True if the triple target is GPU and False otherwise. --- ## is_little_endian `is_little_endian[target: target = _current_target()]() -> Bool` Returns True if the host endianness is little and False otherwise. **Parameters:** * ​target (`target`): The target architecture. **Returns:** True if the host target is little endian and False otherwise. --- ## is_neoverse_n1 `is_neoverse_n1() -> Bool` Returns True if the host system is a Neoverse N1 system, otherwise returns False. **Returns:** True if the host system is a Neoverse N1 system and False otherwise. --- ## is_nvidia_gpu `is_nvidia_gpu() -> Bool` Returns True if the target triple of the compiler is `nvptx64-nvidia-cuda` False otherwise. **Returns:** True if the triple target is cuda and False otherwise. `is_nvidia_gpu[subarch: StringSlice[StaticConstantOrigin]]() -> Bool` Returns True if the target triple of the compiler is `nvptx64-nvidia-cuda` and we are compiling for the specified sub-architecture and False otherwise. **Parameters:** * ​subarch (`StringSlice[StaticConstantOrigin]`): The subarchitecture (e.g. sm\_80). **Returns:** True if the triple target is cuda and False otherwise. --- ## is_triple `is_triple[: string, //, name: StringLiteral[$0], target: target = _current_target()]() -> Bool` Returns True if the target triple of the compiler matches the input and False otherwise. **Parameters:** * ​name (`StringLiteral[$0]`): The name of the triple value. * ​target (`target`): The triple value to be checked against. **Returns:** True if the triple matches and False otherwise. --- ## is_x86 `is_x86() -> Bool` Returns True if the host system architecture is X86 and False otherwise. **Deprecated:** Use `CompilationTarget.is_x86()` instead. **Returns:** True if the host system architecture is X86 and False otherwise. --- ## num_logical_cores `num_logical_cores() -> Int` Returns the number of hardware threads, including hyperthreads across all CPU sockets. **Returns:** Int: The number of threads on the system. --- ## num_performance_cores `num_performance_cores() -> Int` Returns the number of physical performance cores across all CPU sockets. If not known, returns the total number of physical cores. **Returns:** Int: The number of physical performance cores on the system. --- ## num_physical_cores `num_physical_cores() -> Int` Returns the number of physical cores across all CPU sockets. **Returns:** Int: The number of physical cores on the system. --- ## os_is_linux `os_is_linux() -> Bool` Returns True if the host operating system is Linux. **Returns:** True if the host operating system is Linux and False otherwise. --- ## os_is_macos `os_is_macos() -> Bool` Returns True if the host operating system is macOS. **Returns:** True if the host operating system is macOS and False otherwise. --- ## os_is_windows `os_is_windows() -> Bool` Returns True if the host operating system is Windows. **Returns:** True if the host operating system is Windows and False otherwise. --- ## simdbitwidth `simdbitwidth[target: target = _current_target()]() -> Int` Returns the vector size (in bits) of the specified target. **Parameters:** * ​target (`target`): The target architecture. **Returns:** The vector size (in bits) of the specified target. --- ## simdbytewidth `simdbytewidth[target: target = _current_target()]() -> Int` Returns the vector size (in bytes) of the specified target. **Parameters:** * ​target (`target`): The target architecture. **Returns:** The vector size (in bytes) of the host system. --- ## simdwidthof `simdwidthof[type: AnyTrivialRegType, target: target = _current_target()]() -> Int` Returns the vector size of the type on the host system. **Parameters:** * ​type (`AnyTrivialRegType`): The type in question. * ​target (`target`): The target architecture. **Returns:** The vector size of the type on the host system. `simdwidthof[dtype: DType, target: target = _current_target()]() -> Int` Returns the vector size of the type on the host system. **Parameters:** * ​dtype (`DType`): The DType in question. * ​target (`target`): The target architecture. **Returns:** The vector size of the dtype on the host system. --- ## sizeof `sizeof[type: AnyType, target: target = _current_target()]() -> Int` Returns the size of (in bytes) of the type. Example: ```mojo from sys.info import sizeof def main(): print( sizeof[UInt8]() == 1, sizeof[UInt16]() == 2, sizeof[Int32]() == 4, sizeof[Float64]() == 8, sizeof[ SIMD[DType.uint8, 4] ]() == 4, ) ``` Note: `align_of` is in same module. **Parameters:** * ​type (`AnyType`): The type in question. * ​target (`target`): The target architecture. **Returns:** The size of the type in bytes. `sizeof[dtype: DType, target: target = _current_target()]() -> Int` Returns the size of (in bytes) of the dtype. **Parameters:** * ​dtype (`DType`): The DType in question. * ​target (`target`): The target architecture. **Returns:** The size of the dtype in bytes. --- ## PrefetchCache `@register_passable(trivial)` `struct PrefetchCache` Prefetch cache type. ## Fields * ​value (`SIMD[int32, 1]`): The cache prefetch. It should be in \[0, 1]. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `DATA` `alias DATA = PrefetchCache(1)` The data prefetching option. ### `INSTRUCTION` `alias INSTRUCTION = PrefetchCache(0)` The instruction prefetching option. ## Methods ### `__init__` `__init__(value: Int) -> Self` Constructs a prefetch option. **Args:** * ​value (`Int`): An integer value representing the prefetch cache option to be used. Should be a value in the range `[0, 1]`. --- ## PrefetchLocality `@register_passable(trivial)` `struct PrefetchLocality` The prefetch locality. The locality, rw, and cache type correspond to LLVM prefetch intrinsic's inputs (see [LLVM prefetch locality](https://llvm.org/docs/LangRef.html#llvm-prefetch-intrinsic)) ## Fields * ​value (`SIMD[int32, 1]`): The prefetch locality to use. It should be a value in \[0, 3]. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `HIGH` `alias HIGH = PrefetchLocality(3)` Extremely local locality (keep in cache). ### `LOW` `alias LOW = PrefetchLocality(1)` Low locality. ### `MEDIUM` `alias MEDIUM = PrefetchLocality(2)` Medium locality. ### `NONE` `alias NONE = PrefetchLocality(0)` No locality. ## Methods ### `__init__` `__init__(value: Int) -> Self` Constructs a prefetch locality option. **Args:** * ​value (`Int`): An integer value representing the locality. Should be a value in the range `[0, 3]`. --- ## PrefetchOptions `@register_passable(trivial)` `struct PrefetchOptions` Collection of configuration parameters for a prefetch intrinsic call. The op configuration follows similar interface as LLVM intrinsic prefetch op, with a "locality" attribute that specifies the level of temporal locality in the application, that is, how soon would the same data be visited again. Possible locality values are: `NONE`, `LOW`, `MEDIUM`, and `HIGH`. The op also takes a "cache tag" attribute giving hints on how the prefetched data will be used. Possible tags are: `ReadICache`, `ReadDCache` and `WriteDCache`. Note: the actual behavior of the prefetch op and concrete interpretation of these attributes are target-dependent. ## Fields * ​rw (`PrefetchRW`): Indicates prefetching for read or write. * ​locality (`PrefetchLocality`): Indicates locality level. * ​cache (`PrefetchCache`): Indicates i-cache or d-cache prefetching. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Constructs an instance of PrefetchOptions with default params. ### `for_read` `for_read(self) -> Self` Sets the prefetch purpose to read. **Returns:** The updated prefetch parameter. ### `for_write` `for_write(self) -> Self` Sets the prefetch purpose to write. **Returns:** The updated prefetch parameter. ### `no_locality` `no_locality(self) -> Self` Sets the prefetch locality to none. **Returns:** The updated prefetch parameter. ### `low_locality` `low_locality(self) -> Self` Sets the prefetch locality to low. **Returns:** The updated prefetch parameter. ### `medium_locality` `medium_locality(self) -> Self` Sets the prefetch locality to medium. **Returns:** The updated prefetch parameter. ### `high_locality` `high_locality(self) -> Self` Sets the prefetch locality to high. **Returns:** The updated prefetch parameter. ### `to_data_cache` `to_data_cache(self) -> Self` Sets the prefetch target to data cache. **Returns:** The updated prefetch parameter. ### `to_instruction_cache` `to_instruction_cache(self) -> Self` Sets the prefetch target to instruction cache. **Returns:** The updated prefetch parameter. --- ## PrefetchRW `@register_passable(trivial)` `struct PrefetchRW` Prefetch read or write. ## Fields * ​value (`SIMD[int32, 1]`): The read-write prefetch. It should be in \[0, 1]. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `READ` `alias READ = PrefetchRW(0)` Read prefetch. ### `WRITE` `alias WRITE = PrefetchRW(1)` Write prefetch. ## Methods ### `__init__` `__init__(value: Int) -> Self` Constructs a prefetch read-write option. **Args:** * ​value (`Int`): An integer value representing the prefetch read-write option to be used. Should be a value in the range `[0, 1]`. --- ## assume `assume(val: Bool)` Signals to the optimizer that the condition is always true. This allows the optimizer to optimize the code. **Args:** * ​val (`Bool`): The input value which is assumed to be `True`. --- ## ballot `ballot[dtype: DType](value: Bool) -> SIMD[dtype, 1]` Returns a bitfield(Int32 or Int64) containing the result of its Bool argument in all active lanes, and zero in all inactive lanes. For example, ballot(True) returns EXEC mask. **Parameters:** * ​dtype (`DType`): The DType of the return type. **Args:** * ​value (`Bool`): The value to place across the mask. **Returns:** A bitfield(Int32 or Int64) containing the result of its Bool argument in all active lanes. --- ## compressed_store `compressed_store[dtype: DType, size: Int](value: SIMD[dtype, size], addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], mask: SIMD[bool, size])` Compresses the lanes of `value`, skipping `mask` lanes, and stores at `addr`. **Parameters:** * ​dtype (`DType`): DType of `value`, the value to store. * ​size (`Int`): Size of `value`, the value to store. **Args:** * ​value (`SIMD[dtype, size]`): The vector containing data to store. * ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The memory location to store the compressed data. * ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of `value`. --- ## expect `expect[T: AnyTrivialRegType, //, expected_val: T](val: T) -> T` Provides information about expected (the most probable) value of `val`, which can be used by optimizers. Notes: Only works with integer/boolean types. **Parameters:** * ​T (`AnyTrivialRegType`): The type of the input value. * ​expected\_val (`T`): The expected value of `val`. **Args:** * ​val (`T`): The input value. **Returns:** The input value. --- ## gather `gather[dtype: DType, size: Int, //, *, invariant: Bool = False](owned base: SIMD[index, size], mask: SIMD[bool, size], passthrough: SIMD[dtype, size], alignment: Int = 0) -> SIMD[dtype, size]` Reads scalar values from a SIMD vector, and gathers them into one vector. The gather function reads scalar values from a SIMD vector of memory locations and gathers them into one vector. The memory locations are provided in the vector of pointers `base` as addresses. The memory is accessed according to the provided mask. The mask holds a bit for each vector lane, and is used to prevent memory accesses to the masked-off lanes. The masked-off lanes in the result vector are taken from the corresponding lanes of the `passthrough` operand. In general, for some vector of pointers `base`, mask `mask`, and passthrough `passthrough` a call of the form: ```mojo result = gather(base, mask, passthrough) ``` is equivalent to the following sequence of scalar loads in C++: ```cpp for (int i = 0; i dtype (`DType`): DType of the return SIMD buffer. * ​size (`Int`): Size of the return SIMD buffer. * ​invariant (`Bool`): Whether the memory is load invariant. **Args:** * ​base (`SIMD[index, size]`): The vector containing memory addresses that gather will access. * ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of the base vector. * ​passthrough (`SIMD[dtype, size]`): In the result vector, the masked-off lanes are replaced with the passthrough vector. * ​alignment (`Int`): The alignment of the source addresses. Must be 0 or a power of two constant integer value. **Returns:** A SIMD\[dtype, size] containing the result of the gather operation. --- ## implicitarg_ptr `implicitarg_ptr() -> UnsafePointer[SIMD[uint8, 1], address_space=AddressSpace(4)]` Get a pointer to AMD's implicit arguments table. **Returns:** A pointer to LLVM's implicit arguments table. --- ## intrinsics Defines intrinsics. You can import these APIs from the `sys` package. For example: ```mojo from sys import PrefetchLocality ``` ## Aliases ### `block_dim` `alias block_dim = _BlockDim()` ### `block_id_in_cluster` `alias block_id_in_cluster = _Cluster_BlockIdx()` ### `block_idx` `alias block_idx = _BlockIdx()` ### `cluster_dim` `alias cluster_dim = _ClusterDim()` ### `cluster_idx` `alias cluster_idx = _ClusterIdx()` ### `global_idx` `alias global_idx = _GridIdx()` ### `grid_dim` `alias grid_dim = _GridDim()` ### `thread_idx` `alias thread_idx = _ThreadIdx()` ## Structs * [​`PrefetchCache`](/mojo/stdlib/sys/intrinsics/PrefetchCache): Prefetch cache type. * [​`PrefetchLocality`](/mojo/stdlib/sys/intrinsics/PrefetchLocality): The prefetch locality. * [​`PrefetchOptions`](/mojo/stdlib/sys/intrinsics/PrefetchOptions): Collection of configuration parameters for a prefetch intrinsic call. * [​`PrefetchRW`](/mojo/stdlib/sys/intrinsics/PrefetchRW): Prefetch read or write. ## Functions * [​`assume`](/mojo/stdlib/sys/intrinsics/assume): Signals to the optimizer that the condition is always true. This allows the optimizer to optimize the code. * [​`ballot`](/mojo/stdlib/sys/intrinsics/ballot): Returns a bitfield(Int32 or Int64) containing the result of its Bool argument in all active lanes, and zero in all inactive lanes. For example, ballot(True) returns EXEC mask. * [​`compressed_store`](/mojo/stdlib/sys/intrinsics/compressed_store): Compresses the lanes of `value`, skipping `mask` lanes, and stores at `addr`. * [​`expect`](/mojo/stdlib/sys/intrinsics/expect): Provides information about expected (the most probable) value of `val`, which can be used by optimizers. * [​`gather`](/mojo/stdlib/sys/intrinsics/gather): Reads scalar values from a SIMD vector, and gathers them into one vector. * [​`implicitarg_ptr`](/mojo/stdlib/sys/intrinsics/implicitarg_ptr): Get a pointer to AMD's implicit arguments table. * [​`lane_id`](/mojo/stdlib/sys/intrinsics/lane_id): Returns the lane ID of the current thread. * [​`likely`](/mojo/stdlib/sys/intrinsics/likely): Provides information that the most probable value of `val` is going to be `True`. This information can be used by optimizers. * [​`llvm_intrinsic`](/mojo/stdlib/sys/intrinsics/llvm_intrinsic): Calls an LLVM intrinsic with the name `intrin` and return type `type`. * [​`masked_load`](/mojo/stdlib/sys/intrinsics/masked_load): Loads data from memory and return it, replacing masked lanes with values from the passthrough vector. * [​`masked_store`](/mojo/stdlib/sys/intrinsics/masked_store): Stores a value at a memory location, skipping masked lanes. * [​`prefetch`](/mojo/stdlib/sys/intrinsics/prefetch): Prefetches an instruction or data into cache before it is used. * [​`readfirstlane`](/mojo/stdlib/sys/intrinsics/readfirstlane): Get the value in the lowest active lane of the input operand. * [​`scatter`](/mojo/stdlib/sys/intrinsics/scatter): Takes scalar values from a SIMD vector and `scatters` them into a vector of pointers. * [​`sendmsg`](/mojo/stdlib/sys/intrinsics/sendmsg): Send a message to fixed function hardware. Refer to the specific ISA manual for the ops and messages. * [​`strided_load`](/mojo/stdlib/sys/intrinsics/strided_load): Loads values from addr according to a specific stride. * [​`strided_store`](/mojo/stdlib/sys/intrinsics/strided_store): Loads values from addr according to a specific stride. * [​`unlikely`](/mojo/stdlib/sys/intrinsics/unlikely): Provides information that the most probable value of `val` is going to be `False`. This information can be used by optimizers. --- ## lane_id `lane_id() -> UInt` Returns the lane ID of the current thread. **Returns:** The lane ID of the current thread. --- ## likely `likely(val: Bool) -> Bool` Provides information that the most probable value of `val` is going to be `True`. This information can be used by optimizers. **Args:** * ​val (`Bool`): The input value which is likely to be `True` most of the time. **Returns:** The input value. --- ## llvm_intrinsic `llvm_intrinsic[intrin: StringSlice[StaticConstantOrigin], type: AnyTrivialRegType, *types: AnyType, *, has_side_effect: Bool = True](*args: *types) -> type` Calls an LLVM intrinsic with the name `intrin` and return type `type`. **Parameters:** * ​intrin (`StringSlice[StaticConstantOrigin]`): The name of the llvm intrinsic. * ​type (`AnyTrivialRegType`): The return type of the intrinsic. * ​\*types (`AnyType`): The argument types for the function. * ​has\_side\_effect (`Bool`): If `True` the intrinsic will have side effects, otherwise its pure. **Args:** * ​\*args (`*types`): The arguments to the function. **Returns:** The result of calling the llvm intrinsic with no arguments. --- ## masked_load `masked_load[dtype: DType, //, size: Int](addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=False, origin=origin], mask: SIMD[bool, size], passthrough: SIMD[dtype, size], alignment: Int = 1) -> SIMD[dtype, size]` Loads data from memory and return it, replacing masked lanes with values from the passthrough vector. **Parameters:** * ​dtype (`DType`): DType of the return SIMD buffer. * ​size (`Int`): Size of the return SIMD buffer. **Args:** * ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=False, origin=origin]`): The base pointer for the load. * ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of the memory stored at addr. * ​passthrough (`SIMD[dtype, size]`): In the result vector, the masked-off lanes are replaced with the passthrough vector. * ​alignment (`Int`): The alignment of the source addresses. Must be 0 or a power of two constant integer value. Default is 1. **Returns:** The loaded memory stored in a vector of type SIMD\[dtype, size]. --- ## masked_store `masked_store[size: Int](value: SIMD[dtype, size], addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], mask: SIMD[bool, size], alignment: Int = 1)` Stores a value at a memory location, skipping masked lanes. **Parameters:** * ​size (`Int`): Size of `value`, the data to store. **Args:** * ​value (`SIMD[dtype, size]`): The vector containing data to store. * ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): A vector of memory location to store data at. * ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of `value`. * ​alignment (`Int`): The alignment of the destination locations. Must be 0 or a power of two constant integer value. --- ## prefetch `prefetch[dtype: DType, //, params: PrefetchOptions = PrefetchOptions()](addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin])` Prefetches an instruction or data into cache before it is used. The prefetch function provides prefetching hints for the target to prefetch instruction or data into cache before they are used. **Parameters:** * ​dtype (`DType`): The DType of value stored in addr. * ​params (`PrefetchOptions`): Configuration options for the prefect intrinsic. **Args:** * ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The data pointer to prefetch. --- ## readfirstlane `readfirstlane(value: SIMD[int32, 1]) -> SIMD[int32, 1]` Get the value in the lowest active lane of the input operand. **Args:** * ​value (`SIMD[int32, 1]`): The input value. **Returns:** The value in the lowest active lane of the input operand. `readfirstlane(value: UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]` Get the value in the lowest active lane of the input operand. **Args:** * ​value (`UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The input pointer. **Returns:** The value in the lowest active lane of the input operand. `readfirstlane(value: Int) -> Int` Get the value in the lowest active lane of the input operand. **Args:** * ​value (`Int`): The input pointer. **Returns:** The value in the lowest active lane of the input operand. --- ## scatter `scatter[dtype: DType, size: Int, //](value: SIMD[dtype, size], owned base: SIMD[index, size], mask: SIMD[bool, size], alignment: Int = 0)` Takes scalar values from a SIMD vector and `scatters` them into a vector of pointers. The scatter operation stores scalar values from a SIMD vector of memory locations and scatters them into a vector of pointers. The memory locations are provided in the vector of pointers `base` as addresses. The memory is stored according to the provided mask. The mask holds a bit for each vector lane, and is used to prevent memory accesses to the masked-off lanes. The `value` operand is a vector value to be written to memory. The `base` operand is a vector of pointers, pointing to where the value elements should be stored. It has the same underlying type as the value operand. The `mask` operand, mask, is a vector of boolean values. The types of the `mask` and the `value` operand must have the same number of vector elements. Scatter with overlapping addresses is guaranteed to be ordered from least-significant to most-significant element. In general, for some vector `value`, vector of pointers `base`, and mask `mask` a call of the form: ```mojo scatter(value, base, mask) ``` is equivalent to the following sequence of scalar stores in C++: ```cpp for (int i = 0; i dtype (`DType`): DType of `value`, the result SIMD buffer. * ​size (`Int`): Size of `value`, the result SIMD buffer. **Args:** * ​value (`SIMD[dtype, size]`): The vector that will contain the result of the scatter operation. * ​base (`SIMD[index, size]`): The vector containing memory addresses that scatter will access. * ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of the base vector. * ​alignment (`Int`): The alignment of the source addresses. Must be 0 or a power of two constant integer value. --- ## sendmsg `sendmsg(opcode: SIMD[int32, 1], msg: SIMD[int32, 1])` Send a message to fixed function hardware. Refer to the specific ISA manual for the ops and messages. **Args:** * ​opcode (`SIMD[int32, 1]`): The operation to perform. * ​msg (`SIMD[int32, 1]`): The message to send. --- ## strided_load `strided_load[dtype: DType, //, simd_width: Int, *, invariant: Bool = False](addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=False, origin=origin], stride: Int, mask: SIMD[bool, simd_width] = SIMD(True)) -> SIMD[dtype, simd_width]` Loads values from addr according to a specific stride. **Parameters:** * ​dtype (`DType`): DType of `value`, the value to store. * ​simd\_width (`Int`): The width of the SIMD vectors. * ​invariant (`Bool`): Whether the memory is load invariant. **Args:** * ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=False, origin=origin]`): The memory location to load data from. * ​stride (`Int`): How many lanes to skip before loading again. * ​mask (`SIMD[bool, simd_width]`): A binary vector which prevents memory access to certain lanes of `value`. **Returns:** A vector containing the loaded data. --- ## strided_store `strided_store[dtype: DType, //, simd_width: Int](value: SIMD[dtype, simd_width], addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], stride: Int, mask: SIMD[bool, simd_width] = SIMD(True))` Loads values from addr according to a specific stride. **Parameters:** * ​dtype (`DType`): DType of `value`, the value to store. * ​simd\_width (`Int`): The width of the SIMD vectors. **Args:** * ​value (`SIMD[dtype, simd_width]`): The values to store. * ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The location to store values at. * ​stride (`Int`): How many lanes to skip before storing again. * ​mask (`SIMD[bool, simd_width]`): A binary vector which prevents memory access to certain lanes of `value`. --- ## unlikely `unlikely(val: Bool) -> Bool` Provides information that the most probable value of `val` is going to be `False`. This information can be used by optimizers. **Args:** * ​val (`Bool`): The input value which is likely to be `False` most of the time. **Returns:** The input value. --- ## env_get_bool `env_get_bool[name: StringSlice[StaticConstantOrigin]]() -> Bool` Try to get an boolean-valued define. Compilation fails if the name is not defined or the value is neither `True` or `False`. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the define. **Returns:** An boolean parameter value. `env_get_bool[name: StringSlice[StaticConstantOrigin], default: Bool]() -> Bool` Try to get an bool-valued define. If the name is not defined, return a default value instead. The boolean must be either `True` or `False`. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the define. * ​default (`Bool`): The default value to use. **Returns:** An bool parameter value. --- ## env_get_dtype `env_get_dtype[name: StringSlice[StaticConstantOrigin], default: DType]() -> DType` Try to get an DType-valued define. If the name is not defined, return a default value instead. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the define. * ​default (`DType`): The default value to use. **Returns:** An DType parameter value. --- ## env_get_int `env_get_int[name: StringSlice[StaticConstantOrigin]]() -> Int` Try to get an integer-valued define. Compilation fails if the name is not defined. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the define. **Returns:** An integer parameter value. `env_get_int[name: StringSlice[StaticConstantOrigin], default: Int]() -> Int` Try to get an integer-valued define. If the name is not defined, return a default value instead. Example: ```mojo from sys.param_env import env_get_int def main(): alias number = env_get_int[ "favorite_number", 1 # Default value ]() parametrized[number]() fn parametrized[num: Int](): print(num) ``` If the program is `app.mojo`: * `mojo run -D favorite_number=2 app.mojo` * `mojo run -D app.mojo` Note: useful for parameterizing SIMD vector sizes. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the define. * ​default (`Int`): The default value to use. **Returns:** An integer parameter value. --- ## env_get_string `env_get_string[name: StringSlice[StaticConstantOrigin]]() -> StringSlice[StaticConstantOrigin]` Try to get a string-valued define. Compilation fails if the name is not defined. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the define. **Returns:** A string parameter value. `env_get_string[name: StringSlice[StaticConstantOrigin], default: StringSlice[StaticConstantOrigin]]() -> StringSlice[StaticConstantOrigin]` Try to get a string-valued define. If the name is not defined, return a default value instead. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the define. * ​default (`StringSlice[StaticConstantOrigin]`): The default value to use. **Returns:** A string parameter value. --- ## param_env Implements functions for retrieving compile-time defines. You can use these functions to set parameter values or runtime constants based on name-value pairs defined on the command line. For example: ```mojo from sys import is_defined alias float_type = DType.float32 if is_defined["FLOAT32"]() else DType.float64 # Use `float_type` as a constant. ``` And on the command line: ``` mojo -D FLOAT_32 main.mojo ``` For more information, see the [Mojo build docs](/mojo/cli/build.html#d-keyvalue). The `mojo run` command also supports the `-D` option. You can import these APIs from the `sys` package. For example: ```mojo from sys import is_defined ``` ## Functions * [​`env_get_bool`](/mojo/stdlib/sys/param_env/env_get_bool): Try to get an boolean-valued define. Compilation fails if the name is not defined or the value is neither `True` or `False`. * [​`env_get_dtype`](/mojo/stdlib/sys/param_env/env_get_dtype): Try to get an DType-valued define. If the name is not defined, return a default value instead. * [​`env_get_int`](/mojo/stdlib/sys/param_env/env_get_int): Try to get an integer-valued define. Compilation fails if the name is not defined. * [​`env_get_string`](/mojo/stdlib/sys/param_env/env_get_string): Try to get a string-valued define. Compilation fails if the name is not defined. * [​`is_defined`](/mojo/stdlib/sys/param_env/is_defined): Return true if the named value is defined. --- ## is_defined `is_defined[name: StringSlice[StaticConstantOrigin]]() -> Bool` Return true if the named value is defined. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name to test. **Returns:** True if the name is defined. --- ## exit `exit()` Exits from Mojo. Unlike the Python implementation this does not raise an exception to exit. `exit[intable: Intable](code: intable)` Exits from Mojo. Unlike the Python implementation this does not raise an exception to exit. **Parameters:** * ​intable (`Intable`): The type of the exit code. **Args:** * ​code (`intable`): The exit code. --- ## terminate This module includes the exit functions. ## Functions * [​`exit`](/mojo/stdlib/sys/terminate/exit): Exits from Mojo. Unlike the Python implementation this does not raise an exception to exit. --- ## tempfile Implements the tempfile package. ## Modules * [​`tempfile`](/mojo/stdlib/tempfile/tempfile/): Implements tempfile methods. --- ## NamedTemporaryFile `struct NamedTemporaryFile` A handle to a temporary file. Example: ```mojo from tempfile import NamedTemporaryFile from pathlib import Path def main(): var p: Path with NamedTemporaryFile(mode="rw") as f: p = f.name f.write("Hello world!") f.seek(0) print( f.read() == "Hello world!" ) print(String(p), p.exists()) #Removed by default ``` Note: `NamedTemporaryFile.__init__` document the arguments. ## Fields * ​name (`String`): Name of the file. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, mode: String = __init__[__mlir_type.!kgen.string]("w"), name: Optional[String] = Optional(None), suffix: String = __init__[__mlir_type.!kgen.string](""), prefix: String = __init__[__mlir_type.!kgen.string]("tmp"), dir: Optional[String] = Optional(None), delete: Bool = True)` Create a named temporary file. This is a wrapper around a `FileHandle`, `os.remove()` is called in the `close()` method if `delete` is True. Can be used as a context manager. When used as a context manager, the `close()` is called when the context manager exits. **Args:** * ​mode (`String`): The mode to open the file in (the mode can be "r" or "w"). * ​name (`Optional[String]`): The name of the temp file. If it is unspecified, then a random name will be provided. * ​suffix (`String`): Suffix to use for the file name if name is not provided. * ​prefix (`String`): Prefix to use for the file name if name is not provided. * ​dir (`Optional[String]`): Directory in which the file will be created. * ​delete (`Bool`): Whether the file is deleted on close. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Moves constructor for the file handle. **Args:** * ​existing (`Self`): The existing file handle. ### `__del__` `__del__(owned self)` Closes the file handle. ### `close` `close(mut self)` Closes the file handle. ### `read` `read(self, size: Int = -1) -> String` Reads the data from the file. **Args:** * ​size (`Int`): Requested number of bytes to read. **Returns:** The contents of the file. ### `read_bytes` `read_bytes(self, size: Int = -1) -> List[SIMD[uint8, 1]]` Read from file buffer until we have `size` characters or we hit EOF. If `size` is negative or omitted, read until EOF. **Args:** * ​size (`Int`): Requested number of bytes to read. **Returns:** The contents of the file. ### `seek` `seek(self, offset: SIMD[uint64, 1], whence: SIMD[uint8, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[uint64, 1]` Seeks to the given offset in the file. **Args:** * ​offset (`SIMD[uint64, 1]`): The byte offset to seek to from the start of the file. * ​whence (`SIMD[uint8, 1]`): The reference point for the offset: os.SEEK\_SET = 0: start of file (Default). os.SEEK\_CUR = 1: current position. os.SEEK\_END = 2: end of file. **Returns:** The resulting byte offset from the start of the file. **Raises:** An error if this file handle is invalid, or if file seek returned a failure. ### `write` `write[*Ts: Writable](mut self, *args: *Ts)` Write a sequence of Writable arguments to the provided Writer. **Parameters:** * ​\*Ts (`Writable`): Types of the provided argument sequence. **Args:** * ​\*args (`*Ts`): Sequence of arguments to write to this Writer. ### `write_bytes` `write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])` Write a span of bytes to the file. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file. ### `__enter__` `__enter__(owned self) -> Self` The function to call when entering the context. **Returns:** The file handle. --- ## TemporaryDirectory `struct TemporaryDirectory` A temporary directory. ## Fields * ​name (`String`): The name of the temporary directory. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, suffix: String = __init__[__mlir_type.!kgen.string](""), prefix: String = __init__[__mlir_type.!kgen.string]("tmp"), dir: Optional[String] = Optional(None), ignore_cleanup_errors: Bool = False)` Create a temporary directory. Can be used as a context manager. When used as a context manager, the directory is removed when the context manager exits. **Args:** * ​suffix (`String`): Suffix to use for the directory name. * ​prefix (`String`): Prefix to use for the directory name. * ​dir (`Optional[String]`): Directory in which the directory will be created. * ​ignore\_cleanup\_errors (`Bool`): Whether to ignore cleanup errors. ### `__enter__` `__enter__(self) -> String` The function to call when entering the context. **Returns:** The temporary directory name. ### `__exit__` `__exit__(self)` Called when exiting the context with no error. `__exit__(self, err: Error) -> Bool` Called when exiting the context with an error. **Args:** * ​err (`Error`): The error raised inside the context. **Returns:** True if the temporary directory was removed successfully. --- ## gettempdir `gettempdir() -> Optional[String]` Return the default directory to use for temporary files. **Returns:** The name of the default temporary directory. --- ## tempfile Implements tempfile methods. You can import a method from the `tempfile` package. For example: ```mojo from tempfile import gettempdir ``` ## Aliases ### `TMP_MAX` `alias TMP_MAX = 10000` ## Structs * [​`NamedTemporaryFile`](/mojo/stdlib/tempfile/tempfile/NamedTemporaryFile): A handle to a temporary file. * [​`TemporaryDirectory`](/mojo/stdlib/tempfile/tempfile/TemporaryDirectory): A temporary directory. ## Functions * [​`gettempdir`](/mojo/stdlib/tempfile/tempfile/gettempdir): Return the default directory to use for temporary files. * [​`mkdtemp`](/mojo/stdlib/tempfile/tempfile/mkdtemp): Create a temporary directory. Caller is responsible for deleting the directory when done with it. --- ## mkdtemp `mkdtemp(suffix: String = __init__[__mlir_type.!kgen.string](""), prefix: String = __init__[__mlir_type.!kgen.string]("tmp"), dir: Optional[String] = Optional(None)) -> String` Create a temporary directory. Caller is responsible for deleting the directory when done with it. **Args:** * ​suffix (`String`): Suffix to use for the directory name. * ​prefix (`String`): Prefix to use for the directory name. * ​dir (`Optional[String]`): Directory in which the directory will be created. **Returns:** The name of the created directory. **Raises:** If the directory can not be created. --- ## testing Implements the testing package. ## Modules * [​`testing`](/mojo/stdlib/testing/testing/): Implements various testing utils. --- ## assert_almost_equal `assert_almost_equal[dtype: DType, size: Int](lhs: SIMD[dtype, size], rhs: SIMD[dtype, size], msg: String = __init__[__mlir_type.!kgen.string](""), *, atol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0E-8), rtol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0000000000000001E-5), equal_nan: Bool = False, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are equal up to a tolerance. If it is not then an Error is raised. When the type is boolean or integral, then equality is checked. When the type is floating-point, then this checks if the two input values are numerically the close using the $abs(lhs - rhs) dtype (`DType`): The dtype of the left- and right-hand-side SIMD vectors. * ​size (`Int`): The width of the left- and right-hand-side SIMD vectors. **Args:** * ​lhs (`SIMD[dtype, size]`): The lhs of the equality. * ​rhs (`SIMD[dtype, size]`): The rhs of the equality. * ​msg (`String`): The message to print. * ​atol (`SIMD[float64, 1]`): The absolute tolerance. * ​rtol (`SIMD[float64, 1]`): The relative tolerance. * ​equal\_nan (`Bool`): Whether to treat nans as equal. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. --- ## assert_equal `assert_equal[T: EqualityComparable & Stringable, //](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are equal. If it is not then an Error is raised. **Parameters:** * ​T (`EqualityComparable & Stringable`): The type of the input values. **Args:** * ​lhs (`T`): The lhs of the equality. * ​rhs (`T`): The rhs of the equality. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_equal(lhs: String, rhs: String, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are equal. If it is not then an Error is raised. **Args:** * ​lhs (`String`): The lhs of the equality. * ​rhs (`String`): The rhs of the equality. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_equal[dtype: DType, size: Int](lhs: SIMD[dtype, size], rhs: SIMD[dtype, size], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are equal. If it is not then an Error is raised. **Parameters:** * ​dtype (`DType`): The dtype of the left- and right-hand-side SIMD vectors. * ​size (`Int`): The width of the left- and right-hand-side SIMD vectors. **Args:** * ​lhs (`SIMD[dtype, size]`): The lhs of the equality. * ​rhs (`SIMD[dtype, size]`): The rhs of the equality. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_equal[T: Copyable & Movable & EqualityComparable & Representable, //](lhs: List[T], rhs: List[T], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that two lists are equal. **Parameters:** * ​T (`Copyable & Movable & EqualityComparable & Representable`): The type of the elements in the lists. **Args:** * ​lhs (`List[T]`): The left-hand side list. * ​rhs (`List[T]`): The right-hand side list. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_equal[O1: ImmutableOrigin, O2: ImmutableOrigin](lhs: List[StringSlice[O1]], rhs: List[StringSlice[O2]], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that two lists are equal. **Parameters:** * ​O1 (`ImmutableOrigin`): The origin of lhs. * ​O2 (`ImmutableOrigin`): The origin of rhs. **Args:** * ​lhs (`List[StringSlice[O1]]`): The left-hand side list. * ​rhs (`List[StringSlice[O2]]`): The right-hand side list. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_equal[D: DType](lhs: List[SIMD[D, 1]], rhs: List[SIMD[D, 1]], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that two lists are equal. **Parameters:** * ​D (`DType`): A DType. **Args:** * ​lhs (`List[SIMD[D, 1]]`): The left-hand side list. * ​rhs (`List[SIMD[D, 1]]`): The right-hand side list. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_equal(lhs: PythonObject, rhs: PythonObject, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are equal. If it is not then an Error is raised. **Args:** * ​lhs (`PythonObject`): The lhs of the equality. * ​rhs (`PythonObject`): The rhs of the equality. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (default to the `__call_location`). **Raises:** An Error with the provided message if assert fails. --- ## assert_false `assert_false[T: Boolable, //](val: T, msg: String = __init__[__mlir_type.!kgen.string]("condition was unexpectedly True"), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input value is False and raises an Error if it's not. **Parameters:** * ​T (`Boolable`): The type of the value argument. **Args:** * ​val (`T`): The value to assert to be False. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. --- ## assert_is `assert_is[T: Stringable & Identifiable](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values have the same identity. If they do not then an Error is raised. **Parameters:** * ​T (`Stringable & Identifiable`): A Stringable and Identifiable type. **Args:** * ​lhs (`T`): The lhs of the `is` statement. * ​rhs (`T`): The rhs of the `is` statement. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. --- ## assert_is_not `assert_is_not[T: Stringable & Identifiable](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values have different identities. If they do not then an Error is raised. **Parameters:** * ​T (`Stringable & Identifiable`): A Stringable and Identifiable type. **Args:** * ​lhs (`T`): The lhs of the `is not` statement. * ​rhs (`T`): The rhs of the `is not` statement. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. --- ## assert_not_equal `assert_not_equal[T: EqualityComparable & Stringable, //](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are not equal. If it is not then an Error is raised. **Parameters:** * ​T (`EqualityComparable & Stringable`): The type of the input values. **Args:** * ​lhs (`T`): The lhs of the inequality. * ​rhs (`T`): The rhs of the inequality. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_not_equal(lhs: String, rhs: String, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are not equal. If it is not then an an Error is raised. **Args:** * ​lhs (`String`): The lhs of the inequality. * ​rhs (`String`): The rhs of the inequality. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_not_equal[dtype: DType, size: Int](lhs: SIMD[dtype, size], rhs: SIMD[dtype, size], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are not equal. If it is not then an Error is raised. **Parameters:** * ​dtype (`DType`): The dtype of the left- and right-hand-side SIMD vectors. * ​size (`Int`): The width of the left- and right-hand-side SIMD vectors. **Args:** * ​lhs (`SIMD[dtype, size]`): The lhs of the inequality. * ​rhs (`SIMD[dtype, size]`): The rhs of the inequality. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_not_equal[T: Copyable & Movable & EqualityComparable & Representable, //](lhs: List[T], rhs: List[T], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that two lists are not equal. **Parameters:** * ​T (`Copyable & Movable & EqualityComparable & Representable`): The type of the elements in the lists. **Args:** * ​lhs (`List[T]`): The left-hand side list. * ​rhs (`List[T]`): The right-hand side list. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. --- ## assert_raises `struct assert_raises` Context manager that asserts that the block raises an exception. You can use this to test expected error cases, and to test that the correct errors are raised. For instance: ```mojo from testing import assert_raises # Good! Caught the raised error, test passes with assert_raises(): raise "SomeError" # Also good! with assert_raises(contains="Some"): raise "SomeError" # This will assert, we didn't raise with assert_raises(): pass # This will let the underlying error propagate, failing the test with assert_raises(contains="Some"): raise "OtherError" ``` ## Fields * ​message\_contains (`Optional[String]`): If present, check that the error message contains this literal string. * ​call\_location (`_SourceLocation`): Assigned the value returned by \_\_call\_locations() at Self.**init**. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, *, location: Optional[_SourceLocation] = Optional(None))` Construct a context manager with no message pattern. **Args:** * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). `__init__(out self, *, contains: String, location: Optional[_SourceLocation] = Optional(None))` Construct a context manager matching specific errors. **Args:** * ​contains (`String`): The test will only pass if the error message includes the literal text passed. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). ### `__enter__` `__enter__(self)` Enter the context manager. ### `__exit__` `__exit__(self)` Exit the context manager with no error. **Raises:** AssertionError: Always. The block must raise to pass the test. `__exit__(self, error: Error) -> Bool` Exit the context manager with an error. **Args:** * ​error (`Error`): The error raised. **Returns:** True if the error message contained the expected string. **Raises:** Error: If the error raised doesn't include the expected string. --- ## assert_true `assert_true[T: Boolable, //](val: T, msg: String = __init__[__mlir_type.!kgen.string]("condition was unexpectedly False"), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input value is True and raises an Error if it's not. **Parameters:** * ​T (`Boolable`): The type of the value argument. **Args:** * ​val (`T`): The value to assert to be True. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. --- ## testing Implements various testing utils. You can import these APIs from the `testing` package. For example: ```mojo from testing import assert_true def main(): x = 1 y = 2 try: assert_true(x==1) assert_true(y==2) assert_true((x+y)==3) print("All assertions succeeded") except e: print("At least one assertion failed:") print(e) ``` ## Structs * [​`assert_raises`](/mojo/stdlib/testing/testing/assert_raises): Context manager that asserts that the block raises an exception. ## Functions * [​`assert_almost_equal`](/mojo/stdlib/testing/testing/assert_almost_equal): Asserts that the input values are equal up to a tolerance. If it is not then an Error is raised. * [​`assert_equal`](/mojo/stdlib/testing/testing/assert_equal): Asserts that the input values are equal. If it is not then an Error is raised. * [​`assert_false`](/mojo/stdlib/testing/testing/assert_false): Asserts that the input value is False and raises an Error if it's not. * [​`assert_is`](/mojo/stdlib/testing/testing/assert_is): Asserts that the input values have the same identity. If they do not then an Error is raised. * [​`assert_is_not`](/mojo/stdlib/testing/testing/assert_is_not): Asserts that the input values have different identities. If they do not then an Error is raised. * [​`assert_not_equal`](/mojo/stdlib/testing/testing/assert_not_equal): Asserts that the input values are not equal. If it is not then an Error is raised. * [​`assert_true`](/mojo/stdlib/testing/testing/assert_true): Asserts that the input value is True and raises an Error if it's not. --- ## time Implements the time package. ## Modules * [​`time`](/mojo/stdlib/time/time/): Implements basic utils for working with time. --- ## time Implements basic utils for working with time. You can import these APIs from the `time` package. For example: ```mojo from time import perf_counter_ns ``` ## Functions * [​`monotonic`](/mojo/stdlib/time/time/monotonic): Returns the current monotonic time time in nanoseconds. This function queries the current platform's monotonic clock, making it useful for measuring time differences, but the significance of the returned value varies depending on the underlying implementation. * [​`perf_counter`](/mojo/stdlib/time/time/perf_counter): Return the value (in fractional seconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid. * [​`perf_counter_ns`](/mojo/stdlib/time/time/perf_counter_ns): Return the value (in nanoseconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid. * [​`sleep`](/mojo/stdlib/time/time/sleep): Suspends the current thread for the seconds specified. * [​`time_function`](/mojo/stdlib/time/time/time_function): Measures the time spent in the function. --- ## monotonic `monotonic() -> UInt` Returns the current monotonic time time in nanoseconds. This function queries the current platform's monotonic clock, making it useful for measuring time differences, but the significance of the returned value varies depending on the underlying implementation. **Returns:** The current time in ns. --- ## perf_counter `perf_counter() -> SIMD[float64, 1]` Return the value (in fractional seconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid. **Returns:** The current time in ns. --- ## perf_counter_ns `perf_counter_ns() -> UInt` Return the value (in nanoseconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid. **Returns:** The current time in ns. --- ## sleep `sleep(sec: SIMD[float64, 1])` Suspends the current thread for the seconds specified. **Args:** * ​sec (`SIMD[float64, 1]`): The number of seconds to sleep for. `sleep(sec: UInt)` Suspends the current thread for the seconds specified. **Args:** * ​sec (`UInt`): The number of seconds to sleep for. --- ## time_function `time_function[: origin.set, //, func: fn() raises capturing -> None]() -> UInt` Measures the time spent in the function. **Parameters:** * ​func (`fn() raises capturing -> None`): The function to time. **Returns:** The time elapsed in the function in ns. `time_function[: origin.set, //, func: fn() capturing -> None]() -> UInt` Measures the time spent in the function. **Parameters:** * ​func (`fn() capturing -> None`): The function to time. **Returns:** The time elapsed in the function in ns. --- ## utils Implements the utils package. ## Modules * [​`index`](/mojo/stdlib/utils/index_/): Implements `IndexList` which is commonly used to represent N-D indices. * [​`lock`](/mojo/stdlib/utils/lock/): * [​`numerics`](/mojo/stdlib/utils/numerics/): Defines utilities to work with numeric types. * [​`static_tuple`](/mojo/stdlib/utils/static_tuple/): Implements StaticTuple, a statically-sized uniform container. * [​`variant`](/mojo/stdlib/utils/variant/): Defines a Variant type. * [​`write`](/mojo/stdlib/utils/write/): Establishes the contract between `Writer` and `Writable` types. --- ## Index `Index[T0: Intable, //, *, dtype: DType = int64](x: T0) -> IndexList[1, element_type=dtype]` Constructs a 1-D Index from the given value. **Parameters:** * ​T0 (`Intable`): The type of the 1st argument. * ​dtype (`DType`): The integer type of the underlying element. **Args:** * ​x (`T0`): The initial value. **Returns:** The constructed IndexList. `Index[*, dtype: DType = int64](x: UInt) -> IndexList[1, element_type=dtype]` Constructs a 1-D Index from the given value. **Parameters:** * ​dtype (`DType`): The integer type of the underlying element. **Args:** * ​x (`UInt`): The initial value. **Returns:** The constructed IndexList. `Index[T0: Intable, T1: Intable, //, *, dtype: DType = int64](x: T0, y: T1) -> IndexList[2, element_type=dtype]` Constructs a 2-D Index from the given values. **Parameters:** * ​T0 (`Intable`): The type of the 1st argument. * ​T1 (`Intable`): The type of the 2nd argument. * ​dtype (`DType`): The integer type of the underlying element. **Args:** * ​x (`T0`): The 1st initial value. * ​y (`T1`): The 2nd initial value. **Returns:** The constructed IndexList. `Index[*, dtype: DType = int64](x: UInt, y: UInt) -> IndexList[2, element_type=dtype]` Constructs a 2-D Index from the given values. **Parameters:** * ​dtype (`DType`): The integer type of the underlying element. **Args:** * ​x (`UInt`): The 1st initial value. * ​y (`UInt`): The 2nd initial value. **Returns:** The constructed IndexList. `Index[T0: Intable, T1: Intable, T2: Intable, //, *, dtype: DType = int64](x: T0, y: T1, z: T2) -> IndexList[3, element_type=dtype]` Constructs a 3-D Index from the given values. **Parameters:** * ​T0 (`Intable`): The type of the 1st argument. * ​T1 (`Intable`): The type of the 2nd argument. * ​T2 (`Intable`): The type of the 3rd argument. * ​dtype (`DType`): The integer type of the underlying element. **Args:** * ​x (`T0`): The 1st initial value. * ​y (`T1`): The 2nd initial value. * ​z (`T2`): The 3rd initial value. **Returns:** The constructed IndexList. `Index[T0: Intable, T1: Intable, T2: Intable, T3: Intable, //, *, dtype: DType = int64](x: T0, y: T1, z: T2, w: T3) -> IndexList[4, element_type=dtype]` Constructs a 4-D Index from the given values. **Parameters:** * ​T0 (`Intable`): The type of the 1st argument. * ​T1 (`Intable`): The type of the 2nd argument. * ​T2 (`Intable`): The type of the 3rd argument. * ​T3 (`Intable`): The type of the 4th argument. * ​dtype (`DType`): The integer type of the underlying element. **Args:** * ​x (`T0`): The 1st initial value. * ​y (`T1`): The 2nd initial value. * ​z (`T2`): The 3rd initial value. * ​w (`T3`): The 4th initial value. **Returns:** The constructed IndexList. `Index[T0: Intable, T1: Intable, T2: Intable, T3: Intable, T4: Intable, //, *, dtype: DType = int64](x: T0, y: T1, z: T2, w: T3, v: T4) -> IndexList[5, element_type=dtype]` Constructs a 5-D Index from the given values. **Parameters:** * ​T0 (`Intable`): The type of the 1st argument. * ​T1 (`Intable`): The type of the 2nd argument. * ​T2 (`Intable`): The type of the 3rd argument. * ​T3 (`Intable`): The type of the 4th argument. * ​T4 (`Intable`): The type of the 5th argument. * ​dtype (`DType`): The integer type of the underlying element. **Args:** * ​x (`T0`): The 1st initial value. * ​y (`T1`): The 2nd initial value. * ​z (`T2`): The 3rd initial value. * ​w (`T3`): The 4th initial value. * ​v (`T4`): The 5th initial value. **Returns:** The constructed IndexList. --- ## IndexList `@register_passable(trivial)` `struct IndexList[size: Int, *, element_type: DType = int64]` A base struct that implements size agnostic index functions. ## Parameters * ​size (`Int`): The size of the tuple. * ​element\_type (`DType`): The underlying dtype of the integer element value. ## Fields * ​data (`StaticTuple[SIMD[element_type, 1], size]`): The underlying storage of the tuple value. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `EqualityComparable`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable`, `_HashableWithHasher` ## Methods ### `__init__` `__init__() -> Self` Constructs a static int tuple of the given size. `@implicit` `__init__(data: StaticTuple[SIMD[element_type, 1], size]) -> Self` Constructs a static int tuple of the given size. **Args:** * ​data (`StaticTuple[SIMD[element_type, 1], size]`): The StaticTuple to construct the IndexList from. `@implicit` `__init__(elems: Tuple[Int, Int]) -> Self` Constructs a static int tuple given a tuple of integers. **Args:** * ​elems (`Tuple[Int, Int]`): The tuple to copy from. `@implicit` `__init__(elems: Tuple[Int, Int, Int]) -> Self` Constructs a static int tuple given a tuple of integers. **Args:** * ​elems (`Tuple[Int, Int, Int]`): The tuple to copy from. `@implicit` `__init__(elems: Tuple[Int, Int, Int, Int]) -> Self` Constructs a static int tuple given a tuple of integers. **Args:** * ​elems (`Tuple[Int, Int, Int, Int]`): The tuple to copy from. `@implicit` `__init__(*elems: Int, *, __list_literal__: Tuple[] = Tuple()) -> Self` Constructs a static int tuple given a set of arguments. **Args:** * ​\*elems (`Int`): The elements to construct the tuple. * ​**list\_literal** (`Tuple[]`): Specifies that this constructor can be used for list literals. `@implicit` `__init__(elem: Int) -> Self` Constructs a static int tuple given a set of arguments. **Args:** * ​elem (`Int`): The elem to splat into the tuple. `__init__(*, other: Self) -> Self` Copy constructor. **Args:** * ​other (`Self`): The other tuple to copy from. `@implicit` `__init__(values: VariadicList[Int]) -> Self` Creates a tuple constant using the specified values. **Args:** * ​values (`VariadicList[Int]`): The list of values. ### `__getitem__` `__getitem__[idx: Int](self) -> Int` Gets an element from the tuple by index. **Parameters:** * ​idx (`Int`): The element index. **Returns:** The tuple element value. `__getitem__[I: Indexer](self, idx: I) -> Int` Gets an element from the tuple by index. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The element index. **Returns:** The tuple element value. ### `__setitem__` `__setitem__[idx: Int](mut self, val: Int)` Sets an element in the tuple at the given static index. **Parameters:** * ​idx (`Int`): The element index. **Args:** * ​val (`Int`): The value to store. `__setitem__[idx: Int](mut self, val: SIMD[element_type, 1])` Sets an element in the tuple at the given static index. **Parameters:** * ​idx (`Int`): The element index. **Args:** * ​val (`SIMD[element_type, 1]`): The value to store. `__setitem__(mut self, idx: Int, val: Int)` Sets an element in the tuple at the given index. **Args:** * ​idx (`Int`): The element index. * ​val (`Int`): The value to store. ### `__lt__` `__lt__(self, rhs: Self) -> Bool` Compares this tuple to another tuple using LT comparison. A tuple is less-than another tuple if all corresponding elements of lhs is less than rhs. Note: This is **not** a lexical comparison. **Args:** * ​rhs (`Self`): Right hand side tuple. **Returns:** The comparison result. ### `__le__` `__le__(self, rhs: Self) -> Bool` Compares this tuple to another tuple using LE comparison. A tuple is less-or-equal than another tuple if all corresponding elements of lhs is less-or-equal than rhs. Note: This is **not** a lexical comparison. **Args:** * ​rhs (`Self`): Right hand side tuple. **Returns:** The comparison result. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compares this tuple to another tuple for equality. The tuples are equal if all corresponding elements are equal. **Args:** * ​rhs (`Self`): The other tuple. **Returns:** The comparison result. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Compares this tuple to another tuple for non-equality. The tuples are non-equal if at least one element of LHS isn't equal to the corresponding element from RHS. **Args:** * ​rhs (`Self`): The other tuple. **Returns:** The comparison result. ### `__gt__` `__gt__(self, rhs: Self) -> Bool` Compares this tuple to another tuple using GT comparison. A tuple is greater-than than another tuple if all corresponding elements of lhs is greater-than than rhs. Note: This is **not** a lexical comparison. **Args:** * ​rhs (`Self`): Right hand side tuple. **Returns:** The comparison result. ### `__ge__` `__ge__(self, rhs: Self) -> Bool` Compares this tuple to another tuple using GE comparison. A tuple is greater-or-equal than another tuple if all corresponding elements of lhs is greater-or-equal than rhs. Note: This is **not** a lexical comparison. **Args:** * ​rhs (`Self`): Right hand side tuple. **Returns:** The comparison result. ### `__add__` `__add__(self, rhs: Self) -> Self` Performs element-wise integer add. **Args:** * ​rhs (`Self`): Right hand side operand. **Returns:** The resulting index tuple. ### `__sub__` `__sub__(self, rhs: Self) -> Self` Performs element-wise integer subtract. **Args:** * ​rhs (`Self`): Right hand side operand. **Returns:** The resulting index tuple. ### `__mul__` `__mul__(self, rhs: Self) -> Self` Performs element-wise integer multiply. **Args:** * ​rhs (`Self`): Right hand side operand. **Returns:** The resulting index tuple. ### `__floordiv__` `__floordiv__(self, rhs: Self) -> Self` Performs element-wise integer floor division. **Args:** * ​rhs (`Self`): The elementwise divisor. **Returns:** The resulting index tuple. ### `__rfloordiv__` `__rfloordiv__(self, rhs: Self) -> Self` Floor divides rhs by this object. **Args:** * ​rhs (`Self`): The value to elementwise divide by self. **Returns:** The resulting index tuple. ### `__len__` `__len__(self) -> Int` Returns the size of the tuple. **Returns:** The tuple size. ### `as_tuple` `as_tuple(self) -> StaticTuple[Int, size]` Converts this IndexList to StaticTuple. **Returns:** The corresponding StaticTuple object. ### `canonicalize` `canonicalize(self) -> IndexList[size]` Canonicalizes the IndexList. **Returns:** Canonicalizes the object. ### `flattened_length` `flattened_length(self) -> Int` Returns the flattened length of the tuple. **Returns:** The flattened length of the tuple. ### `remu` `remu(self, rhs: Self) -> Self` Performs element-wise integer unsigned modulo. **Args:** * ​rhs (`Self`): Right hand side operand. **Returns:** The resulting index tuple. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this IndexList value to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__str__` `__str__(self) -> String` Get the tuple as a string. **Returns:** A string representation. ### `cast` `cast[dtype: DType](self) -> IndexList[size, element_type=dtype]` Casts to the target DType. **Parameters:** * ​dtype (`DType`): The dtype to cast towards. **Returns:** The list casted to the target type. ### `__hash__` `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with the underlying bytes. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. --- ## index Implements `IndexList` which is commonly used to represent N-D indices. You can import these APIs from the `utils` package. For example: ```mojo from utils import IndexList ``` ## Structs * [​`IndexList`](/mojo/stdlib/utils/index_/IndexList): A base struct that implements size agnostic index functions. ## Functions * [​`Index`](/mojo/stdlib/utils/index_/Index-function): Constructs a 1-D Index from the given value. * [​`product`](/mojo/stdlib/utils/index_/product): Computes a product of values in the tuple up to the given index. --- ## product `product[size: Int](tuple: IndexList[size, element_type=element_type], end_idx: Int = size) -> Int` Computes a product of values in the tuple up to the given index. **Parameters:** * ​size (`Int`): The tuple size. **Args:** * ​tuple (`IndexList[size, element_type=element_type]`): The tuple to get a product of. * ​end\_idx (`Int`): The end index. **Returns:** The product of all tuple elements in the given range. `product[size: Int](tuple: IndexList[size, element_type=element_type], start_idx: Int, end_idx: Int) -> Int` Computes a product of values in the tuple in the given index range. **Parameters:** * ​size (`Int`): The tuple size. **Args:** * ​tuple (`IndexList[size, element_type=element_type]`): The tuple to get a product of. * ​start\_idx (`Int`): The start index of the range. * ​end\_idx (`Int`): The end index of the range. **Returns:** The product of all tuple elements in the given range. --- ## BlockingScopedLock `struct BlockingScopedLock` A scope adapter for BlockingSpinLock. ## Fields * ​lock (`UnsafePointer[BlockingSpinLock]`): The underlying lock instance. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `LockType` `alias LockType = BlockingSpinLock` The type of the lock. ## Methods ### `__init__` `__init__(out self, lock: UnsafePointer[BlockingSpinLock])` Primary constructor. **Args:** * ​lock (`UnsafePointer[BlockingSpinLock]`): A pointer to the underlying lock. `__init__(out self, mut lock: BlockingSpinLock)` Secondary constructor. **Args:** * ​lock (`BlockingSpinLock`): A mutable reference to the underlying lock. ### `__enter__` `__enter__(mut self)` Acquire the lock on entry. This is done by setting the owner of the lock to own address. ### `__exit__` `__exit__(mut self)` Release the lock on exit. Reset the address on the underlying lock. --- ## BlockingSpinLock `struct BlockingSpinLock` A basic locking implementation that uses an integer to represent the owner of the lock. ## Fields * ​counter (`Atomic[int64]`): The atomic counter implementing the spin lock. ## Implemented traits `AnyType`, `Defaultable`, `UnknownDestructibility` ## Aliases ### `UNLOCKED` `alias UNLOCKED = -1` non-zero means locked, -1 means unlocked. ## Methods ### `__init__` `__init__(out self)` Default constructor. ### `lock` `lock(mut self, owner: Int)` Acquires the lock. **Args:** * ​owner (`Int`): The lock's owner (usually an address). ### `unlock` `unlock(mut self, owner: Int) -> Bool` Releases the lock. **Args:** * ​owner (`Int`): The lock's owner (usually an address). **Returns:** The successful release of the lock. --- ## SpinWaiter `struct SpinWaiter` A proxy for the C++ runtime's SpinWaiter type. ## Fields * ​storage (`UnsafePointer[NoneType]`): Pointer to the underlying SpinWaiter instance. ## Implemented traits `AnyType`, `Defaultable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Initializes a SpinWaiter instance. ### `__del__` `__del__(owned self)` Destroys the SpinWaiter instance. ### `wait` `wait(self)` Blocks the current task for a duration determined by the underlying policy. --- ## lock ## Structs * [​`BlockingScopedLock`](/mojo/stdlib/utils/lock/BlockingScopedLock): A scope adapter for BlockingSpinLock. * [​`BlockingSpinLock`](/mojo/stdlib/utils/lock/BlockingSpinLock): A basic locking implementation that uses an integer to represent the owner of the lock. * [​`SpinWaiter`](/mojo/stdlib/utils/lock/SpinWaiter): A proxy for the C++ runtime's SpinWaiter type. --- ## FPUtils `struct FPUtils[dtype: DType, *, _constraint: NoneType = NoneType(_constrain_fp_type[::DType]())]` Collection of utility functions for working with FP values. **Constraints:** The dtype is floating point. ## Parameters * ​dtype (`DType`): The concrete FP dtype (FP32/FP64/etc). * ​\_constraint (`NoneType`): Implements the constraint. Do not pass explicitly. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `integral_type` `alias integral_type = _integral_type_of[::DType]()` The equivalent integer dtype of the float type. ### `uint_type` `alias uint_type = _unsigned_integral_type_of[::DType]()` The equivalent uint dtype of the float type. ## Methods ### `mantissa_width` `static mantissa_width() -> Int` Returns the mantissa width of a floating point type. **Returns:** The mantissa width. ### `max_exponent` `static max_exponent() -> Int` Returns the max exponent of a floating point dtype without accounting for inf representations. This is not the maximum representable exponent, which is generally equal to the exponent\_bias. **Returns:** The max exponent. ### `exponent_width` `static exponent_width() -> Int` Returns the exponent width of a floating point type. **Returns:** The exponent width. ### `mantissa_mask` `static mantissa_mask() -> Int` Returns the mantissa mask of a floating point type. **Returns:** The mantissa mask. ### `exponent_bias` `static exponent_bias() -> Int` Returns the exponent bias of a floating point type. **Returns:** The exponent bias. ### `sign_mask` `static sign_mask() -> Int` Returns the sign mask of a floating point type. It is computed by `1 ### `exponent_mask` `static exponent_mask() -> Int` Returns the exponent mask of a floating point type. It is computed by `~(sign_mask | mantissa_mask)`. **Returns:** The exponent mask. ### `exponent_mantissa_mask` `static exponent_mantissa_mask() -> Int` Returns the exponent and mantissa mask of a floating point type. It is computed by `exponent_mask | mantissa_mask`. **Returns:** The exponent and mantissa mask. ### `quiet_nan_mask` `static quiet_nan_mask() -> Int` Returns the quiet NaN mask for a floating point type. The mask is defined by evaluating: ``` (1 ### `bitcast_to_integer` `static bitcast_to_integer(value: SIMD[dtype, 1]) -> Int` Bitcasts the floating-point value to an integer. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point type. **Returns:** An integer representation of the floating-point value. ### `bitcast_to_uint` `static bitcast_to_uint(value: SIMD[dtype, 1]) -> SIMD[_unsigned_integral_type_of[::DType](), 1]` Bitcasts the floating-point value to an integer. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point type. **Returns:** An integer representation of the floating-point value. ### `bitcast_from_integer` `static bitcast_from_integer(value: Int) -> SIMD[dtype, 1]` Bitcasts the floating-point value from an integer. **Args:** * ​value (`Int`): The int value. **Returns:** An floating-point representation of the Int. ### `get_sign` `static get_sign(value: SIMD[dtype, 1]) -> Bool` Returns the sign of the floating point value. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point type. **Returns:** Returns True if the sign is set and False otherwise. ### `set_sign` `static set_sign(value: SIMD[dtype, 1], sign: Bool) -> SIMD[dtype, 1]` Sets the sign of the floating point value. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point value. * ​sign (`Bool`): True to set the sign and false otherwise. **Returns:** Returns the floating point value with the sign set. ### `get_exponent` `static get_exponent(value: SIMD[dtype, 1]) -> Int` Returns the exponent bits of the floating-point value. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point value. **Returns:** Returns the exponent bits. ### `get_exponent_biased` `static get_exponent_biased(value: SIMD[dtype, 1]) -> Int` Returns the biased exponent of the floating-point value as an Int, this is how the value is stored before subtracting the exponent bias. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point value. **Returns:** The biased exponent as an Int. ### `set_exponent` `static set_exponent(value: SIMD[dtype, 1], exponent: Int) -> SIMD[dtype, 1]` Sets the exponent bits of the floating-point value. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point value. * ​exponent (`Int`): The exponent bits. **Returns:** Returns the floating-point value with the exponent bits set. ### `get_mantissa` `static get_mantissa(value: SIMD[dtype, 1]) -> Int` Gets the mantissa bits of the floating-point value. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point value. **Returns:** The mantissa bits. ### `get_mantissa_uint` `static get_mantissa_uint(value: SIMD[dtype, 1]) -> SIMD[_unsigned_integral_type_of[::DType](), 1]` Gets the mantissa bits of the floating-point value. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point value. **Returns:** The mantissa bits. ### `set_mantissa` `static set_mantissa(value: SIMD[dtype, 1], mantissa: Int) -> SIMD[dtype, 1]` Sets the mantissa bits of the floating-point value. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point value. * ​mantissa (`Int`): The mantissa bits. **Returns:** Returns the floating-point value with the mantissa bits set. ### `pack` `static pack(sign: Bool, exponent: Int, mantissa: Int) -> SIMD[dtype, 1]` Construct a floating-point value from its constituent sign, exponent, and mantissa. **Args:** * ​sign (`Bool`): The sign of the floating-point value. * ​exponent (`Int`): The exponent of the floating-point value. * ​mantissa (`Int`): The mantissa of the floating-point value. **Returns:** Returns the floating-point value. --- ## FlushDenormals `struct FlushDenormals` Flushes and denormals are set to zero within the context and the state is restored to the prior value on exit. ## Fields * ​state (`SIMD[int32, 1]`): The current state. ## Implemented traits `AnyType`, `Defaultable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Initializes the FlushDenormals. ### `__enter__` `__enter__(self)` Enters the context. This will set denormals to zero. ### `__exit__` `__exit__(self)` Exits the context. This will restore the prior FPState. --- ## get_accum_type `get_accum_type[dtype: DType, *, preferred_accum_type: DType = float32]() -> DType` Returns the recommended dtype for accumulation operations. Half precision and float8 types can introduce numerical error if they are used in reduction/accumulation operations. This method returns a higher precision dtype to use for accumulation if a half precision types is provided, otherwise it returns the original dtype. The rules are as follows: \- If the dtype is a float8 type, return a float16 type. \- If the dtype is a bfloat16 precision type, return a float32 type. \- If the dtype is a float16 precision type, return a float32 dtype if the preferred\_accum\_type is float32, otherwise return a float16 type. \- Otherwise, return the original type. **Parameters:** * ​dtype (`DType`): The dtype of some accumulation operation. * ​preferred\_accum\_type (`DType`): The preferred dtype for accumulation. **Returns:** The recommended dtype for accumulation operations based on the input dtype and the preferred accumulation type. --- ## numerics Defines utilities to work with numeric types. You can import these APIs from the `utils` package. For example: ```mojo from utils.numerics import FPUtils ``` ## Structs * [​`FlushDenormals`](/mojo/stdlib/utils/numerics/FlushDenormals): Flushes and denormals are set to zero within the context and the state is restored to the prior value on exit. * [​`FPUtils`](/mojo/stdlib/utils/numerics/FPUtils): Collection of utility functions for working with FP values. ## Functions * [​`get_accum_type`](/mojo/stdlib/utils/numerics/get_accum_type): Returns the recommended dtype for accumulation operations. * [​`inf`](/mojo/stdlib/utils/numerics/inf): Gets a +inf value for the given dtype. * [​`isfinite`](/mojo/stdlib/utils/numerics/isfinite): Checks if the value is not infinite. * [​`isinf`](/mojo/stdlib/utils/numerics/isinf): Checks if the value is infinite. * [​`isnan`](/mojo/stdlib/utils/numerics/isnan): Checks if the value is Not a Number (NaN). * [​`max_finite`](/mojo/stdlib/utils/numerics/max_finite): Returns the maximum finite value of type. * [​`max_or_inf`](/mojo/stdlib/utils/numerics/max_or_inf): Returns the maximum (potentially infinite) value of type. * [​`min_finite`](/mojo/stdlib/utils/numerics/min_finite): Returns the minimum (lowest) finite value of type. * [​`min_or_neg_inf`](/mojo/stdlib/utils/numerics/min_or_neg_inf): Returns the minimum (potentially negative infinite) value of type. * [​`nan`](/mojo/stdlib/utils/numerics/nan): Gets a NaN value for the given dtype. * [​`neg_inf`](/mojo/stdlib/utils/numerics/neg_inf): Gets a -inf value for the given dtype. * [​`nextafter`](/mojo/stdlib/utils/numerics/nextafter): Computes next representable value of `arg0` in the direction of `arg1`. --- ## inf `inf[dtype: DType]() -> SIMD[dtype, 1]` Gets a +inf value for the given dtype. **Constraints:** Can only be used for FP dtypes. **Parameters:** * ​dtype (`DType`): The value dtype. **Returns:** The +inf value of the given dtype. --- ## isfinite `isfinite[dtype: DType, simd_width: Int](val: SIMD[dtype, simd_width]) -> SIMD[bool, simd_width]` Checks if the value is not infinite. This is always True for non-FP data types. **Parameters:** * ​dtype (`DType`): The value dtype. * ​simd\_width (`Int`): The width of the SIMD vector. **Args:** * ​val (`SIMD[dtype, simd_width]`): The value to check. **Returns:** True if val is finite and False otherwise. --- ## isinf `isinf[dtype: DType, simd_width: Int](val: SIMD[dtype, simd_width]) -> SIMD[bool, simd_width]` Checks if the value is infinite. This is always False for non-FP data types. **Parameters:** * ​dtype (`DType`): The value dtype. * ​simd\_width (`Int`): The width of the SIMD vector. **Args:** * ​val (`SIMD[dtype, simd_width]`): The value to check. **Returns:** True if val is infinite and False otherwise. --- ## isnan `isnan[dtype: DType, simd_width: Int](val: SIMD[dtype, simd_width]) -> SIMD[bool, simd_width]` Checks if the value is Not a Number (NaN). **Parameters:** * ​dtype (`DType`): The value dtype. * ​simd\_width (`Int`): The width of the SIMD vector. **Args:** * ​val (`SIMD[dtype, simd_width]`): The value to check. **Returns:** True if val is NaN and False otherwise. --- ## max_finite `max_finite[dtype: DType]() -> SIMD[dtype, 1]` Returns the maximum finite value of type. **Parameters:** * ​dtype (`DType`): The value dtype. **Returns:** The maximum representable value of the type. Does not include infinity for floating-point types. --- ## max_or_inf `max_or_inf[dtype: DType]() -> SIMD[dtype, 1]` Returns the maximum (potentially infinite) value of type. **Parameters:** * ​dtype (`DType`): The value dtype. **Returns:** The maximum representable value of the type. Can include infinity for floating-point types. --- ## min_finite `min_finite[dtype: DType]() -> SIMD[dtype, 1]` Returns the minimum (lowest) finite value of type. **Parameters:** * ​dtype (`DType`): The value dtype. **Returns:** The minimum representable value of the type. Does not include negative infinity for floating-point types. --- ## min_or_neg_inf `min_or_neg_inf[dtype: DType]() -> SIMD[dtype, 1]` Returns the minimum (potentially negative infinite) value of type. **Parameters:** * ​dtype (`DType`): The value dtype. **Returns:** The minimum representable value of the type. Can include negative infinity for floating-point types. --- ## nan `nan[dtype: DType]() -> SIMD[dtype, 1]` Gets a NaN value for the given dtype. **Constraints:** Can only be used for FP dtypes. **Parameters:** * ​dtype (`DType`): The value dtype. **Returns:** The NaN value of the given dtype. --- ## neg_inf `neg_inf[dtype: DType]() -> SIMD[dtype, 1]` Gets a -inf value for the given dtype. **Constraints:** Can only be used for FP dtypes. **Parameters:** * ​dtype (`DType`): The value dtype. **Returns:** The -inf value of the given dtype. --- ## nextafter `nextafter[dtype: DType, simd_width: Int](arg0: SIMD[dtype, simd_width], arg1: SIMD[dtype, simd_width]) -> SIMD[dtype, simd_width]` Computes next representable value of `arg0` in the direction of `arg1`. **Constraints:** The element dtype of the input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​simd\_width (`Int`): The width of the input and output SIMD vector. **Args:** * ​arg0 (`SIMD[dtype, simd_width]`): The first input argument. * ​arg1 (`SIMD[dtype, simd_width]`): The second input argument. **Returns:** The `nextafter` of the inputs. --- ## StaticTuple `@register_passable(trivial)` `struct StaticTuple[element_type: AnyTrivialRegType, size: Int]` A statically sized tuple type which contains elements of homogeneous types. ## Parameters * ​element\_type (`AnyTrivialRegType`): The type of the elements in the tuple. * ​size (`Int`): The size of the tuple. ## Fields * ​array (`array, element_type>`): The underlying storage for the static tuple. ## Implemented traits `AnyType`, `Copyable`, `Defaultable`, `Movable`, `Sized`, `UnknownDestructibility` ## Aliases ### `type` `alias type = array, element_type>` ## Methods ### `__init__` `__init__() -> Self` Constructs an empty (undefined) tuple. `@implicit` `__init__(array: array, element_type>) -> Self` Constructs from an array type. **Args:** * ​array (`array, element_type>`): Underlying MLIR array type. `@implicit` `__init__(*elems: element_type) -> Self` Constructs a static tuple given a set of arguments. **Args:** * ​\*elems (`element_type`): The element types. `@implicit` `__init__(values: VariadicList[element_type]) -> Self` Creates a tuple constant using the specified values. **Args:** * ​values (`VariadicList[element_type]`): The list of values. `__init__(*, other: Self) -> Self` Explicitly copy the provided StaticTuple. **Args:** * ​other (`Self`): The StaticTuple to copy. ### `__getitem__` `__getitem__[index: Int](self) -> element_type` Returns the value of the tuple at the given index. **Parameters:** * ​index (`Int`): The index into the tuple. **Returns:** The value at the specified position. `__getitem__[I: Indexer, //](self, idx: I) -> element_type` Returns the value of the tuple at the given dynamic index. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The index into the tuple. **Returns:** The value at the specified position. ### `__setitem__` `__setitem__[I: Indexer, //](mut self, idx: I, val: element_type)` Stores a single value into the tuple at the specified dynamic index. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The index into the tuple. * ​val (`element_type`): The value to store. `__setitem__[idx: Int](mut self, val: element_type)` Stores a single value into the tuple at the specified index. **Parameters:** * ​idx (`Int`): The index into the tuple. **Args:** * ​val (`element_type`): The value to store. ### `__len__` `__len__(self) -> Int` Returns the length of the array. This is a known constant value. **Returns:** The size of the list. --- ## static_tuple Implements StaticTuple, a statically-sized uniform container. You can import these APIs from the `utils` package. For example: ```mojo from utils import StaticTuple ``` ## Structs * [​`StaticTuple`](/mojo/stdlib/utils/static_tuple/StaticTuple): A statically sized tuple type which contains elements of homogeneous types. --- ## Variant `struct Variant[*Ts: Copyable & Movable]` A runtime-variant type. Data for this type is stored internally. Currently, its size is the largest size of any of its variants plus a 16-bit discriminant. You can \- use `isa[T]()` to check what type a variant is \- use `unsafe_take[T]()` to take a value from the variant \- use `[T]` to get a value out of a variant \- This currently does an extra copy/move until we have origins \- It also temporarily requires the value to be mutable \- use `set[T](owned new_value: T)` to reset the variant to a new value \- use `is_type_supported[T]` to check if the variant permits the type `T` Example: ```mojo from utils import Variant alias IntOrString = Variant[Int, String] fn to_string(mut x: IntOrString) -> String: if x.isa[String](): return x[String] # x.isa[Int]() return String(x[Int]) # They have to be mutable for now, and implement Copyable & Movable var an_int = IntOrString(4) var a_string = IntOrString(String("I'm a string!")) var who_knows = IntOrString(0) import random if random.random_ui64(0, 1): who_knows.set[String]("I'm actually a string too!") print(to_string(an_int)) print(to_string(a_string)) print(to_string(who_knows)) ``` ## Parameters * ​\*Ts (`Copyable & Movable`): The elements of the variadic. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, *, unsafe_uninitialized: Tuple[])` Unsafely create an uninitialized Variant. **Args:** * ​unsafe\_uninitialized (`Tuple[]`): Marker argument indicating this initializer is unsafe. `@implicit` `__init__[T: Copyable & Movable](out self, owned value: T)` Create a variant with one of the types. **Parameters:** * ​T (`Copyable & Movable`): The type to initialize the variant to. Generally this should be able to be inferred from the call type, eg. `Variant[Int, String](4)`. **Args:** * ​value (`T`): The value to initialize the variant with. ### `__copyinit__` `__copyinit__(out self, other: Self)` Creates a deep copy of an existing variant. **Args:** * ​other (`Self`): The variant to copy from. ### `__moveinit__` `__moveinit__(out self, owned other: Self)` Move initializer for the variant. **Args:** * ​other (`Self`): The variant to move. ### `__del__` `__del__(owned self)` Destroy the variant. ### `__getitem__` `__getitem__[T: Copyable & Movable](ref self) -> ref [self] T` Get the value out of the variant as a type-checked type. This explicitly check that your value is of that type! If you haven't verified the type correctness at runtime, the program will abort! For now this has the limitations that it \- requires the variant value to be mutable **Parameters:** * ​T (`Copyable & Movable`): The type of the value to get out. **Returns:** A reference to the internal data. ### `copy` `copy(self, out copy: Self)` Explicitly creates a deep copy of an existing variant. **Returns:** A copy of the value. ### `take` `take[T: Copyable & Movable](mut self) -> T` Take the current value of the variant with the provided type. The caller takes ownership of the underlying value. This explicitly check that your value is of that type! If you haven't verified the type correctness at runtime, the program will abort! **Parameters:** * ​T (`Copyable & Movable`): The type to take out. **Returns:** The underlying data to be taken out as an owned value. ### `unsafe_take` `unsafe_take[T: Copyable & Movable](mut self) -> T` Unsafely take the current value of the variant with the provided type. The caller takes ownership of the underlying value. This doesn't explicitly check that your value is of that type! If you haven't verified the type correctness at runtime, you'll get a type that *looks* like your type, but has potentially unsafe and garbage member data. **Parameters:** * ​T (`Copyable & Movable`): The type to take out. **Returns:** The underlying data to be taken out as an owned value. ### `replace` `replace[Tin: Copyable & Movable, Tout: Copyable & Movable](mut self, owned value: Tin) -> Tout` Replace the current value of the variant with the provided type. The caller takes ownership of the underlying value. This explicitly check that your value is of that type! If you haven't verified the type correctness at runtime, the program will abort! **Parameters:** * ​Tin (`Copyable & Movable`): The type to put in. * ​Tout (`Copyable & Movable`): The type to take out. **Args:** * ​value (`Tin`): The value to put in. **Returns:** The underlying data to be taken out as an owned value. ### `unsafe_replace` `unsafe_replace[Tin: Copyable & Movable, Tout: Copyable & Movable](mut self, owned value: Tin) -> Tout` Unsafely replace the current value of the variant with the provided type. The caller takes ownership of the underlying value. This doesn't explicitly check that your value is of that type! If you haven't verified the type correctness at runtime, you'll get a type that *looks* like your type, but has potentially unsafe and garbage member data. **Parameters:** * ​Tin (`Copyable & Movable`): The type to put in. * ​Tout (`Copyable & Movable`): The type to take out. **Args:** * ​value (`Tin`): The value to put in. **Returns:** The underlying data to be taken out as an owned value. ### `set` `set[T: Copyable & Movable](mut self, owned value: T)` Set the variant value. This will call the destructor on the old value, and update the variant's internal type and data to the new value. **Parameters:** * ​T (`Copyable & Movable`): The new variant type. Must be one of the Variant's type arguments. **Args:** * ​value (`T`): The new value to set the variant to. ### `isa` `isa[T: Copyable & Movable](self) -> Bool` Check if the variant contains the required type. **Parameters:** * ​T (`Copyable & Movable`): The type to check. **Returns:** True if the variant contains the requested type. ### `unsafe_get` `unsafe_get[T: Copyable & Movable](ref self) -> ref [self] T` Get the value out of the variant as a type-checked type. This doesn't explicitly check that your value is of that type! If you haven't verified the type correctness at runtime, you'll get a type that *looks* like your type, but has potentially unsafe and garbage member data. For now this has the limitations that it \- requires the variant value to be mutable **Parameters:** * ​T (`Copyable & Movable`): The type of the value to get out. **Returns:** The internal data represented as a `Pointer[T]`. ### `is_type_supported` `static is_type_supported[T: Copyable & Movable]() -> Bool` Check if a type can be used by the `Variant`. Example: ```mojo from utils import Variant def takes_variant(mut arg: Variant): if arg.is_type_supported[Float64](): arg = Float64(1.5) def main(): var x = Variant[Int, Float64](1) takes_variant(x) if x.isa[Float64](): print(x[Float64]) # 1.5 ``` For example, the `Variant[Int, Bool]` permits `Int` and `Bool`. **Parameters:** * ​T (`Copyable & Movable`): The type of the value to check support for. **Returns:** `True` if type `T` is supported by the `Variant`. --- ## variant Defines a Variant type. You can use this type to implement variant/sum types. For example: ```mojo from utils import Variant alias IntOrString = Variant[Int, String] fn to_string(mut x: IntOrString) -> String: if x.isa[String](): return x[String] # x.isa[Int]() return String(x[Int]) # They have to be mutable for now, and implement Copyable & Movable var an_int = IntOrString(4) var a_string = IntOrString(String("I'm a string!")) var who_knows = IntOrString(0) import random if random.random_ui64(0, 1): who_knows.set[String]("I'm actually a string too!") print(to_string(an_int)) print(to_string(a_string)) print(to_string(who_knows)) ``` ## Structs * [​`Variant`](/mojo/stdlib/utils/variant/Variant): A runtime-variant type. --- ## Writable The `Writable` trait describes how a type is written into a `Writer`. You must implement `write_to` which takes `self` and a type conforming to `Writer`: ```mojo struct Point(Writable): var x: Float64 var y: Float64 fn write_to[W: Writer](self, mut writer: W): var string = "Point" # Write a single `Span[Byte]`: writer.write_bytes(string.as_bytes()) # Pass multiple args that can be converted to a `Span[Byte]`: writer.write("(", self.x, ", ", self.y, ")") ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `write_to` `write_to[W: Writer](self: _Self, mut writer: W)` Formats the string representation of this type to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The type conforming to `Writable`. --- ## Writer Describes a type that can be written to by any type that implements the `write_to` function. This enables you to write one implementation that can be written to a variety of types such as file descriptors, strings, network locations etc. The types are written as a `Span[Byte]`, so the `Writer` can avoid allocations depending on the requirements. There is also a general `write` that takes multiple args that implement `write_to`. Example: ```mojo from memory import Span @fieldwise_init struct NewString(Writer, Writable, Copyable, Movable): var s: String # Writer requirement to write a Span of Bytes fn write_bytes(mut self, bytes: Span[Byte, _]): self.s._iadd(bytes) # Writer requirement to take multiple args fn write[*Ts: Writable](mut self, *args: *Ts): @parameter for i in range(args.__len__()): args[i].write_to(self) # Also make it Writable to allow `print` to write the inner String fn write_to[W: Writer](self, mut writer: W): writer.write(self.s) @fieldwise_init struct Point(Writable, Copyable, Movable): var x: Int var y: Int # Pass multiple args to the Writer. The Int and StaticString types # call `writer.write_bytes` in their own `write_to` implementations. fn write_to[W: Writer](self, mut writer: W): writer.write("Point(", self.x, ", ", self.y, ")") # Enable conversion to a String using `String(point)` fn __str__(self) -> String: return String.write(self) fn main(): var point = Point(1, 2) var new_string = NewString(String(point)) new_string.write("\n", Point(3, 4)) print(new_string) ``` Output: ```plaintext Point(1, 2) Point(3, 4) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `write_bytes` `write_bytes(mut self: _Self, bytes: Span[SIMD[uint8, 1], origin])` Write a `Span[Byte]` to this `Writer`. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The string slice to write to this Writer. Must NOT be null-terminated. ### `write` `write[*Ts: Writable](mut self: _Self, *args: *Ts)` Write a sequence of Writable arguments to the provided Writer. **Parameters:** * ​\*Ts (`Writable`): Types of the provided argument sequence. **Args:** * ​\*args (`*Ts`): Sequence of arguments to write to this Writer. --- ## write Establishes the contract between `Writer` and `Writable` types. ## Aliases ### `HEAP_BUFFER_BYTES` `alias HEAP_BUFFER_BYTES = env_get_int[::StringSlice[::Bool()` How much memory to pre-allocate for the heap buffer, will abort if exceeded. ### `STACK_BUFFER_BYTES` `alias STACK_BUFFER_BYTES = env_get_int[::StringSlice[::Bool()` The size of the stack buffer for IO operations from CPU. ## Traits * [​`Writable`](/mojo/stdlib/utils/write/Writable): The `Writable` trait describes how a type is written into a `Writer`. * [​`Writer`](/mojo/stdlib/utils/write/Writer): Describes a type that can be written to by any type that implements the `write_to` function.