Python module

kernels

Helper functions for wrapping custom kv cache/attention related ops.

`apply_penalties_to_logits()`

max.nn.kernels.apply_penalties_to_logits(logits_buffer, frequency_data, frequency_offsets, *, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0)

Applies penalties to the logits.

Parameters:

logits_buffer (BufferValue) – The buffer to apply penalties to.
frequency_data (TensorValue) – 2d tensor of shape [unique_tokens, 2], where the first column indicates the token id and the second column indicates the frequency of the token.
frequency_offsets (TensorValue) – 1d tensor of shape [batch_size + 1], indicating start of each sequence’s data.
frequency_penalty (Value[TensorType] | TensorValue | Shape | Dim | HasTensorValue | int | float | integer | floating | ndarray) – The frequency penalty to apply to the model’s output. A positive value will penalize new tokens based on their frequency in the generated text: tokens will receive a penalty proportional to the count of appearances.
presence_penalty (Value[TensorType] | TensorValue | Shape | Dim | HasTensorValue | int | float | integer | floating | ndarray) – The presence penalty to apply to the model’s output A positive value will penalize new tokens that have already appeared in the generated text at least once by applying a constant penalty.
repetition_penalty (Value[TensorType] | TensorValue | Shape | Dim | HasTensorValue | int | float | integer | floating | ndarray) – The repetition penalty to apply to the model’s output. Values > 1 will penalize new tokens that have already appeared in prompt and generated text at least once by dividing the logits by the repetition penalty.

Return type:

None

`cross_attention_ragged()`

max.nn.kernels.cross_attention_ragged(kv_params, input, input_row_offsets, kv_collection, layer_idx, mask_variant, kv_input_row_offsets, q_max_seq_len, scale, local_window_size=-1)

Computes cross attention provided the !mo.opaque KV Cache.

Notably, this materializes the attention mask (dependent on MHAMaskVariant) within the kernel. input and input_row_offsets are used together to implement the ragged tensor. input_row_offsets indicates where each batch starts and ends in input

attention, kv_input_row_offsets represents the KV sequence length.

Parameters:

kv_params (KVCacheParams)
input (TensorValue)
input_row_offsets (TensorValue)
kv_collection (ContinuousBatchingKVCacheCollection | PagedKVCacheCollection)
layer_idx (TensorValue)
mask_variant (MHAMaskVariant)
kv_input_row_offsets (TensorValue)
q_max_seq_len (TensorValue)
scale (float)
local_window_size (int)

Return type:

TensorValue

`dynamic_scaled_matmul()`

max.nn.kernels.dynamic_scaled_matmul(a, b, a_scales, b_scales, out_type=bfloat16)

Perform a matmul of two tensors with scaling factors. Currently only supports channel-wise scaling for weights and per-token scaling for inputs.

Parameters:

a (TensorValue) – The first tensor to multiply.
b (TensorValue) – The second tensor to multiply, must be transposed.
a_scales (TensorValue) – The scaling factors for the first tensor.
b_scales (TensorValue) – The scaling factors for the second tensor.
out_type (DType)

Returns:

The result of the matmul operation.

Return type:

TensorValue

`flare_mla_decode_ragged()`

max.nn.kernels.flare_mla_decode_ragged(kv_params, input, input_row_offsets, kv_collection, layer_idx, mask_variant, scale, qk_rope_dim=64)

Computes flash (self) attention provided the !mo.opaque KV Cache.

Note that this is self attention and the KV sequence length is assumed to be equal to the Q sequence length. For KV sequence length != Q sequence length, use cross_attention_ragged.

Parameters:

kv_params (KVCacheParams)
input (TensorValue)
input_row_offsets (TensorValue)
kv_collection (PagedKVCacheCollection)
layer_idx (TensorValue)
mask_variant (MHAMaskVariant)
scale (float)
qk_rope_dim (int)

Return type:

TensorValue

`flare_mla_decompress_k_cache()`

max.nn.kernels.flare_mla_decompress_k_cache(kv_params, buffer_row_offsets_1d, cache_offsets_1d, buffer_length, weight, kv_collection, layer_idx, buffer_size)

This kernel decompresses the key cache by up-projecting latent representations into the KV space using a weight matrix.

The process involves:: Copying buffer_length latent vectors from the key cache into a contiguous buffer (k_latent) 2. Computing k = k_latent @ weight.T to obtain the decompressed keys

Returns:

A tensor of shape [buffer_size, weight.shape[0]] containing the decompressed keys. Note that only the first buffer_length tokens are valid.

Parameters:

kv_params (KVCacheParams)
buffer_row_offsets_1d (TensorValue)
cache_offsets_1d (TensorValue)
buffer_length (TensorValue)
weight (TensorValue)
kv_collection (PagedKVCacheCollection)
layer_idx (TensorValue)
buffer_size (int)

Return type:

TensorValue

`flare_mla_prefill_plan()`

max.nn.kernels.flare_mla_prefill_plan(kv_params, input_row_offsets, kv_collection, layer_idx, buffer_size, max_chunks=16)

This kernel plans how to process a batch of sequences with varying lengths using a fixed-size buffer.

Each sequence in the batch has some existing cached tokens and new input tokens. The kernel divides the total tokens into chunks of buffer_size.

For each chunk (iteration), it calculates:: Buffer offsets for each sequence in each chunk 2. Cache offsets for each sequence in each chunk 3. Total buffer lengths for each processing iteration

Parameters:

kv_params (KVCacheParams)
input_row_offsets (TensorValue)
kv_collection (PagedKVCacheCollection)
layer_idx (TensorValue)
buffer_size (int)
max_chunks (int)

Return type:

tuple[TensorValue, TensorValue, TensorValue]

`flare_mla_prefill_ragged()`

max.nn.kernels.flare_mla_prefill_ragged(kv_params, input, k, v, input_row_offsets, buffer_row_offsets, cache_offsets, kv_collection, layer_idx, mask_variant, scale, qk_rope_dim=64, prev_output=None, prev_softmax_info=None)

Performs MLA prefill. In the MLA prefill, we need to decompress the KV tensors, as we store the latent representations in the KV cache. We will decompress the KV tensors into a fixed size buffer to avoid out-of-memory errors. In case the total cache length is greater than the buffer size, we will process the attention calculation in chunks.

This MLA prefill kernel will return the output tensor for this iteration and the softmax info tensor for this iteration. Such tensors will be used by the next iteration of the MLA prefill kernel to continue the attention calculation.

Parameters:

kv_params (KVCacheParams) – KVCacheParams
input (TensorValue) – Input tensor
k (TensorValue) – Key tensor
v (TensorValue) – Value tensor
input_row_offsets (TensorValue) – Indicates where each batch starts and ends in input
buffer_row_offsets (TensorValue) – Indicates where each batch starts and ends in the buffer
cache_offsets (TensorValue) – Indicates where each batch starts and ends in the KV cache
kv_collection (PagedKVCacheCollection) – KV collection
layer_idx (TensorValue) – Layer index tensor
mask_variant (MHAMaskVariant) – Mask variant
scale (float) – Scale
qk_rope_dim (int) – QK rope dimension
prev_output (TensorValue | None) – Optional. Previous output tensor
prev_softmax_info (TensorValue | None) – Optional. Previous softmax info tensor

Returns:

The first tensor is the output tensor for this iteration
The second tensor is the softmax info tensor for this iteration

Return type:

A tuple of two tensors

`flash_attention()`

max.nn.kernels.flash_attention(kv_params, input, kv_collection, layer_idx, attention_mask, valid_lengths, scale)

Computes flash attention provided the mo.opaque KV Cache.

Parameters:

kv_params (KVCacheParams)
input (TensorValue)
kv_collection (ContinuousBatchingKVCacheCollection)
layer_idx (TensorValue)
attention_mask (TensorValue)
valid_lengths (TensorValue)
scale (float)

Return type:

TensorValue

`flash_attention_gpu()`

max.nn.kernels.flash_attention_gpu(q, k, v, mask_variant, scale, local_window_size=-1, valid_length=None)

Computes flash attention using GPU-optimized kernel.

Parameters:

q (TensorValue) – Query tensor of shape [batch, seq_len, num_heads, head_dim]
k (TensorValue) – Key tensor of shape [batch, seq_len, num_heads, head_dim]
v (TensorValue) – Value tensor of shape [batch, seq_len, num_heads, head_dim]
mask_variant (MHAMaskVariant) – The mask variant to use for attention
scale (float) – Scaling factor for attention scores
local_window_size (int) – Local window size for sliding window attention
valid_length (TensorValue | None) – Optional tensor of shape [batch] with dtype uint32. When provided, uses the padded kernel variant that respects the valid sequence lengths for each batch element.

Returns:

Output tensor of shape [batch, seq_len, num_heads, head_dim]

Return type:

TensorValue

`flash_attention_ragged()`

max.nn.kernels.flash_attention_ragged(kv_params, input, input_row_offsets, kv_collection, layer_idx, mask_variant, scale, local_window_size=-1, chain=None)

Computes flash (self) attention provided the !mo.opaque KV Cache.

Note that this is self attention and the KV sequence length is assumed to be equal to the Q sequence length. For KV sequence length != Q sequence length, use cross_attention_ragged.

Parameters:

kv_params (KVCacheParams)
input (TensorValue)
input_row_offsets (TensorValue)
kv_collection (ContinuousBatchingKVCacheCollection | PagedKVCacheCollection)
layer_idx (TensorValue)
mask_variant (MHAMaskVariant)
scale (float)
local_window_size (int)
chain (_ChainObject | None)

Return type:

TensorValue

`flash_attention_with_causal_mask()`

max.nn.kernels.flash_attention_with_causal_mask(kv_params, input, kv_collection, layer_idx, valid_lengths, scale)

Computes flash attention provided the mo.opaque KV Cache. Notably, materializes the causal mask within the kernel.

Parameters:

kv_params (KVCacheParams)
input (TensorValue)
kv_collection (ContinuousBatchingKVCacheCollection)
layer_idx (TensorValue)
valid_lengths (TensorValue)
scale (float)

Return type:

TensorValue

`fused_qk_ragged_rope()`

max.nn.kernels.fused_qk_ragged_rope(kv_params, input, input_row_offsets, kv_collection, freqs_cis, layer_idx, interleaved=True, position_ids=None, mrope_section=None, chain=None)

Computes fused query-key attention with rotary positional encodings and ragged inputs.

Parameters:

kv_params (KVCacheParams) – KV cache parameters
input (TensorValue) – [batch_size * seq_len, n_heads, head_dim]
input_row_offsets (TensorValue) – Ragged tensor offsets indicating where each batch starts and ends
kv_collection (ContinuousBatchingKVCacheCollection | PagedKVCacheCollection) – KV cache collection
freqs_cis (TensorValue) – tensor of shape (max_seq_len * 2, head_dim)
layer_idx (TensorValue) – Layer index for KV cache
interleaved (bool) – Whether to use interleaved RoPE pattern
position_ids (TensorValue | None) – Optional ragged 2D array of position IDs. If None, defaults to cache_length + token_idx for each token. When num_sections > 1, mrope_section must be provided to indicate each section of the head_dim to apply RoPE to. Shape: [num_sections, total_seq_len]
mrope_section (list[int] | None) – Optional list of integers indicating the section of the head_dim to
position_ids. (apply RoPE to. Must be used in conjunction with)
chain (_ChainObject | None)

Return type:

TensorValue

input and input_row_offsets are used together to implement the ragged tensor. input_row_offsets indicates where each batch starts and ends in input. If input is not of the same dtype as freqs_cis, it will be cast to the dtype of freqs_cis for the computation, and cast back to the original dtype after the computation is finished.

When position_ids and mrope_section are provided, it replaces the default position calculation (cache_length + token_idx) with explicit position values. This is useful for 3D RoPE in models like Qwen2.5-VL that need custom position encoding.

`fused_qk_rope()`

max.nn.kernels.fused_qk_rope(kv_params, input, kv_collection, freqs_cis_2d, layer_idx, interleaved=True)

Computes fused query-key attention with rotary positional encodings.

Parameters:

kv_params (KVCacheParams)
input (TensorValue)
kv_collection (ContinuousBatchingKVCacheCollection)
freqs_cis_2d (TensorValue)
layer_idx (TensorValue)
interleaved (bool)

Return type:

TensorValue

`fused_qkv_matmul()`

max.nn.kernels.fused_qkv_matmul(kv_params, input, wqkv, kv_collection, layer_idx, n_heads)

Computes fused query, key and value projections.

Parameters:

kv_params (KVCacheParams)
input (TensorValue)
wqkv (TensorValue)
kv_collection (ContinuousBatchingKVCacheCollection)
layer_idx (TensorValue)
n_heads (int)

Return type:

TensorValue

`fused_qkv_ragged_matmul()`

max.nn.kernels.fused_qkv_ragged_matmul(kv_params, input, input_row_offsets, wqkv, kv_collection, layer_idx, n_heads, bias=None, chain=None)

Computes fused query, key, and value projections with ragged input.

input and input_row_offsets are used together to implement the ragged tensor. input_row_offsets indicates where each batch starts and ends in input

Raises:

ValueError – on input shapes/dtypes that are invalid for the kernel.

Parameters:

kv_params (KVCacheParams)
input (TensorValue)
input_row_offsets (TensorValue)
wqkv (TensorValue)
kv_collection (ContinuousBatchingKVCacheCollection | PagedKVCacheCollection)
layer_idx (TensorValue)
n_heads (int)
bias (TensorValue | None)
chain (_ChainObject | None)

Return type:

TensorValue

`fused_qkv_ragged_matmul_quantized()`

max.nn.kernels.fused_qkv_ragged_matmul_quantized(kv_params, input, input_row_offsets, wqkv, kv_collection, layer_idx, n_heads, quantization_config, perm_idx=None, bias=None)

Computes fused query, key, and value projections with ragged input and quantized weight matrices. A quantization_config must be provided.

input and input_row_offsets are used together to implement the ragged tensor. input_row_offsets indicates where each batch starts and ends in input

Raises:

ValueError – on input shapes/dtypes that are invalid for the kernel.

Parameters:

kv_params (KVCacheParams)
input (TensorValue)
input_row_offsets (TensorValue)
wqkv (TensorValue)
kv_collection (ContinuousBatchingKVCacheCollection | PagedKVCacheCollection)
layer_idx (TensorValue)
n_heads (int)
quantization_config (QuantizationConfig)
perm_idx (TensorValue | None)
bias (TensorValue | None)

Return type:

TensorValue

`fused_qkv_ragged_matmul_scaled_float8()`

max.nn.kernels.fused_qkv_ragged_matmul_scaled_float8(kv_params, input, input_row_offsets, wqkv, kv_collection, layer_idx, n_heads, input_scale, weight_scale, bias=None, chain=None)

Computes fused query, key, and value projections with ragged input.

input and input_row_offsets are used together to implement the ragged tensor. input_row_offsets indicates where each batch starts and ends in input

Raises:

ValueError – on input shapes/dtypes that are invalid for the kernel.

Parameters:

kv_params (KVCacheParams)
input (TensorValue)
input_row_offsets (TensorValue)
wqkv (TensorValue)
kv_collection (PagedKVCacheCollection)
layer_idx (TensorValue)
n_heads (int)
input_scale (TensorValue)
weight_scale (TensorValue)
bias (TensorValue | None)
chain (_ChainObject | None)

Return type:

TensorValue

`grouped_matmul_ragged()`

max.nn.kernels.grouped_matmul_ragged(hidden_states, weight, expert_start_indices, expert_ids, expert_usage_stats_host)

Grouped matmul used in MoE layer.

hidden_states and expert_start_indices are used together to implement the ragged tensor. expert_start_indices indicates where each group starts and ends in hidden_states

expert_ids is the id of the expert for each group in hidden_states

expert_usage_stats_host is the maximum number of tokens assigned to any expert, and the number of active experts.

Parameters:

hidden_states (TensorValue)
weight (TensorValue)
expert_start_indices (TensorValue)
expert_ids (TensorValue)
expert_usage_stats_host (TensorValue)

Return type:

TensorValue

`kv_cache_get_max_seq_len()`

max.nn.kernels.kv_cache_get_max_seq_len(kv_params, kv_collection)

This kernel returns the maximum sequence length.

Parameters:

kv_params (KVCacheParams)
kv_collection (PagedKVCacheCollection)

Return type:

TensorValue

`kv_cache_ragged_radd()`

max.nn.kernels.kv_cache_ragged_radd(kv_params, a, kv_collection, input_row_offsets, batch_offset, layer_idx)

This function adds a tensor to a slice of the KVCache, sliced on the batch dimension.

This expects that the requests which should be sliced out are contiguous and in the front of the tensor, and we’re only adding to the last requests in the batch :param a: The tensor to add to the KVCache. :param kv_collection: The KVCache collection to add to. :param input_row_offsets: The offsets of the input tensor. :param batch_offset: The batch to start applying the r-add to. :param layer_idx: The layer index to add to.

Parameters:

kv_params (KVCacheParams)
a (TensorValue)
kv_collection (PagedKVCacheCollection)
input_row_offsets (TensorValue)
batch_offset (TensorValue)
layer_idx (int)

Return type:

None

`matmul_k_cache_ragged()`

max.nn.kernels.matmul_k_cache_ragged(kv_params, hidden_states, input_row_offsets, weight, kv_collection, layer_idx)

Computes key projections with ragged input.

hidden_states and input_row_offsets are used together to implement the ragged tensor. input_row_offsets indicates where each batch starts and ends in input

Parameters:

kv_params (KVCacheParams)
hidden_states (TensorValue)
input_row_offsets (TensorValue)
weight (TensorValue)
kv_collection (PagedKVCacheCollection)
layer_idx (TensorValue)

Return type:

None

`matmul_kv_cache_ragged()`

max.nn.kernels.matmul_kv_cache_ragged(kv_params, hidden_states, input_row_offsets, weight, kv_collection, layer_idx)

Computes key and value projections with ragged input.

hidden_states and input_row_offsets are used together to implement the ragged tensor. input_row_offsets indicates where each batch starts and ends in input

Parameters:

kv_params (KVCacheParams)
hidden_states (TensorValue)
input_row_offsets (TensorValue)
weight (TensorValue)
kv_collection (PagedKVCacheCollection)
layer_idx (TensorValue)

Return type:

None

`matmul_static_scaled_float8()`

max.nn.kernels.matmul_static_scaled_float8(input, weight, input_scale, weight_scale)

Parameters:

input (TensorValue)
weight (TensorValue)
input_scale (TensorValue)
weight_scale (TensorValue)

Return type:

TensorValue

`merge_ragged_tensors()`

max.nn.kernels.merge_ragged_tensors(a, a_row_offsets, b, b_row_offsets)

Merges two ragged tensors into a single ragged tensor.

Both ragged tensors must have the same batch size (same number of row offsets). This function interleaves the rows from each tensor based on their row offsets.

Parameters:

a (TensorValue) – The first ragged tensor of shape [total_a_rows, …].
a_row_offsets (TensorValue) – The row offsets of the first ragged tensor,indicating where each batch starts and ends in a.
b (TensorValue) – The second ragged tensor of shape [total_b_rows, …].
b_row_offsets (TensorValue) – The row offsets of the second ragged tensor, indicating where each batch starts and ends in b.

Returns:

The merged ragged tensor with shape [total_a_rows + total_b_rows, …].
The merged row offsets with the same shape as input row offsets.

Return type:

A tuple of two tensors

Example:

a = [1, 2, 3, 4, 5, 6] a_row_offsets = [0, 2, 6] b = [7, 8, 9, 10] b_row_offsets = [0, 3, 4]

merged_tensor, merged_row_offsets = merge_ragged_tensors(: a, a_row_offsets, b, b_row_offsets)

merged_tensor = [1, 2, 7, 8, 9, 3, 4, 5, 6, 10] merged_row_offsets = [0, 5, 10]

`moe_create_indices()`

max.nn.kernels.moe_create_indices(topk_ids, num_local_experts)

Creates indices for the MoE layer.

Parameters:

topk_ids (TensorValue) – The expert assignments for each token from the router.
num_local_experts (int) – The number of experts on this device.

Returns:

token_expert_order: The reordered token indices, grouped by assigned expert.
expert_start_indices: The starting index for each expert’s token group in the reordered sequence.
restore_token_order: The indices to restore original token ordering after expert computation.
expert_ids: ids of active experts selected for tokens
expert_usage_stats: The maximum number of tokens assigned to any expert, and the number of active experts.

Return type:

A tuple of four tensors

`quantize_dynamic_scaled_float8()`

max.nn.kernels.quantize_dynamic_scaled_float8(input, scale_ub=1200.0, group_size_or_per_token=-1, out_type=float8_e4m3fn, scales_type=bfloat16)

Dynamically quantize the input tensor to fp8.

Parameters:

input (TensorValue) – The input tensor to quantize.
scale_ub (float) – The upper bound of the scale factor.
group_size_or_per_token (int) – The group size for quantization. When set to -1, the quantization is column-wise.
out_type (DType) – The type of the output tensor.
scales_type (DType) – The type of the scales tensor.

Returns:

The quantized tensor and the scales.

Return type:

tuple[TensorValue, TensorValue]

`quantize_static_scaled_float8()`

max.nn.kernels.quantize_static_scaled_float8(x, scale, scale_is_inverted=True)

Parameters:

x (TensorValue)
scale (TensorValue)
scale_is_inverted (bool)

Return type:

TensorValue

`rms_norm_key_cache()`

max.nn.kernels.rms_norm_key_cache(kv_params, kv_collection, gamma, epsilon, layer_idx, total_seq_len, input_row_offsets, weight_offset, rms_norm_cols=None, multiply_before_cast=True, per_head_norm=True)

This function applies RMSNorm to the _new_ entries in the KVCache.

When per_head_norm=True (default), RMSNorm is applied separately to each head. In this mode, gamma should have size [head_dim] and normalization occurs across the head_dim dimensions within each head.

When per_head_norm=False, RMSNorm is applied per token across all heads. In this mode, gamma should have size [n_kv_heads * head_dim] and normalization occurs across all dimensions for each token.

The size of the gamma tensor determines how many dimensions will be normalized. If gamma’s size doesn’t match the expected size based on per_head_norm setting, rms_norm_cols must be explicitly specified to confirm the intention to normalize only a subset of dimensions.

Currently, the KVCacheT class itself isn’t aware of the new cache entries until cache length increment, which happens after model forward. So use input_row_offsets to do this bookkeeping.

Parameters:

kv_params (KVCacheParams)
kv_collection (ContinuousBatchingKVCacheCollection | PagedKVCacheCollection)
gamma (TensorValue)
epsilon (float | floating)
layer_idx (TensorValue)
total_seq_len (Dim)
input_row_offsets (TensorValue)
weight_offset (float | floating)
rms_norm_cols (int | None)
multiply_before_cast (bool)
per_head_norm (bool)

Return type:

None

`scatter_set_constant()`

max.nn.kernels.scatter_set_constant(data, indices, fill_val)

Scatters values into a tensor at specified indices.

Parameters:

data (BufferValue)
indices (TensorValue)
fill_val (float)

Return type:

None

`sgmv_kernel()`

max.nn.kernels.sgmv_kernel(input, lora, lora_ids, lora_ranks, input_row_offsets, max_lora_seq_len, bias=None)

Performs the SGMV kernel for LoRA. This is LoRA agnostic, meaning that we can perform LoRA A or B from this kernel call. :param input: The input tensor :param lora: The LoRA tensor :param lora_ids: Ids of the LoRAs used for each sequence :param lora_ranks: The ranks of the LoRAs ihn the batch :param input_row_offsets: The sequence offsets that use LoRA :param max_lora_seq_len: The maximum sequence length of any given LoRA in the batch :param bias: The LoRA bias

Parameters:

input (TensorValue)
lora (TensorValue)
lora_ids (TensorValue)
lora_ranks (TensorValue)
input_row_offsets (TensorValue)
max_lora_seq_len (int)
bias (TensorValue | None)

`sgmv_lora_kernel()`

max.nn.kernels.sgmv_lora_kernel(input, lora_a, lora_b, lora_ids, lora_ranks, input_row_offsets, max_lora_seq_len, bias=None)

Computes the SGMV LoRA kernel for some number of LoRAs A and B given the input.

out = Wx + xAB

SGMV can be explained by two independent kernels:: shrink -> shrinks high-dimensional tensor to low-rank tensor

expand -> expands low-rank tensor to high-dimensional tensor

where v = [0, …] and y = (some output tensor)

SGMV-shrink:: v += xA
SGMV-expand:: y += vB

Parameters:

input (TensorValue) – The input tensor
lora_a (TensorValue) – The LoRA tensor for A
lora_b (TensorValue) – The LoRA tensor for B
lora_ids (TensorValue) – Ids of the LoRAs used for each sequence
lora_ranks (TensorValue) – The ranks of the LoRAs ihn the batch
input_row_offsets (TensorValue) – The sequence offsets that use LoRA
max_lora_seq_len (int) – The maximum sequence length of any given LoRA in the batch
bias (TensorValue | None) – The LoRA bias

Return type:

TensorValue

`sgmv_qkv_lora_kernel()`

max.nn.kernels.sgmv_qkv_lora_kernel(input, lora_a, lora_b, lora_ids, lora_ranks, input_row_offsets, kv_collection, kv_params, layer_idx, max_lora_seq_len, max_rank, q_dim, kv_dim, bias=None)

Computes the SGMV QKV LoRA kernel for Q, K, V projections with LoRA.

Parameters:

input (TensorValue) – The input tensor
lora_a (TensorValue) – The LoRA tensor for A
lora_b (TensorValue) – The LoRA tensor for B
lora_ids (TensorValue) – Ids of the LoRAs used for each sequence
lora_ranks (TensorValue) – The ranks of the LoRAs ihn the batch
input_row_offsets (TensorValue) – The sequence offsets that use LoRA
kv_collection (PagedKVCacheCollection | ContinuousBatchingKVCacheCollection) – The KV cache
kv_params (KVCacheParams) – The KV params
layer_idx (TensorValue) – The layer index to retrieve the KV cache
max_lora_seq_len (int) – The maximum sequence length of any given LoRA in the batch
max_rank (int) – The maximum rank for the LoRAs
q_dim (int) – The q dimension
kv_dim (int) – The kv dimension
bias (TensorValue | None) – Optional LoRA bias

Return type:

TensorValue

`swish_glu()`

max.nn.kernels.swish_glu(a, b0, b1)

Computes swish(a@b0.t()) * (a@b1.t())

Parameters:

a (Value[TensorType] | TensorValue | Shape | Dim | HasTensorValue | int | float | integer | floating | ndarray)
b0 (Value[TensorType] | TensorValue | Shape | Dim | HasTensorValue | int | float | integer | floating | ndarray)
b1 (Value[TensorType] | TensorValue | Shape | Dim | HasTensorValue | int | float | integer | floating | ndarray)

Return type:

TensorValue

`topk_fused_sampling()`

max.nn.kernels.topk_fused_sampling(logits, top_k, *, temperature=1.0, max_k=None, top_p=1.0, seed=0)

Performs top-k sampling with temperature scaling.

Parameters:

logits (TensorValue) – Input logits tensor of shape [batch_size, vocab_size].
top_k (Value[TensorType] | TensorValue | Shape | Dim | HasTensorValue | int | float | integer | floating | ndarray) – Number of top tokens to consider for sampling. Can be a scalar (which will be expanded to batch_size) or a tensor of shape [batch_size].
temperature (Value[TensorType] | TensorValue | Shape | Dim | HasTensorValue | int | float | integer | floating | ndarray) – Temperature for scaling logits before sampling.
max_k (Value[TensorType] | TensorValue | Shape | Dim | HasTensorValue | int | float | integer | floating | ndarray | None) – Maximum value of k across the batch. Required when top_k is a tensor.
top_p (Value[TensorType] | TensorValue | Shape | Dim | HasTensorValue | int | float | integer | floating | ndarray) – Top-p (nucleus) sampling threshold. Can be a scalar or tensor.
seed (Value[TensorType] | TensorValue | Shape | Dim | HasTensorValue | int | float | integer | floating | ndarray) – Seed for the random number generator. Can be a scalar or tensor.

Returns:

Sampled tokens tensor of shape [batch_size, 1].

Raises:

ValueError – If input validation fails.

Return type:

TensorValue

`unfused_qkv_ragged_matmul_gguf_quantized()`

max.nn.kernels.unfused_qkv_ragged_matmul_gguf_quantized(kv_params, input, input_row_offsets, n_heads, q_weight, k_weight, v_weight, quantization_encoding_q, quantization_encoding_k, quantization_encoding_v, kv_collection, layer_idx)

Computes fused query, key, and value projections with ragged input and quantized weight matrices. A quantization_config must be provided.

input and input_row_offsets are used together to implement the ragged tensor. input_row_offsets indicates where each batch starts and ends in input

Raises:

ValueError – on input shapes/dtypes that are invalid for the kernel.

Parameters:

kv_params (KVCacheParams)
input (TensorValue)
input_row_offsets (TensorValue)
n_heads (int)
q_weight (TensorValue)
k_weight (TensorValue)
v_weight (TensorValue)
quantization_encoding_q (QuantizationEncoding)
quantization_encoding_k (QuantizationEncoding)
quantization_encoding_v (QuantizationEncoding)
kv_collection (ContinuousBatchingKVCacheCollection | PagedKVCacheCollection)
layer_idx (TensorValue)

Return type:

TensorValue

`update_frequency_data()`

max.nn.kernels.update_frequency_data(frequency_data, frequency_offsets, tokens)

Updates the frequency data.

Parameters:

frequency_data (BufferValue) – 2d tensor of shape [unique_tokens, 2], where the first column indicates the token id and the second column indicates the frequency of the token.
frequency_offsets (TensorValue) – 1d tensor of shape [batch_size + 1], indicating start of each sequence’s data.
tokens (TensorValue) – The tokens to update the frequency data with.

Return type:

None

apply_penalties_to_logits()
cross_attention_ragged()
dynamic_scaled_matmul()
flare_mla_decode_ragged()
flare_mla_decompress_k_cache()
flare_mla_prefill_plan()
flare_mla_prefill_ragged()
flash_attention()
flash_attention_gpu()
flash_attention_ragged()
flash_attention_with_causal_mask()
fused_qk_ragged_rope()
fused_qk_rope()
fused_qkv_matmul()
fused_qkv_ragged_matmul()
fused_qkv_ragged_matmul_quantized()
fused_qkv_ragged_matmul_scaled_float8()
grouped_matmul_ragged()
kv_cache_get_max_seq_len()
kv_cache_ragged_radd()
matmul_k_cache_ragged()
matmul_kv_cache_ragged()
matmul_static_scaled_float8()
merge_ragged_tensors()
moe_create_indices()
quantize_dynamic_scaled_float8()
quantize_static_scaled_float8()
rms_norm_key_cache()
scatter_set_constant()
sgmv_kernel()
sgmv_lora_kernel()
sgmv_qkv_lora_kernel()
swish_glu()
topk_fused_sampling()
unfused_qkv_ragged_matmul_gguf_quantized()
update_frequency_data()

View source

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!

apply_penalties_to_logits()​

cross_attention_ragged()​

dynamic_scaled_matmul()​

flare_mla_decode_ragged()​

flare_mla_decompress_k_cache()​

flare_mla_prefill_plan()​

flare_mla_prefill_ragged()​

flash_attention()​

flash_attention_gpu()​

flash_attention_ragged()​

flash_attention_with_causal_mask()​

fused_qk_ragged_rope()​

fused_qk_rope()​

fused_qkv_matmul()​

fused_qkv_ragged_matmul()​

fused_qkv_ragged_matmul_quantized()​

fused_qkv_ragged_matmul_scaled_float8()​

grouped_matmul_ragged()​

kv_cache_get_max_seq_len()​

kv_cache_ragged_radd()​

matmul_k_cache_ragged()​

matmul_kv_cache_ragged()​

matmul_static_scaled_float8()​

merge_ragged_tensors()​

moe_create_indices()​

quantize_dynamic_scaled_float8()​

quantize_static_scaled_float8()​

rms_norm_key_cache()​

scatter_set_constant()​

sgmv_kernel()​

sgmv_lora_kernel()​

sgmv_qkv_lora_kernel()​

swish_glu()​

topk_fused_sampling()​

unfused_qkv_ragged_matmul_gguf_quantized()​

update_frequency_data()​

`apply_penalties_to_logits()`

`cross_attention_ragged()`

`dynamic_scaled_matmul()`

`flare_mla_decode_ragged()`

`flare_mla_decompress_k_cache()`

`flare_mla_prefill_plan()`

`flare_mla_prefill_ragged()`

`flash_attention()`

`flash_attention_gpu()`

`flash_attention_ragged()`

`flash_attention_with_causal_mask()`

`fused_qk_ragged_rope()`

`fused_qk_rope()`

`fused_qkv_matmul()`

`fused_qkv_ragged_matmul()`

`fused_qkv_ragged_matmul_quantized()`

`fused_qkv_ragged_matmul_scaled_float8()`

`grouped_matmul_ragged()`

`kv_cache_get_max_seq_len()`

`kv_cache_ragged_radd()`

`matmul_k_cache_ragged()`

`matmul_kv_cache_ragged()`

`matmul_static_scaled_float8()`

`merge_ragged_tensors()`

`moe_create_indices()`

`quantize_dynamic_scaled_float8()`

`quantize_static_scaled_float8()`

`rms_norm_key_cache()`

`scatter_set_constant()`

`sgmv_kernel()`

`sgmv_lora_kernel()`

`sgmv_qkv_lora_kernel()`

`swish_glu()`

`topk_fused_sampling()`

`unfused_qkv_ragged_matmul_gguf_quantized()`

`update_frequency_data()`