Python module
attention_with_rope
An opaque KV Cache optimized attention mechanism with Rope.
AttentionWithRope
class max.nn.attention.attention_with_rope.AttentionWithRope(n_heads: 'int', kv_params: 'KVCacheParams', layer_idx: 'TensorValue', wqkv: 'TensorValue', wo: 'Linear', scale: 'float', rope: 'OptimizedRotaryEmbedding', bias: 'Optional[TensorValue]' = None, perm_idx: 'Optional[TensorValue]' = None, quantization_config: 'Optional[QuantizationConfig]' = None)
bias
bias*: TensorValue | None* = None
perm_idx
perm_idx*: TensorValue | None* = None
quantization_config
quantization_config*: QuantizationConfig | None* = None
rope
rope*: OptimizedRotaryEmbedding*
AttentionWithRopeQKV
class max.nn.attention.attention_with_rope.AttentionWithRopeQKV(n_heads: 'int', kv_params: 'KVCacheParams', layer_idx: 'int', wq: 'TensorValueLike', wk: 'TensorValueLike', wv: 'TensorValueLike', wo: 'Linear', scale: 'float', rope: 'OptimizedRotaryEmbedding')
rope
rope*: OptimizedRotaryEmbedding*
AttentionWithRopeV2
class max.nn.attention.attention_with_rope.AttentionWithRopeV2(*, rope: ~max.nn.rotary_embedding.OptimizedRotaryEmbedding, num_attention_heads: int, num_key_value_heads: int, hidden_size: int, kv_params: ~max.pipelines.kv_cache.cache_params.KVCacheParams, layer_idx: int, dtype: ~max._core.dtype.DType = DType.float32, devices: list[max.graph.type.DeviceRef] | None = None, linear_cls: ~typing.Callable[[...], ~max.nn.linear.LinearV2] = <class 'max.nn.linear.LinearV2'>, stacked_qkv: bool = False, scale: float | None = None, has_bias: bool = False, clip_qkv: float | None = None)
Implementation of attention that uses the rope frequency.
AttentionWithRopeV2 will replace AttentionWithRope as we roll out the new Layer API.
Initializes the attention layer.
-
Parameters:
- rope – The rope layer to borrow the freq_cis value from.
- num_attention_heads – The number of attention heads.
- num_key_value_heads – Number of key/value heads.
- hidden_size – The dimension of the hidden states.
- kv_params – KV Cache Params, including the number of kv heads, the head dim, and data type.
- layer_idx – The layer number associated with this Attention block.
- dtype – DType of the
- devices – Device to place the weights and run the computation. If multiple are provided, the first device is used. Use DistributedAttentionWithRope to use all devices during attention computation.
- linear_cls – Linear class to use for the outputs dense layer.
- stacked_qkv – Whether the weights are stacked together.
- scale – Value used to scale the results of the attention output.
- has_bias – Whether to use an attention bias.
- clip_qkv – If provided, the QKV weights are clamped between [-clip_qkv, clip_qkv]
rope
rope*: OptimizedRotaryEmbedding*
wqkv
property wqkv*: TensorValue*
The concatenation of q, k, and v weight vectors.
wqkv_bias
property wqkv_bias*: TensorValue | None*
The concatenation of q, k, and v bias weight vectors.
DistributedAttentionWithRope
class max.nn.attention.attention_with_rope.DistributedAttentionWithRope(**kwargs)
Initializes the attention layer.
-
Parameters:
- rope – The rope layer to borrow the freq_cis value from.
- num_attention_heads – The number of attention heads.
- num_key_value_heads – Number of key/value heads.
- hidden_size – The dimension of the hidden states.
- kv_params – KV Cache Params, including the number of kv heads, the head dim, and data type.
- layer_idx – The layer number associated with this Attention block.
- dtype – DType of the
- devices – Device to place the weights and run the computation. If multiple are provided, the first device is used. Use DistributedAttentionWithRope to use all devices during attention computation.
- linear_cls – Linear class to use for the outputs dense layer.
- stacked_qkv – Whether the weights are stacked together.
- scale – Value used to scale the results of the attention output.
- has_bias – Whether to use an attention bias.
- clip_qkv – If provided, the QKV weights are clamped between [-clip_qkv, clip_qkv]
GGUFQAttentionWithRope
class max.nn.attention.attention_with_rope.GGUFQAttentionWithRope(*, rope: ~max.nn.rotary_embedding.OptimizedRotaryEmbedding, num_attention_heads: int, num_key_value_heads: int, hidden_size: int, kv_params: ~max.pipelines.kv_cache.cache_params.KVCacheParams, layer_idx: int, dtype: ~max._core.dtype.DType, quantization_encoding: ~max.graph.quantization.QuantizationEncoding, devices: list[max.graph.type.DeviceRef] | None = None, linear_cls: ~typing.Callable[[...], ~max.nn.linear.LinearV2] = <class 'max.nn.linear.LinearV2'>, scale: float | None = None, has_bias: bool = False, clip_qkv: float | None = None)
Implementation of attention with GGUF quantized weights.
Initializes the attention layer.
-
Parameters:
- rope – The rope layer to borrow the freq_cis value from.
- num_attention_heads – The number of attention heads.
- num_key_value_heads – Number of key/value heads.
- hidden_size – The dimension of the hidden states.
- kv_params – KV Cache Params, including the number of kv heads, the head dim, and data type.
- layer_idx – The layer number associated with this Attention block.
- dtype – DType of the weights, should always be uint8.
- devices – Device to place the weights and run the computation. If multiple are provided, the first device is used. Use DistributedAttentionWithRope to use all devices during attention computation.
- quantization_encoding – Quantization encoding of the weights.
- linear_cls – Linear class to use for the outputs dense layer.
- scale – Value used to scale the results of the attention output.
- has_bias – Whether to use an attention bias.
- clip_qkv – If provided, the QKV weights are clamped between [-clip_qkv, clip_qkv]
rope
rope*: OptimizedRotaryEmbedding*
wqkv
property wqkv*: TensorValue*
The concatenation of q, k, and v weight vectors.
wqkv_bias
property wqkv_bias*: TensorValue | None*
The concatenation of q, k, and v bias weight vectors.
GPTQAttentionWithRope
class max.nn.attention.attention_with_rope.GPTQAttentionWithRope(quantization_config: ~max.graph.quantization.QuantizationConfig, rope: ~max.nn.rotary_embedding.OptimizedRotaryEmbedding, num_attention_heads: int, num_key_value_heads: int, hidden_size: int, kv_params: ~max.pipelines.kv_cache.cache_params.KVCacheParams, layer_idx: int, dtype: ~max._core.dtype.DType = DType.float32, devices: list[max.graph.type.DeviceRef] | None = None, scale: float | None = None, linear_cls: ~typing.Callable[[...], ~max.nn.linear.LinearV2] = <class 'max.nn.linear.LinearV2'>)
Implementation of the GPT-Q attention layer.
Initializes the attention layer.
-
Parameters:
- rope – The rope layer to borrow the freq_cis value from.
- num_attention_heads – The number of attention heads.
- num_key_value_heads – Number of key/value heads.
- hidden_size – The dimension of the hidden states.
- kv_params – KV Cache Params, including the number of kv heads, the head dim, and data type.
- layer_idx – The layer number associated with this Attention block.
- dtype – DType of the
- devices – Device to place the weights and run the computation. If multiple are provided, the first device is used. Use DistributedAttentionWithRope to use all devices during attention computation.
- linear_cls – Linear class to use for the outputs dense layer.
- stacked_qkv – Whether the weights are stacked together.
- scale – Value used to scale the results of the attention output.
- has_bias – Whether to use an attention bias.
- clip_qkv – If provided, the QKV weights are clamped between [-clip_qkv, clip_qkv]
wqkv
property wqkv*: TensorValue*
The concatenation of q, k, and v weight vectors.
distribute_value()
max.nn.attention.attention_with_rope.distribute_value(v: TensorValue, devices: List[DeviceRef]) → List[TensorValue]
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!