Skip to main content

Python module

rotary_embedding

The rope embedding used within the model.

DeepseekYarnRopeScalingParams

class max.nn.rotary_embedding.DeepseekYarnRopeScalingParams(scaling_factor: float, original_max_position_embeddings: int, beta_fast: int, beta_slow: int, mscale: float, mscale_all_dim: float)

Parameters:

  • scaling_factor (float )
  • original_max_position_embeddings (int )
  • beta_fast (int )
  • beta_slow (int )
  • mscale (float )
  • mscale_all_dim (float )

beta_fast

beta_fast: int

Fast interpolation rate.

beta_slow

beta_slow: int

Slow interpolation rate.

mscale

mscale: float

Scaling factor for middle frequencies.

mscale_all_dim

mscale_all_dim: float

Scaling factor applied to all dimensions.

original_max_position_embeddings

original_max_position_embeddings: int

Original maximum sequence length during training.

scaling_factor

scaling_factor: float

Scaling factor for frequency interpolation.

DeepseekYarnRotaryEmbedding

class max.nn.rotary_embedding.DeepseekYarnRotaryEmbedding(dim, n_heads, theta, max_seq_len, device, head_dim=None, _freqs_cis=None, interleaved=True, scaling_params=None)

Deepseek’s YaRN (Yet another RoPE eNhancement) Rotary Position Embedding layer.

Unlike Llama3RotaryEmbedding, the dim argument here is the rope dimension of the model, not the hidden dimension.

Parameters:

compute_scale()

compute_scale(user_scale=None)

Parameters:

user_scale (float | None )

Return type:

float

freqs_cis_base()

freqs_cis_base()

Computes the frequency tensor for complex exponentials (cis) for a given seq_len. Tensor is scaled with theta parameter. Required to apply Rotary Position Embedding (RoPE) to tensor. See ‘Roformer: Enhanced Transformer with Rotary Embedding’ (arxiv.org/pdf/2104.09864).

Returns:

The frequency tensor for complex exponentials with shape (max_seq_len, rope_dim // 2, 2)

Return type:

TensorValue

scaling_params

scaling_params: DeepseekYarnRopeScalingParams | None = None

LinearScalingParams

class max.nn.rotary_embedding.LinearScalingParams(factor: float)

Parameters:

factor (float )

factor

factor: float

Main scaling factor for the frequency components of the rope.

Llama3RopeScalingParams

class max.nn.rotary_embedding.Llama3RopeScalingParams(factor: float, low_freq_factor: float, high_freq_factor: float, orig_max_position: int)

Parameters:

  • factor (float )
  • low_freq_factor (float )
  • high_freq_factor (float )
  • orig_max_position (int )

factor

factor: float

Main scaling factor for the frequency components of the rope.

high_freq_factor

high_freq_factor: float

Factor to scale the high frequency components of the rope.

low_freq_factor

low_freq_factor: float

Factor to scale the low frequency components of the rope.

orig_max_position

orig_max_position: int

The original maximum position length supported by the model.

Llama3RotaryEmbedding

class max.nn.rotary_embedding.Llama3RotaryEmbedding(dim, n_heads, theta, max_seq_len, device, head_dim=None, _freqs_cis=None, interleaved=True, scaling_params=None)

RotaryEmbedding for Llama3 that takes rope scaling into account.

Parameters:

scaling_params

scaling_params: Llama3RopeScalingParams | None = None

Scaling parameters to enable llama to function with a longer context length.

LongRoPERotaryEmbedding

class max.nn.rotary_embedding.LongRoPERotaryEmbedding(dim, n_heads, theta, max_seq_len, device, head_dim=None, _freqs_cis=None, interleaved=True, scaling_params=None)

Rotary position embedding with LongRoPE scaling for Phi-3.5 models.

Initialize LongRoPE rotary embeddings.

Parameters:

  • dim (int ) – Model dimension
  • n_heads (int ) – Number of attention heads
  • theta (float ) – Base for computing frequencies (usually 10000.0)
  • max_seq_len (int ) – Maximum sequence length
  • device (DeviceRef ) – Device to place tensors on
  • head_dim (int ) – Head dimension (if None, computed as dim // n_heads)
  • _freqs_cis (Value [ TensorType ] | TensorValue | Shape | Dim | int | float | integer | floating | ndarray | None ) – Pre-computed frequency tensor (optional)
  • interleaved (bool ) – Whether to use interleaved RoPE weights
  • scaling_params (LongRoPEScalingParams | None ) – LongRoPE scaling parameters

compute_scale()

compute_scale(user_scale=None)

Compute attention scale with LongRoPE adjustment.

Parameters:

user_scale (float | None )

Return type:

float

freqs_cis_base()

freqs_cis_base()

Computes the frequency tensor for complex exponentials (cis) with LongRoPE scaling. Creates a “stitched” table where:

  • Positions 0 to original_max_position use short_factor
  • Positions from original_max_position onwards use long_factor

Returns:

The frequency tensor for complex exponentials with shape (max_seq_len * 2, head_dim / 2, 2)

Return type:

TensorValue

LongRoPEScalingParams

class max.nn.rotary_embedding.LongRoPEScalingParams(short_factor, long_factor, original_max_position, max_position_embeddings)

Parameters for LongRoPE scaling as used in Phi-3.5 models.

Parameters:

  • short_factor (list [ float ] )
  • long_factor (list [ float ] )
  • original_max_position (int )
  • max_position_embeddings (int )

long_factor

long_factor: list[float]

Scaling factors for long sequences (can be much larger).

max_position_embeddings

max_position_embeddings: int

Current max position embeddings after scaling.

original_max_position

original_max_position: int

Original max position embeddings the model was trained with.

short_factor

short_factor: list[float]

Scaling factors for short sequences (typically close to 1.0).

RotaryEmbedding

class max.nn.rotary_embedding.RotaryEmbedding(dim, n_heads, theta, max_seq_len, device, head_dim=None, _freqs_cis=None, interleaved=True)

RotaryEmbedding layer to calculate and apply the frequency tensor for complex exponentials.

Parameters:

compute_scale()

compute_scale(user_scale=None)

Parameters:

user_scale (float | None )

Return type:

float

device

device: DeviceRef

dim

dim: int

freqs_cis

property freqs_cis: TensorValue

freqs_cis_base()

freqs_cis_base()

Computes the frequency tensor for complex exponentials (cis) for a given seq_len. Tensor is scaled with theta parameter. Required to apply Rotary Position Embedding (RoPE) to tensor. See ‘Roformer: Enhanced Transformer with Rotary Embedding’ (arxiv.org/pdf/2104.09864).

Returns:

The frequency tensor for complex exponentials with shape (max_seq_len * 2, head_dim / 2, 2)

Return type:

TensorValue

head_dim

head_dim: int

head_dim = dim // n_heads if not specified in the config.

interleaved

interleaved: bool = True

max_seq_len

max_seq_len: int

The maximum sequence length for model’s input.

n_heads

n_heads: int

theta

theta: float

Hyperparameter used to control the frequency scaling of the sinusoidal components of the embeddings.