Python module

core

`TTSContext`

class max.pipelines.core.TTSContext(audio_prompt_tokens=<factory>, buffer_speech_tokens=None, audio_buffer=None, prev_samples_beyond_offset=0, streaming=False, _speech_token_size=128, _speech_token_end_idx=0, _speech_tokens=<factory>, decoded_index=0, _block_counter=0, _arrival_time=<factory>, audio_generation_status=GenerationStatus.ACTIVE, *, max_length, tokens, request_id=<factory>, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _size=-1, _start_idx=0, _active_idx=-1, _end_idx=-1, _completion_start_idx=-1, _completion_end_idx=-1, _prompt_len=-1, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, target_endpoint=None)

A context for Text-to-Speech (TTS) model inference.

This class extends TextContext to handle speech token generation and management. It maintains buffers for audio prompt tokens and generated speech tokens, along with tracking indices for decoding progress.

Parameters:

audio_prompt_tokens (ndarray[tuple[int, ...], dtype[integer[Any]]]) – Array of input audio prompt tokens used for voice cloning
buffer_speech_tokens (ndarray[tuple[int, ...], dtype[integer[Any]]] | None)
audio_buffer (ndarray[tuple[int, ...], dtype[floating[Any]]] | None)
prev_samples_beyond_offset (int)
streaming (bool) – Whether the request is streaming the audio to client
_speech_token_size (int) – Size of the speech token buffer, defaults to SPEECH_TOKEN_audio_chunk_size
_speech_token_end_idx (int) – Index marking the end of valid speech tokens
_speech_tokens (ndarray[tuple[int, ...], dtype[integer[Any]]]) – Buffer containing the generated speech tokens
decoded_index (int)
_block_counter (int) – Counter tracking number of speech token blocks generated
_arrival_time (float)
audio_generation_status (GenerationStatus)
max_length (int)
tokens (ndarray[tuple[int, ...], dtype[integer[Any]]])
request_id (RequestID)
eos_token_ids (set[int])
eos_sequences (list[list[int]])
log_probabilities (int)
log_probabilities_echo (bool)
ignore_eos (bool)
json_schema (str | None)
sampling_params (SamplingParams)
model_name (str)
_matcher (Any | None)
status (GenerationStatus)
_size (int)
_start_idx (int)
_active_idx (int)
_end_idx (int)
_completion_start_idx (int)
_completion_end_idx (int)
_prompt_len (int)
_log_probabilities_data (dict[int, LogProbabilities])
_is_initial_prompt (bool)
_draft_offset (int)
target_endpoint (str | None)

`audio_buffer`

audio_buffer: ndarray[tuple[int, ...], dtype[floating[Any]]] | None

`audio_generation_status`

audio_generation_status: GenerationStatus

`audio_prompt_tokens`

audio_prompt_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]]

`block_counter`

property block_counter: int

`buffer_speech_tokens`

buffer_speech_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]] | None

`decoded_index`

decoded_index: int

`is_done`

property is_done: bool

`next_speech_tokens()`

next_speech_tokens(audio_chunk_size=None, buffer=None)

Returns a chunk of the next unseen speech tokens.

Calling this function will not update the index of the last seen token. This must be done by setting decoded_index after the chunk is processed.

Parameters:

audio_chunk_size (int | None) – The number of speech tokens to return.
buffer (int | None) – The number of previous speech tokens to pass to the audio decoder on each generation step.

Returns:

A tuple of (chunk of speech tokens, buffer).

Return type:

tuple[ndarray[tuple[int, …], dtype[integer[Any]]], int]

`prev_samples_beyond_offset`

prev_samples_beyond_offset: int

`speech_tokens`

property speech_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]]

`streaming`

streaming: bool

`update_speech_tokens()`

update_speech_tokens(new_tokens)

Updates the next_tokens

Parameters:: new_tokens (ndarray[tuple[int, ...], dtype[integer[Any]]])
Return type:: None

`TextAndVisionContext`

class max.pipelines.core.TextAndVisionContext(*, max_length, tokens, request_id=<factory>, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _size=-1, _start_idx=0, _active_idx=-1, _end_idx=-1, _completion_start_idx=-1, _completion_end_idx=-1, _prompt_len=-1, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, target_endpoint=None, vision_token_ids, images=<factory>, extra_model_args=<factory>)

A base class for model context, specifically for Vision model variants.

For example:: <vision_start_token_id> = 97

<vision_token_id> = 98
<vision_end_token_id> = 99

Token array:: idx: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ]

token_ids: [ 51 52 53 54 97 98 98 98 98 99 55 56 57 58 97 98 98 98 98 99 59 60 61 62 ] ^– img0 –^ ^– img1 –^ : ^ start_idx=11 (image_idx=1)

Then we would have:: ImageMetadata(start_idx=5, end_idx=9, …) # img0

ImageMetadata(start_idx=15, end_idx=19, …) # img1

These image ranges should be non-overlapping.

The image_idx is determined based on the value of start_idx. It is the idx of the first image that is not yet encoded. For example in the above diagram when start_idx=11, this implies that image_idx=1.

Currently we restrict start_idx and active_idx from being in the middle of an image! This is verified in _validate_state methods that are called before and after mutating methods like bump_token_indices.

Note that for Llama Vision, the number of token ids for the image is 1 due to that models specific implementation.

Parameters:

max_length (int)
tokens (ndarray[tuple[int, ...], dtype[integer[Any]]])
request_id (RequestID)
eos_token_ids (set[int])
eos_sequences (list[list[int]])
log_probabilities (int)
log_probabilities_echo (bool)
ignore_eos (bool)
json_schema (str | None)
sampling_params (SamplingParams)
model_name (str)
_matcher (Any | None)
status (GenerationStatus)
_size (int)
_start_idx (int)
_active_idx (int)
_end_idx (int)
_completion_start_idx (int)
_completion_end_idx (int)
_prompt_len (int)
_log_probabilities_data (dict[int, LogProbabilities])
_is_initial_prompt (bool)
_draft_offset (int)
target_endpoint (str | None)
vision_token_ids (list[int])
images (list[ImageMetadata])
extra_model_args (dict[str, ndarray[tuple[int, ...], dtype[Any]]])

`bump_token_indices()`

bump_token_indices(start_idx=0, active_idx=0, end_idx=0)

Update the start_idx, active_idx and end_idx without manipulating the token array.

Parameters:

start_idx (int)
active_idx (int)
end_idx (int)

Return type:

None

`compute_image_aligned_idx()`

compute_image_aligned_idx(idx)

Possibly aligns a index value downward if it lies in the middle of an image.

Parameters:: idx (int)
Return type:: int

`extra_model_args`

extra_model_args: dict[str, ndarray[tuple[int, ...], dtype[Any]]]

Extra model arguments for the vision model. These are model specific arguments.

`image_idx`

property image_idx: int

Index of the next unencoded image in the prompt.

`images`

images: list[ImageMetadata]

Metadata about each image in the prompt.

`needs_vision_encoding`

property needs_vision_encoding: bool

Returns whether vision encoding is needed for this context.

`next_images`

property next_images: list[ImageMetadata]

Returns the images that are not yet encoded.

`set_token_indices()`

set_token_indices(start_idx=None, active_idx=None, end_idx=None)

Set the token indices without manipulating the token array.

Parameters:

start_idx (int | None)
active_idx (int | None)
end_idx (int | None)

Return type:

None

`update()`

update(new_token, log_probabilities=None)

Updates the next_tokens and extends existing tokens to include all generated tokens.

Parameters:

new_token (int)
log_probabilities (LogProbabilities | None)

Return type:

None

`vision_token_ids`

vision_token_ids: list[int]

The value of the <vision_token_id> special token. The reason this is a list is primarily due to Pixtral which also has a image_break_token_id.

`TextContext`

class max.pipelines.core.TextContext(*, max_length, tokens, request_id=<factory>, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _size=-1, _start_idx=0, _active_idx=-1, _end_idx=-1, _completion_start_idx=-1, _completion_end_idx=-1, _prompt_len=-1, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, target_endpoint=None)

A base class for model context, specifically for Text model variants.

This class manages the state and processing of text generation, including token management, caching, and generation parameters.

Parameters:

max_length (int) – Maximum allowed length of the generated sequence
tokens (ndarray[tuple[int, ...], dtype[integer[Any]]]) – NumPy array containing the token IDs
request_id (RequestID) – A unique identifier for this sequence.
eos_token_ids (set[int]) – Set of token IDs that indicate end of sequence
eos_sequences (list[list[int]])
log_probabilities (int) – Whether to return token log probabilities
log_probabilities_echo (bool) – Whether to return log probabilities for prompt tokens
ignore_eos (bool) – Whether to ignore end of sequence tokens and continue generating
json_schema (str | None) – Optional JSON schema for structured output
sampling_params (SamplingParams) – Parameters controlling the token sampling strategy
model_name (str)
_matcher (Any | None)
status (GenerationStatus)
_size (int) – Current allocated size of token array
_start_idx (int) – Start index of current generation window
_active_idx (int) – Current position in token sequence
_end_idx (int) – End index of valid tokens
_completion_start_idx (int) – Start index of completion tokens
_completion_end_idx (int) – End index of completion tokens
_prompt_len (int) – Length of original prompt
_log_probabilities_data (dict[int, LogProbabilities]) – Token log probabilities data
_is_initial_prompt (bool) – Whether this is the initial prompt encoding
_draft_offset (int) – Offset for draft decoding
target_endpoint (str | None) – Optional target endpoint identifier for routing requests

`active_idx`

property active_idx: int

`active_length`

property active_length: int

num tokens input this iteration.

This will be the prompt size for context encoding, and simply 1 (or more) for token generation.

Type:: Current sequence length

`all_tokens`

property all_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]]

`bump_token_indices()`

bump_token_indices(start_idx=0, active_idx=0, end_idx=0)

Update the start_idx, active_idx and end_idx without manipulating the token array.

Parameters:

start_idx (int)
active_idx (int)
end_idx (int)

Return type:

None

`compute_num_available_steps()`

compute_num_available_steps(max_seq_len)

Compute the max number of steps we can execute for a given context without exceeding the max_seq_len.

Parameters:: max_seq_len (int)
Return type:: int

`current_length`

property current_length: int

The current length of the sequence, including completed and active tokens.

`end_idx`

property end_idx: int

`eos_sequences`

eos_sequences: list[list[int]]

`eos_token_ids`

eos_token_ids: set[int]

`generated_tokens`

property generated_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]]

Returns all tokens that have been generated after the prompt.

Returns:: Array of generated tokens from prompt_len to end_idx.
Return type:: np.ndarray

`get_min_token_logit_mask()`

get_min_token_logit_mask(num_steps)

Returns a set of indices for the tokens in the output that should be masked.

This is primarily used for the min_tokens setting, where we mask eos tokens in the logits to avoid generating them before we reach min_tokens.

Returns:: A set of indices for the tokens in the output that should be masked.
Parameters:: num_steps (int)
Return type:: list[ndarray[tuple[int, …], dtype[int32]]]

`ignore_eos`

ignore_eos: bool

`is_done`

property is_done: bool

`is_initial_prompt`

property is_initial_prompt: bool

Returns true if the context has not been updated with tokens.

`json_schema`

json_schema: str | None

`jump_ahead()`

jump_ahead(new_token)

Updates the token array, while ensuring the new token is returned to the user.

Parameters:: new_token (int)
Return type:: None

`last_generated_token`

property last_generated_token: int

Returns the most recently generated token. If no tokens have been generated, raises an error. :returns: The most recently generated token. :rtype: int

`log_probabilities`

log_probabilities: int

`log_probabilities_echo`

log_probabilities_echo: bool

`matcher`

property matcher: LLMatcher | None

`max_length`

max_length: int

`min_tokens`

property min_tokens: int

The minimum number of new tokens to generate.

`model_name`

model_name: str

`needs_ce`

property needs_ce: bool

Returns whether this context needs context encoding (CE).

CE mode indicates that the context has additional prompt tokens to encode.

Returns:: True if the context needs CE, False otherwise.
Return type:: bool

`next_tokens`

property next_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]]

Returns the tokens between start_idx and active_idx.

Returns:: Array of tokens that have been generated but not yet processed.
Return type:: np.ndarray

`prompt_tokens`

property prompt_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]]

Returns the original prompt tokens.

Returns:: Array of tokens from the initial prompt.
Return type:: np.ndarray

`request_id`

request_id: RequestID

`reset()`

reset()

Resets the context’s state by combining all tokens into a new prompt.

Return type:: None

`sampling_params`

sampling_params: SamplingParams

`set_matcher()`

set_matcher(matcher)

Parameters:: matcher (LLMatcher)
Return type:: None

`set_token_indices()`

set_token_indices(start_idx=None, active_idx=None, end_idx=None)

Set the token indices without manipulating the token array.

Parameters:

start_idx (int | None)
active_idx (int | None)
end_idx (int | None)

Return type:

None

`start_idx`

property start_idx: int

`status`

status: GenerationStatus

`target_endpoint`

target_endpoint: str | None

`to_generation_output()`

to_generation_output()

Get completion tokens that are ready to be returned to the user.

This method retrieves tokens that have been generated but not yet delivered to the user, along with their associated log probability data.

Returns:: The completion tokens and their associated log probabilities, if available.
Return type:: TextGenerationOutput

`tokens`

tokens: ndarray[tuple[int, ...], dtype[integer[Any]]]

`update()`

update(new_token, log_probabilities=None)

Updates the next_tokens and extends existing tokens to include all generated tokens.

Parameters:

new_token (int)
log_probabilities (LogProbabilities | None)

Return type:

None

`validate_aspect_ratio_args()`

max.pipelines.core.validate_aspect_ratio_args(context)

Validates that required aspect ratio arguments are present for vision input.

Parameters:: context (TextContext | TextAndVisionContext) – The context to validate.
Raises:: InputError – If required aspect ratio arguments are missing.
Return type:: None

`validate_image_grid_thw_args()`

max.pipelines.core.validate_image_grid_thw_args(context)

Validates that image_grid_thw is present when vision encoding is needed.

Parameters:: context (TextContext | TextAndVisionContext) – The context to validate.
Raises:: InputError – If image_grid_thw is missing from extra_model_args when vision encoding is needed.
Return type:: None

`validate_image_shape_5d()`

max.pipelines.core.validate_image_shape_5d(context)

Validates that images have the expected 5-dimensional shape.

Parameters:: context (TextContext | TextAndVisionContext) – The context to validate.
Raises:: InputError – If the image shape is not 5-dimensional.
Return type:: None

`validate_initial_prompt_has_image()`

max.pipelines.core.validate_initial_prompt_has_image(context)

Validates that initial prompts contain an image for vision models.

Parameters:: context (TextContext | TextAndVisionContext) – The context to validate.
Raises:: InputError – If the initial prompt doesn’t contain an image.
Return type:: None

`validate_only_one_image()`

max.pipelines.core.validate_only_one_image(context)

Validates that at most one image is provided in the context.

Parameters:: context (TextContext | TextAndVisionContext) – The context to validate.
Raises:: InputError – If more than one image is provided.
Return type:: None

`validate_requires_vision_context()`

max.pipelines.core.validate_requires_vision_context(context)

Validates that the context is a TextAndVisionContext.

Parameters:: context (TextContext | TextAndVisionContext) – The context to validate.
Raises:: InputError – If the context is not a TextAndVisionContext.
Return type:: None

`validate_vision_position_ids()`

max.pipelines.core.validate_vision_position_ids(context)

Validates that vision_position_ids is present when vision encoding is needed.

Parameters:: context (TextContext | TextAndVisionContext) – The context to validate.
Raises:: InputError – If vision_position_ids is missing from extra_model_args when vision encoding is needed.
Return type:: None

TTSContext
TextAndVisionContext
TextContext
validate_aspect_ratio_args()
validate_image_grid_thw_args()
validate_image_shape_5d()
validate_initial_prompt_has_image()
validate_only_one_image()
validate_requires_vision_context()
validate_vision_position_ids()

View source

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!

TTSContext​

audio_buffer​

audio_generation_status​

audio_prompt_tokens​

block_counter​

buffer_speech_tokens​

decoded_index​

is_done​

next_speech_tokens()​

prev_samples_beyond_offset​

speech_tokens​

streaming​

update_speech_tokens()​

TextAndVisionContext​

bump_token_indices()​

compute_image_aligned_idx()​

extra_model_args​

image_idx​

images​

needs_vision_encoding​

next_images​

set_token_indices()​

update()​

vision_token_ids​

TextContext​

active_idx​

active_length​

all_tokens​

bump_token_indices()​

compute_num_available_steps()​

current_length​

end_idx​

eos_sequences​

eos_token_ids​

generated_tokens​

get_min_token_logit_mask()​

ignore_eos​

is_done​

is_initial_prompt​

json_schema​

jump_ahead()​

last_generated_token​

log_probabilities​

log_probabilities_echo​

matcher​

max_length​

min_tokens​

model_name​

needs_ce​

next_tokens​

prompt_tokens​

request_id​

reset()​

sampling_params​

set_matcher()​

set_token_indices()​

start_idx​

status​

target_endpoint​

to_generation_output()​

tokens​

update()​

validate_aspect_ratio_args()​

validate_image_grid_thw_args()​

validate_image_shape_5d()​

validate_initial_prompt_has_image()​

validate_only_one_image()​

validate_requires_vision_context()​

validate_vision_position_ids()​

`TTSContext`

`audio_buffer`

`audio_generation_status`

`audio_prompt_tokens`

`block_counter`

`buffer_speech_tokens`

`decoded_index`

`is_done`

`next_speech_tokens()`

`prev_samples_beyond_offset`

`speech_tokens`

`streaming`

`update_speech_tokens()`

`TextAndVisionContext`

`bump_token_indices()`

`compute_image_aligned_idx()`

`extra_model_args`

`image_idx`

`images`

`needs_vision_encoding`

`next_images`

`set_token_indices()`

`update()`

`vision_token_ids`

`TextContext`

`active_idx`

`active_length`

`all_tokens`

`bump_token_indices()`

`compute_num_available_steps()`

`current_length`

`end_idx`

`eos_sequences`

`eos_token_ids`

`generated_tokens`

`get_min_token_logit_mask()`

`ignore_eos`

`is_done`

`is_initial_prompt`

`json_schema`

`jump_ahead()`

`last_generated_token`

`log_probabilities`

`log_probabilities_echo`

`matcher`

`max_length`

`min_tokens`

`model_name`

`needs_ce`

`next_tokens`

`prompt_tokens`

`request_id`

`reset()`

`sampling_params`

`set_matcher()`

`set_token_indices()`

`start_idx`

`status`

`target_endpoint`

`to_generation_output()`

`tokens`

`update()`

`validate_aspect_ratio_args()`

`validate_image_grid_thw_args()`

`validate_image_shape_5d()`

`validate_initial_prompt_has_image()`

`validate_only_one_image()`

`validate_requires_vision_context()`

`validate_vision_position_ids()`