Skip to main content

Python module

core

TTSContext

class max.pipelines.core.TTSContext(audio_prompt_tokens=<factory>, buffer_speech_tokens=None, audio_buffer=None, prev_samples_beyond_offset=0, streaming=False, _speech_token_size=128, _speech_token_end_idx=0, _speech_tokens=<factory>, decoded_index=0, _block_counter=0, _arrival_time=<factory>, audio_generation_status=GenerationStatus.ACTIVE, *, max_length, tokens, request_id=<factory>, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _size=-1, _start_idx=0, _active_idx=-1, _end_idx=-1, _completion_start_idx=-1, _completion_end_idx=-1, _prompt_len=-1, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, target_endpoint=None)

A context for Text-to-Speech (TTS) model inference.

This class extends TextContext to handle speech token generation and management. It maintains buffers for audio prompt tokens and generated speech tokens, along with tracking indices for decoding progress.

Parameters:

audio_buffer

audio_buffer: ndarray[tuple[int, ...], dtype[floating[Any]]] | None

audio_generation_status

audio_generation_status: GenerationStatus

audio_prompt_tokens

audio_prompt_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]]

block_counter

property block_counter: int

buffer_speech_tokens

buffer_speech_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]] | None

decoded_index

decoded_index: int

is_done

property is_done: bool

next_speech_tokens()

next_speech_tokens(audio_chunk_size=None, buffer=None)

Returns a chunk of the next unseen speech tokens.

Calling this function will not update the index of the last seen token. This must be done by setting decoded_index after the chunk is processed.

Parameters:

  • audio_chunk_size (int | None) – The number of speech tokens to return.
  • buffer (int | None) – The number of previous speech tokens to pass to the audio decoder on each generation step.

Returns:

A tuple of (chunk of speech tokens, buffer).

Return type:

tuple[ndarray[tuple[int, …], dtype[integer[Any]]], int]

prev_samples_beyond_offset

prev_samples_beyond_offset: int

speech_tokens

property speech_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]]

streaming

streaming: bool

update_speech_tokens()

update_speech_tokens(new_tokens)

Updates the next_tokens

Parameters:

new_tokens (ndarray[tuple[int, ...], dtype[integer[Any]]])

Return type:

None

TextAndVisionContext

class max.pipelines.core.TextAndVisionContext(*, max_length, tokens, request_id=<factory>, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _size=-1, _start_idx=0, _active_idx=-1, _end_idx=-1, _completion_start_idx=-1, _completion_end_idx=-1, _prompt_len=-1, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, target_endpoint=None, vision_token_ids, images=<factory>, extra_model_args=<factory>)

A base class for model context, specifically for Vision model variants.

For example:
  • <vision_start_token_id> = 97
    • <vision_token_id> = 98
    • <vision_end_token_id> = 99
    Token array:
  • idx: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ]
    • token_ids: [ 51 52 53 54 97 98 98 98 98 99 55 56 57 58 97 98 98 98 98 99 59 60 61 62 ] ^– img0 –^ ^– img1 –^ : ^ start_idx=11 (image_idx=1)
    Then we would have:
  • ImageMetadata(start_idx=5, end_idx=9, …) # img0
    • ImageMetadata(start_idx=15, end_idx=19, …) # img1

    These image ranges should be non-overlapping.

    The image_idx is determined based on the value of start_idx. It is the idx of the first image that is not yet encoded. For example in the above diagram when start_idx=11, this implies that image_idx=1.

    Currently we restrict start_idx and active_idx from being in the middle of an image! This is verified in _validate_state methods that are called before and after mutating methods like bump_token_indices.

    Note that for Llama Vision, the number of token ids for the image is 1 due to that models specific implementation.

    Parameters:

    bump_token_indices()

    bump_token_indices(start_idx=0, active_idx=0, end_idx=0)

    Update the start_idx, active_idx and end_idx without manipulating the token array.

    Parameters:

    • start_idx (int)
    • active_idx (int)
    • end_idx (int)

    Return type:

    None

    compute_image_aligned_idx()

    compute_image_aligned_idx(idx)

    Possibly aligns a index value downward if it lies in the middle of an image.

    Parameters:

    idx (int)

    Return type:

    int

    extra_model_args

    extra_model_args: dict[str, ndarray[tuple[int, ...], dtype[Any]]]

    Extra model arguments for the vision model. These are model specific arguments.

    image_idx

    property image_idx: int

    Index of the next unencoded image in the prompt.

    images

    images: list[ImageMetadata]

    Metadata about each image in the prompt.

    needs_vision_encoding

    property needs_vision_encoding: bool

    Returns whether vision encoding is needed for this context.

    next_images

    property next_images: list[ImageMetadata]

    Returns the images that are not yet encoded.

    set_token_indices()

    set_token_indices(start_idx=None, active_idx=None, end_idx=None)

    Set the token indices without manipulating the token array.

    Parameters:

    • start_idx (int | None)
    • active_idx (int | None)
    • end_idx (int | None)

    Return type:

    None

    update()

    update(new_token, log_probabilities=None)

    Updates the next_tokens and extends existing tokens to include all generated tokens.

    Parameters:

    Return type:

    None

    vision_token_ids

    vision_token_ids: list[int]

    The value of the <vision_token_id> special token. The reason this is a list is primarily due to Pixtral which also has a image_break_token_id.

    TextContext

    class max.pipelines.core.TextContext(*, max_length, tokens, request_id=<factory>, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _size=-1, _start_idx=0, _active_idx=-1, _end_idx=-1, _completion_start_idx=-1, _completion_end_idx=-1, _prompt_len=-1, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, target_endpoint=None)

    A base class for model context, specifically for Text model variants.

    This class manages the state and processing of text generation, including token management, caching, and generation parameters.

    Parameters:

    • max_length (int) – Maximum allowed length of the generated sequence
    • tokens (ndarray[tuple[int, ...], dtype[integer[Any]]]) – NumPy array containing the token IDs
    • request_id (RequestID) – A unique identifier for this sequence.
    • eos_token_ids (set[int]) – Set of token IDs that indicate end of sequence
    • eos_sequences (list[list[int]])
    • log_probabilities (int) – Whether to return token log probabilities
    • log_probabilities_echo (bool) – Whether to return log probabilities for prompt tokens
    • ignore_eos (bool) – Whether to ignore end of sequence tokens and continue generating
    • json_schema (str | None) – Optional JSON schema for structured output
    • sampling_params (SamplingParams) – Parameters controlling the token sampling strategy
    • model_name (str)
    • _matcher (Any | None)
    • status (GenerationStatus)
    • _size (int) – Current allocated size of token array
    • _start_idx (int) – Start index of current generation window
    • _active_idx (int) – Current position in token sequence
    • _end_idx (int) – End index of valid tokens
    • _completion_start_idx (int) – Start index of completion tokens
    • _completion_end_idx (int) – End index of completion tokens
    • _prompt_len (int) – Length of original prompt
    • _log_probabilities_data (dict[int, LogProbabilities]) – Token log probabilities data
    • _is_initial_prompt (bool) – Whether this is the initial prompt encoding
    • _draft_offset (int) – Offset for draft decoding
    • target_endpoint (str | None) – Optional target endpoint identifier for routing requests

    active_idx

    property active_idx: int

    active_length

    property active_length: int

    num tokens input this iteration.

    This will be the prompt size for context encoding, and simply 1 (or more) for token generation.

    Type:

    Current sequence length

    all_tokens

    property all_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]]

    bump_token_indices()

    bump_token_indices(start_idx=0, active_idx=0, end_idx=0)

    Update the start_idx, active_idx and end_idx without manipulating the token array.

    Parameters:

    • start_idx (int)
    • active_idx (int)
    • end_idx (int)

    Return type:

    None

    compute_num_available_steps()

    compute_num_available_steps(max_seq_len)

    Compute the max number of steps we can execute for a given context without exceeding the max_seq_len.

    Parameters:

    max_seq_len (int)

    Return type:

    int

    current_length

    property current_length: int

    The current length of the sequence, including completed and active tokens.

    end_idx

    property end_idx: int

    eos_sequences

    eos_sequences: list[list[int]]

    eos_token_ids

    eos_token_ids: set[int]

    generated_tokens

    property generated_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]]

    Returns all tokens that have been generated after the prompt.

    Returns:

    Array of generated tokens from prompt_len to end_idx.

    Return type:

    np.ndarray

    get_min_token_logit_mask()

    get_min_token_logit_mask(num_steps)

    Returns a set of indices for the tokens in the output that should be masked.

    This is primarily used for the min_tokens setting, where we mask eos tokens in the logits to avoid generating them before we reach min_tokens.

    Returns:

    A set of indices for the tokens in the output that should be masked.

    Parameters:

    num_steps (int)

    Return type:

    list[ndarray[tuple[int, …], dtype[int32]]]

    ignore_eos

    ignore_eos: bool

    is_done

    property is_done: bool

    is_initial_prompt

    property is_initial_prompt: bool

    Returns true if the context has not been updated with tokens.

    json_schema

    json_schema: str | None

    jump_ahead()

    jump_ahead(new_token)

    Updates the token array, while ensuring the new token is returned to the user.

    Parameters:

    new_token (int)

    Return type:

    None

    last_generated_token

    property last_generated_token: int

    Returns the most recently generated token. If no tokens have been generated, raises an error. :returns: The most recently generated token. :rtype: int

    log_probabilities

    log_probabilities: int

    log_probabilities_echo

    log_probabilities_echo: bool

    matcher

    property matcher: LLMatcher | None

    max_length

    max_length: int

    min_tokens

    property min_tokens: int

    The minimum number of new tokens to generate.

    model_name

    model_name: str

    needs_ce

    property needs_ce: bool

    Returns whether this context needs context encoding (CE).

    CE mode indicates that the context has additional prompt tokens to encode.

    Returns:

    True if the context needs CE, False otherwise.

    Return type:

    bool

    next_tokens

    property next_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]]

    Returns the tokens between start_idx and active_idx.

    Returns:

    Array of tokens that have been generated but not yet processed.

    Return type:

    np.ndarray

    prompt_tokens

    property prompt_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]]

    Returns the original prompt tokens.

    Returns:

    Array of tokens from the initial prompt.

    Return type:

    np.ndarray

    request_id

    request_id: RequestID

    reset()

    reset()

    Resets the context’s state by combining all tokens into a new prompt.

    Return type:

    None

    sampling_params

    sampling_params: SamplingParams

    set_matcher()

    set_matcher(matcher)

    Parameters:

    matcher (LLMatcher)

    Return type:

    None

    set_token_indices()

    set_token_indices(start_idx=None, active_idx=None, end_idx=None)

    Set the token indices without manipulating the token array.

    Parameters:

    • start_idx (int | None)
    • active_idx (int | None)
    • end_idx (int | None)

    Return type:

    None

    start_idx

    property start_idx: int

    status

    status: GenerationStatus

    target_endpoint

    target_endpoint: str | None

    to_generation_output()

    to_generation_output()

    Get completion tokens that are ready to be returned to the user.

    This method retrieves tokens that have been generated but not yet delivered to the user, along with their associated log probability data.

    Returns:

    The completion tokens and their associated log probabilities, if available.

    Return type:

    TextGenerationOutput

    tokens

    tokens: ndarray[tuple[int, ...], dtype[integer[Any]]]

    update()

    update(new_token, log_probabilities=None)

    Updates the next_tokens and extends existing tokens to include all generated tokens.

    Parameters:

    Return type:

    None

    validate_aspect_ratio_args()

    max.pipelines.core.validate_aspect_ratio_args(context)

    Validates that required aspect ratio arguments are present for vision input.

    Parameters:

    context (TextContext | TextAndVisionContext) – The context to validate.

    Raises:

    InputError – If required aspect ratio arguments are missing.

    Return type:

    None

    validate_image_grid_thw_args()

    max.pipelines.core.validate_image_grid_thw_args(context)

    Validates that image_grid_thw is present when vision encoding is needed.

    Parameters:

    context (TextContext | TextAndVisionContext) – The context to validate.

    Raises:

    InputError – If image_grid_thw is missing from extra_model_args when vision encoding is needed.

    Return type:

    None

    validate_image_shape_5d()

    max.pipelines.core.validate_image_shape_5d(context)

    Validates that images have the expected 5-dimensional shape.

    Parameters:

    context (TextContext | TextAndVisionContext) – The context to validate.

    Raises:

    InputError – If the image shape is not 5-dimensional.

    Return type:

    None

    validate_initial_prompt_has_image()

    max.pipelines.core.validate_initial_prompt_has_image(context)

    Validates that initial prompts contain an image for vision models.

    Parameters:

    context (TextContext | TextAndVisionContext) – The context to validate.

    Raises:

    InputError – If the initial prompt doesn’t contain an image.

    Return type:

    None

    validate_only_one_image()

    max.pipelines.core.validate_only_one_image(context)

    Validates that at most one image is provided in the context.

    Parameters:

    context (TextContext | TextAndVisionContext) – The context to validate.

    Raises:

    InputError – If more than one image is provided.

    Return type:

    None

    validate_requires_vision_context()

    max.pipelines.core.validate_requires_vision_context(context)

    Validates that the context is a TextAndVisionContext.

    Parameters:

    context (TextContext | TextAndVisionContext) – The context to validate.

    Raises:

    InputError – If the context is not a TextAndVisionContext.

    Return type:

    None

    validate_vision_position_ids()

    max.pipelines.core.validate_vision_position_ids(context)

    Validates that vision_position_ids is present when vision encoding is needed.

    Parameters:

    context (TextContext | TextAndVisionContext) – The context to validate.

    Raises:

    InputError – If vision_position_ids is missing from extra_model_args when vision encoding is needed.

    Return type:

    None

    Was this page helpful?