Skip to main content

Python module

core

AudioGenerationRequest

class max.pipelines.core.AudioGenerationRequest(id: 'str', index: 'int', model: 'str', input: 'Optional[str]' = None, audio_prompt_tokens: 'list[int]' = <factory>, audio_prompt_transcription: 'str' = '', sampling_params: 'SamplingParams' = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0), _assistant_message_override: 'str | None' = None, prompt: 'Optional[list[int] | str]' = None)

Parameters:

audio_prompt_tokens

audio_prompt_tokens: list[int]

The prompt speech IDs to use for audio generation.

audio_prompt_transcription

audio_prompt_transcription: str = ''

The audio prompt transcription to use for audio generation.

id

id: str

A unique identifier for the request. This ID can be used to trace and log the request throughout its lifecycle, facilitating debugging and tracking.

index

index: int

The sequence order of this request within a batch. This is useful for maintaining the order of requests when processing multiple requests simultaneously, ensuring that responses can be matched back to their corresponding requests accurately.

input

input: str | None = None

The text to generate audio for. The maximum length is 4096 characters.

model

model: str

The name of the model to be used for generating audio chunks. This should match the available models on the server and determines the behavior and capabilities of the response generation.

prompt

prompt: list[int] | str | None = None

Optionally provide a preprocessed list of token ids or a prompt string to pass as input directly into the model. This replaces automatically generating TokenGeneratorRequestMessages given the input, audio prompt tokens, audio prompt transcription fields.

sampling_params

sampling_params: SamplingParams = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0)

Request sampling configuration options.

AudioGenerationResponse

class max.pipelines.core.AudioGenerationResponse(final_status, audio=None)

Parameters:

audio_data

property audio_data: ndarray

final_status

property final_status: TextGenerationStatus

has_audio_data

property has_audio_data: bool

is_done

property is_done: bool

AudioGenerator

class max.pipelines.core.AudioGenerator(*args, **kwargs)

Interface for audio generation models.

decoder_sample_rate

property decoder_sample_rate: int

The sample rate of the decoder.

next_chunk()

next_chunk(batch, num_tokens)

Computes the next audio chunk for a single batch.

The new speech tokens are saved to the context. The most recently generated audio is return through the AudioGenerationResponse.

Parameters:

  • batch (dict [ str , AudioGeneratorContext ] ) – Batch of contexts.
  • num_tokens (int ) – Number of speech tokens to generate.

Returns:

Dictionary mapping request IDs to audio generation responses.

Return type:

dict[str, AudioGenerationResponse]

release()

release(context)

Releases resources associated with this context.

Parameters:

context (AudioGeneratorContext ) – Finished context.

Return type:

None

AudioGeneratorOutput

class max.pipelines.core.AudioGeneratorOutput(audio_data: 'torch.Tensor', metadata: 'dict[str, Any]', is_done: 'bool')

Parameters:

  • audio_data (torch.Tensor )
  • metadata (dict [ str , Any ] )
  • is_done (bool )

audio_data

audio_data: torch.Tensor

is_done

is_done: bool

metadata

metadata: dict[str, Any]

EmbeddingsGenerator

class max.pipelines.core.EmbeddingsGenerator(*args, **kwargs)

Interface for LLM embeddings-generator models.

encode()

encode(batch)

Computes embeddings for a batch of inputs.

Parameters:

batch (dict [ str , EmbeddingsGeneratorContext ] ) – Batch of contexts to generate embeddings for.

Returns:

Dictionary mapping request IDs to their corresponding embeddings. Each embedding is typically a numpy array or tensor of floating point values.

Return type:

dict[str, Any]

EmbeddingsResponse

class max.pipelines.core.EmbeddingsResponse(embeddings)

Container for the response from embeddings pipeline.

Parameters:

embeddings (ndarray )

embeddings

embeddings: ndarray

InputContext

class max.pipelines.core.InputContext(*args, **kwargs)

A base class for model contexts, represent model inputs for TokenGenerators.

Token array layout:

.                      +---------- full prompt ----------+   CHUNK_SIZE*N v
. +--------------------+---------------+-----------------+----------------+
. | completed | next_tokens | | preallocated |
. +--------------------+---------------+-----------------+----------------+
. start_idx ^ active_idx ^ end_idx ^
.                      +---------- full prompt ----------+   CHUNK_SIZE*N v
. +--------------------+---------------+-----------------+----------------+
. | completed | next_tokens | | preallocated |
. +--------------------+---------------+-----------------+----------------+
. start_idx ^ active_idx ^ end_idx ^
  • completed: The tokens that have already been processed and encoded.
  • next_tokens: The tokens that will be processed in the next iteration. This may be a subset of the full prompt due to chunked prefill.
  • preallocated: The token slots that have been preallocated. The token array resizes to multiples of CHUNK_SIZE to accommodate the new tokens.

active_idx

property active_idx: int

active_length

property active_length: int

num tokens input this iteration.

This will be the prompt size for context encoding, and simply 1 for token generation.

Type:

Current sequence length

all_tokens

property all_tokens: ndarray

All prompt and generated tokens in the context.

assign_to_cache()

assign_to_cache(cache_seq_id)

Assigns the context to a cache slot.

Parameters:

cache_seq_id (int )

Return type:

None

bump_token_indices()

bump_token_indices(start_idx=0, active_idx=0, end_idx=0, committed_idx=0)

Update the start_idx, active_idx and end_idx without manipulating the token array.

Parameters:

  • start_idx (int )
  • active_idx (int )
  • end_idx (int )
  • committed_idx (int )

Return type:

None

cache_seq_id

property cache_seq_id: int

Returns the cache slot assigned to the context, raising an error if not assigned.

committed_idx

property committed_idx: int

compute_num_available_steps()

compute_num_available_steps(max_seq_len)

Compute the max number of steps we can execute for a given context without exceeding the max_seq_len.

Parameters:

max_seq_len (int )

Return type:

int

current_length

property current_length: int

The current length of the sequence, including completed and active tokens.

end_idx

property end_idx: int

eos_token_ids

property eos_token_ids: set[int]

generated_tokens

property generated_tokens: ndarray

All generated tokens in the context.

get_min_token_logit_mask()

get_min_token_logit_mask(num_steps)

Returns a set of indices for the tokens in the output that should be masked.

This is primarily used for the min_tokens setting, where we mask eos tokens in the logits to avoid generating them before we reach min_tokens.

Parameters:

num_steps (int )

Return type:

list[ndarray[Any, dtype[int32]]]

is_assigned_to_cache

property is_assigned_to_cache: bool

Returns True if input is assigned to a cache slot, False otherwise.

is_ce

property is_ce: bool

Returns True if the context is a context encoding context, False otherwise.

is_done

property is_done: bool

is_initial_prompt

property is_initial_prompt: bool

Returns true if the context has not been updated with tokens.

json_schema

property json_schema: str | None

A json schema to use during constrained decoding.

jump_ahead()

jump_ahead(new_token)

Updates the token array, while ensuring the new token is returned to the user.

Parameters:

new_token (int )

Return type:

None

log_probabilities

property log_probabilities: int

When > 0, returns the log probabilities for the top N tokens for each element token in the sequence.

log_probabilities_echo

property log_probabilities_echo: bool

When True, the input tokens are added to the returned logprobs.

matcher

property matcher: xgr.GrammarMatcher | None

An optional xgr Grammar Matcher provided when using structured output.

max_length

property max_length: int | None

The maximum length of this sequence.

min_tokens

property min_tokens: int

The minimum number of new tokens to generate.

next_tokens

property next_tokens: ndarray

The next prompt tokens to be input during this iteration.

This should be a 1D array of tokens of length active_length.

outstanding_completion_tokens()

outstanding_completion_tokens()

Return the list of outstanding completion tokens and log probabilities that must be returned to the user.

Return type:

list[tuple[int, LogProbabilities | None]]

prompt_tokens

property prompt_tokens: ndarray

Prompt tokens in the context.

reset()

reset()

Resets the context’s state by combining all tokens into a new prompt. This method is used when a request is evicted, meaning that the context needed to be re-encoded in the following CE iteration.

Return type:

None

rollback()

rollback(idx)

Rollback and remove the last idx tokens.

Parameters:

idx (int )

Return type:

None

sampling_params

property sampling_params: SamplingParams

Returns the per-request sampling configuration

set_draft_offset()

set_draft_offset(idx)

Parameters:

idx (int )

Return type:

None

set_matcher()

set_matcher(matcher)

Set a grammar matcher for use during constrained decoding.

Parameters:

matcher (xgr.GrammarMatcher )

Return type:

None

set_token_indices()

set_token_indices(start_idx=None, active_idx=None, end_idx=None, committed_idx=None)

Set the token indices without manipulating the token array.

Parameters:

  • start_idx (int | None )
  • active_idx (int | None )
  • end_idx (int | None )
  • committed_idx (int | None )

Return type:

None

start_idx

property start_idx: int

status

property status: TextGenerationStatus

tokens

property tokens: ndarray

All tokens (including padded tokens) in the context. In most scenarios, use all_tokens to get the active full token array.

unassign_from_cache()

unassign_from_cache()

Unassigns the context from a cache slot.

Return type:

None

update()

update(new_token, log_probabilities=None)

Updates the next_tokens and extends existing tokens to include all generated tokens.

Parameters:

Return type:

None

update_status()

update_status(status)

Parameters:

status (TextGenerationStatus )

Return type:

None

LogProbabilities

class max.pipelines.core.LogProbabilities(token_log_probabilities, top_log_probabilities)

Log probabilities for an individual output token.

Parameters:

token_log_probabilities

token_log_probabilities

Probabilities of each token.

Type:

list[float]

top_log_probabilities

top_log_probabilities

Top tokens and their corresponding probabilities.

Type:

list[dict[int, float]]

PipelineTask

class max.pipelines.core.PipelineTask(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

AUDIO_GENERATION

AUDIO_GENERATION = 'audio_generation'

EMBEDDINGS_GENERATION

EMBEDDINGS_GENERATION = 'embeddings_generation'

SPEECH_TOKEN_GENERATION

SPEECH_TOKEN_GENERATION = 'speech_token_generation'

TEXT_GENERATION

TEXT_GENERATION = 'text_generation'

PipelineTokenizer

class max.pipelines.core.PipelineTokenizer(*args, **kwargs)

Interface for LLM tokenizers.

decode()

async decode(context, encoded, **kwargs)

Decodes response tokens to text.

Parameters:

  • context (TokenGeneratorContext ) – Current generation context.
  • encoded (TokenizerEncoded ) – Encoded response tokens.

Returns:

Un-encoded response text.

Return type:

str

encode()

async encode(prompt, add_special_tokens)

Encodes text prompts as tokens.

Parameters:

  • prompt (str ) – Un-encoded prompt text.
  • add_special_tokens (bool )

Raises:

ValueError – If the prompt exceeds the configured maximum length.

Return type:

TokenizerEncoded

eos

property eos: int

The end of sequence token for this tokenizer.

expects_content_wrapping

property expects_content_wrapping: bool

If true, this tokenizer expects messages to have a content property.

Text messages are formatted as:

{ "type": "text", "content": "text content" }
{ "type": "text", "content": "text content" }

instead of the OpenAI spec:

{ "type": "text", "text": "text content" }
{ "type": "text", "text": "text content" }

NOTE: Multimodal messages omit the content property. Both image_urls and image content parts are converted to:

{ "type": "image" }
{ "type": "image" }

Their content is provided as byte arrays through the top-level property on the request object, i.e., PipelineTokenizerRequest.images.

new_context()

async new_context(request)

Creates a new context from a request object. This is sent to the worker process once and then cached locally.

Parameters:

request (PipelineTokenizerRequest ) – Incoming request.

Returns:

Initialized context.

Return type:

TokenGeneratorContext

SamplingParams

class max.pipelines.core.SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0)

Request Specific Sampling Parameters that are only known at run time.

Parameters:

detokenize

detokenize: bool = True

Whether to detokenize the output tokens into text.

frequency_penalty

frequency_penalty: float = 0.0

The frequency penalty to apply to the model’s output. A positive value will penalize new tokens based on their frequency in the generated text: tokens will receive a penalty proportional to the count of appearances.

ignore_eos

ignore_eos: bool = False

If True, the response will ignore the EOS token, and continue to generate until the max tokens or a stop string is hit.

max_new_tokens

max_new_tokens: int | None = None

The maximum number of new tokens to generate in the response. If not set, the model may generate tokens until it reaches its internal limits or based on other stopping criteria.

min_new_tokens

min_new_tokens: int = 0

The minimum number of tokens to generate in the response.

min_p

min_p: float = 0.0

Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

presence_penalty

presence_penalty: float = 0.0

The presence penalty to apply to the model’s output. A positive value will penalize new tokens that have already appeared in the generated text at least once by applying a constant penalty.

repetition_penalty

repetition_penalty: float = 1.0

The repetition penalty to apply to the model’s output. Values > 1 will penalize new tokens that have already appeared in the generated text at least once by dividing the logits by the repetition penalty.

seed

seed: int = 0

The seed to use for the random number generator.

stop

stop: list[str] | None = None

A list of detokenized sequences that can be used as stop criteria when generating a new sequence.

stop_token_ids

stop_token_ids: list[int] | None = None

A list of token ids that are used as stopping criteria when generating a new sequence.

temperature

temperature: float = 1

Controls the randomness of the model’s output; higher values produce more diverse responses.

top_k

top_k: int = 1

Limits the sampling to the K most probable tokens. This defaults to 1, which enables greedy sampling.

top_p

top_p: float = 1

Only use the tokens whose cumulative probability within the top_p threshold. This applies to the top_k tokens.

TTSContext

class max.pipelines.core.TTSContext(audio_prompt_tokens=<factory>, prev_samples_beyond_offset=0, _speech_token_size=128, _speech_token_end_idx=0, _speech_tokens=<factory>, _decoded_index=0, _block_counter=0, _arrival_time=<factory>, _audio_generation_status=TextGenerationStatus.ACTIVE, *, prompt, max_length, tokens, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=None, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, _matcher=None, _status=TextGenerationStatus.ACTIVE, _cache_seq_id=None, _size=-1, _start_idx=0, _active_idx=-1, _end_idx=-1, _completion_start_idx=-1, _completion_end_idx=-1, _prompt_len=-1, _committed_idx=0, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0)

A context for Text-to-Speech (TTS) model inference.

This class extends TextContext to handle speech token generation and management. It maintains buffers for audio prompt tokens and generated speech tokens, along with tracking indices for decoding progress.

Parameters:

  • audio_prompt_tokens (ndarray ) – Array of input audio prompt tokens used for voice cloning
  • prev_samples_beyond_offset (int )
  • _speech_token_size (int ) – Size of the speech token buffer, defaults to SPEECH_TOKEN_audio_chunk_size
  • _speech_token_end_idx (int ) – Index marking the end of valid speech tokens
  • _speech_tokens (ndarray ) – Buffer containing the generated speech tokens
  • _decoded_index (int ) – Index tracking how many tokens have been decoded to audio
  • _block_counter (int ) – Counter tracking number of speech token blocks generated
  • _arrival_time (float )
  • _audio_generation_status (TextGenerationStatus )
  • prompt (str | Sequence [ int ] )
  • max_length (int )
  • tokens (ndarray )
  • eos_token_ids (set [ int ] )
  • eos_sequences (list [ list [ int ] ] )
  • log_probabilities (int | None )
  • log_probabilities_echo (bool )
  • ignore_eos (bool )
  • json_schema (str | None )
  • sampling_params (SamplingParams )
  • _matcher (Any | None )
  • _status (TextGenerationStatus )
  • _cache_seq_id (int | None )
  • _size (int )
  • _start_idx (int )
  • _active_idx (int )
  • _end_idx (int )
  • _completion_start_idx (int )
  • _completion_end_idx (int )
  • _prompt_len (int )
  • _committed_idx (int )
  • _log_probabilities_data (dict [ int , LogProbabilities ] )
  • _is_initial_prompt (bool )
  • _draft_offset (int )

audio_generation_status

property audio_generation_status: TextGenerationStatus

audio_prompt_tokens

audio_prompt_tokens: ndarray

block_counter

property block_counter: int

decoded_index

property decoded_index: int

has_undecoded_speech_tokens()

has_undecoded_speech_tokens(exclude_last_n=0)

Checks whether there are undecoded speech tokens.

Parameters:

exclude_last_n (int ) – Number of tokens to exclude from the end when checking for undecoded tokens. For example, if set to 1, the last token will not be considered when checking for undecoded tokens.

Returns:

True if there are undecoded speech tokens (excluding the last n tokens), False otherwise.

Return type:

bool

is_done

property is_done: bool

next_speech_tokens()

next_speech_tokens(audio_chunk_size=None, buffer=None)

Returns a chunk of the next unseen speech tokens.

Calling this function will update the index of the last seen token.

Parameters:

  • audio_chunk_size (int | None ) – The number of speech tokens to return.
  • buffer (int | None ) – The number of previous speech tokens to pass to the audio decoder on each generation step.

Returns:

A tuple of (chunk of speech tokens, buffer).

Return type:

tuple[ndarray, int]

prev_samples_beyond_offset

prev_samples_beyond_offset: int

set_decoded_index()

set_decoded_index(idx)

Parameters:

idx (int )

Return type:

None

speech_token_status

property speech_token_status: TextGenerationStatus

Returns the status of the speech token generation.

speech_tokens

property speech_tokens: ndarray

status

property status: TextGenerationStatus

update_audio_generation_status()

update_audio_generation_status(status)

Parameters:

status (TextGenerationStatus )

Return type:

None

update_speech_token_status()

update_speech_token_status(status)

Parameters:

status (TextGenerationStatus )

Return type:

None

update_speech_tokens()

update_speech_tokens(new_tokens)

Updates the next_tokens

Parameters:

new_tokens (ndarray )

Return type:

None

update_status()

update_status(status)

Parameters:

status (TextGenerationStatus )

Return type:

None

TextAndVisionContext

class max.pipelines.core.TextAndVisionContext(*, prompt, max_length, tokens, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=None, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, _matcher=None, _status=TextGenerationStatus.ACTIVE, _cache_seq_id=None, _size=-1, _start_idx=0, _active_idx=-1, _end_idx=-1, _completion_start_idx=-1, _completion_end_idx=-1, _prompt_len=-1, _committed_idx=0, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, pixel_values=<factory>, extra_model_args=<factory>)

A base class for model context, specifically for Vision model variants.

Parameters:

extra_model_args

extra_model_args: dict[str, ndarray]

pixel_values

pixel_values: tuple[ndarray, ...]

update()

update(new_token, log_probabilities=None)

Updates the next_tokens and extends existing tokens to include all generated tokens.

Parameters:

Return type:

None

TextContext

class max.pipelines.core.TextContext(*, prompt, max_length, tokens, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=None, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, _matcher=None, _status=TextGenerationStatus.ACTIVE, _cache_seq_id=None, _size=-1, _start_idx=0, _active_idx=-1, _end_idx=-1, _completion_start_idx=-1, _completion_end_idx=-1, _prompt_len=-1, _committed_idx=0, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0)

A base class for model context, specifically for Text model variants.

This class manages the state and processing of text generation, including token management, caching, and generation parameters.

Parameters:

  • prompt (str | Sequence [ int ] ) – The input prompt as either a string or sequence of token IDs
  • max_length (int ) – Maximum allowed length of the generated sequence
  • tokens (ndarray ) – NumPy array containing the token IDs
  • eos_token_ids (set [ int ] ) – Set of token IDs that indicate end of sequence
  • eos_sequences (list [ list [ int ] ] )
  • log_probabilities (int | None ) – Whether to return token log probabilities (None or int)
  • log_probabilities_echo (bool ) – Whether to return log probabilities for prompt tokens
  • ignore_eos (bool ) – Whether to ignore end of sequence tokens and continue generating
  • json_schema (str | None ) – Optional JSON schema for structured output
  • sampling_params (SamplingParams ) – Parameters controlling the token sampling strategy
  • _matcher (Any | None )
  • _status (TextGenerationStatus ) – Current generation status (active, finished, etc)
  • _cache_seq_id (int | None ) – ID of KV cache slot assigned to this context
  • _size (int ) – Current allocated size of token array
  • _start_idx (int ) – Start index of current generation window
  • _active_idx (int ) – Current position in token sequence
  • _end_idx (int ) – End index of valid tokens
  • _completion_start_idx (int ) – Start index of completion tokens
  • _completion_end_idx (int ) – End index of completion tokens
  • _prompt_len (int ) – Length of original prompt
  • _committed_idx (int ) – Index up to which tokens are committed
  • _log_probabilities_data (dict [ int , LogProbabilities ] ) – Token log probabilities data
  • _is_initial_prompt (bool ) – Whether this is the initial prompt encoding
  • _draft_offset (int ) – Offset for draft decoding

active_idx

property active_idx: int

active_length

property active_length: int

num tokens input this iteration.

This will be the prompt size for context encoding, and simply 1 (or more) for token generation.

Type:

Current sequence length

all_tokens

property all_tokens: ndarray

assign_to_cache()

assign_to_cache(cache_seq_id)

Assigns this context to a cache slot.

The cache slot is used to store and retrieve KV-cache entries for this context during token generation.

Parameters:

cache_seq_id (int ) – The ID of the cache slot to assign this context to.

Raises:

RuntimeError – If this context is already assigned to a cache slot.

Return type:

None

bump_token_indices()

bump_token_indices(start_idx=0, active_idx=0, end_idx=0, committed_idx=0)

Update the start_idx, active_idx and end_idx without manipulating the token array.

Parameters:

  • start_idx (int )
  • active_idx (int )
  • end_idx (int )
  • committed_idx (int )

Return type:

None

cache_seq_id

property cache_seq_id: int

Gets the ID of the cache slot this context is assigned to.

The cache_seq_id is used to look up KV-cache entries for this context during token generation.

Returns:

The cache slot ID.

Return type:

int

Raises:

ValueError – If this context is not currently assigned to a cache slot.

committed_idx

property committed_idx: int

compute_num_available_steps()

compute_num_available_steps(max_seq_len)

Compute the max number of steps we can execute for a given context without exceeding the max_seq_len.

Parameters:

max_seq_len (int )

Return type:

int

current_length

property current_length: int

The current length of the sequence, including completed and active tokens.

end_idx

property end_idx: int

eos_sequences

eos_sequences: list[list[int]]

eos_token_ids

eos_token_ids: set[int]

generated_tokens

property generated_tokens: ndarray

Returns all tokens that have been generated after the prompt.

Returns:

Array of generated tokens from prompt_len to end_idx.

Return type:

np.ndarray

get_min_token_logit_mask()

get_min_token_logit_mask(num_steps)

Returns a set of indices for the tokens in the output that should be masked.

This is primarily used for the min_tokens setting, where we mask eos tokens in the logits to avoid generating them before we reach min_tokens.

Returns:

A set of indices for the tokens in the output that should be masked.

Parameters:

num_steps (int )

Return type:

list[ndarray[Any, dtype[int32]]]

ignore_eos

ignore_eos: bool

is_assigned_to_cache

property is_assigned_to_cache: bool

Returns whether this context is currently assigned to a cache slot.

The cache assignment status indicates whether this context can currently access KV-cache entries for token generation.

Returns:

True if assigned to a cache slot, False otherwise.

Return type:

bool

is_ce

property is_ce: bool

Returns whether this context is in context encoding (CE) mode.

CE mode indicates that the context has more than one active token to process, typically during the initial encoding of a prompt or after a rollback.

Returns:

True if in CE mode (active_length > 1), False otherwise.

Return type:

bool

is_done

property is_done: bool

is_initial_prompt

property is_initial_prompt: bool

Returns true if the context has not been updated with tokens.

json_schema

json_schema: str | None

jump_ahead()

jump_ahead(new_token)

Updates the token array, while ensuring the new token is returned to the user.

Parameters:

new_token (int )

Return type:

None

log_probabilities

log_probabilities: int | None

log_probabilities_echo

log_probabilities_echo: bool

matcher

property matcher: xgr.GrammarMatcher | None

max_length

max_length: int

min_tokens

property min_tokens: int

The minimum number of new tokens to generate.

next_tokens

property next_tokens: ndarray

Returns the tokens between start_idx and active_idx.

Returns:

Array of tokens that have been generated but not yet processed.

Return type:

np.ndarray

outstanding_completion_tokens()

outstanding_completion_tokens()

Return the list of outstanding completion tokens and log probabilities that must be returned to the user.

Return type:

list[tuple[int, LogProbabilities | None]]

prompt

prompt: str | Sequence[int]

prompt_tokens

property prompt_tokens: ndarray

Returns the original prompt tokens.

Returns:

Array of tokens from the initial prompt.

Return type:

np.ndarray

reset()

reset()

Resets the context’s state by combining all tokens into a new prompt.

Return type:

None

rollback()

rollback(idx)

Parameters:

idx (int )

Return type:

None

sampling_params

sampling_params: SamplingParams

set_draft_offset()

set_draft_offset(idx)

Sets the draft offset index used for speculative decoding.

Parameters:

idx (int ) – The index to set as the draft offset.

Return type:

None

set_matcher()

set_matcher(matcher)

Parameters:

matcher (xgr.GrammarMatcher )

Return type:

None

set_token_indices()

set_token_indices(start_idx=None, active_idx=None, end_idx=None, committed_idx=None)

Set the token indices without manipulating the token array.

Parameters:

  • start_idx (int | None )
  • active_idx (int | None )
  • end_idx (int | None )
  • committed_idx (int | None )

Return type:

None

start_idx

property start_idx: int

status

property status: TextGenerationStatus

tokens

tokens: ndarray

unassign_from_cache()

unassign_from_cache()

Unassigns this context from its current cache slot.

This clears the cache_seq_id, allowing the cache slot to be reused by other contexts. Should be called when the context is no longer actively generating tokens.

Return type:

None

update()

update(new_token, log_probabilities=None)

Updates the next_tokens and extends existing tokens to include all generated tokens.

Parameters:

Return type:

None

update_status()

update_status(status)

Parameters:

status (TextGenerationStatus )

Return type:

None

TextGenerationResponse

class max.pipelines.core.TextGenerationResponse(tokens, final_status)

Parameters:

append_token()

append_token(token)

Parameters:

token (TextResponse )

Return type:

None

final_status

property final_status: TextGenerationStatus

is_done

property is_done: bool

tokens

property tokens: list[TextResponse]

update_status()

update_status(status)

Parameters:

status (TextGenerationStatus )

Return type:

None

TextGenerationStatus

class max.pipelines.core.TextGenerationStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

ACTIVE

ACTIVE = 'active'

END_OF_SEQUENCE

END_OF_SEQUENCE = 'end_of_sequence'

MAXIMUM_LENGTH

MAXIMUM_LENGTH = 'maximum_length'

is_done

property is_done: bool

TextResponse

class max.pipelines.core.TextResponse(next_token, log_probabilities=None)

A base class for model response, specifically for Text model variants.

Parameters:

next_token

next_token

Encoded predicted next token.

Type:

int | str

log_probabilities

log_probabilities

Log probabilities of each output token.

Type:

LogProbabilities | None

TokenGenerator

class max.pipelines.core.TokenGenerator(*args, **kwargs)

Interface for LLM token-generator models.

next_token()

next_token(batch, num_steps)

Computes the next token response for a single batch.

Parameters:

  • batch (dict [ str , TokenGeneratorContext ] ) – Batch of contexts.
  • int (num_steps ) – Number of tokens to generate.
  • num_steps (int )

Returns:

List of encoded responses (indexed by req. ID)

Return type:

list[dict[str, TextResponse]]

release()

release(context)

Releases resources associated with this context.

Parameters:

context (TokenGeneratorContext ) – Finished context.

Return type:

None

TokenGeneratorRequest

class max.pipelines.core.TokenGeneratorRequest(id: 'str', index: 'int', model_name: 'str', prompt: 'Union[str, Sequence[int], None]' = None, messages: 'Optional[list[TokenGeneratorRequestMessage]]' = None, images: 'Optional[list[bytes]]' = None, tools: 'Optional[list[TokenGeneratorRequestTool]]' = None, response_format: 'Optional[TokenGeneratorResponseFormat]' = None, timestamp_ns: 'int' = 0, request_path: 'str' = '/', logprobs: 'int' = 0, echo: 'bool' = False, stop: 'Optional[Union[str, list[str]]]' = None, chat_template_options: 'Optional[dict[str, Any]]' = None, sampling_params: 'SamplingParams' = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0))

Parameters:

chat_template_options

chat_template_options: dict[str, Any] | None = None

Optional dictionary of options to pass when applying the chat template.

echo

echo: bool = False

If set to True, the response will include the original prompt along with the generated output. This can be useful for debugging or when you want to see how the input relates to the output.

id

id: str

A unique identifier for the request. This ID can be used to trace and log the request throughout its lifecycle, facilitating debugging and tracking.

images

images: list[bytes] | None = None

A list of image byte arrays that can be included as part of the request. This field is optional and may be used for multimodal inputs where images are relevant to the prompt or task.

index

index: int

The sequence order of this request within a batch. This is useful for maintaining the order of requests when processing multiple requests simultaneously, ensuring that responses can be matched back to their corresponding requests accurately.

logprobs

logprobs: int = 0

The number of top log probabilities to return for each generated token. A value of 0 means that log probabilities will not be returned. Useful for analyzing model confidence in its predictions.

messages

messages: list[TokenGeneratorRequestMessage] | None = None

A list of messages for chat-based interactions. This is used in chat completion APIs, where each message represents a turn in the conversation. If provided, the model will generate responses based on these messages.

model_name

model_name: str

The name of the model to be used for generating tokens. This should match the available models on the server and determines the behavior and capabilities of the response generation.

prompt

prompt: str | Sequence[int] | None = None

The prompt to be processed by the model. This field supports legacy completion APIs and can accept either a string or a sequence of integers representing token IDs. If not provided, the model may generate output based on the messages field.

request_path

request_path: str = '/'

The endpoint path for the request. This is typically used for routing and logging requests within the server infrastructure.

response_format

response_format: TokenGeneratorResponseFormat | None = None

Specifies the desired format for the model’s output. When set, it enables structured generation, which adheres to the json_schema provided.

sampling_params

sampling_params: SamplingParams = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0)

Token sampling configuration parameters for the request.

stop

stop: str | list[str] | None = None

//platform.openai.com/docs/api-reference/chat/create#chat-create-stop)

Type:

Optional list of stop expressions (see

Type:

https

timestamp_ns

timestamp_ns: int = 0

The time (in nanoseconds) when the request was received by the server. This can be useful for performance monitoring and logging purposes.

tools

tools: list[TokenGeneratorRequestTool] | None = None

A list of tools that can be invoked during the generation process. This allows the model to utilize external functionalities or APIs to enhance its responses.

TokenGeneratorRequestFunction

class max.pipelines.core.TokenGeneratorRequestFunction

description

description: str

name

name: str

parameters

parameters: dict

TokenGeneratorRequestMessage

class max.pipelines.core.TokenGeneratorRequestMessage

content

content: str | list[dict[str, Any]]

Content can be simple string or a list of message parts of different modalities.

For example:

{
"role": "user",
"content": "What'''s the weather like in Boston today?"
}
{
"role": "user",
"content": "What'''s the weather like in Boston today?"
}

Or:

{
"role": "user",
"content": [
{
"type": "text",
"text": "What'''s in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
{
"role": "user",
"content": [
{
"type": "text",
"text": "What'''s in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}

role

role: Literal['system', 'user', 'assistant']

TokenGeneratorRequestTool

class max.pipelines.core.TokenGeneratorRequestTool

function

function: TokenGeneratorRequestFunction

type

type: str

TokenGeneratorResponseFormat

class max.pipelines.core.TokenGeneratorResponseFormat

json_schema

json_schema: dict

type

type: str

msgpack_numpy_decoder()

max.pipelines.core.msgpack_numpy_decoder(type_, copy=True)

Create a decoder function for the specified type.

Parameters:

  • type – The type to decode into
  • copy (bool ) – Copy numpy arrays if true
  • type_ (Any )

Returns:

A function that decodes bytes into the specified type

Return type:

Callable[[bytes], Any]

msgpack_numpy_encoder()

max.pipelines.core.msgpack_numpy_encoder()

Create an encoder function that handles numpy arrays.

Returns:

A function that encodes objects into bytes

Return type:

Callable[[Any], bytes]