Python module
core
AudioGenerationRequest
class max.pipelines.core.AudioGenerationRequest(id: 'str', index: 'int', model: 'str', input: 'Optional[str]' = None, audio_prompt_tokens: 'list[int]' = <factory>, audio_prompt_transcription: 'str' = '', sampling_params: 'SamplingParams' = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0), _assistant_message_override: 'str | None' = None, prompt: 'Optional[list[int] | str]' = None)
-
Parameters:
audio_prompt_tokens
The prompt speech IDs to use for audio generation.
audio_prompt_transcription
audio_prompt_transcription: str = ''
The audio prompt transcription to use for audio generation.
id
id: str
A unique identifier for the request. This ID can be used to trace and log the request throughout its lifecycle, facilitating debugging and tracking.
index
index: int
The sequence order of this request within a batch. This is useful for maintaining the order of requests when processing multiple requests simultaneously, ensuring that responses can be matched back to their corresponding requests accurately.
input
The text to generate audio for. The maximum length is 4096 characters.
model
model: str
The name of the model to be used for generating audio chunks. This should match the available models on the server and determines the behavior and capabilities of the response generation.
prompt
Optionally provide a preprocessed list of token ids or a prompt string to pass as input directly into the model. This replaces automatically generating TokenGeneratorRequestMessages given the input, audio prompt tokens, audio prompt transcription fields.
sampling_params
sampling_params: SamplingParams = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0)
Request sampling configuration options.
AudioGenerationResponse
class max.pipelines.core.AudioGenerationResponse(final_status, audio=None)
-
Parameters:
-
- final_status (
TextGenerationStatus
) - audio (
np.ndarray
|
None
)
- final_status (
audio_data
property audio_data: ndarray
final_status
property final_status: TextGenerationStatus
has_audio_data
property has_audio_data: bool
is_done
property is_done: bool
AudioGenerator
class max.pipelines.core.AudioGenerator(*args, **kwargs)
Interface for audio generation models.
decoder_sample_rate
property decoder_sample_rate: int
The sample rate of the decoder.
next_chunk()
next_chunk(batch, num_tokens)
Computes the next audio chunk for a single batch.
The new speech tokens are saved to the context. The most recently generated audio is return through the AudioGenerationResponse.
release()
release(context)
Releases resources associated with this context.
-
Parameters:
-
context (
AudioGeneratorContext
) – Finished context. -
Return type:
-
None
AudioGeneratorOutput
class max.pipelines.core.AudioGeneratorOutput(audio_data: 'torch.Tensor', metadata: 'dict[str, Any]', is_done: 'bool')
audio_data
audio_data: torch.Tensor
is_done
is_done: bool
metadata
EmbeddingsGenerator
class max.pipelines.core.EmbeddingsGenerator(*args, **kwargs)
Interface for LLM embeddings-generator models.
encode()
encode(batch)
Computes embeddings for a batch of inputs.
EmbeddingsResponse
class max.pipelines.core.EmbeddingsResponse(embeddings)
Container for the response from embeddings pipeline.
-
Parameters:
-
embeddings (
ndarray
)
embeddings
embeddings: ndarray
InputContext
class max.pipelines.core.InputContext(*args, **kwargs)
A base class for model contexts, represent model inputs for TokenGenerators.
Token array layout:
. +---------- full prompt ----------+ CHUNK_SIZE*N v
. +--------------------+---------------+-----------------+----------------+
. | completed | next_tokens | | preallocated |
. +--------------------+---------------+-----------------+----------------+
. start_idx ^ active_idx ^ end_idx ^
. +---------- full prompt ----------+ CHUNK_SIZE*N v
. +--------------------+---------------+-----------------+----------------+
. | completed | next_tokens | | preallocated |
. +--------------------+---------------+-----------------+----------------+
. start_idx ^ active_idx ^ end_idx ^
- completed: The tokens that have already been processed and encoded.
- next_tokens: The tokens that will be processed in the next iteration. This may be a subset of the full prompt due to chunked prefill.
- preallocated: The token slots that have been preallocated. The token array resizes to multiples of CHUNK_SIZE to accommodate the new tokens.
active_idx
property active_idx: int
active_length
property active_length: int
num tokens input this iteration.
This will be the prompt size for context encoding, and simply 1 for token generation.
-
Type:
-
Current sequence length
all_tokens
property all_tokens: ndarray
All prompt and generated tokens in the context.
assign_to_cache()
assign_to_cache(cache_seq_id)
Assigns the context to a cache slot.
-
Parameters:
-
cache_seq_id (
int
) -
Return type:
-
None
bump_token_indices()
bump_token_indices(start_idx=0, active_idx=0, end_idx=0, committed_idx=0)
Update the start_idx, active_idx and end_idx without manipulating the token array.
cache_seq_id
property cache_seq_id: int
Returns the cache slot assigned to the context, raising an error if not assigned.
committed_idx
property committed_idx: int
compute_num_available_steps()
compute_num_available_steps(max_seq_len)
Compute the max number of steps we can execute for a given context without exceeding the max_seq_len.
current_length
property current_length: int
The current length of the sequence, including completed and active tokens.
end_idx
property end_idx: int
eos_token_ids
generated_tokens
property generated_tokens: ndarray
All generated tokens in the context.
get_min_token_logit_mask()
get_min_token_logit_mask(num_steps)
Returns a set of indices for the tokens in the output that should be masked.
This is primarily used for the min_tokens setting, where we mask eos tokens in the logits to avoid generating them before we reach min_tokens.
is_assigned_to_cache
property is_assigned_to_cache: bool
Returns True if input is assigned to a cache slot, False otherwise.
is_ce
property is_ce: bool
Returns True if the context is a context encoding context, False otherwise.
is_done
property is_done: bool
is_initial_prompt
property is_initial_prompt: bool
Returns true if the context has not been updated with tokens.
json_schema
A json schema to use during constrained decoding.
jump_ahead()
jump_ahead(new_token)
Updates the token array, while ensuring the new token is returned to the user.
-
Parameters:
-
new_token (
int
) -
Return type:
-
None
log_probabilities
property log_probabilities: int
When > 0, returns the log probabilities for the top N tokens for each element token in the sequence.
log_probabilities_echo
property log_probabilities_echo: bool
When True, the input tokens are added to the returned logprobs.
matcher
property matcher: xgr.GrammarMatcher | None
An optional xgr Grammar Matcher provided when using structured output.
max_length
The maximum length of this sequence.
min_tokens
property min_tokens: int
The minimum number of new tokens to generate.
next_tokens
property next_tokens: ndarray
The next prompt tokens to be input during this iteration.
This should be a 1D array of tokens of length active_length.
outstanding_completion_tokens()
outstanding_completion_tokens()
Return the list of outstanding completion tokens and log probabilities that must be returned to the user.
-
Return type:
-
list[tuple[int, LogProbabilities | None]]
prompt_tokens
property prompt_tokens: ndarray
Prompt tokens in the context.
reset()
reset()
Resets the context’s state by combining all tokens into a new prompt. This method is used when a request is evicted, meaning that the context needed to be re-encoded in the following CE iteration.
-
Return type:
-
None
rollback()
rollback(idx)
Rollback and remove the last idx tokens.
-
Parameters:
-
idx (
int
) -
Return type:
-
None
sampling_params
property sampling_params: SamplingParams
Returns the per-request sampling configuration
set_draft_offset()
set_draft_offset(idx)
-
Parameters:
-
idx (
int
) -
Return type:
-
None
set_matcher()
set_matcher(matcher)
Set a grammar matcher for use during constrained decoding.
-
Parameters:
-
matcher (
xgr.GrammarMatcher
) -
Return type:
-
None
set_token_indices()
set_token_indices(start_idx=None, active_idx=None, end_idx=None, committed_idx=None)
Set the token indices without manipulating the token array.
start_idx
property start_idx: int
status
property status: TextGenerationStatus
tokens
property tokens: ndarray
All tokens (including padded tokens) in the context. In most scenarios, use all_tokens to get the active full token array.
unassign_from_cache()
unassign_from_cache()
Unassigns the context from a cache slot.
-
Return type:
-
None
update()
update(new_token, log_probabilities=None)
Updates the next_tokens and extends existing tokens to include all generated tokens.
-
Parameters:
-
- new_token (
int
) - log_probabilities (
LogProbabilities
|
None
)
- new_token (
-
Return type:
-
None
update_status()
update_status(status)
-
Parameters:
-
status (
TextGenerationStatus
) -
Return type:
-
None
LogProbabilities
class max.pipelines.core.LogProbabilities(token_log_probabilities, top_log_probabilities)
Log probabilities for an individual output token.
-
Parameters:
token_log_probabilities
token_log_probabilities
Probabilities of each token.
top_log_probabilities
top_log_probabilities
Top tokens and their corresponding probabilities.
PipelineTask
class max.pipelines.core.PipelineTask(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
AUDIO_GENERATION
AUDIO_GENERATION = 'audio_generation'
EMBEDDINGS_GENERATION
EMBEDDINGS_GENERATION = 'embeddings_generation'
SPEECH_TOKEN_GENERATION
SPEECH_TOKEN_GENERATION = 'speech_token_generation'
TEXT_GENERATION
TEXT_GENERATION = 'text_generation'
PipelineTokenizer
class max.pipelines.core.PipelineTokenizer(*args, **kwargs)
Interface for LLM tokenizers.
decode()
async decode(context, encoded, **kwargs)
Decodes response tokens to text.
-
Parameters:
-
- context (
TokenGeneratorContext
) – Current generation context. - encoded (
TokenizerEncoded
) – Encoded response tokens.
- context (
-
Returns:
-
Un-encoded response text.
-
Return type:
encode()
async encode(prompt, add_special_tokens)
Encodes text prompts as tokens.
-
Parameters:
-
Raises:
-
ValueError – If the prompt exceeds the configured maximum length.
-
Return type:
-
TokenizerEncoded
eos
property eos: int
The end of sequence token for this tokenizer.
expects_content_wrapping
property expects_content_wrapping: bool
If true, this tokenizer expects messages to have a content property.
Text messages are formatted as:
{ "type": "text", "content": "text content" }
{ "type": "text", "content": "text content" }
instead of the OpenAI spec:
{ "type": "text", "text": "text content" }
{ "type": "text", "text": "text content" }
NOTE: Multimodal messages omit the content property.
Both image_urls
and image
content parts are converted to:
{ "type": "image" }
{ "type": "image" }
Their content is provided as byte arrays through the top-level property
on the request object, i.e., PipelineTokenizerRequest.images
.
new_context()
async new_context(request)
Creates a new context from a request object. This is sent to the worker process once and then cached locally.
-
Parameters:
-
request (
PipelineTokenizerRequest
) – Incoming request. -
Returns:
-
Initialized context.
-
Return type:
-
TokenGeneratorContext
SamplingParams
class max.pipelines.core.SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0)
Request Specific Sampling Parameters that are only known at run time.
-
Parameters:
-
- top_k (
int
) - top_p (
float
) - min_p (
float
) - temperature (
float
) - frequency_penalty (
float
) - presence_penalty (
float
) - repetition_penalty (
float
) - max_new_tokens (
int
|
None
) - min_new_tokens (
int
) - ignore_eos (
bool
) - stop (
list
[
str
]
|
None
) - stop_token_ids (
list
[
int
]
|
None
) - detokenize (
bool
) - seed (
int
)
- top_k (
detokenize
detokenize: bool = True
Whether to detokenize the output tokens into text.
frequency_penalty
frequency_penalty: float = 0.0
The frequency penalty to apply to the model’s output. A positive value will penalize new tokens based on their frequency in the generated text: tokens will receive a penalty proportional to the count of appearances.
ignore_eos
ignore_eos: bool = False
If True, the response will ignore the EOS token, and continue to generate until the max tokens or a stop string is hit.
max_new_tokens
The maximum number of new tokens to generate in the response. If not set, the model may generate tokens until it reaches its internal limits or based on other stopping criteria.
min_new_tokens
min_new_tokens: int = 0
The minimum number of tokens to generate in the response.
min_p
min_p: float = 0.0
Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.
presence_penalty
presence_penalty: float = 0.0
The presence penalty to apply to the model’s output. A positive value will penalize new tokens that have already appeared in the generated text at least once by applying a constant penalty.
repetition_penalty
repetition_penalty: float = 1.0
The repetition penalty to apply to the model’s output. Values > 1 will penalize new tokens that have already appeared in the generated text at least once by dividing the logits by the repetition penalty.
seed
seed: int = 0
The seed to use for the random number generator.
stop
A list of detokenized sequences that can be used as stop criteria when generating a new sequence.
stop_token_ids
A list of token ids that are used as stopping criteria when generating a new sequence.
temperature
temperature: float = 1
Controls the randomness of the model’s output; higher values produce more diverse responses.
top_k
top_k: int = 1
Limits the sampling to the K most probable tokens. This defaults to 1, which enables greedy sampling.
top_p
top_p: float = 1
Only use the tokens whose cumulative probability within the top_p threshold. This applies to the top_k tokens.
TTSContext
class max.pipelines.core.TTSContext(audio_prompt_tokens=<factory>, prev_samples_beyond_offset=0, _speech_token_size=128, _speech_token_end_idx=0, _speech_tokens=<factory>, _decoded_index=0, _block_counter=0, _arrival_time=<factory>, _audio_generation_status=TextGenerationStatus.ACTIVE, *, prompt, max_length, tokens, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=None, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, _matcher=None, _status=TextGenerationStatus.ACTIVE, _cache_seq_id=None, _size=-1, _start_idx=0, _active_idx=-1, _end_idx=-1, _completion_start_idx=-1, _completion_end_idx=-1, _prompt_len=-1, _committed_idx=0, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0)
A context for Text-to-Speech (TTS) model inference.
This class extends TextContext to handle speech token generation and management. It maintains buffers for audio prompt tokens and generated speech tokens, along with tracking indices for decoding progress.
-
Parameters:
-
- audio_prompt_tokens (
ndarray
) – Array of input audio prompt tokens used for voice cloning - prev_samples_beyond_offset (
int
) - _speech_token_size (
int
) – Size of the speech token buffer, defaults to SPEECH_TOKEN_audio_chunk_size - _speech_token_end_idx (
int
) – Index marking the end of valid speech tokens - _speech_tokens (
ndarray
) – Buffer containing the generated speech tokens - _decoded_index (
int
) – Index tracking how many tokens have been decoded to audio - _block_counter (
int
) – Counter tracking number of speech token blocks generated - _arrival_time (
float
) - _audio_generation_status (
TextGenerationStatus
) - prompt (
str
|
Sequence
[
int
]
) - max_length (
int
) - tokens (
ndarray
) - eos_token_ids (
set
[
int
]
) - eos_sequences (
list
[
list
[
int
]
]
) - log_probabilities (
int
|
None
) - log_probabilities_echo (
bool
) - ignore_eos (
bool
) - json_schema (
str
|
None
) - sampling_params (
SamplingParams
) - _matcher (
Any
|
None
) - _status (
TextGenerationStatus
) - _cache_seq_id (
int
|
None
) - _size (
int
) - _start_idx (
int
) - _active_idx (
int
) - _end_idx (
int
) - _completion_start_idx (
int
) - _completion_end_idx (
int
) - _prompt_len (
int
) - _committed_idx (
int
) - _log_probabilities_data (
dict
[
int
,
LogProbabilities
]
) - _is_initial_prompt (
bool
) - _draft_offset (
int
)
- audio_prompt_tokens (
audio_generation_status
property audio_generation_status: TextGenerationStatus
audio_prompt_tokens
audio_prompt_tokens: ndarray
block_counter
property block_counter: int
decoded_index
property decoded_index: int
has_undecoded_speech_tokens()
has_undecoded_speech_tokens(exclude_last_n=0)
Checks whether there are undecoded speech tokens.
-
Parameters:
-
exclude_last_n (
int
) – Number of tokens to exclude from the end when checking for undecoded tokens. For example, if set to 1, the last token will not be considered when checking for undecoded tokens. -
Returns:
-
True if there are undecoded speech tokens (excluding the last n tokens), False otherwise.
-
Return type:
is_done
property is_done: bool
next_speech_tokens()
next_speech_tokens(audio_chunk_size=None, buffer=None)
Returns a chunk of the next unseen speech tokens.
Calling this function will update the index of the last seen token.
prev_samples_beyond_offset
prev_samples_beyond_offset: int
set_decoded_index()
set_decoded_index(idx)
-
Parameters:
-
idx (
int
) -
Return type:
-
None
speech_token_status
property speech_token_status: TextGenerationStatus
Returns the status of the speech token generation.
speech_tokens
property speech_tokens: ndarray
status
property status: TextGenerationStatus
update_audio_generation_status()
update_audio_generation_status(status)
-
Parameters:
-
status (
TextGenerationStatus
) -
Return type:
-
None
update_speech_token_status()
update_speech_token_status(status)
-
Parameters:
-
status (
TextGenerationStatus
) -
Return type:
-
None
update_speech_tokens()
update_speech_tokens(new_tokens)
Updates the next_tokens
-
Parameters:
-
new_tokens (
ndarray
) -
Return type:
-
None
update_status()
update_status(status)
-
Parameters:
-
status (
TextGenerationStatus
) -
Return type:
-
None
TextAndVisionContext
class max.pipelines.core.TextAndVisionContext(*, prompt, max_length, tokens, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=None, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, _matcher=None, _status=TextGenerationStatus.ACTIVE, _cache_seq_id=None, _size=-1, _start_idx=0, _active_idx=-1, _end_idx=-1, _completion_start_idx=-1, _completion_end_idx=-1, _prompt_len=-1, _committed_idx=0, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, pixel_values=<factory>, extra_model_args=<factory>)
A base class for model context, specifically for Vision model variants.
-
Parameters:
-
- prompt (
str
|
Sequence
[
int
]
) - max_length (
int
) - tokens (
ndarray
) - eos_token_ids (
set
[
int
]
) - eos_sequences (
list
[
list
[
int
]
]
) - log_probabilities (
int
|
None
) - log_probabilities_echo (
bool
) - ignore_eos (
bool
) - json_schema (
str
|
None
) - sampling_params (
SamplingParams
) - _matcher (
Any
|
None
) - _status (
TextGenerationStatus
) - _cache_seq_id (
int
|
None
) - _size (
int
) - _start_idx (
int
) - _active_idx (
int
) - _end_idx (
int
) - _completion_start_idx (
int
) - _completion_end_idx (
int
) - _prompt_len (
int
) - _committed_idx (
int
) - _log_probabilities_data (
dict
[
int
,
LogProbabilities
]
) - _is_initial_prompt (
bool
) - _draft_offset (
int
) - pixel_values (
tuple
[
ndarray
,
...
]
) - extra_model_args (
dict
[
str
,
ndarray
]
)
- prompt (
extra_model_args
pixel_values
update()
update(new_token, log_probabilities=None)
Updates the next_tokens and extends existing tokens to include all generated tokens.
-
Parameters:
-
- new_token (
int
) - log_probabilities (
LogProbabilities
|
None
)
- new_token (
-
Return type:
-
None
TextContext
class max.pipelines.core.TextContext(*, prompt, max_length, tokens, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=None, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, _matcher=None, _status=TextGenerationStatus.ACTIVE, _cache_seq_id=None, _size=-1, _start_idx=0, _active_idx=-1, _end_idx=-1, _completion_start_idx=-1, _completion_end_idx=-1, _prompt_len=-1, _committed_idx=0, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0)
A base class for model context, specifically for Text model variants.
This class manages the state and processing of text generation, including token management, caching, and generation parameters.
-
Parameters:
-
- prompt (
str
|
Sequence
[
int
]
) – The input prompt as either a string or sequence of token IDs - max_length (
int
) – Maximum allowed length of the generated sequence - tokens (
ndarray
) – NumPy array containing the token IDs - eos_token_ids (
set
[
int
]
) – Set of token IDs that indicate end of sequence - eos_sequences (
list
[
list
[
int
]
]
) - log_probabilities (
int
|
None
) – Whether to return token log probabilities (None or int) - log_probabilities_echo (
bool
) – Whether to return log probabilities for prompt tokens - ignore_eos (
bool
) – Whether to ignore end of sequence tokens and continue generating - json_schema (
str
|
None
) – Optional JSON schema for structured output - sampling_params (
SamplingParams
) – Parameters controlling the token sampling strategy - _matcher (
Any
|
None
) - _status (
TextGenerationStatus
) – Current generation status (active, finished, etc) - _cache_seq_id (
int
|
None
) – ID of KV cache slot assigned to this context - _size (
int
) – Current allocated size of token array - _start_idx (
int
) – Start index of current generation window - _active_idx (
int
) – Current position in token sequence - _end_idx (
int
) – End index of valid tokens - _completion_start_idx (
int
) – Start index of completion tokens - _completion_end_idx (
int
) – End index of completion tokens - _prompt_len (
int
) – Length of original prompt - _committed_idx (
int
) – Index up to which tokens are committed - _log_probabilities_data (
dict
[
int
,
LogProbabilities
]
) – Token log probabilities data - _is_initial_prompt (
bool
) – Whether this is the initial prompt encoding - _draft_offset (
int
) – Offset for draft decoding
- prompt (
active_idx
property active_idx: int
active_length
property active_length: int
num tokens input this iteration.
This will be the prompt size for context encoding, and simply 1 (or more) for token generation.
-
Type:
-
Current sequence length
all_tokens
property all_tokens: ndarray
assign_to_cache()
assign_to_cache(cache_seq_id)
Assigns this context to a cache slot.
The cache slot is used to store and retrieve KV-cache entries for this context during token generation.
-
Parameters:
-
cache_seq_id (
int
) – The ID of the cache slot to assign this context to. -
Raises:
-
RuntimeError – If this context is already assigned to a cache slot.
-
Return type:
-
None
bump_token_indices()
bump_token_indices(start_idx=0, active_idx=0, end_idx=0, committed_idx=0)
Update the start_idx, active_idx and end_idx without manipulating the token array.
cache_seq_id
property cache_seq_id: int
Gets the ID of the cache slot this context is assigned to.
The cache_seq_id is used to look up KV-cache entries for this context during token generation.
-
Returns:
-
The cache slot ID.
-
Return type:
-
Raises:
-
ValueError – If this context is not currently assigned to a cache slot.
committed_idx
property committed_idx: int
compute_num_available_steps()
compute_num_available_steps(max_seq_len)
Compute the max number of steps we can execute for a given context without exceeding the max_seq_len.
current_length
property current_length: int
The current length of the sequence, including completed and active tokens.
end_idx
property end_idx: int
eos_sequences
eos_token_ids
generated_tokens
property generated_tokens: ndarray
Returns all tokens that have been generated after the prompt.
-
Returns:
-
Array of generated tokens from prompt_len to end_idx.
-
Return type:
-
np.ndarray
get_min_token_logit_mask()
get_min_token_logit_mask(num_steps)
Returns a set of indices for the tokens in the output that should be masked.
This is primarily used for the min_tokens setting, where we mask eos tokens in the logits to avoid generating them before we reach min_tokens.
ignore_eos
ignore_eos: bool
is_assigned_to_cache
property is_assigned_to_cache: bool
Returns whether this context is currently assigned to a cache slot.
The cache assignment status indicates whether this context can currently access KV-cache entries for token generation.
-
Returns:
-
True if assigned to a cache slot, False otherwise.
-
Return type:
is_ce
property is_ce: bool
Returns whether this context is in context encoding (CE) mode.
CE mode indicates that the context has more than one active token to process, typically during the initial encoding of a prompt or after a rollback.
-
Returns:
-
True if in CE mode (active_length > 1), False otherwise.
-
Return type:
is_done
property is_done: bool
is_initial_prompt
property is_initial_prompt: bool
Returns true if the context has not been updated with tokens.
json_schema
jump_ahead()
jump_ahead(new_token)
Updates the token array, while ensuring the new token is returned to the user.
-
Parameters:
-
new_token (
int
) -
Return type:
-
None
log_probabilities
log_probabilities_echo
log_probabilities_echo: bool
matcher
property matcher: xgr.GrammarMatcher | None
max_length
max_length: int
min_tokens
property min_tokens: int
The minimum number of new tokens to generate.
next_tokens
property next_tokens: ndarray
Returns the tokens between start_idx and active_idx.
-
Returns:
-
Array of tokens that have been generated but not yet processed.
-
Return type:
-
np.ndarray
outstanding_completion_tokens()
outstanding_completion_tokens()
Return the list of outstanding completion tokens and log probabilities that must be returned to the user.
-
Return type:
-
list[tuple[int, LogProbabilities | None]]
prompt
prompt_tokens
property prompt_tokens: ndarray
Returns the original prompt tokens.
-
Returns:
-
Array of tokens from the initial prompt.
-
Return type:
-
np.ndarray
reset()
reset()
Resets the context’s state by combining all tokens into a new prompt.
-
Return type:
-
None
rollback()
rollback(idx)
-
Parameters:
-
idx (
int
) -
Return type:
-
None
sampling_params
sampling_params: SamplingParams
set_draft_offset()
set_draft_offset(idx)
Sets the draft offset index used for speculative decoding.
-
Parameters:
-
idx (
int
) – The index to set as the draft offset. -
Return type:
-
None
set_matcher()
set_matcher(matcher)
-
Parameters:
-
matcher (
xgr.GrammarMatcher
) -
Return type:
-
None
set_token_indices()
set_token_indices(start_idx=None, active_idx=None, end_idx=None, committed_idx=None)
Set the token indices without manipulating the token array.
start_idx
property start_idx: int
status
property status: TextGenerationStatus
tokens
tokens: ndarray
unassign_from_cache()
unassign_from_cache()
Unassigns this context from its current cache slot.
This clears the cache_seq_id, allowing the cache slot to be reused by other contexts. Should be called when the context is no longer actively generating tokens.
-
Return type:
-
None
update()
update(new_token, log_probabilities=None)
Updates the next_tokens and extends existing tokens to include all generated tokens.
-
Parameters:
-
- new_token (
int
) - log_probabilities (
LogProbabilities
|
None
)
- new_token (
-
Return type:
-
None
update_status()
update_status(status)
-
Parameters:
-
status (
TextGenerationStatus
) -
Return type:
-
None
TextGenerationResponse
class max.pipelines.core.TextGenerationResponse(tokens, final_status)
-
Parameters:
-
- tokens (
list
[
TextResponse
]
) - final_status (
TextGenerationStatus
)
- tokens (
append_token()
append_token(token)
-
Parameters:
-
token (
TextResponse
) -
Return type:
-
None
final_status
property final_status: TextGenerationStatus
is_done
property is_done: bool
tokens
property tokens: list[TextResponse]
update_status()
update_status(status)
-
Parameters:
-
status (
TextGenerationStatus
) -
Return type:
-
None
TextGenerationStatus
class max.pipelines.core.TextGenerationStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
ACTIVE
ACTIVE = 'active'
END_OF_SEQUENCE
END_OF_SEQUENCE = 'end_of_sequence'
MAXIMUM_LENGTH
MAXIMUM_LENGTH = 'maximum_length'
is_done
property is_done: bool
TextResponse
class max.pipelines.core.TextResponse(next_token, log_probabilities=None)
A base class for model response, specifically for Text model variants.
-
Parameters:
-
- next_token (
int
|
str
) - log_probabilities (
LogProbabilities
|
None
)
- next_token (
next_token
next_token
Encoded predicted next token.
log_probabilities
log_probabilities
Log probabilities of each output token.
-
Type:
-
LogProbabilities | None
TokenGenerator
class max.pipelines.core.TokenGenerator(*args, **kwargs)
Interface for LLM token-generator models.
next_token()
next_token(batch, num_steps)
Computes the next token response for a single batch.
release()
release(context)
Releases resources associated with this context.
-
Parameters:
-
context (
TokenGeneratorContext
) – Finished context. -
Return type:
-
None
TokenGeneratorRequest
class max.pipelines.core.TokenGeneratorRequest(id: 'str', index: 'int', model_name: 'str', prompt: 'Union[str, Sequence[int], None]' = None, messages: 'Optional[list[TokenGeneratorRequestMessage]]' = None, images: 'Optional[list[bytes]]' = None, tools: 'Optional[list[TokenGeneratorRequestTool]]' = None, response_format: 'Optional[TokenGeneratorResponseFormat]' = None, timestamp_ns: 'int' = 0, request_path: 'str' = '/', logprobs: 'int' = 0, echo: 'bool' = False, stop: 'Optional[Union[str, list[str]]]' = None, chat_template_options: 'Optional[dict[str, Any]]' = None, sampling_params: 'SamplingParams' = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0))
-
Parameters:
-
- id (
str
) - index (
int
) - model_name (
str
) - prompt (
str
|
Sequence
[
int
]
|
None
) - messages (
list
[
TokenGeneratorRequestMessage
]
|
None
) - images (
list
[
bytes
]
|
None
) - tools (
list
[
TokenGeneratorRequestTool
]
|
None
) - response_format (
TokenGeneratorResponseFormat
|
None
) - timestamp_ns (
int
) - request_path (
str
) - logprobs (
int
) - echo (
bool
) - stop (
str
|
list
[
str
]
|
None
) - chat_template_options (
dict
[
str
,
Any
]
|
None
) - sampling_params (
SamplingParams
)
- id (
chat_template_options
Optional dictionary of options to pass when applying the chat template.
echo
echo: bool = False
If set to True, the response will include the original prompt along with the generated output. This can be useful for debugging or when you want to see how the input relates to the output.
id
id: str
A unique identifier for the request. This ID can be used to trace and log the request throughout its lifecycle, facilitating debugging and tracking.
images
A list of image byte arrays that can be included as part of the request. This field is optional and may be used for multimodal inputs where images are relevant to the prompt or task.
index
index: int
The sequence order of this request within a batch. This is useful for maintaining the order of requests when processing multiple requests simultaneously, ensuring that responses can be matched back to their corresponding requests accurately.
logprobs
logprobs: int = 0
The number of top log probabilities to return for each generated token. A value of 0 means that log probabilities will not be returned. Useful for analyzing model confidence in its predictions.
messages
messages: list[TokenGeneratorRequestMessage] | None = None
A list of messages for chat-based interactions. This is used in chat completion APIs, where each message represents a turn in the conversation. If provided, the model will generate responses based on these messages.
model_name
model_name: str
The name of the model to be used for generating tokens. This should match the available models on the server and determines the behavior and capabilities of the response generation.
prompt
The prompt to be processed by the model. This field supports legacy completion APIs and can accept either a string or a sequence of integers representing token IDs. If not provided, the model may generate output based on the messages field.
request_path
request_path: str = '/'
The endpoint path for the request. This is typically used for routing and logging requests within the server infrastructure.
response_format
response_format: TokenGeneratorResponseFormat | None = None
Specifies the desired format for the model’s output. When set, it enables structured generation, which adheres to the json_schema provided.
sampling_params
sampling_params: SamplingParams = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0)
Token sampling configuration parameters for the request.
stop
//platform.openai.com/docs/api-reference/chat/create#chat-create-stop)
-
Type:
-
Optional list of stop expressions (see
-
Type:
-
https
timestamp_ns
timestamp_ns: int = 0
The time (in nanoseconds) when the request was received by the server. This can be useful for performance monitoring and logging purposes.
tools
tools: list[TokenGeneratorRequestTool] | None = None
A list of tools that can be invoked during the generation process. This allows the model to utilize external functionalities or APIs to enhance its responses.
TokenGeneratorRequestFunction
class max.pipelines.core.TokenGeneratorRequestFunction
description
description: str
name
name: str
parameters
parameters: dict
TokenGeneratorRequestMessage
class max.pipelines.core.TokenGeneratorRequestMessage
content
Content can be simple string or a list of message parts of different modalities.
For example:
{
"role": "user",
"content": "What'''s the weather like in Boston today?"
}
{
"role": "user",
"content": "What'''s the weather like in Boston today?"
}
Or:
{
"role": "user",
"content": [
{
"type": "text",
"text": "What'''s in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
{
"role": "user",
"content": [
{
"type": "text",
"text": "What'''s in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
role
role: Literal['system', 'user', 'assistant']
TokenGeneratorRequestTool
class max.pipelines.core.TokenGeneratorRequestTool
function
function: TokenGeneratorRequestFunction
type
type: str
TokenGeneratorResponseFormat
class max.pipelines.core.TokenGeneratorResponseFormat
json_schema
json_schema: dict
type
type: str
msgpack_numpy_decoder()
max.pipelines.core.msgpack_numpy_decoder(type_, copy=True)
Create a decoder function for the specified type.
msgpack_numpy_encoder()
max.pipelines.core.msgpack_numpy_encoder()
Create an encoder function that handles numpy arrays.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!