Python module

interfaces

Universal interfaces between all aspects of the MAX Inference Stack.

`AudioGenerationResponse`

class max.interfaces.AudioGenerationResponse(final_status, audio=None, buffer_speech_tokens=None)

Represents a response from the audio generation API.

Parameters:

final_status (GenerationStatus) – The final status of the generation process.
audio (ndarray | None) – The generated audio data, if available.
buffer_speech_tokens (ndarray | None) – Buffered speech tokens, if available.

`audio`

audio: ndarray | None

`audio_data`

property audio_data: ndarray

Returns the audio data if available.

Returns:: The generated audio data.
Return type:: np.ndarray
Raises:: AssertionError – If audio data is not available.

`buffer_speech_tokens`

buffer_speech_tokens: ndarray | None

`final_status`

final_status: GenerationStatus

`has_audio_data`

property has_audio_data: bool

Checks if audio data is present in the response.

Returns:: True if audio data is available, False otherwise.
Return type:: bool

`is_done`

property is_done: bool

Indicates whether the audio generation process is complete.

Returns:: True if generation is done, False otherwise.
Return type:: bool

`EmbeddingsResponse`

class max.interfaces.EmbeddingsResponse(embeddings)

Response structure for embedding generation.

Parameters:: embeddings (ndarray) – The generated embeddings as a NumPy array.

`embeddings`

embeddings: ndarray

`EngineResult`

class max.interfaces.EngineResult(status, result)

Structure representing the result of an engine operation.

Parameters:

status (EngineStatus) – The status of the operation.
result (T | None) – The result data of the operation.

`active()`

classmethod active(result)

Create an EngineResult representing an active operation.

Parameters:: result (T) – The result data of the operation.
Returns:: An EngineResult with ACTIVE status and the provided result.
Return type:: EngineResult

`cancelled()`

classmethod cancelled()

Create an EngineResult representing a cancelled operation.

Returns:: An EngineResult with CANCELLED status and no result.
Return type:: EngineResult

`complete()`

classmethod complete(result)

Create an EngineResult representing a completed operation.

Returns:: An EngineResult with COMPLETE status and no result.
Return type:: EngineResult
Parameters:: result (T)

`result`

result: T | None

`status`

status: EngineStatus

`stop_stream`

property stop_stream: bool

Determine if the stream should continue based on the current status.

Returns:: True if the stream should stop, False otherwise.
Return type:: bool

`EngineStatus`

class max.interfaces.EngineStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Represents the status of an engine operation.

Status values:: ACTIVE: Indicates that the engine executed the operation successfully and request remains active. CANCELLED: Indicates that the request was cancelled before completion; no further data will be provided. COMPLETE: Indicates that the engine executed the operation successfully and the request is completed.

`ACTIVE`

ACTIVE = 'active'

Indicates that the engine executed the operation successfully and request remains active.

`CANCELLED`

CANCELLED = 'cancelled'

Indicates that the request was cancelled before completion; no further data will be provided.

`COMPLETE`

COMPLETE = 'complete'

Indicates that the request was previously finished and no further data should be streamed.

`GenerationStatus`

class max.interfaces.GenerationStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Enum representing the status of a generation process in the MAX API.

`ACTIVE`

ACTIVE = 'active'

The generation process is ongoing.

`END_OF_SEQUENCE`

END_OF_SEQUENCE = 'end_of_sequence'

The generation process has reached the end of the sequence.

`MAXIMUM_LENGTH`

MAXIMUM_LENGTH = 'maximum_length'

The generation process has reached the maximum allowed length.

`is_done`

property is_done: bool

Returns True if the generation process is complete (not ACTIVE).

Returns:: True if the status is not ACTIVE, indicating completion.
Return type:: bool

`LogProbabilities`

class max.interfaces.LogProbabilities(token_log_probabilities=<factory>, top_log_probabilities=<factory>)

Log probabilities for an individual output token.

This is a data-only class that serves as a serializable data structure for transferring log probability information. It does not provide any functionality for calculating or manipulating log probabilities - it is purely for data storage and serialization purposes.

Parameters:

token_log_probabilities (list[float]) – Probabilities of each token.
top_log_probabilities (list[dict[int, float]]) – Top tokens and their corresponding probabilities.

`token_log_probabilities`

token_log_probabilities: list[float]

`top_log_probabilities`

top_log_probabilities: list[dict[int, float]]

`PipelineTask`

class max.interfaces.PipelineTask(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Enum representing the types of pipeline tasks supported.

`AUDIO_GENERATION`

AUDIO_GENERATION = 'audio_generation'

Task for generating audio.

`EMBEDDINGS_GENERATION`

EMBEDDINGS_GENERATION = 'embeddings_generation'

Task for generating embeddings.

`SPEECH_TOKEN_GENERATION`

SPEECH_TOKEN_GENERATION = 'speech_token_generation'

Task for generating speech tokens.

`TEXT_GENERATION`

TEXT_GENERATION = 'text_generation'

Task for generating text.

`output_type`

property output_type: type

Get the output type for the pipeline task.

Returns:: The output type for the pipeline task.
Return type:: type

`SamplingParams`

class max.interfaces.SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0)

Request specific sampling parameters that are only known at run time.

Parameters:

top_k (int)
top_p (float)
min_p (float)
temperature (float)
frequency_penalty (float)
presence_penalty (float)
repetition_penalty (float)
max_new_tokens (int | None)
min_new_tokens (int)
ignore_eos (bool)
stop (list[str] | None)
stop_token_ids (list[int] | None)
detokenize (bool)
seed (int)

`detokenize`

detokenize: bool = True

Whether to detokenize the output tokens into text.

`frequency_penalty`

frequency_penalty: float = 0.0

The frequency penalty to apply to the model’s output. A positive value will penalize new tokens based on their frequency in the generated text: tokens will receive a penalty proportional to the count of appearances.

`ignore_eos`

ignore_eos: bool = False

If True, the response will ignore the EOS token, and continue to generate until the max tokens or a stop string is hit.

`max_new_tokens`

max_new_tokens: int | None = None

The maximum number of new tokens to generate in the response. If not set, the model may generate tokens until it reaches its internal limits or based on other stopping criteria.

`min_new_tokens`

min_new_tokens: int = 0

The minimum number of tokens to generate in the response.

`min_p`

min_p: float = 0.0

Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

`presence_penalty`

presence_penalty: float = 0.0

The presence penalty to apply to the model’s output. A positive value will penalize new tokens that have already appeared in the generated text at least once by applying a constant penalty.

`repetition_penalty`

repetition_penalty: float = 1.0

The repetition penalty to apply to the model’s output. Values > 1 will penalize new tokens that have already appeared in the generated text at least once by dividing the logits by the repetition penalty.

`seed`

seed: int = 0

The seed to use for the random number generator.

`stop`

stop: list[str] | None = None

A list of detokenized sequences that can be used as stop criteria when generating a new sequence.

`stop_token_ids`

stop_token_ids: list[int] | None = None

A list of token ids that are used as stopping criteria when generating a new sequence.

`temperature`

temperature: float = 1

Controls the randomness of the model’s output; higher values produce more diverse responses.

`top_k`

top_k: int = 1

Limits the sampling to the K most probable tokens. This defaults to 1, which enables greedy sampling.

`top_p`

top_p: float = 1

Only use the tokens whose cumulative probability is within the top_p threshold. This applies to the top_k tokens.

`TextGenerationResponse`

class max.interfaces.TextGenerationResponse(tokens, final_status)

Response structure for text generation.

Parameters:

tokens (list[TextResponse]) – List of generated text responses.
final_status (GenerationStatus) – The final status of the generation process.

`append_token()`

append_token(token)

Parameters:: token (TextResponse)
Return type:: None

`final_status`

final_status: GenerationStatus

`is_done`

property is_done: bool

`tokens`

tokens: list[TextResponse]

`update_status()`

update_status(status)

Parameters:: status (GenerationStatus)
Return type:: None

`TextResponse`

class max.interfaces.TextResponse(next_token, log_probabilities=None)

A base class for model responses, specifically for text model variants.

Parameters:

next_token (int | str) – Encoded predicted next token.
log_probabilities (LogProbabilities | None) – Log probabilities of each output token.

`log_probabilities`

log_probabilities: LogProbabilities | None

`next_token`

next_token: int | str

`TokenGenerator`

class max.interfaces.TokenGenerator(*args, **kwargs)

Interface for LLM token-generator models.

`next_token()`

next_token(batch, num_steps)

Computes the next token response for a single batch.

Parameters:

batch (dict[str, T]) – Batch of contexts.
num_steps (int) – Number of tokens to generate.

Returns:

List of encoded responses (indexed by req. ID)

Return type:

list[dict[str, TextResponse]]

`release()`

release(context)

Releases resources associated with this context.

Parameters:: context (T) – Finished context.
Return type:: None

AudioGenerationResponse​

audio​

audio_data​

buffer_speech_tokens​

final_status​

has_audio_data​

is_done​

EmbeddingsResponse​

embeddings​

EngineResult​

active()​

cancelled()​

complete()​

result​

status​

stop_stream​

EngineStatus​

ACTIVE​

CANCELLED​

COMPLETE​

GenerationStatus​

ACTIVE​

END_OF_SEQUENCE​

MAXIMUM_LENGTH​

is_done​

LogProbabilities​

token_log_probabilities​

top_log_probabilities​

PipelineTask​

AUDIO_GENERATION​

EMBEDDINGS_GENERATION​

SPEECH_TOKEN_GENERATION​

TEXT_GENERATION​

output_type​

SamplingParams​

detokenize​

frequency_penalty​

ignore_eos​

max_new_tokens​

min_new_tokens​

min_p​

presence_penalty​

repetition_penalty​

seed​

stop​

stop_token_ids​

temperature​

top_k​

top_p​

TextGenerationResponse​

append_token()​

final_status​

is_done​

tokens​

update_status()​

TextResponse​

log_probabilities​

next_token​

TokenGenerator​

next_token()​

release()​

`AudioGenerationResponse`

`audio`

`audio_data`

`buffer_speech_tokens`

`final_status`

`has_audio_data`

`is_done`

`EmbeddingsResponse`

`embeddings`

`EngineResult`

`active()`

`cancelled()`

`complete()`

`result`

`status`

`stop_stream`

`EngineStatus`

`ACTIVE`

`CANCELLED`

`COMPLETE`

`GenerationStatus`

`ACTIVE`

`END_OF_SEQUENCE`

`MAXIMUM_LENGTH`

`is_done`

`LogProbabilities`

`token_log_probabilities`

`top_log_probabilities`

`PipelineTask`

`AUDIO_GENERATION`

`EMBEDDINGS_GENERATION`

`SPEECH_TOKEN_GENERATION`

`TEXT_GENERATION`

`output_type`

`SamplingParams`

`detokenize`

`frequency_penalty`

`ignore_eos`

`max_new_tokens`

`min_new_tokens`

`min_p`

`presence_penalty`

`repetition_penalty`

`seed`

`stop`

`stop_token_ids`

`temperature`

`top_k`

`top_p`

`TextGenerationResponse`

`append_token()`

`final_status`

`is_done`

`tokens`

`update_status()`

`TextResponse`

`log_probabilities`

`next_token`

`TokenGenerator`

`next_token()`

`release()`