Skip to main content

Python module

interfaces

Universal interfaces between all aspects of the MAX Inference Stack.

AudioGenerationResponse

class max.interfaces.AudioGenerationResponse(final_status, audio=None, buffer_speech_tokens=None)

Represents a response from the audio generation API.

Parameters:

  • final_status (GenerationStatus) – The final status of the generation process.
  • audio (ndarray | None) – The generated audio data, if available.
  • buffer_speech_tokens (ndarray | None) – Buffered speech tokens, if available.

audio

audio: ndarray | None

audio_data

property audio_data: ndarray

Returns the audio data if available.

Returns:

The generated audio data.

Return type:

np.ndarray

Raises:

AssertionError – If audio data is not available.

buffer_speech_tokens

buffer_speech_tokens: ndarray | None

final_status

final_status: GenerationStatus

has_audio_data

property has_audio_data: bool

Checks if audio data is present in the response.

Returns:

True if audio data is available, False otherwise.

Return type:

bool

is_done

property is_done: bool

Indicates whether the audio generation process is complete.

Returns:

True if generation is done, False otherwise.

Return type:

bool

EmbeddingsResponse

class max.interfaces.EmbeddingsResponse(embeddings)

Response structure for embedding generation.

Parameters:

embeddings (ndarray) – The generated embeddings as a NumPy array.

embeddings

embeddings: ndarray

EngineResult

class max.interfaces.EngineResult(status, result)

Structure representing the result of an engine operation.

Parameters:

  • status (EngineStatus) – The status of the operation.
  • result (T | None) – The result data of the operation.

active()

classmethod active(result)

Create an EngineResult representing an active operation.

Parameters:

result (T) – The result data of the operation.

Returns:

An EngineResult with ACTIVE status and the provided result.

Return type:

EngineResult

cancelled()

classmethod cancelled()

Create an EngineResult representing a cancelled operation.

Returns:

An EngineResult with CANCELLED status and no result.

Return type:

EngineResult

complete()

classmethod complete(result)

Create an EngineResult representing a completed operation.

Returns:

An EngineResult with COMPLETE status and no result.

Return type:

EngineResult

Parameters:

result (T)

result

result: T | None

status

status: EngineStatus

stop_stream

property stop_stream: bool

Determine if the stream should continue based on the current status.

Returns:

True if the stream should stop, False otherwise.

Return type:

bool

EngineStatus

class max.interfaces.EngineStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Represents the status of an engine operation.

Status values:
ACTIVE: Indicates that the engine executed the operation successfully and request remains active. CANCELLED: Indicates that the request was cancelled before completion; no further data will be provided. COMPLETE: Indicates that the engine executed the operation successfully and the request is completed.

ACTIVE

ACTIVE = 'active'

Indicates that the engine executed the operation successfully and request remains active.

CANCELLED

CANCELLED = 'cancelled'

Indicates that the request was cancelled before completion; no further data will be provided.

COMPLETE

COMPLETE = 'complete'

Indicates that the request was previously finished and no further data should be streamed.

GenerationStatus

class max.interfaces.GenerationStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Enum representing the status of a generation process in the MAX API.

ACTIVE

ACTIVE = 'active'

The generation process is ongoing.

END_OF_SEQUENCE

END_OF_SEQUENCE = 'end_of_sequence'

The generation process has reached the end of the sequence.

MAXIMUM_LENGTH

MAXIMUM_LENGTH = 'maximum_length'

The generation process has reached the maximum allowed length.

is_done

property is_done: bool

Returns True if the generation process is complete (not ACTIVE).

Returns:

True if the status is not ACTIVE, indicating completion.

Return type:

bool

LogProbabilities

class max.interfaces.LogProbabilities(token_log_probabilities=<factory>, top_log_probabilities=<factory>)

Log probabilities for an individual output token.

This is a data-only class that serves as a serializable data structure for transferring log probability information. It does not provide any functionality for calculating or manipulating log probabilities - it is purely for data storage and serialization purposes.

Parameters:

  • token_log_probabilities (list[float]) – Probabilities of each token.
  • top_log_probabilities (list[dict[int, float]]) – Top tokens and their corresponding probabilities.

token_log_probabilities

token_log_probabilities: list[float]

top_log_probabilities

top_log_probabilities: list[dict[int, float]]

PipelineTask

class max.interfaces.PipelineTask(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Enum representing the types of pipeline tasks supported.

AUDIO_GENERATION

AUDIO_GENERATION = 'audio_generation'

Task for generating audio.

EMBEDDINGS_GENERATION

EMBEDDINGS_GENERATION = 'embeddings_generation'

Task for generating embeddings.

SPEECH_TOKEN_GENERATION

SPEECH_TOKEN_GENERATION = 'speech_token_generation'

Task for generating speech tokens.

TEXT_GENERATION

TEXT_GENERATION = 'text_generation'

Task for generating text.

output_type

property output_type: type

Get the output type for the pipeline task.

Returns:

The output type for the pipeline task.

Return type:

type

SamplingParams

class max.interfaces.SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0)

Request specific sampling parameters that are only known at run time.

Parameters:

detokenize

detokenize: bool = True

Whether to detokenize the output tokens into text.

frequency_penalty

frequency_penalty: float = 0.0

The frequency penalty to apply to the model’s output. A positive value will penalize new tokens based on their frequency in the generated text: tokens will receive a penalty proportional to the count of appearances.

ignore_eos

ignore_eos: bool = False

If True, the response will ignore the EOS token, and continue to generate until the max tokens or a stop string is hit.

max_new_tokens

max_new_tokens: int | None = None

The maximum number of new tokens to generate in the response. If not set, the model may generate tokens until it reaches its internal limits or based on other stopping criteria.

min_new_tokens

min_new_tokens: int = 0

The minimum number of tokens to generate in the response.

min_p

min_p: float = 0.0

Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

presence_penalty

presence_penalty: float = 0.0

The presence penalty to apply to the model’s output. A positive value will penalize new tokens that have already appeared in the generated text at least once by applying a constant penalty.

repetition_penalty

repetition_penalty: float = 1.0

The repetition penalty to apply to the model’s output. Values > 1 will penalize new tokens that have already appeared in the generated text at least once by dividing the logits by the repetition penalty.

seed

seed: int = 0

The seed to use for the random number generator.

stop

stop: list[str] | None = None

A list of detokenized sequences that can be used as stop criteria when generating a new sequence.

stop_token_ids

stop_token_ids: list[int] | None = None

A list of token ids that are used as stopping criteria when generating a new sequence.

temperature

temperature: float = 1

Controls the randomness of the model’s output; higher values produce more diverse responses.

top_k

top_k: int = 1

Limits the sampling to the K most probable tokens. This defaults to 1, which enables greedy sampling.

top_p

top_p: float = 1

Only use the tokens whose cumulative probability is within the top_p threshold. This applies to the top_k tokens.

TextGenerationResponse

class max.interfaces.TextGenerationResponse(tokens, final_status)

Response structure for text generation.

Parameters:

append_token()

append_token(token)

Parameters:

token (TextResponse)

Return type:

None

final_status

final_status: GenerationStatus

is_done

property is_done: bool

tokens

tokens: list[TextResponse]

update_status()

update_status(status)

Parameters:

status (GenerationStatus)

Return type:

None

TextResponse

class max.interfaces.TextResponse(next_token, log_probabilities=None)

A base class for model responses, specifically for text model variants.

Parameters:

  • next_token (int | str) – Encoded predicted next token.
  • log_probabilities (LogProbabilities | None) – Log probabilities of each output token.

log_probabilities

log_probabilities: LogProbabilities | None

next_token

next_token: int | str

TokenGenerator

class max.interfaces.TokenGenerator(*args, **kwargs)

Interface for LLM token-generator models.

next_token()

next_token(batch, num_steps)

Computes the next token response for a single batch.

Parameters:

  • batch (dict[str, T]) – Batch of contexts.
  • num_steps (int) – Number of tokens to generate.

Returns:

List of encoded responses (indexed by req. ID)

Return type:

list[dict[str, TextResponse]]

release()

release(context)

Releases resources associated with this context.

Parameters:

context (T) – Finished context.

Return type:

None