Skip to main content

Python module

interfaces

Universal interfaces between all aspects of the MAX Inference Stack.

AudioGenerationMetadata

class max.interfaces.AudioGenerationMetadata(*, sample_rate=None, duration=None, chunk_id=None, timestamp=None, final_chunk=None, model_name=None, request_id=None, tokens_generated=None, processing_time=None, echo=None)

Represents metadata associated with audio generation.

This class will eventually replace the metadata dictionary used throughout the AudioGenerationOutput object, providing a structured and type-safe alternative for audio generation metadata.

Parameters:

  • sample_rate (int | None) – The sample rate of the generated audio in Hz.
  • duration (float | None) – The duration of the generated audio in seconds.
  • chunk_id (int | None) – Identifier for the audio chunk (useful for streaming).
  • timestamp (str | None) – Timestamp when the audio was generated.
  • final_chunk (bool | None) – Whether this is the final chunk in a streaming sequence.
  • model_name (str | None) – Name of the model used for generation.
  • request_id (str | None) – Unique identifier for the generation request.
  • tokens_generated (int | None) – Number of tokens generated for this audio.
  • processing_time (float | None) – Time taken to process this audio chunk in seconds.
  • echo (str | None) – Echo of the input prompt or identifier for verification.

chunk_id

chunk_id: int | None

duration

duration: float | None

echo

echo: str | None

final_chunk

final_chunk: bool | None

model_name

model_name: str | None

processing_time

processing_time: float | None

request_id

request_id: str | None

sample_rate

sample_rate: int | None

timestamp

timestamp: str | None

to_dict()

to_dict()

Convert the metadata to a dictionary format.

Returns:

Dictionary representation of the metadata.

Return type:

dict[str, any]

tokens_generated

tokens_generated: int | None

AudioGenerationRequest

class max.interfaces.AudioGenerationRequest(request_id: str, index: 'int', model: 'str', lora: 'str | None' = None, input: 'Optional[str]' = None, audio_prompt_tokens: 'list[int]' = <factory>, audio_prompt_transcription: 'str' = '', sampling_params: 'SamplingParams' = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0), _assistant_message_override: 'str | None' = None, prompt: 'Optional[list[int] | str]' = None, streaming: 'bool' = True, buffer_speech_tokens: 'np.ndarray | None' = None)

Parameters:

audio_prompt_tokens

audio_prompt_tokens: list[int]

The prompt speech IDs to use for audio generation.

audio_prompt_transcription

audio_prompt_transcription: str = ''

The audio prompt transcription to use for audio generation.

buffer_speech_tokens

buffer_speech_tokens: np.ndarray | None = None

An optional field potentially containing the last N speech tokens generated by the model from a previous request.

When this field is specified, this tensor is used to buffer the tokens sent to the audio decoder.

index

index: int

The sequence order of this request within a batch. This is useful for maintaining the order of requests when processing multiple requests simultaneously, ensuring that responses can be matched back to their corresponding requests accurately.

input

input: str | None = None

The text to generate audio for. The maximum length is 4096 characters.

lora

lora: str | None = None

The name of the LoRA to be used for generating audio chunks. This should match the available models on the server and determines the behavior and capabilities of the response generation.

model

model: str

The name of the model to be used for generating audio chunks. This should match the available models on the server and determines the behavior and capabilities of the response generation.

prompt

prompt: list[int] | str | None = None

Optionally provide a preprocessed list of token ids or a prompt string to pass as input directly into the model. This replaces automatically generating TokenGeneratorRequestMessages given the input, audio prompt tokens, audio prompt transcription fields.

sampling_params

sampling_params: SamplingParams = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0)

Request sampling configuration options.

streaming

streaming: bool = True

Whether to stream the audio generation.

AudioGenerationResponse

class max.interfaces.AudioGenerationResponse(final_status, audio=None, buffer_speech_tokens=None)

Represents a response from the audio generation API.

This class encapsulates the result of an audio generation request, including the final status, generated audio data, and optional buffered speech tokens.

Parameters:

audio

audio: ndarray | None

The generated audio data, if available.

audio_data

property audio_data: ndarray

Returns the audio data if available.

Returns:

The generated audio data.

Return type:

ndarray

Raises:

AssertionError – If audio data is not available.

buffer_speech_tokens

buffer_speech_tokens: ndarray | None

Buffered speech tokens, if available.

final_status

final_status: GenerationStatus

The final status of the generation process.

has_audio_data

property has_audio_data: bool

Checks if audio data is present in the response.

Returns:

True if audio data is available, False otherwise.

Return type:

bool

is_done

property is_done: bool

Indicates whether the audio generation process is complete.

Returns:

True if generation is done, False otherwise.

Return type:

bool

AudioGenerator

class max.interfaces.AudioGenerator(*args, **kwargs)

Interface for audio generation models.

decoder_sample_rate

property decoder_sample_rate: int

The sample rate of the decoder.

next_chunk()

next_chunk(batch)

Computes the next audio chunk for a single batch.

The new speech tokens are saved to the context. The most recently generated audio is return through the AudioGenerationResponse.

Parameters:

batch (dict[str, AudioGeneratorContext]) – Batch of contexts.

Returns:

Dictionary mapping request IDs to audio generation responses.

Return type:

dict[str, AudioGenerationResponse]

prev_num_steps

property prev_num_steps: int

The number of speech tokens that were generated during the processing of the previous batch.

release()

release(context)

Releases resources associated with this context.

Parameters:

context (AudioGeneratorContext) – Finished context.

Return type:

None

AudioGeneratorOutput

class max.interfaces.AudioGeneratorOutput(audio_data, metadata, is_done, buffer_speech_tokens=None)

Represents the output of an audio generation step.

Parameters:

audio_data

audio_data: ndarray

The generated audio data as a NumPy array.

buffer_speech_tokens

buffer_speech_tokens: ndarray | None

An optional field containing the last N speech tokens generated by the model. This can be used to buffer speech tokens for a follow-up request, enabling seamless continuation of audio generation.

is_done

is_done: bool

Indicates whether the audio generation is complete (True) or if more chunks are expected (False).

metadata

metadata: AudioGenerationMetadata

Metadata associated with the audio generation, such as chunk information, prompt details, or other relevant context.

EmbeddingsGenerator

class max.interfaces.EmbeddingsGenerator(*args, **kwargs)

Interface for LLM embeddings-generator models.

encode()

encode(batch)

Computes embeddings for a batch of inputs.

Parameters:

batch (dict[str, EmbeddingsGeneratorContext]) – Batch of contexts to generate embeddings for.

Returns:

Dictionary mapping request IDs to their corresponding embeddings. Each embedding is typically a numpy array or tensor of floating point values.

Return type:

dict[str, Any]

EmbeddingsOutput

class max.interfaces.EmbeddingsOutput(embeddings)

Response structure for embedding generation.

Parameters:

embeddings (ndarray) – The generated embeddings as a NumPy array.

embeddings

embeddings: ndarray

is_done

property is_done: bool

Indicates whether the embedding generation process is complete.

Returns:

Always True, as embedding generation is a single-step operation.

Return type:

bool

GenerationStatus

class max.interfaces.GenerationStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Enum representing the status of a generation process in the MAX API.

ACTIVE

ACTIVE = 'active'

The generation process is ongoing.

END_OF_SEQUENCE

END_OF_SEQUENCE = 'end_of_sequence'

The generation process has reached the end of the sequence.

MAXIMUM_LENGTH

MAXIMUM_LENGTH = 'maximum_length'

The generation process has reached the maximum allowed length.

is_done

property is_done: bool

Returns True if the generation process is complete (not ACTIVE).

Returns:

True if the status is not ACTIVE, indicating completion.

Return type:

bool

InputContext

class max.interfaces.InputContext(*args, **kwargs)

A base class for model contexts, representing model inputs for TokenGenerators.

Token array layout:

.                      +---------- full prompt ----------+   CHUNK_SIZE*N v
. +--------------------+---------------+-----------------+----------------+
. | completed | next_tokens | | preallocated |
. +--------------------+---------------+-----------------+----------------+
. start_idx ^ active_idx ^ end_idx ^
.                      +---------- full prompt ----------+   CHUNK_SIZE*N v
. +--------------------+---------------+-----------------+----------------+
. | completed | next_tokens | | preallocated |
. +--------------------+---------------+-----------------+----------------+
. start_idx ^ active_idx ^ end_idx ^
  • completed: The tokens that have already been processed and encoded.
  • next_tokens: The tokens that will be processed in the next iteration. This may be a subset of the full prompt due to chunked prefill.
  • preallocated: The token slots that have been preallocated. The token array resizes to multiples of CHUNK_SIZE to accommodate the new tokens.

active_idx

property active_idx: int

active_length

property active_length: int

num tokens input this iteration.

This will be the prompt size for context encoding, and simply 1 for token generation.

Type:

Current sequence length

all_tokens

property all_tokens: ndarray

All prompt and generated tokens in the context.

assign_to_cache()

assign_to_cache(cache_seq_id)

Assigns the context to a cache slot.

Parameters:

cache_seq_id (int)

Return type:

None

bump_token_indices()

bump_token_indices(start_idx=0, active_idx=0, end_idx=0, committed_idx=0)

Update the start_idx, active_idx and end_idx without manipulating the token array.

Parameters:

  • start_idx (int)
  • active_idx (int)
  • end_idx (int)
  • committed_idx (int)

Return type:

None

cache_seq_id

property cache_seq_id: int

Returns the cache slot assigned to the context, raising an error if not assigned.

committed_idx

property committed_idx: int

compute_num_available_steps()

compute_num_available_steps(max_seq_len)

Compute the max number of steps we can execute for a given context without exceeding the max_seq_len.

Parameters:

max_seq_len (int)

Return type:

int

current_length

property current_length: int

The current length of the sequence, including completed and active tokens.

end_idx

property end_idx: int

eos_token_ids

property eos_token_ids: set[int]

generated_tokens

property generated_tokens: ndarray

All generated tokens in the context.

get_min_token_logit_mask()

get_min_token_logit_mask(num_steps)

Returns a set of indices for the tokens in the output that should be masked.

This is primarily used for the min_tokens setting, where we mask eos tokens in the logits to avoid generating them before we reach min_tokens.

Parameters:

num_steps (int)

Return type:

list[ndarray[Any, dtype[int32]]]

is_assigned_to_cache

property is_assigned_to_cache: bool

Returns True if input is assigned to a cache slot, False otherwise.

is_ce

property is_ce: bool

Returns True if the context is a context encoding context, False otherwise.

is_done

property is_done: bool

is_initial_prompt

property is_initial_prompt: bool

Returns true if the context has not been updated with tokens.

json_schema

property json_schema: str | None

A json schema to use during constrained decoding.

jump_ahead()

jump_ahead(new_token)

Updates the token array, while ensuring the new token is returned to the user.

Parameters:

new_token (int)

Return type:

None

log_probabilities

property log_probabilities: int

When > 0, returns the log probabilities for the top N tokens for each element token in the sequence.

log_probabilities_echo

property log_probabilities_echo: bool

When True, the input tokens are added to the returned logprobs.

matcher

property matcher: Any | None

An optional Grammar Matcher provided when using structured output.

max_length

property max_length: int | None

The maximum length of this sequence.

min_tokens

property min_tokens: int

The minimum number of new tokens to generate.

next_tokens

property next_tokens: ndarray

The next prompt tokens to be input during this iteration.

This should be a 1D array of tokens of length active_length.

outstanding_completion_tokens()

outstanding_completion_tokens()

Return the list of outstanding completion tokens and log probabilities that must be returned to the user.

Return type:

list[tuple[int, LogProbabilities | None]]

prompt_tokens

property prompt_tokens: ndarray

Prompt tokens in the context.

request_id

property request_id: str

reset()

reset()

Resets the context’s state by combining all tokens into a new prompt. This method is used when a request is evicted, meaning that the context needed to be re-encoded in the following CE iteration.

Return type:

None

rollback()

rollback(idx)

Rollback and remove the last idx tokens.

Parameters:

idx (int)

Return type:

None

sampling_params

property sampling_params: SamplingParams

Returns the per-request sampling configuration

set_draft_offset()

set_draft_offset(idx)

Parameters:

idx (int)

Return type:

None

set_matcher()

set_matcher(matcher)

Set a grammar matcher for use during constrained decoding.

Parameters:

matcher (Any)

Return type:

None

set_token_indices()

set_token_indices(start_idx=None, active_idx=None, end_idx=None, committed_idx=None)

Set the token indices without manipulating the token array.

Parameters:

  • start_idx (int | None)
  • active_idx (int | None)
  • end_idx (int | None)
  • committed_idx (int | None)

Return type:

None

start_idx

property start_idx: int

status

property status: GenerationStatus

tokens

property tokens: ndarray

All tokens (including padded tokens) in the context. In most scenarios, use all_tokens to get the active full token array.

unassign_from_cache()

unassign_from_cache()

Unassigns the context from a cache slot.

Return type:

None

update()

update(new_token, log_probabilities=None)

Updates the next_tokens and extends existing tokens to include all generated tokens.

Parameters:

Return type:

None

update_status()

update_status(status)

Parameters:

status (GenerationStatus)

Return type:

None

LogProbabilities

class max.interfaces.LogProbabilities(token_log_probabilities, top_log_probabilities)

Log probabilities for an individual output token.

This is a data-only class that serves as a serializable data structure for transferring log probability information. It does not provide any functionality for calculating or manipulating log probabilities - it is purely for data storage and serialization purposes.

Parameters:

token_log_probabilities

token_log_probabilities: list[float]

Probabilities of each token.

top_log_probabilities

top_log_probabilities: list[dict[int, float]]

Top tokens and their corresponding probabilities.

PipelineOutput

class max.interfaces.PipelineOutput(*args, **kwargs)

Abstract base class representing the output of a pipeline operation.

Subclasses must implement the is_done property to indicate whether the pipeline operation has completed.

is_done

property is_done: bool

Indicates whether the pipeline operation has completed.

Returns:

True if the operation is done, False otherwise.

Return type:

bool

PipelineTask

class max.interfaces.PipelineTask(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Enum representing the types of pipeline tasks supported.

AUDIO_GENERATION

AUDIO_GENERATION = 'audio_generation'

Task for generating audio.

EMBEDDINGS_GENERATION

EMBEDDINGS_GENERATION = 'embeddings_generation'

Task for generating embeddings.

SPEECH_TOKEN_GENERATION

SPEECH_TOKEN_GENERATION = 'speech_token_generation'

Task for generating speech tokens.

TEXT_GENERATION

TEXT_GENERATION = 'text_generation'

Task for generating text.

output_type

property output_type: type

Get the output type for the pipeline task.

Returns:

The output type for the pipeline task.

Return type:

type

PipelineTokenizer

class max.interfaces.PipelineTokenizer(*args, **kwargs)

Interface for LLM tokenizers.

decode()

async decode(encoded, **kwargs)

Decodes response tokens to text.

Parameters:

encoded (TokenizerEncoded) – Encoded response tokens.

Returns:

Un-encoded response text.

Return type:

str

encode()

async encode(prompt, add_special_tokens)

Encodes text prompts as tokens.

Parameters:

  • prompt (str) – Un-encoded prompt text.
  • add_special_tokens (bool)

Raises:

ValueError – If the prompt exceeds the configured maximum length.

Return type:

TokenizerEncoded

eos

property eos: int

The end of sequence token for this tokenizer.

expects_content_wrapping

property expects_content_wrapping: bool

If true, this tokenizer expects messages to have a content property.

Text messages are formatted as:

{ "type": "text", "content": "text content" }
{ "type": "text", "content": "text content" }

instead of the OpenAI spec:

{ "type": "text", "text": "text content" }
{ "type": "text", "text": "text content" }

NOTE: Multimodal messages omit the content property. Both image_urls and image content parts are converted to:

{ "type": "image" }
{ "type": "image" }

Their content is provided as byte arrays through the top-level property on the request object, i.e., RequestType.images.

new_context()

async new_context(request)

Creates a new context from a request object. This is sent to the worker process once and then cached locally.

Parameters:

request (RequestType) – Incoming request.

Returns:

Initialized context.

Return type:

UnboundContextType

Request

class max.interfaces.Request(request_id)

Base class representing a generic request within the MAX API.

This class provides a unique identifier for each request, ensuring that all requests can be tracked and referenced consistently throughout the system. Subclasses can extend this class to include additional fields specific to their request types.

Parameters:

request_id (str)

request_id

request_id: str

RequestID

max.interfaces.RequestID

alias of str

SamplingParams

class max.interfaces.SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0)

Request specific sampling parameters that are only known at run time.

Parameters:

detokenize

detokenize: bool = True

Whether to detokenize the output tokens into text.

frequency_penalty

frequency_penalty: float = 0.0

The frequency penalty to apply to the model’s output. A positive value will penalize new tokens based on their frequency in the generated text: tokens will receive a penalty proportional to the count of appearances.

ignore_eos

ignore_eos: bool = False

If True, the response will ignore the EOS token, and continue to generate until the max tokens or a stop string is hit.

max_new_tokens

max_new_tokens: int | None = None

The maximum number of new tokens to generate in the response. If not set, the model may generate tokens until it reaches its internal limits or based on other stopping criteria.

min_new_tokens

min_new_tokens: int = 0

The minimum number of tokens to generate in the response.

min_p

min_p: float = 0.0

Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

presence_penalty

presence_penalty: float = 0.0

The presence penalty to apply to the model’s output. A positive value will penalize new tokens that have already appeared in the generated text at least once by applying a constant penalty.

repetition_penalty

repetition_penalty: float = 1.0

The repetition penalty to apply to the model’s output. Values > 1 will penalize new tokens that have already appeared in the generated text at least once by dividing the logits by the repetition penalty.

seed

seed: int = 0

The seed to use for the random number generator.

stop

stop: list[str] | None = None

A list of detokenized sequences that can be used as stop criteria when generating a new sequence.

stop_token_ids

stop_token_ids: list[int] | None = None

A list of token ids that are used as stopping criteria when generating a new sequence.

temperature

temperature: float = 1

Controls the randomness of the model’s output; higher values produce more diverse responses.

top_k

top_k: int = 1

Limits the sampling to the K most probable tokens. This defaults to 1, which enables greedy sampling.

top_p

top_p: float = 1

Only use the tokens whose cumulative probability is within the top_p threshold. This applies to the top_k tokens.

SchedulerResult

class max.interfaces.SchedulerResult(status, result)

Structure representing the result of a scheduler operation for a specific pipeline execution.

This class encapsulates the outcome of a pipeline operation as managed by the scheduler, including both the execution status and any resulting data from the pipeline. The scheduler uses this structure to communicate the state of pipeline operations back to clients, whether the operation is still running, has completed successfully, or was cancelled.

The generic type parameter allows this result to work with different types of pipeline outputs while maintaining type safety.

Parameters:

active()

classmethod active(result)

Create a SchedulerResult representing an active pipeline operation.

Parameters:

result (PipelineOutputType) – The current pipeline output data (may be partial for streaming operations).

Returns:

A SchedulerResult with ACTIVE status and the provided result.

Return type:

SchedulerResult

cancelled()

classmethod cancelled()

Create a SchedulerResult representing a cancelled pipeline operation.

Returns:

A SchedulerResult with CANCELLED status and no result.

Return type:

SchedulerResult

complete()

classmethod complete(result)

Create a SchedulerResult representing a completed pipeline operation.

Parameters:

result (PipelineOutputType) – The final pipeline output data.

Returns:

A SchedulerResult with COMPLETE status and the final result.

Return type:

SchedulerResult

result

result: PipelineOutputType | None

The pipeline output data, if any. May be None for cancelled operations or during intermediate states of streaming operations.

status

status: SchedulerStatus

The current status of the pipeline operation from the scheduler’s perspective.

stop_stream

property stop_stream: bool

Determine if the pipeline operation stream should continue based on the current status.

Returns:

True if the pipeline operation stream should stop (CANCELLED or COMPLETE), False if it should continue (ACTIVE).

Return type:

bool

SchedulerStatus

class max.interfaces.SchedulerStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Represents the status of a scheduler operation for a specific pipeline execution.

The scheduler manages the execution of pipeline operations and returns status updates to indicate the current state of the pipeline execution. This enum defines the possible states that a pipeline operation can be in from the scheduler’s perspective.

ACTIVE

ACTIVE = 'active'

Indicates that the scheduler executed the pipeline operation successfully and request remains active.

CANCELLED

CANCELLED = 'cancelled'

Indicates that the pipeline operation was cancelled before completion; no further data will be provided.

COMPLETE

COMPLETE = 'complete'

Indicates that the pipeline operation was previously finished and no further data should be streamed.

SharedMemoryArray

class max.interfaces.SharedMemoryArray(name, shape, dtype)

Wrapper for numpy array stored in shared memory.

This class is used as a placeholder in pixel_values during serialization. It will be encoded as a dict with __shm__ flag and decoded back to a numpy array.

Parameters:

TextGenerationInputs

class max.interfaces.TextGenerationInputs(batch, num_steps)

Input parameters for text generation pipeline operations.

This class encapsulates the batch of contexts and number of steps required for token generation in a single input object, replacing the previous pattern of passing batch and num_steps as separate parameters.

Parameters:

batch

batch: dict[str, T]

Dictionary mapping request IDs to context objects.

num_steps

num_steps: int

Number of tokens to generate.

TextGenerationOutput

class max.interfaces.TextGenerationOutput(request_id, tokens, final_status, log_probabilities=None)

Represents the output of a text generation operation, combining token IDs, final generation status, request ID, and optional log probabilities for each token.

Parameters:

final_status

final_status: GenerationStatus

The final status of the generation process.

is_done

property is_done: bool

Indicates whether the text generation process is complete.

Returns:

True if the generation is done, False otherwise.

Return type:

bool

log_probabilities

log_probabilities: list[LogProbabilities] | None

Optional list of log probabilities for each token.

request_id

request_id: str

The unique identifier for the generation request.

tokens

tokens: list[int]

List of generated token IDs.

TextGenerationRequest

class max.interfaces.TextGenerationRequest(request_id: str, index: 'int', model_name: 'str', lora_name: 'str | None' = None, prompt: 'Union[str, Sequence[int], None]' = None, messages: 'Optional[list[TextGenerationRequestMessage]]' = None, images: 'Optional[list[bytes]]' = None, tools: 'Optional[list[TextGenerationRequestTool]]' = None, response_format: 'Optional[TextGenerationResponseFormat]' = None, timestamp_ns: 'int' = 0, request_path: 'str' = '/', logprobs: 'int' = 0, echo: 'bool' = False, stop: 'Optional[Union[str, list[str]]]' = None, chat_template_options: 'Optional[dict[str, Any]]' = None, sampling_params: 'SamplingParams' = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0))

Parameters:

chat_template_options

chat_template_options: dict[str, Any] | None = None

Optional dictionary of options to pass when applying the chat template.

echo

echo: bool = False

If set to True, the response will include the original prompt along with the generated output. This can be useful for debugging or when you want to see how the input relates to the output.

images

images: list[bytes] | None = None

A list of image byte arrays that can be included as part of the request. This field is optional and may be used for multimodal inputs where images are relevant to the prompt or task.

index

index: int

The sequence order of this request within a batch. This is useful for maintaining the order of requests when processing multiple requests simultaneously, ensuring that responses can be matched back to their corresponding requests accurately.

logprobs

logprobs: int = 0

The number of top log probabilities to return for each generated token. A value of 0 means that log probabilities will not be returned. Useful for analyzing model confidence in its predictions.

lora_name

lora_name: str | None = None

The name of the lora to be used for generating tokens. This should match the available models on the server and determines the behavior and capabilities of the response generation.

messages

messages: list[TextGenerationRequestMessage] | None = None

A list of messages for chat-based interactions. This is used in chat completion APIs, where each message represents a turn in the conversation. If provided, the model will generate responses based on these messages.

model_name

model_name: str

The name of the model to be used for generating tokens. This should match the available models on the server and determines the behavior and capabilities of the response generation.

prompt

prompt: str | Sequence[int] | None = None

The prompt to be processed by the model. This field supports legacy completion APIs and can accept either a string or a sequence of integers representing token IDs. If not provided, the model may generate output based on the messages field.

request_path

request_path: str = '/'

The endpoint path for the request. This is typically used for routing and logging requests within the server infrastructure.

response_format

response_format: TextGenerationResponseFormat | None = None

Specifies the desired format for the model’s output. When set, it enables structured generation, which adheres to the json_schema provided.

sampling_params

sampling_params: SamplingParams = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0)

Token sampling configuration parameters for the request.

stop

stop: str | list[str] | None = None

//platform.openai.com/docs/api-reference/chat/create#chat-create-stop)

Type:

Optional list of stop expressions (see

Type:

https

timestamp_ns

timestamp_ns: int = 0

The time (in nanoseconds) when the request was received by the server. This can be useful for performance monitoring and logging purposes.

tools

tools: list[TextGenerationRequestTool] | None = None

A list of tools that can be invoked during the generation process. This allows the model to utilize external functionalities or APIs to enhance its responses.

TextGenerationRequestFunction

class max.interfaces.TextGenerationRequestFunction

Represents a function definition for a text generation request.

description

description: str

A human-readable description of the function’s purpose.

name

name: str

The name of the function to be invoked.

parameters

parameters: dict

A dictionary describing the function’s parameters, typically following a JSON schema.

TextGenerationRequestMessage

class max.interfaces.TextGenerationRequestMessage

content

content: str | list[dict[str, Any]]

Content can be a simple string or a list of message parts of different modalities.

For example:

{
"role": "user",
"content": "What's the weather like in Boston today?"
}
{
"role": "user",
"content": "What's the weather like in Boston today?"
}

Or:

{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}

role

role: Literal['system', 'user', 'assistant']

The role of the message sender, indicating whether the message is from the system, user, or assistant.

TextGenerationRequestTool

class max.interfaces.TextGenerationRequestTool

Represents a tool definition for a text generation request.

function

function: TextGenerationRequestFunction

The function definition associated with the tool, including its name, description, and parameters.

type

type: str

The type of the tool, typically indicating the tool’s category or usage.

TextGenerationResponseFormat

class max.interfaces.TextGenerationResponseFormat

Represents the response format specification for a text generation request.

json_schema

json_schema: dict

A JSON schema dictionary that defines the structure and validation rules for the generated response.

type

type: str

The type of response format, e.g., “json_object”.

TokenGenerator

class max.interfaces.TokenGenerator(*args, **kwargs)

Interface for LLM token-generator models.

next_token()

next_token(inputs)

Computes the next token response for a single batch.

Parameters:

inputs (TextGenerationInputs[T]) – Input data containing batch of contexts and number of steps to generate.

Returns:

Dictionary of responses indexed by request ID.

Return type:

dict[str, TextGenerationOutput]

release()

release(request_id)

Releases resources associated with this request ID.

Parameters:

request_id (str) – Unique identifier for the finished request.

Return type:

None