Skip to main content

Python module

config

Standardized configuration for Pipeline Inference.

AudioGenerationConfig

class max.pipelines.lib.config.AudioGenerationConfig(audio_decoder: 'str', audio_decoder_weights: 'str' = '', block_sizes: 'list[int] | None' = None, buffer: 'int' = 0, block_causal: 'bool' = False, prepend_prompt_speech_tokens: 'PrependPromptSpeechTokens' = <PrependPromptSpeechTokens.NEVER: 'never'>, prepend_prompt_speech_tokens_causal: 'bool' = False, run_model_test_mode: 'bool' = False, **kwargs: 'Any')

Parameters:

  • audio_decoder (str )
  • audio_decoder_weights (str )
  • block_sizes (list [ int ] | None )
  • buffer (int )
  • block_causal (bool )
  • prepend_prompt_speech_tokens (PrependPromptSpeechTokens )
  • prepend_prompt_speech_tokens_causal (bool )
  • run_model_test_mode (bool )
  • kwargs (Any )

audio_decoder

audio_decoder: str = ''

The name of the audio decoder model architecture.

audio_decoder_config

audio_decoder_config: dict[str, Any]

Parameters to pass to the audio decoder model.

audio_decoder_weights

audio_decoder_weights: str = ''

The path to the audio decoder weights file.

block_causal

block_causal: bool = False

Whether prior buffered tokens should attend to tokens in the current block. Has no effect if buffer is not set.

block_sizes

block_sizes: list[int] | None = None

The block sizes to use for streaming. If this is an int, then fixed-size blocks of the given size are used If this is a list, then variable block sizes are used.

buffer

buffer: int = 0

The number of previous speech tokens to pass to the audio decoder on each generation step.

from_flags()

classmethod from_flags(audio_flags, **config_flags)

Parameters:

Return type:

AudioGenerationConfig

prepend_prompt_speech_tokens

prepend_prompt_speech_tokens: PrependPromptSpeechTokens = 'once'

Whether the prompt speech tokens should be forwarded to the audio decoder. If “never”, the prompt tokens are not forwarded. If “once”, the prompt tokens are only forwarded on the first block. If “always”, the prompt tokens are forwarded on all blocks.

prepend_prompt_speech_tokens_causal

prepend_prompt_speech_tokens_causal: bool = False

Whether the prompt speech tokens should attend to tokens in the currently generated audio block. Has no effect if prepend_prompt_speech_tokens is “never”. If False (default), the prompt tokens do not attend to the current block. If True, the prompt tokens attend to the current block.

PipelineConfig

class max.pipelines.lib.config.PipelineConfig(**kwargs)

Configuration for a pipeline.

WIP - Once a PipelineConfig is fully initialized, it should be as immutable as possible (frozen=True). All underlying dataclass fields should have been initialized to their default values, be it user specified via some CLI flag, config file, environment variable, or internally set to a reasonable default.

Parameters:

kwargs (Any )

ce_delay_ms

ce_delay_ms: float = 0.0

Duration of scheduler sleep prior to starting a prefill batch.

This is an experimental flag solely for the TTS scheduler. Do not use unless you know what you are doing.

custom_architectures

custom_architectures: list[str]

A list of custom architecture implementations to register. Each input can either be a raw module name or an import path followed by a colon and the module name. Ex:

  • my_module
  • folder/path/to/import:my_module

Each module must expose an ARCHITECTURES list of architectures to register.

draft_model_config

property draft_model_config: MAXModelConfig | None

enable_chunked_prefill

enable_chunked_prefill: bool = True

Enable chunked prefill to split context encoding requests into multiple chunks based on ‘target_num_new_tokens’.

enable_echo

enable_echo: bool = False

Whether the model should be built with echo capabilities.

enable_in_flight_batching

enable_in_flight_batching: bool = False

When enabled, prioritizes token generation by batching it with context encoding requests.

enable_prioritize_first_decode

enable_prioritize_first_decode: bool = False

When enabled, the scheduler will always run a TG batch immediately after a CE batch, with the same requests. This may be useful for decreasing time-to-first-chunk latency.

This is an experimental flag solely for the TTS scheduler. Do not use unless you know what you are doing.

engine

engine: PipelineEngine | None = None

Engine backend to use for serving, ‘max’ for the max engine, or ‘huggingface’ as fallback option for improved model coverage.

graph_quantization_encoding

property graph_quantization_encoding: QuantizationEncoding | None

Converts the CLI encoding to a MAX graph quantization encoding.

Returns:

The graph quantization encoding corresponding to the CLI encoding.

help()

static help()

Documentation for this config class. Return a dictionary of config options and their descriptions.

Return type:

dict[str, str]

ignore_eos

ignore_eos: bool = False

Ignore EOS and continue generating tokens, even when an EOS variable is hit.

lora_config

property lora_config: LoRAConfig | None

lora_manager

property lora_manager: LoRAManager | None

max_batch_size

max_batch_size: int | None = None

Maximum batch size to execute with the model. This is set to one, to minimize memory consumption for the base case, in which a person is running a local server to test out MAX. For users launching in a server scenario, the expectation is that this value should be set higher based on server capacity.

max_ce_batch_size

max_ce_batch_size: int = 192

Maximum cache size to reserve for a single context encoding batch. The actual limit is the lesser of this and max_batch_size.

max_length

max_length: int | None = None

Maximum sequence length of the model.

max_new_tokens

max_new_tokens: int = -1

Maximum number of new tokens to generate during a single inference pass of the model.

max_num_steps

max_num_steps: int = -1

The number of steps to run for multi-step scheduling. -1 specifies a default value based on configuration and platform. Ignored for models which are not auto-regressive (e.g. embedding models).

max_queue_size_tg

max_queue_size_tg: int | None = None

Maximum number of requests in decode queue. By default, this is max-batch-size.

min_batch_size_tg

min_batch_size_tg: int | None = None

Specifies a soft floor on the decode batch size.

If the TG batch size is larger than this value, the scheduler will continue to run TG batches. If it falls below, the scheduler will prioritize CE. Note that this is NOT a strict minimum! By default, this is max-queue-size-tg.

This is an experimental flag solely for the TTS scheduler. Do not use unless you know what you are doing.

model_config

property model_config: MAXModelConfig

pad_to_multiple_of

pad_to_multiple_of: int = 2

Pad input tensors to be a multiple of value provided.

pdl_level

pdl_level: str = '0'

Level of overlap of kernel launch via programmatic dependent grid control.

pipeline_role

pipeline_role: PipelineRole = 'prefill_and_decode'

Whether the pipeline should serve both a prefill or decode role or both.

pool_embeddings

pool_embeddings: bool = True

Whether to pool embedding outputs.

profiling_config

property profiling_config: ProfilingConfig

resolve()

resolve()

Validates and resolves the config.

This method is called after the config is initialized, to ensure that all config fields have been initialized to a valid state.

Return type:

None

sampling_config

property sampling_config: SamplingConfig

target_num_new_tokens

target_num_new_tokens: int | None = None

The target number of un-encoded tokens to include in each batch. If not set, this will be set to a best-guess optimal value based on model, hardware, and available memory.

use_experimental_kernels

use_experimental_kernels: str = 'false'

PrependPromptSpeechTokens

class max.pipelines.lib.config.PrependPromptSpeechTokens(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

ALWAYS

ALWAYS = 'always'

Prepend the prompt speech tokens to all blocks of speech tokens sent to the audio decoder.

NEVER

NEVER = 'never'

Never prepend the prompt speech tokens sent to the audio decoder.

ONCE

ONCE = 'once'

Prepend the prompt speech tokens to the first block of the audio decoder.