Skip to main content
Log in

Python module

config

Standardized configuration for Pipeline Inference.

HuggingFaceRepo

class max.pipelines.config.HuggingFaceRepo(repo_id: 'str', trust_remote_code: 'bool' = False, repo_type: 'Optional[RepoType]' = None)

download()

download(filename: str, force_download: bool = False) → Path

encoding_for_file()

encoding_for_file(file: str | Path) → SupportedEncoding

file_exists()

file_exists(filename: str) → bool

files_for_encoding()

files_for_encoding(encoding: SupportedEncoding, weights_format: WeightsFormat | None = None, alternate_encoding: SupportedEncoding | None = None) → dict[max.graph.weights.format.WeightsFormat, list[pathlib.Path]]

formats_available

property formats_available*: list[max.graph.weights.format.WeightsFormat]*

info

property info*: ModelInfo*

repo_id

repo_id*: str*

repo_type

repo_type*: RepoType | None* = None

size_of()

size_of(filename: str) → int | None

supported_encodings

property supported_encodings*: list[max.pipelines.config.SupportedEncoding]*

trust_remote_code

trust_remote_code*: bool* = False

weight_files

property weight_files*: dict[max.graph.weights.format.WeightsFormat, list[str]]*

KVCacheConfig

class max.pipelines.config.KVCacheConfig(cache_strategy: 'KVCacheStrategy' = <KVCacheStrategy.MODEL_DEFAULT: 'model_default'>, kv_cache_page_size: 'int' = 128, enable_prefix_caching: 'bool' = False, device_memory_utilization: 'float' = 0.9, _available_cache_memory: 'Optional[int]' = None)

cache_strategy

cache_strategy*: KVCacheStrategy* = 'model_default'

The cache strategy to use. This defaults to model_default, which will set the cache strategy based on the default strategy for the architecture requested.

You can also force the engine to use a specific caching strategy: naive | continuous | paged.

device_memory_utilization

device_memory_utilization*: float* = 0.9

The fraction of available device memory that the process should consume.

This is used to inform the size of the KVCache workspace. The calculation is:

kv_cache_workspace=(total_free_memoryimesdevice_memory_utilization)model_weights_sizekv\_cache\_workspace = (total\_free\_memory imes device\_memory\_utilization) - model\_weights\_size

enable_prefix_caching

enable_prefix_caching*: bool* = False

Whether to enable prefix caching for the paged attention KVCache.

help()

static help() → dict[str, str]

Documentation for this config class. Return a dictionary of config options and their descriptions.

kv_cache_page_size

kv_cache_page_size*: int* = 128

The number of tokens in a single page in the paged KVCache.

MAXConfig

class max.pipelines.config.MAXConfig

Abstract base class for all MAX configs.

There are some invariants that MAXConfig classes should follow:

  • All config classes should be dataclasses.
  • All config classes should have a help() method that returns a dictionary of config options and their descriptions.
  • All config classes dataclass fields should have default values, and hence can be trivially initialized via cls().
  • All config classes should be frozen (except KVCacheConfig for now), to avoid accidental modification of config objects.
  • All config classes must have mutually exclusive dataclass fields among themselves.

help()

abstract help() → dict[str, str]

Documentation for this config class. Return a dictionary of config options and their descriptions.

MAXModelConfig

class max.pipelines.config.MAXModelConfig(model_path: str = '', huggingface_repo_id: str = '', weight_path: list[pathlib.Path] = <factory>, quantization_encoding: ~max.pipelines.config.SupportedEncoding | None = None, huggingface_revision: str = 'main', trust_remote_code: bool = False, device_specs: list[max.driver.driver.DeviceSpec] = <factory>, force_download: bool = False, _weights_repo_id: str | None = None, _quant_config: ~max.graph.quantization.QuantizationConfig | None = None)

Abstract base class for all MAX model configs.

This class is used to configure a model to use for a pipeline.

device_specs

device_specs*: list[max.driver.driver.DeviceSpec]*

Devices to run inference upon. This option is not documented in help() as it shouldn’t be used directly via the CLI entrypoint.

finalize_encoding_config()

finalize_encoding_config()

force_download

force_download*: bool* = False

Whether to force download a given file if it’s already present in the local cache.

graph_quantization_encoding

property graph_quantization_encoding*: QuantizationEncoding | None*

Converts the CLI encoding to a MAX Graph quantization encoding.

  • Returns:

    The graph quantization encoding corresponding to the CLI encoding.

  • Raises:

    ValueError – If no CLI encoding was specified.

help()

static help() → dict[str, str]

Documentation for this config class. Return a dictionary of config options and their descriptions.

huggingface_repo_id

huggingface_repo_id*: str* = ''

repo_id of a Hugging Face model repository to use. Use model_path instead.

  • Type:

    DEPRECATED

huggingface_revision

huggingface_revision*: str* = 'main'

Branch or Git revision of Hugging Face model repository to use.

huggingface_weights_repo()

huggingface_weights_repo() → HuggingFaceRepo

model_path

model_path*: str* = ''

repo_id of a Hugging Face model repository to use.

quantization_encoding

quantization_encoding*: SupportedEncoding | None* = None

Weight encoding type.

trust_remote_code

trust_remote_code*: bool* = False

Whether or not to allow for custom modelling files on Hugging Face.

validate()

validate()

Validates the config.

This method is called after the model config is initialized, to ensure that all config fields have been initialized to a valid state. It will also set and update other fields which may not be determined / initialized in the default factory.

weight_path

weight_path*: list[pathlib.Path]*

Optional path or url of the model weights to use.

weights_size()

weights_size() → int

PipelineConfig

class max.pipelines.config.PipelineConfig(**kwargs: Any)

Configuration for a pipeline.

WIP - Once a PipelineConfig is fully initialized, it should be as immutable as possible (frozen=True). All underlying dataclass fields should have been initialized to their default values, be it user specified via some CLI flag, config file, environment variable, or internally set to a reasonable default.

draft_model

draft_model*: str | None* = None

Draft model for use during Speculative Decoding.

enable_chunked_prefill

enable_chunked_prefill*: bool* = True

Enable chunked prefill to split context encoding requests into multiple chunks based on ‘target_num_new_tokens’.

enable_echo

enable_echo*: bool* = False

Whether the model should be built with echo capabilities.

enable_in_flight_batching

enable_in_flight_batching*: bool* = False

When enabled, prioritizes token generation by batching it with context encoding requests. Requires chunked prefill.

engine

engine*: PipelineEngine | None* = None

Engine backend to use for serving, ‘max’ for the max engine, or ‘huggingface’ as fallback option for improved model coverage.

graph_quantization_encoding

property graph_quantization_encoding*: QuantizationEncoding | None*

Converts the CLI encoding to a MAX graph quantization encoding.

  • Returns:

    The graph quantization encoding corresponding to the CLI encoding.

help()

static help() → dict[str, str]

Documentation for this config class. Return a dictionary of config options and their descriptions.

kv_cache_config

property kv_cache_config*: KVCacheConfig*

max_batch_size

max_batch_size*: int | None* = None

Maximum batch size to execute with the model. This is set to one, to minimize memory consumption for the base case, in which a person is running a local server to test out MAX. For users launching in a server scenario, the expectation is that this value should be set higher based on server capacity.

max_cache_batch_size

max_cache_batch_size*: int | None* = None

The maximum cache batch size to use for the model. Use max_batch_size instead.

  • Type:

    DEPRECATED

max_ce_batch_size

max_ce_batch_size*: int* = 192

Maximum cache size to reserve for a single context encoding batch. The actual limit is the lesser of this and max_batch_size.

max_length

max_length*: int | None* = None

Maximum sequence length of the model.

max_new_tokens

max_new_tokens*: int* = -1

Maximum number of new tokens to generate during a single inference pass of the model.

max_num_steps

max_num_steps*: int* = -1

The number of steps to run for multi-step scheduling. -1 specifies a default value based on configuration and platform. Ignored for models which are not auto-regressive (e.g. embedding models).

model_config

property model_config*: MAXModelConfig*

pad_to_multiple_of

pad_to_multiple_of*: int* = 2

Pad input tensors to be a multiple of value provided.

pool_embeddings

pool_embeddings*: bool* = True

Whether to pool embedding outputs.

profiling_config

property profiling_config*: ProfilingConfig*

rope_type

rope_type*: RopeType | None* = None

none | normal | neox. Only matters for GGUF weights.

  • Type:

    Force using a specific rope type

sampling_config

property sampling_config*: SamplingConfig*

save_to_serialized_model_path

save_to_serialized_model_path*: str | None* = None

Serialization paths no longer supported.

  • Type:

    DEPRECATED

serialized_model_path

serialized_model_path*: str | None* = None

Serialization paths no longer supported.

  • Type:

    DEPRECATED

target_num_new_tokens

target_num_new_tokens*: int | None* = None

The target number of un-encoded tokens to include in each batch. If not set, this will be set to a best-guess optimal value based on model, hardware, and available memory.

use_experimental_kernels

use_experimental_kernels*: str* = 'false'

validate()

validate() → None

Validate the config.

This method is called after the config is initialized, to ensure that all config fields have been initialized to a valid state.

PipelineEngine

class max.pipelines.config.PipelineEngine(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

HUGGINGFACE

HUGGINGFACE = 'huggingface'

MAX

MAX = 'max'

ProfilingConfig

class max.pipelines.config.ProfilingConfig(gpu_profiling: 'GPUProfilingMode' = <GPUProfilingMode.OFF: 'off'>)

gpu_profiling

gpu_profiling*: GPUProfilingMode* = 'off'

Whether to enable GPU profiling of the model.

help()

static help() → dict[str, str]

Documentation for this config class. Return a dictionary of config options and their descriptions.

RepoType

class max.pipelines.config.RepoType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

local

local = 'local'

online

online = 'online'

RopeType

class max.pipelines.config.RopeType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

neox

neox = 'neox'

none

none = 'none'

normal

normal = 'normal'

SamplingConfig

class max.pipelines.config.SamplingConfig(top_k: 'int' = 1, enable_structured_output: 'bool' = False, in_dtype: 'DType' = float32, out_dtype: 'DType' = float32)

enable_structured_output

enable_structured_output*: bool* = False

Enable structured generation/guided decoding for the server. This allows the user to pass a json schema in the response_format field, which the LLM will adhere to.

help()

static help() → dict[str, str]

Documentation for this config class. Return a dictionary of config options and their descriptions.

in_dtype

in_dtype*: DType* = 72

The data type of the input tokens.

out_dtype

out_dtype*: DType* = 72

The data type of the output logits.

top_k

top_k*: int* = 1

Limits the sampling to the K most probable tokens. This defaults to 1, which enables greedy sampling.

SupportedEncoding

class max.pipelines.config.SupportedEncoding(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

All possible encodings which may be supported by a particular model.

bfloat16

bfloat16 = 'bfloat16'

cache_dtype

property cache_dtype*: DType*

The underlying dtype used in the kvcache for correctness.

dtype

property dtype*: DType*

The underlying model dtype associated with a quantization_encoding.

float32

float32 = 'float32'

gptq

gptq = 'gptq'

parse_from_file_name()

classmethod parse_from_file_name(name: str)

q4_0

q4_0 = 'q4_0'

q4_k

q4_k = 'q4_k'

q6_k

q6_k = 'q6_k'

quantization_encoding

property quantization_encoding*: QuantizationEncoding | None*

supported_on()

supported_on(device_spec: DeviceSpec) → bool

Returns whether this quantization encoding is supported on a device.