Python module

config

Standardized configuration for Pipeline Inference.

`HuggingFaceRepo`

class max.pipelines.config.HuggingFaceRepo(repo_id: 'str', trust_remote_code: 'bool' = False, repo_type: 'Optional[RepoType]' = None)

`download()`

download(filename: str, force_download: bool = False) → Path

`encoding_for_file()`

encoding_for_file(file: str | Path) → SupportedEncoding

`file_exists()`

file_exists(filename: str) → bool

`files_for_encoding()`

files_for_encoding(encoding: SupportedEncoding, weights_format: WeightsFormat | None = None, alternate_encoding: SupportedEncoding | None = None) → dict[max.graph.weights.format.WeightsFormat, list[pathlib.Path]]

`formats_available`

property formats_available*: list[max.graph.weights.format.WeightsFormat]*

`info`

property info*: ModelInfo*

`repo_id`

repo_id*: str*

`repo_type`

repo_type*: RepoType | None* = None

`size_of()`

size_of(filename: str) → int | None

`supported_encodings`

property supported_encodings*: list[max.pipelines.config.SupportedEncoding]*

`trust_remote_code`

trust_remote_code*: bool* = False

`weight_files`

property weight_files*: dict[max.graph.weights.format.WeightsFormat, list[str]]*

`KVCacheConfig`

class max.pipelines.config.KVCacheConfig(cache_strategy: 'KVCacheStrategy' = <KVCacheStrategy.MODEL_DEFAULT: 'model_default'>, kv_cache_page_size: 'int' = 128, enable_prefix_caching: 'bool' = False, device_memory_utilization: 'float' = 0.9, _available_cache_memory: 'Optional[int]' = None)

`cache_strategy`

cache_strategy*: KVCacheStrategy* = 'model_default'

The cache strategy to use. This defaults to model_default, which will set the cache strategy based on the default strategy for the architecture requested.

You can also force the engine to use a specific caching strategy: naive | continuous | paged.

`device_memory_utilization`

device_memory_utilization*: float* = 0.9

The fraction of available device memory that the process should consume.

This is used to inform the size of the KVCache workspace. The calculation is:

kv\_cache\_workspace = (total\_free\_memory imes device\_memory\_utilization) - model\_weights\_size

`enable_prefix_caching`

enable_prefix_caching*: bool* = False

Whether to enable prefix caching for the paged attention KVCache.

`help()`

static help() → dict[str, str]

Documentation for this config class. Return a dictionary of config options and their descriptions.

`kv_cache_page_size`

kv_cache_page_size*: int* = 128

The number of tokens in a single page in the paged KVCache.

`MAXConfig`

class max.pipelines.config.MAXConfig

Abstract base class for all MAX configs.

There are some invariants that MAXConfig classes should follow:

All config classes should be dataclasses.
All config classes should have a help() method that returns a dictionary of config options and their descriptions.
All config classes dataclass fields should have default values, and hence can be trivially initialized via cls().
All config classes should be frozen (except KVCacheConfig for now), to avoid accidental modification of config objects.
All config classes must have mutually exclusive dataclass fields among themselves.

`help()`

abstract help() → dict[str, str]

Documentation for this config class. Return a dictionary of config options and their descriptions.

`MAXModelConfig`

class max.pipelines.config.MAXModelConfig(model_path: str = '', huggingface_repo_id: str = '', weight_path: list[pathlib.Path] = <factory>, quantization_encoding: ~max.pipelines.config.SupportedEncoding | None = None, huggingface_revision: str = 'main', trust_remote_code: bool = False, device_specs: list[max.driver.driver.DeviceSpec] = <factory>, force_download: bool = False, _weights_repo_id: str | None = None, _quant_config: ~max.graph.quantization.QuantizationConfig | None = None)

Abstract base class for all MAX model configs.

This class is used to configure a model to use for a pipeline.

`device_specs`

device_specs*: list[max.driver.driver.DeviceSpec]*

Devices to run inference upon. This option is not documented in help() as it shouldn’t be used directly via the CLI entrypoint.

`finalize_encoding_config()`

finalize_encoding_config()

`force_download`

force_download*: bool* = False

Whether to force download a given file if it’s already present in the local cache.

`graph_quantization_encoding`

property graph_quantization_encoding*: QuantizationEncoding | None*

Converts the CLI encoding to a MAX Graph quantization encoding.

Returns:

The graph quantization encoding corresponding to the CLI encoding.
Raises:

ValueError – If no CLI encoding was specified.

`help()`

static help() → dict[str, str]

Documentation for this config class. Return a dictionary of config options and their descriptions.

`huggingface_repo_id`

huggingface_repo_id*: str* = ''

repo_id of a Hugging Face model repository to use. Use model_path instead.

Type:

DEPRECATED

`huggingface_revision`

huggingface_revision*: str* = 'main'

Branch or Git revision of Hugging Face model repository to use.

`huggingface_weights_repo()`

huggingface_weights_repo() → HuggingFaceRepo

`model_path`

model_path*: str* = ''

repo_id of a Hugging Face model repository to use.

`quantization_encoding`

quantization_encoding*: SupportedEncoding | None* = None

Weight encoding type.

`trust_remote_code`

trust_remote_code*: bool* = False

Whether or not to allow for custom modelling files on Hugging Face.

`validate()`

validate()

Validates the config.

This method is called after the model config is initialized, to ensure that all config fields have been initialized to a valid state. It will also set and update other fields which may not be determined / initialized in the default factory.

`weight_path`

weight_path*: list[pathlib.Path]*

Optional path or url of the model weights to use.

`weights_size()`

weights_size() → int

`PipelineConfig`

class max.pipelines.config.PipelineConfig(**kwargs: Any)

Configuration for a pipeline.

WIP - Once a PipelineConfig is fully initialized, it should be as immutable as possible (frozen=True). All underlying dataclass fields should have been initialized to their default values, be it user specified via some CLI flag, config file, environment variable, or internally set to a reasonable default.

`draft_model`

draft_model*: str | None* = None

Draft model for use during Speculative Decoding.

`enable_chunked_prefill`

enable_chunked_prefill*: bool* = True

Enable chunked prefill to split context encoding requests into multiple chunks based on ‘target_num_new_tokens’.

`enable_echo`

enable_echo*: bool* = False

Whether the model should be built with echo capabilities.

`enable_in_flight_batching`

enable_in_flight_batching*: bool* = False

When enabled, prioritizes token generation by batching it with context encoding requests. Requires chunked prefill.

`engine`

engine*: PipelineEngine | None* = None

Engine backend to use for serving, ‘max’ for the max engine, or ‘huggingface’ as fallback option for improved model coverage.

`graph_quantization_encoding`

property graph_quantization_encoding*: QuantizationEncoding | None*

Converts the CLI encoding to a MAX graph quantization encoding.

Returns:

The graph quantization encoding corresponding to the CLI encoding.

`help()`

static help() → dict[str, str]

Documentation for this config class. Return a dictionary of config options and their descriptions.

`kv_cache_config`

property kv_cache_config*: KVCacheConfig*

`max_batch_size`

max_batch_size*: int | None* = None

Maximum batch size to execute with the model. This is set to one, to minimize memory consumption for the base case, in which a person is running a local server to test out MAX. For users launching in a server scenario, the expectation is that this value should be set higher based on server capacity.

`max_cache_batch_size`

max_cache_batch_size*: int | None* = None

The maximum cache batch size to use for the model. Use max_batch_size instead.

Type:

DEPRECATED

`max_ce_batch_size`

max_ce_batch_size*: int* = 192

Maximum cache size to reserve for a single context encoding batch. The actual limit is the lesser of this and max_batch_size.

`max_length`

max_length*: int | None* = None

Maximum sequence length of the model.

`max_new_tokens`

max_new_tokens*: int* = -1

Maximum number of new tokens to generate during a single inference pass of the model.

`max_num_steps`

max_num_steps*: int* = -1

The number of steps to run for multi-step scheduling. -1 specifies a default value based on configuration and platform. Ignored for models which are not auto-regressive (e.g. embedding models).

`model_config`

property model_config*: MAXModelConfig*

`pad_to_multiple_of`

pad_to_multiple_of*: int* = 2

Pad input tensors to be a multiple of value provided.

`pool_embeddings`

pool_embeddings*: bool* = True

Whether to pool embedding outputs.

`profiling_config`

property profiling_config*: ProfilingConfig*

`rope_type`

rope_type*: RopeType | None* = None

none | normal | neox. Only matters for GGUF weights.

Type:

Force using a specific rope type

`sampling_config`

property sampling_config*: SamplingConfig*

`save_to_serialized_model_path`

save_to_serialized_model_path*: str | None* = None

Serialization paths no longer supported.

Type:

DEPRECATED

`serialized_model_path`

serialized_model_path*: str | None* = None

Serialization paths no longer supported.

Type:

DEPRECATED

`target_num_new_tokens`

target_num_new_tokens*: int | None* = None

The target number of un-encoded tokens to include in each batch. If not set, this will be set to a best-guess optimal value based on model, hardware, and available memory.

`use_experimental_kernels`

use_experimental_kernels*: str* = 'false'

`validate()`

validate() → None

Validate the config.

This method is called after the config is initialized, to ensure that all config fields have been initialized to a valid state.

`PipelineEngine`

class max.pipelines.config.PipelineEngine(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

`HUGGINGFACE`

HUGGINGFACE = 'huggingface'

`MAX`

MAX = 'max'

`ProfilingConfig`

class max.pipelines.config.ProfilingConfig(gpu_profiling: 'GPUProfilingMode' = <GPUProfilingMode.OFF: 'off'>)

`gpu_profiling`

gpu_profiling*: GPUProfilingMode* = 'off'

Whether to enable GPU profiling of the model.

`help()`

static help() → dict[str, str]

Documentation for this config class. Return a dictionary of config options and their descriptions.

`RepoType`

class max.pipelines.config.RepoType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

`local`

local = 'local'

`online`

online = 'online'

`RopeType`

class max.pipelines.config.RopeType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

`neox`

neox = 'neox'

`none`

none = 'none'

`normal`

normal = 'normal'

`SamplingConfig`

class max.pipelines.config.SamplingConfig(top_k: 'int' = 1, enable_structured_output: 'bool' = False, in_dtype: 'DType' = float32, out_dtype: 'DType' = float32)

`enable_structured_output`

enable_structured_output*: bool* = False

Enable structured generation/guided decoding for the server. This allows the user to pass a json schema in the response_format field, which the LLM will adhere to.

`help()`

static help() → dict[str, str]

Documentation for this config class. Return a dictionary of config options and their descriptions.

`in_dtype`

in_dtype*: DType* = 72

The data type of the input tokens.

`out_dtype`

out_dtype*: DType* = 72

The data type of the output logits.

`top_k`

top_k*: int* = 1

Limits the sampling to the K most probable tokens. This defaults to 1, which enables greedy sampling.

`SupportedEncoding`

class max.pipelines.config.SupportedEncoding(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

All possible encodings which may be supported by a particular model.

`bfloat16`

bfloat16 = 'bfloat16'

`cache_dtype`

property cache_dtype*: DType*

The underlying dtype used in the kvcache for correctness.

`dtype`

property dtype*: DType*

The underlying model dtype associated with a quantization_encoding.

`float32`

float32 = 'float32'

`gptq`

gptq = 'gptq'

`parse_from_file_name()`

classmethod parse_from_file_name(name: str)

`q4_0`

q4_0 = 'q4_0'

`q4_k`

q4_k = 'q4_k'

`q6_k`

q6_k = 'q6_k'

`quantization_encoding`

property quantization_encoding*: QuantizationEncoding | None*

`supported_on()`

supported_on(device_spec: DeviceSpec) → bool

Returns whether this quantization encoding is supported on a device.

HuggingFaceRepo
KVCacheConfig
MAXConfig
- help()
MAXModelConfig
PipelineConfig
PipelineEngine
- HUGGINGFACE
- MAX
ProfilingConfig
- gpu_profiling
- help()
RepoType
- local
- online
RopeType
SamplingConfig
SupportedEncoding

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!

HuggingFaceRepo​

download()​

encoding_for_file()​

file_exists()​

files_for_encoding()​

formats_available​

info​

repo_id​

repo_type​

size_of()​

supported_encodings​

trust_remote_code​

weight_files​

KVCacheConfig​

cache_strategy​

device_memory_utilization​

enable_prefix_caching​

help()​

kv_cache_page_size​

MAXConfig​

help()​

MAXModelConfig​

device_specs​

finalize_encoding_config()​

force_download​

graph_quantization_encoding​

help()​

huggingface_repo_id​

huggingface_revision​

huggingface_weights_repo()​

model_path​

quantization_encoding​

trust_remote_code​

validate()​

weight_path​

weights_size()​

PipelineConfig​

draft_model​

enable_chunked_prefill​

enable_echo​

enable_in_flight_batching​

engine​

graph_quantization_encoding​

help()​

kv_cache_config​

max_batch_size​

max_cache_batch_size​

max_ce_batch_size​

max_length​

max_new_tokens​

max_num_steps​

model_config​

pad_to_multiple_of​

pool_embeddings​

profiling_config​

rope_type​

sampling_config​

save_to_serialized_model_path​

serialized_model_path​

target_num_new_tokens​

use_experimental_kernels​

validate()​

PipelineEngine​

HUGGINGFACE​

MAX​

ProfilingConfig​

gpu_profiling​

help()​

RepoType​

local​

online​

RopeType​

neox​

none​

normal​

SamplingConfig​

enable_structured_output​

help()​

in_dtype​

out_dtype​

`HuggingFaceRepo`

`download()`

`encoding_for_file()`

`file_exists()`

`files_for_encoding()`

`formats_available`

`info`

`repo_id`

`repo_type`

`size_of()`

`supported_encodings`

`trust_remote_code`

`weight_files`

`KVCacheConfig`

`cache_strategy`

`device_memory_utilization`

`enable_prefix_caching`

`help()`

`kv_cache_page_size`

`MAXConfig`

`help()`

`MAXModelConfig`

`device_specs`

`finalize_encoding_config()`

`force_download`

`graph_quantization_encoding`

`help()`

`huggingface_repo_id`

`huggingface_revision`

`huggingface_weights_repo()`

`model_path`

`quantization_encoding`

`trust_remote_code`

`validate()`

`weight_path`

`weights_size()`

`PipelineConfig`

`draft_model`

`enable_chunked_prefill`

`enable_echo`

`enable_in_flight_batching`

`engine`

`graph_quantization_encoding`

`help()`

`kv_cache_config`

`max_batch_size`

`max_cache_batch_size`

`max_ce_batch_size`

`max_length`

`max_new_tokens`

`max_num_steps`

`model_config`

`pad_to_multiple_of`

`pool_embeddings`

`profiling_config`

`rope_type`

`sampling_config`

`save_to_serialized_model_path`

`serialized_model_path`

`target_num_new_tokens`

`use_experimental_kernels`

`validate()`

`PipelineEngine`

`HUGGINGFACE`

`MAX`

`ProfilingConfig`

`gpu_profiling`

`help()`

`RepoType`

`local`

`online`

`RopeType`

`neox`

`none`

`normal`

`SamplingConfig`

`enable_structured_output`

`help()`

`in_dtype`

`out_dtype`