Python module
config
Standardized configuration for Pipeline Inference.
HuggingFaceRepo
class max.pipelines.config.HuggingFaceRepo(repo_id: 'str', trust_remote_code: 'bool' = False, repo_type: 'Optional[RepoType]' = None)
download()
download(filename: str, force_download: bool = False) → Path
encoding_for_file()
encoding_for_file(file: str | Path) → SupportedEncoding
file_exists()
files_for_encoding()
files_for_encoding(encoding: SupportedEncoding, weights_format: WeightsFormat | None = None, alternate_encoding: SupportedEncoding | None = None) → dict[max.graph.weights.format.WeightsFormat, list[pathlib.Path]]
formats_available
property formats_available*: list[max.graph.weights.format.WeightsFormat]*
info
property info*: ModelInfo*
repo_id
repo_id*: str*
repo_type
size_of()
supported_encodings
property supported_encodings*: list[max.pipelines.config.SupportedEncoding]*
trust_remote_code
trust_remote_code*: bool* = False
weight_files
property weight_files*: dict[max.graph.weights.format.WeightsFormat, list[str]]*
KVCacheConfig
class max.pipelines.config.KVCacheConfig(cache_strategy: 'KVCacheStrategy' = <KVCacheStrategy.MODEL_DEFAULT: 'model_default'>, kv_cache_page_size: 'int' = 128, enable_prefix_caching: 'bool' = False, device_memory_utilization: 'float' = 0.9, _available_cache_memory: 'Optional[int]' = None)
cache_strategy
cache_strategy*: KVCacheStrategy* = 'model_default'
The cache strategy to use. This defaults to model_default
, which will set the cache
strategy based on the default strategy for the architecture requested.
You can also force the engine to use a specific caching strategy: naive
| continuous
| paged
.
device_memory_utilization
device_memory_utilization*: float* = 0.9
The fraction of available device memory that the process should consume.
This is used to inform the size of the KVCache workspace. The calculation is:
enable_prefix_caching
enable_prefix_caching*: bool* = False
Whether to enable prefix caching for the paged attention KVCache.
help()
Documentation for this config class. Return a dictionary of config options and their descriptions.
kv_cache_page_size
kv_cache_page_size*: int* = 128
The number of tokens in a single page in the paged KVCache.
MAXConfig
class max.pipelines.config.MAXConfig
Abstract base class for all MAX configs.
There are some invariants that MAXConfig
classes should follow:
- All config classes should be dataclasses.
- All config classes should have a
help()
method that returns a dictionary of config options and their descriptions. - All config classes dataclass fields should have default values, and hence
can be trivially initialized via
cls()
. - All config classes should be frozen (except
KVCacheConfig
for now), to avoid accidental modification of config objects. - All config classes must have mutually exclusive dataclass fields among themselves.
help()
Documentation for this config class. Return a dictionary of config options and their descriptions.
MAXModelConfig
class max.pipelines.config.MAXModelConfig(model_path: str = '', huggingface_repo_id: str = '', weight_path: list[pathlib.Path] = <factory>, quantization_encoding: ~max.pipelines.config.SupportedEncoding | None = None, huggingface_revision: str = 'main', trust_remote_code: bool = False, device_specs: list[max.driver.driver.DeviceSpec] = <factory>, force_download: bool = False, _weights_repo_id: str | None = None, _quant_config: ~max.graph.quantization.QuantizationConfig | None = None)
Abstract base class for all MAX model configs.
This class is used to configure a model to use for a pipeline.
device_specs
device_specs*: list[max.driver.driver.DeviceSpec]*
Devices to run inference upon. This option is not documented in help()
as it shouldn’t be used directly via the CLI entrypoint.
finalize_encoding_config()
finalize_encoding_config()
force_download
force_download*: bool* = False
Whether to force download a given file if it’s already present in the local cache.
graph_quantization_encoding
property graph_quantization_encoding*: QuantizationEncoding | None*
Converts the CLI encoding to a MAX Graph quantization encoding.
-
Returns:
The graph quantization encoding corresponding to the CLI encoding.
-
Raises:
ValueError – If no CLI encoding was specified.
help()
Documentation for this config class. Return a dictionary of config options and their descriptions.
huggingface_repo_id
huggingface_repo_id*: str* = ''
repo_id
of a Hugging Face model repository to use. Use model_path
instead.
-
Type:
DEPRECATED
huggingface_revision
huggingface_revision*: str* = 'main'
Branch or Git revision of Hugging Face model repository to use.
huggingface_weights_repo()
huggingface_weights_repo() → HuggingFaceRepo
model_path
model_path*: str* = ''
repo_id
of a Hugging Face model repository to use.
quantization_encoding
quantization_encoding*: SupportedEncoding | None* = None
Weight encoding type.
trust_remote_code
trust_remote_code*: bool* = False
Whether or not to allow for custom modelling files on Hugging Face.
validate()
validate()
Validates the config.
This method is called after the model config is initialized, to ensure that all config fields have been initialized to a valid state. It will also set and update other fields which may not be determined / initialized in the default factory.
weight_path
weight_path*: list[pathlib.Path]*
Optional path or url of the model weights to use.
weights_size()
weights_size() → int
PipelineConfig
class max.pipelines.config.PipelineConfig(**kwargs: Any)
Configuration for a pipeline.
WIP - Once a PipelineConfig is fully initialized, it should be as immutable as possible (frozen=True). All underlying dataclass fields should have been initialized to their default values, be it user specified via some CLI flag, config file, environment variable, or internally set to a reasonable default.
draft_model
Draft model for use during Speculative Decoding.
enable_chunked_prefill
enable_chunked_prefill*: bool* = True
Enable chunked prefill to split context encoding requests into multiple chunks based on ‘target_num_new_tokens’.
enable_echo
enable_echo*: bool* = False
Whether the model should be built with echo capabilities.
enable_in_flight_batching
enable_in_flight_batching*: bool* = False
When enabled, prioritizes token generation by batching it with context encoding requests. Requires chunked prefill.
engine
engine*: PipelineEngine | None* = None
Engine backend to use for serving, ‘max’ for the max engine, or ‘huggingface’ as fallback option for improved model coverage.
graph_quantization_encoding
property graph_quantization_encoding*: QuantizationEncoding | None*
Converts the CLI encoding to a MAX graph quantization encoding.
-
Returns:
The graph quantization encoding corresponding to the CLI encoding.
help()
Documentation for this config class. Return a dictionary of config options and their descriptions.
kv_cache_config
property kv_cache_config*: KVCacheConfig*
max_batch_size
Maximum batch size to execute with the model. This is set to one, to minimize memory consumption for the base case, in which a person is running a local server to test out MAX. For users launching in a server scenario, the expectation is that this value should be set higher based on server capacity.
max_cache_batch_size
The maximum cache batch size to use for the model. Use max_batch_size instead.
-
Type:
DEPRECATED
max_ce_batch_size
max_ce_batch_size*: int* = 192
Maximum cache size to reserve for a single context encoding batch. The actual limit is the lesser of this and max_batch_size.
max_length
Maximum sequence length of the model.
max_new_tokens
max_new_tokens*: int* = -1
Maximum number of new tokens to generate during a single inference pass of the model.
max_num_steps
max_num_steps*: int* = -1
The number of steps to run for multi-step scheduling. -1 specifies a default value based on configuration and platform. Ignored for models which are not auto-regressive (e.g. embedding models).
model_config
property model_config*: MAXModelConfig*
pad_to_multiple_of
pad_to_multiple_of*: int* = 2
Pad input tensors to be a multiple of value provided.
pool_embeddings
pool_embeddings*: bool* = True
Whether to pool embedding outputs.
profiling_config
property profiling_config*: ProfilingConfig*
rope_type
none | normal | neox. Only matters for GGUF weights.
-
Type:
Force using a specific rope type
sampling_config
property sampling_config*: SamplingConfig*
save_to_serialized_model_path
Serialization paths no longer supported.
-
Type:
DEPRECATED
serialized_model_path
Serialization paths no longer supported.
-
Type:
DEPRECATED
target_num_new_tokens
The target number of un-encoded tokens to include in each batch. If not set, this will be set to a best-guess optimal value based on model, hardware, and available memory.
use_experimental_kernels
use_experimental_kernels*: str* = 'false'
validate()
validate() → None
Validate the config.
This method is called after the config is initialized, to ensure that all config fields have been initialized to a valid state.
PipelineEngine
class max.pipelines.config.PipelineEngine(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
HUGGINGFACE
HUGGINGFACE = 'huggingface'
MAX
MAX = 'max'
ProfilingConfig
class max.pipelines.config.ProfilingConfig(gpu_profiling: 'GPUProfilingMode' = <GPUProfilingMode.OFF: 'off'>)
gpu_profiling
gpu_profiling*: GPUProfilingMode* = 'off'
Whether to enable GPU profiling of the model.
help()
Documentation for this config class. Return a dictionary of config options and their descriptions.
RepoType
class max.pipelines.config.RepoType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
local
local = 'local'
online
online = 'online'
RopeType
class max.pipelines.config.RopeType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
neox
neox = 'neox'
none
none = 'none'
normal
normal = 'normal'
SamplingConfig
class max.pipelines.config.SamplingConfig(top_k: 'int' = 1, enable_structured_output: 'bool' = False, in_dtype: 'DType' = float32, out_dtype: 'DType' = float32)
enable_structured_output
enable_structured_output*: bool* = False
Enable structured generation/guided decoding for the server. This allows the user to pass a json schema in the response_format field, which the LLM will adhere to.
help()
Documentation for this config class. Return a dictionary of config options and their descriptions.
in_dtype
in_dtype*: DType* = 72
The data type of the input tokens.
out_dtype
out_dtype*: DType* = 72
The data type of the output logits.
top_k
top_k*: int* = 1
Limits the sampling to the K most probable tokens. This defaults to 1, which enables greedy sampling.
SupportedEncoding
class max.pipelines.config.SupportedEncoding(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
All possible encodings which may be supported by a particular model.
bfloat16
bfloat16 = 'bfloat16'
cache_dtype
property cache_dtype*: DType*
The underlying dtype used in the kvcache for correctness.
dtype
property dtype*: DType*
The underlying model dtype associated with a quantization_encoding.
float32
float32 = 'float32'
gptq
gptq = 'gptq'
parse_from_file_name()
classmethod parse_from_file_name(name: str)
q4_0
q4_0 = 'q4_0'
q4_k
q4_k = 'q4_k'
q6_k
q6_k = 'q6_k'
quantization_encoding
property quantization_encoding*: QuantizationEncoding | None*
supported_on()
supported_on(device_spec: DeviceSpec) → bool
Returns whether this quantization encoding is supported on a device.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!