max CLI

The max CLI tool accelerates GenAI tasks by creating optimized inference pipelines with OpenAI-compatible endpoints. It supports models from Hugging Face and MAX Graph optimized versions of models like Llama 3.1, Mistral, and Replit Code.

Generate text or start an OpenAI-compatible endpoint with a single command using the max CLI tool.

The max-pipelines CLI tool has been renamed to max. Users should switch to using the max CLI tool. The underlying implementation remains identical with the same commands and flags, so your existing workflows will continue to work as expected.

Install

Create a Python project to install our APIs and the max CLI.

pixi
uv
pip
conda

If you don't have it, install pixi:
```
curl -fsSL https://pixi.sh/install.sh | sh
```
Then restart your terminal for the changes to take effect.

Create a project:

pixi init example-project \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd example-project

Tip: You can skip the -c options if you add these channels as defaults.

Install the modular Python package:
- Nightly
- Stable
pixi add modular
pixi add "modular=25.5"
Start the virtual environment:
```
pixi shell
```

If you don't have it, install uv:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Then restart your terminal to make uv accessible.

Create a project:

uv init example-project && cd example-project

Create and start a virtual environment:
```
uv venv && source .venv/bin/activate
```

Install the modular Python package:

Nightly
Stable

uv pip install modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/ \
  --prerelease allow

uv pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

Create a project folder:

mkdir example-project && cd example-project

Create and activate a virtual environment:

python3 -m venv .venv/example-project \
  && source .venv/example-project/bin/activate

Install the modular Python package:

Nightly
Stable

pip install --pre modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/

pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

If you don't have it, install conda. A common choice is with brew:
```
brew install miniconda
```
Initialize conda for shell interaction:
```
conda init
```
If you're on a Mac, instead use:
```
conda init zsh
```
Then restart your terminal for the changes to take effect.
Create a project:
```
conda create -n example-project
```
Start the virtual environment:
```
conda activate example-project
```

Install the modular conda package:

Nightly
Stable

conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular

conda install -c conda-forge -c https://conda.modular.com/max/ modular

When you install the modular package, you'll get access to the max CLI tool automatically. You can check your version like this:

max --version

Run your first model

Now that you have max installed, you can run your first model:

max generate --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
  --prompt "Generate a story about a robot"

If you use private or gated models, you must set your Hugging Face access token first. For example:

export HF_TOKEN="hf_..."

Then you can run commands in max for a private or gated model.

Commands

max provides the following commands.

You can also print the available commands and documentation with --help. For example:

max --help

max serve --help

`encode`

Converts input text into embeddings for semantic search, text similarity, and NLP applications.

max encode [OPTIONS]

Example

Basic embedding generation:

max encode \
  --model-path sentence-transformers/all-MiniLM-L6-v2 \
  --prompt "Convert this text into embeddings"

`generate`

Performs text generation based on a provided prompt.

max generate [OPTIONS]

Examples

Text generation:

max generate \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
  --max-length 1024 \
  --max-new-tokens 100 \
  --prompt "Generate a story about a robot"

Text generation with controls:

max generate \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
  --max-length 1024 \
  --max-new-tokens 500 \
  --top-k 40 \
  --quantization-encoding q4_k \
  --cache-strategy paged \
  --prompt "Explain quantum computing"

Process an image using a vision-language model given a URL to an image:

Llama 3.2 Vision

Llama Vision models take prompts with <|image|> and <|begin_of_text|> tokens. For more information, see the Llama 3.2 Vision documentation.

max generate \
  --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
  --prompt "<|image|><|begin_of_text|>What is in this image?" \
  --image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
  --max-new-tokens 100 \
  --max-batch-size 1 \
  --max-length 108172

Pixtral

Pixtral models take prompts with [IMG] tokens. For more information, see the Pixtral documentation.

max generate \
  --model-path mistral-community/pixtral-12b \
  --max-length 6491 \
  --image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
  --prompt "<s>[INST]Describe the images.\n[IMG][/INST]"

You can adjust parameters like --max-batch-size and --max-length depending on your system's available resources such as GPU memory.

For more information on how to use the generate command with vision models, see Generate image descriptions with Llama 3.2 Vision.

`list`

Displays available model architectures and configurations, including:

Hugging Face model repositories
Supported encoding types
Available cache strategies

max list

`serve`

Launches an OpenAI-compatible REST API server for production deployments. For more detail, see the Serve API docs.

max serve [OPTIONS]

Examples

CPU serving:

max serve \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF

Optimized GPU serving:

max serve \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
  --devices gpu \
  --quantization-encoding bfloat16 \
  --max-batch-size 4 \
  --cache-strategy paged

Production setup:

max serve \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
  --devices gpu:0,1 \
  --max-batch-size 8 \
  --device-memory-utilization 0.9

Custom architectures

The max CLI supports loading custom model architectures through the --custom-architectures flag. This allows you to extend MAX's capabilities with your own model implementations:

max serve \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
  --custom-architectures path/to/module1:module1 \
  --custom-architectures path/to/module2:module2

Custom architectures

The --custom-architectures flag allows you to load custom pipeline architectures from your own Python modules. You can set the ARCHITECTURES variable containing the architecture definitions. Each entry in --custom-architectures can be specified in two formats:

A raw module name; for example: my_module.
An import path followed by a colon and the module name; for example: folder/path/to/import:my_module.

The ARCHITECTURES variable in your module should be a list of implementations that conform to the SupportedArchitecture interface. These will be registered with the MAX pipeline registry automatically.

`warm-cache`

Preloads and compiles the model to optimize initialization time by:

Pre-compiling models before deployment
Warming up the Hugging Face cache

This command is useful to run before serving a model.

max warm-cache [OPTIONS]

Example:

Basic cache warming:

max warm-cache \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF

The Modular Executable Format (MEF) is platform independent, but the serialized cache (MEF files) produced during compilation is platform-dependent. This is because:

Platform-dependent optimizations happen during compilation.
Fallback operations assume a particular runtime environment.

Weight transformations and hashing during MEF caching can impact performance. While efforts to improve this through weight externalization are ongoing, compiled MEF files remain platform-specific and are not generally portable.

Model configuration

Core settings for model loading and execution.

Option	Description	Values
`--custom-architectures`	Load custom pipeline architectures	Module path format: `folder/path/to/import:my_module`
`--model-path TEXT`	(required) Path to model	Any valid path or Hugging Face repo ID (e.g. `mistralai/Mistral-7B-v0.1`)
`--quantization-encoding`	Weight encoding type	`float32`\|`bfloat16`\|`q4_k`\|`q4_0`\|`q6_k`\|`gptq`
`--weight-path PATH`	Custom model weights path	Valid file path (supports multiple paths via repeated flags)

Quantization encoding

When using GGUF models, quantization encoding formats are automatically detected. If no --quantization-encoding is specified, MAX Serve automatically detects and uses the first encoding option from the repository. If quantization encoding is provided, it must align with the available encoding options in the repository.

If the repository contains multiple quantization formats, specify which encoding type you want to use with the --quantization-encoding parameter.

Device configuration

Controls hardware placement and memory usage.

Option	Description	Default	Values
`--devices`	Target devices		`cpu`\|`gpu`\|`gpu:{id}` (e.g. `gpu:0,1`)
`--device-specs`	Specific device configuration	`CPU`	`DeviceSpec` format (e.g. `DeviceSpec(id=-1, device_type='cpu')`)
`--device-memory-utilization`	Device memory fraction	`0.9`	Float between 0.0 and 1.0

Performance tuning

Optimization settings for batch processing, caching, and sequence handling.

Option	Description	Default	Values
`--cache-strategy`	Cache strategy		`naive`\|`continuous`
`--kv-cache-page-size`	Token count per KVCache page	`128`	Positive integer
`--max-batch-size`	Maximum cache size per batch	`1`	Positive integer
`--max-ce-batch-size`	Maximum context encoding batch size	`32`	Positive integer
`--max-length`	Maximum input sequence length	The Hugging Face model's default max length is used.	Positive integer (must be less than model's max config)
`--max-new-tokens`	Maximum tokens to generate	`-1`	Integer (-1 for model max)
`--pad-to-multiple-of`	Input tensor padding multiple	`2`	Positive integer

Model state control

Options for saving or loading model states and handling external code

Option	Description	Default	Values
`--force-download`	Force re-download cached files	`false`	`true`\|`false`
`--trust-remote-code`	Allow custom Hugging Face code	`false`	`true`\|`false`

Generation parameters

Controls for generation behavior.

Option	Description	Default	Values
`--enable-constrained-decoding`	Enable constrained generation	`false`	`true`\|`false`
`--enable-echo`	Enable model echo	`false`	`true`\|`false`
`--image_url`	URLs of images to include with prompt. Ignored if model doesn't support image inputs	`[]`	List of valid URLs
`--rope-type`	RoPE type for GGUF weights		`none`\|`normal`\|`neox`
`--top-k`	Limit sampling to top K tokens	`1`	Positive integer (1 for greedy sampling)

Install​

Run your first model​

Commands​

encode​

generate​

list​

serve​

warm-cache​

Model configuration​

Device configuration​

Performance tuning​

Model state control​

Generation parameters​