max CLI
The max
CLI tool accelerates GenAI tasks by creating optimized inference
pipelines with OpenAI-compatible
endpoints. It
supports models from Hugging Face
and MAX Graph optimized versions of models
like Llama 3.1, Mistral, and Replit Code.
Generate text or start an OpenAI-compatible endpoint with a single command using
the max
CLI tool.
Set up
Create a Python project to install our APIs and CLI tools.
- pip
- uv
- magic
-
Create a project folder:
mkdir quickstart && cd quickstart
mkdir quickstart && cd quickstart
-
Create and activate a virtual environment:
python3 -m venv .venv \
&& source .venv/bin/activatepython3 -m venv .venv \
&& source .venv/bin/activate -
Install the
modular
Python package:- Nightly
- Stable
pip install modular \
--index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/max-nightly/python/simple/pip install modular \
--index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/max-nightly/python/simple/pip install modular \
--index-url https://download.pytorch.org/whl/cpupip install modular \
--index-url https://download.pytorch.org/whl/cpu
-
Install
uv
:curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh
Then restart your terminal to make
uv
accessible. -
Create a project:
uv init quickstart && cd quickstart
uv init quickstart && cd quickstart
-
Create and start a virtual environment:
uv venv && source .venv/bin/activate
uv venv && source .venv/bin/activate
-
Install the
modular
Python package:- Nightly
- Stable
uv pip install modular \
--index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/max-nightly/python/simple/uv pip install modular \
--index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/max-nightly/python/simple/uv pip install modular \
--index-url https://download.pytorch.org/whl/cpuuv pip install modular \
--index-url https://download.pytorch.org/whl/cpu
-
Install
magic
:curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the
source
command that's printed in your terminal. -
Create a project:
magic init quickstart --format pyproject && cd quickstart
magic init quickstart --format pyproject && cd quickstart
-
Install the
max-pipelines
conda package:- Nightly
- Stable
magic add max-pipelines
magic add max-pipelines
magic add "max-pipelines==25.3"
magic add "max-pipelines==25.3"
-
Start the virtual environment:
magic shell
magic shell
Run your first model
Now that you have max
installed, you can run your first model:
max generate --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--prompt "Generate a story about a robot"
max generate --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--prompt "Generate a story about a robot"
Uninstall
To remove the modular
Python package:
- pip
- uv
- magic
pip uninstall modular
pip uninstall modular
uv pip uninstall modular
uv pip uninstall modular
magic remove modular
magic remove modular
Commands
max
provides the following commands.
You can also print the available commands and documentation with --help
.
For example:
max --help
max --help
max serve --help
max serve --help
encode
Converts input text into embeddings for semantic search, text similarity, and NLP applications.
max encode [OPTIONS]
max encode [OPTIONS]
Example
Basic embedding generation:
max encode \
--model-path sentence-transformers/all-MiniLM-L6-v2 \
--prompt "Convert this text into embeddings"
max encode \
--model-path sentence-transformers/all-MiniLM-L6-v2 \
--prompt "Convert this text into embeddings"
generate
Performs text generation based on a provided prompt.
max generate [OPTIONS]
max generate [OPTIONS]
Examples
Text generation:
max generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 100 \
--prompt "Generate a story about a robot"
max generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 100 \
--prompt "Generate a story about a robot"
Text generation with controls:
max generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 500 \
--top-k 40 \
--quantization-encoding q4_k \
--cache-strategy paged \
--prompt "Explain quantum computing"
max generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 500 \
--top-k 40 \
--quantization-encoding q4_k \
--cache-strategy paged \
--prompt "Explain quantum computing"
Process an image using a vision-language model given a URL to an image:
LLama 3.2 Vision
LLama Vision models take prompts with <|image|>
and <|begin_of_text|>
tokens.
For more information, see the LLama 3.2 Vision
documentation.
max generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172
max generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172
Pixtral
Pixtral models take prompts with [IMG]
tokens. For more information, see the
Pixtral
documentation.
max generate \
--model-path mistral-community/pixtral-12b \
--max-length 6491 \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--prompt "<s>[INST]Describe the images.\n[IMG][/INST]"
max generate \
--model-path mistral-community/pixtral-12b \
--max-length 6491 \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--prompt "<s>[INST]Describe the images.\n[IMG][/INST]"
For more information on how to use the generate
command with vision models, see
Generate image descriptions with Llama 3.2
Vision.
list
Displays available model architectures and configurations, including:
- Hugging Face model repositories
- Supported encoding types
- Available cache strategies
max list
max list
serve
Launches an OpenAI-compatible REST API server for production deployments. For more detail, see the Serve API docs.
max serve [OPTIONS]
max serve [OPTIONS]
Examples
CPU serving:
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
Optimized GPU serving:
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu \
--quantization-encoding bfloat16 \
--max-batch-size 4 \
--cache-strategy paged
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu \
--quantization-encoding bfloat16 \
--max-batch-size 4 \
--cache-strategy paged
Production setup:
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu:0,1 \
--max-batch-size 8 \
--device-memory-utilization 0.9
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu:0,1 \
--max-batch-size 8 \
--device-memory-utilization 0.9
warm-cache
Preloads and compiles the model to optimize initialization time by:
- Pre-compiling models before deployment
- Warming up the Hugging Face cache
This command is useful to run before serving a model.
max warm-cache [OPTIONS]
max warm-cache [OPTIONS]
Example:
Basic cache warming:
max warm-cache \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
max warm-cache \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
Configuration options
Model configuration
Core settings for model loading and execution.
Option | Description | Default | Values |
---|---|---|---|
--engine | Backend engine | max | max |huggingface |
--model-path TEXT | (required) Path to model | Any valid path or Hugging Face repo ID (e.g. mistralai/Mistral-7B-v0.1 ) | |
--quantization-encoding | Weight encoding type | float32 |bfloat16 |q4_k |q4_0 |q6_k |gptq | |
--weight-path PATH | Custom model weights path | Valid file path (supports multiple paths via repeated flags) |
Device configuration
Controls hardware placement and memory usage.
Option | Description | Default | Values |
---|---|---|---|
--devices | Target devices | cpu |gpu |gpu:{id} (e.g. gpu:0,1 ) | |
--device-specs | Specific device configuration | CPU | DeviceSpec format (e.g. DeviceSpec(id=-1, device_type='cpu') ) |
--device-memory-utilization | Device memory fraction | 0.9 | Float between 0.0 and 1.0 |
Performance tuning
Optimization settings for batch processing, caching, and sequence handling.
Option | Description | Default | Values |
---|---|---|---|
--cache-strategy | Cache strategy | naive |continuous | |
--kv-cache-page-size | Token count per KVCache page | 128 | Positive integer |
--max-batch-size | Maximum cache size per batch | 1 | Positive integer |
--max-ce-batch-size | Maximum context encoding batch size | 32 | Positive integer |
--max-length | Maximum input sequence length | The Hugging Face model's default max length is used. | Positive integer (must be less than model's max config) |
--max-new-tokens | Maximum tokens to generate | -1 | Integer (-1 for model max) |
--pad-to-multiple-of | Input tensor padding multiple | 2 | Positive integer |
Model state control
Options for saving or loading model states and handling external code
Option | Description | Default | Values |
---|---|---|---|
--force-download | Force re-download cached files | false | true |false |
--trust-remote-code | Allow custom Hugging Face code | false | true |false |
Generation parameters
Controls for generation behavior.
Option | Description | Default | Values |
---|---|---|---|
--enable-constrained-decoding | Enable constrained generation | false | true |false |
--enable-echo | Enable model echo | false | true |false |
--image_url | URLs of images to include with prompt. Ignored if model doesn't support image inputs | [] | List of valid URLs |
--rope-type | RoPE type for GGUF weights | none |normal |neox | |
--top-k | Limit sampling to top K tokens | 1 | Positive integer (1 for greedy sampling) |
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!