Skip to main content
Log in

max CLI

The max CLI tool accelerates GenAI tasks by creating optimized inference pipelines with OpenAI-compatible endpoints. It supports models from Hugging Face and MAX Graph optimized versions of models like Llama 3.1, Mistral, and Replit Code.

Generate text or start an OpenAI-compatible endpoint with a single command using the max CLI tool.

Set up

Create a Python project to install our APIs and CLI tools.

  1. Create a project folder:

    mkdir quickstart && cd quickstart
    mkdir quickstart && cd quickstart
  2. Create and activate a virtual environment:

    python3 -m venv .venv \
    && source .venv/bin/activate
    python3 -m venv .venv \
    && source .venv/bin/activate
  3. Install the modular Python package:

    pip install modular \
    --index-url https://download.pytorch.org/whl/cpu \
    --extra-index-url https://dl.modular.com/public/max-nightly/python/simple/
    pip install modular \
    --index-url https://download.pytorch.org/whl/cpu \
    --extra-index-url https://dl.modular.com/public/max-nightly/python/simple/

Run your first model

Now that you have max installed, you can run your first model:

max generate --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--prompt "Generate a story about a robot"
max generate --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--prompt "Generate a story about a robot"

Uninstall

To remove the modular Python package:

pip uninstall modular
pip uninstall modular

Commands

max provides the following commands.

You can also print the available commands and documentation with --help. For example:

max --help
max --help
max serve --help
max serve --help

encode

Converts input text into embeddings for semantic search, text similarity, and NLP applications.

max encode [OPTIONS]
max encode [OPTIONS]

Example

Basic embedding generation:

max encode \
--model-path sentence-transformers/all-MiniLM-L6-v2 \
--prompt "Convert this text into embeddings"
max encode \
--model-path sentence-transformers/all-MiniLM-L6-v2 \
--prompt "Convert this text into embeddings"

generate

Performs text generation based on a provided prompt.

max generate [OPTIONS]
max generate [OPTIONS]

Examples

Text generation:

max generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 100 \
--prompt "Generate a story about a robot"
max generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 100 \
--prompt "Generate a story about a robot"

Text generation with controls:

max generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 500 \
--top-k 40 \
--quantization-encoding q4_k \
--cache-strategy paged \
--prompt "Explain quantum computing"
max generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 500 \
--top-k 40 \
--quantization-encoding q4_k \
--cache-strategy paged \
--prompt "Explain quantum computing"

Process an image using a vision-language model given a URL to an image:

LLama 3.2 Vision

LLama Vision models take prompts with <|image|> and <|begin_of_text|> tokens. For more information, see the LLama 3.2 Vision documentation.

max generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172
max generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172

Pixtral

Pixtral models take prompts with [IMG] tokens. For more information, see the Pixtral documentation.

max generate \
--model-path mistral-community/pixtral-12b \
--max-length 6491 \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--prompt "<s>[INST]Describe the images.\n[IMG][/INST]"
max generate \
--model-path mistral-community/pixtral-12b \
--max-length 6491 \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--prompt "<s>[INST]Describe the images.\n[IMG][/INST]"

For more information on how to use the generate command with vision models, see Generate image descriptions with Llama 3.2 Vision.

list

Displays available model architectures and configurations, including:

  • Hugging Face model repositories
  • Supported encoding types
  • Available cache strategies
max list
max list

serve

Launches an OpenAI-compatible REST API server for production deployments. For more detail, see the Serve API docs.

max serve [OPTIONS]
max serve [OPTIONS]

Examples

CPU serving:

max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF

Optimized GPU serving:

max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu \
--quantization-encoding bfloat16 \
--max-batch-size 4 \
--cache-strategy paged
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu \
--quantization-encoding bfloat16 \
--max-batch-size 4 \
--cache-strategy paged

Production setup:

max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu:0,1 \
--max-batch-size 8 \
--device-memory-utilization 0.9
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu:0,1 \
--max-batch-size 8 \
--device-memory-utilization 0.9

warm-cache

Preloads and compiles the model to optimize initialization time by:

  • Pre-compiling models before deployment
  • Warming up the Hugging Face cache

This command is useful to run before serving a model.

max warm-cache [OPTIONS]
max warm-cache [OPTIONS]

Example:

Basic cache warming:

max warm-cache \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
max warm-cache \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF

Configuration options

Model configuration

Core settings for model loading and execution.

OptionDescriptionDefaultValues
--engineBackend enginemaxmax|huggingface
--model-path TEXT(required) Path to modelAny valid path or Hugging Face repo ID (e.g. mistralai/Mistral-7B-v0.1)
--quantization-encodingWeight encoding typefloat32|bfloat16|q4_k|q4_0|q6_k|gptq
--weight-path PATHCustom model weights pathValid file path (supports multiple paths via repeated flags)

Device configuration

Controls hardware placement and memory usage.

OptionDescriptionDefaultValues
--devicesTarget devicescpu|gpu|gpu:{id} (e.g. gpu:0,1)
--device-specsSpecific device configurationCPUDeviceSpec format (e.g. DeviceSpec(id=-1, device_type='cpu'))
--device-memory-utilizationDevice memory fraction0.9Float between 0.0 and 1.0

Performance tuning

Optimization settings for batch processing, caching, and sequence handling.

OptionDescriptionDefaultValues
--cache-strategyCache strategynaive|continuous
--kv-cache-page-sizeToken count per KVCache page128Positive integer
--max-batch-sizeMaximum cache size per batch1Positive integer
--max-ce-batch-sizeMaximum context encoding batch size32Positive integer
--max-lengthMaximum input sequence lengthThe Hugging Face model's default max length is used.Positive integer (must be less than model's max config)
--max-new-tokensMaximum tokens to generate-1Integer (-1 for model max)
--pad-to-multiple-ofInput tensor padding multiple2Positive integer

Model state control

Options for saving or loading model states and handling external code

OptionDescriptionDefaultValues
--force-downloadForce re-download cached filesfalsetrue|false
--trust-remote-codeAllow custom Hugging Face codefalsetrue|false

Generation parameters

Controls for generation behavior.

OptionDescriptionDefaultValues
--enable-constrained-decodingEnable constrained generationfalsetrue|false
--enable-echoEnable model echofalsetrue|false
--image_urlURLs of images to include with prompt. Ignored if model doesn't support image inputs[]List of valid URLs
--rope-typeRoPE type for GGUF weightsnone|normal|neox
--top-kLimit sampling to top K tokens1Positive integer (1 for greedy sampling)