Skip to main content

max CLI

The max CLI tool accelerates GenAI tasks by creating optimized inference pipelines with OpenAI-compatible endpoints. It supports models from Hugging Face and MAX Graph optimized versions of models like Llama 3.1, Mistral, and Replit Code.

Generate text or start an OpenAI-compatible endpoint with a single command using the max CLI tool.

Install

Create a Python project to install our APIs and the max CLI.

  1. If you don't have it, install pixi:
    curl -fsSL https://pixi.sh/install.sh | sh

    Then restart your terminal for the changes to take effect.

  2. Create a project:
    pixi init example-project \
      -c https://conda.modular.com/max-nightly/ -c conda-forge \
      && cd example-project
  3. Install the modular conda package:
    pixi add modular
  4. Start the virtual environment:
    pixi shell

When you install the modular package, you'll get access to the max CLI tool automatically. You can check your version like this:

max --version

Run your first model

Now that you have max installed, you can run your first model:

max generate --model modularai/Llama-3.1-8B-Instruct-GGUF \
  --prompt "Generate a story about a robot"

Commands

max provides the following commands.

You can also print the available commands and documentation with --help. For example:

max --help
max serve --help

benchmark

Runs comprehensive benchmark tests on an active model server to measure performance metrics including throughput, latency, and resource utilization.

max benchmark [OPTIONS]

Before running this command, make sure the model server is running, via max serve.

Example

Benchmark the Llama-3.1-8B-Instruct-GGUF model already running on localhost:

max benchmark \
  --model modularai/Llama-3.1-8B-Instruct-GGUF \
  --backend modular \
  --endpoint /v1/chat/completions \
  --host localhost \
  --port 8000 \
  --num-prompts 50 \
  --dataset-name arxiv-summarization \
  --arxiv-summarization-input-len 12000 \
  --max-output-len 1200

Instead of passing all the benchmark options, you can instead pass a configuration file. See Configuration file below.

Options

This list of options is not exhaustive. For more information, run max benchmark --help or see the benchmarking script source code.

  • Backend configuration:

    • --backend: Choose from modular (MAX v1/completions endpoint), modular-chat (MAX v1/chat/completions endpoint), vllm (vLLM), or trt-llm (TensorRT-LLM)

    • --model: Hugging Face model ID or local path

  • Load generation:

    • --num-prompts: Number of prompts to process (int, default: 500)

    • --request-rate: Request rate in requests/second (int, default: inf)

    • --seed: The random seed used to sample the dataset (int, default: 0)

  • Serving options

    • --base-url: Base URL of the API service

    • --endpoint: Specific API endpoint (/v1/completions or /v1/chat/completions)

    • --tokenizer: Hugging Face tokenizer to use (can be different from model)

    • --dataset-name: (Required; default:sharegpt) Specifies which type of benchmark dataset to use. This determines the dataset class and processing logic. See Datasets below.

    • --dataset-path: Path to a local dataset file that overrides the default dataset source for the specified dataset-name. The file format must match the expected format for the specified dataset-name (such as JSON for axolotl, JSONL for obfuscated-conversations, plain text for sonnet).

  • Additional options

    • --collect-gpu-stats: Report GPU utilization and memory consumption. Only works when running max benchmark on the same instance as the server, and only on NVIDIA GPUs.

    • --save-results: Saves results to a local JSON file.

    • --config-file: Path to a YAML file containing benchmark configuration. The configuration file is a YAML file that contains key-value pairs for all your benchmark configurations (as a replacement for individual command line options). See Configuration file below.

Output

Here's an explanation of the most important metrics printed upon completion:

  • Request throughput: Number of complete requests processed per second
  • Input token throughput: Number of input tokens processed per second
  • Output token throughput: Number of tokens generated per second
  • TTFT: Time to first token—the time from request start to first token generation
  • TPOT: Time per output token—the average time taken to generate each output token
  • ITL: Inter-token latency—the average time between consecutive token or token-chunk generations

If --collect-gpu-stats is set, you'll also see these:

  • GPU utilization: Percentage of time during which at least one GPU kernel is being executed
  • Peak GPU memory used: Peak memory usage during benchmark run

Datasets

The --dataset-name option supports several dataset names/formats you can use for benchmarking:

  • arxiv-summarization - Research paper summarization dataset containing academic papers with abstracts for training summarization models, from Hugging Face Datasets.

  • axolotl - Local dataset in Axolotl format with conversation segments labeled as human/assistant text, from Hugging Face Datasets.

  • code_debug - Long-context code debugging dataset containing code with multiple choice debugging questions for testing long-context understanding, from Hugging Face Datasets.

  • obfuscated-conversations - Local dataset with obfuscated conversation data. You must pair this with the --dataset-path option to specify the local JSONL file.

  • random - Synthetically generated random dataset that creates random token sequences with configurable input/output lengths and distributions.

  • sharegpt - Conversational dataset containing human-AI conversations for chat model evaluation, from Hugging Face Datasets.

  • sonnet - Poetry dataset using local text files containing poem lines, from Hugging Face Datasets.

  • vision-arena - Vision-language benchmark dataset containing images with associated questions for multimodal model evaluation, from Hugging Face Datasets.

You can override the default dataset source for any of these using the --dataset-path option (except for generated datasets like random), but you must always specify a --dataset-name so the tool knows how to process the dataset format.

Configuration file

The --config-file option allows you to specify a YAML file containing all your benchmark configurations, as a replacement for individual command line options. Simply define all the configuration options (corresponding to the max benchmark command line options) in a YAML file, all nested under the benchmark_config key.

For example, without a configuration file, you must specify all configurations with command line options like this:

max benchmark \
  --model google/gemma-3-27b-it \
  --backend modular \
  --endpoint /v1/chat/completions \
  --host localhost \
  --port 8000 \
  --num-prompts 50 \
  --dataset-name arxiv-summarization \
  --arxiv-summarization-input-len 12000 \
  --max-output-len 1200

Instead, you can create a configuration file:

gemma-benchmark.yaml
benchmark_config:
  model: google/gemma-3-27b-it
  backend: modular
  endpoint: /v1/chat/completions
  host: localhost
  port: 8000
  num_prompts: 50
  dataset_name: arxiv-summarization
  arxiv_summarization_input_len: 12000
  max_output_len: 1200

And then run the benchmark by passing that file:

max benchmark --config-file gemma-benchmark.yaml

For more information about running benchmarks, see the benchmarking tutorial.

encode

Converts input text into embeddings for semantic search, text similarity, and NLP applications.

max encode [OPTIONS]

Example

Basic embedding generation:

max encode \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --prompt "Convert this text into embeddings"

generate

Performs text generation based on a provided prompt.

max generate [OPTIONS]

Examples

Text generation:

max generate \
  --model modularai/Llama-3.1-8B-Instruct-GGUF \
  --max-length 1024 \
  --max-new-tokens 100 \
  --prompt "Generate a story about a robot"

Text generation with controls:

max generate \
  --model modularai/Llama-3.1-8B-Instruct-GGUF \
  --max-length 1024 \
  --max-new-tokens 500 \
  --top-k 40 \
  --temperature 0.7 \
  --seed 42 \
  --quantization-encoding q4_k \
  --cache-strategy paged \
  --prompt "Explain quantum computing"

Process an image using a vision-language model given a URL to an image:

Llama 3.2 Vision

Llama Vision models take prompts with <|image|> and <|begin_of_text|> tokens. For more information, see the Llama 3.2 Vision documentation.

max generate \
  --model meta-llama/Llama-3.2-11B-Vision-Instruct \
  --prompt "<|image|><|begin_of_text|>What is in this image?" \
  --image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
  --max-new-tokens 100 \
  --max-batch-size 1 \
  --max-length 108172

Pixtral

Pixtral models take prompts with [IMG] tokens. For more information, see the Pixtral documentation.

max generate \
  --model mistral-community/pixtral-12b \
  --max-length 6491 \
  --image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
  --prompt "<s>[INST]Describe the images.\n[IMG][/INST]"

For more information on how to use the generate command with vision models, see Generate image descriptions with Llama 3.2 Vision.

list

Displays available model architectures and configurations, including:

  • Hugging Face model repositories
  • Supported encoding types
  • Available cache strategies
max list

serve

Launches an OpenAI-compatible REST API server for production deployments. For more detail, see the Serve API docs.

max serve [OPTIONS]

Examples

CPU serving:

max serve \
  --model modularai/Llama-3.1-8B-Instruct-GGUF

Optimized GPU serving:

max serve \
  --model modularai/Llama-3.1-8B-Instruct-GGUF \
  --devices gpu \
  --quantization-encoding bfloat16 \
  --max-batch-size 4 \
  --cache-strategy paged

Production setup:

max serve \
  --model modularai/Llama-3.1-8B-Instruct-GGUF \
  --devices gpu:0,1 \
  --max-batch-size 8 \
  --device-memory-utilization 0.9

Custom architectures

The max CLI supports loading custom model architectures through the --custom-architectures flag. This allows you to extend MAX's capabilities with your own model implementations:

max serve \
  --model modularai/Llama-3.1-8B-Instruct-GGUF \
  --custom-architectures path/to/module1:module1 \
  --custom-architectures path/to/module2:module2

warm-cache

Preloads and compiles the model to optimize initialization time by:

  • Pre-compiling models before deployment
  • Warming up the Hugging Face cache

This command is useful to run before serving a model.

max warm-cache [OPTIONS]

Example:

Basic cache warming:

max warm-cache \
  --model modularai/Llama-3.1-8B-Instruct-GGUF

Model configuration

Core settings for model loading and execution.

OptionDescriptionDefaultValues
--custom-architecturesLoad custom pipeline architecturesModule path format: folder/path/to/import:my_module
--model TEXTModel ID or local pathHugging Face repo ID (e.g. mistralai/Mistral-7B-v0.1) or a local path
--model-path TEXTModel ID or local path (alternative to --model)Hugging Face repo ID (e.g. mistralai/Mistral-7B-v0.1) or a local path
--quantization-encodingWeight encoding typefloat32|bfloat16|q4_k|q4_0|q6_k|gptq
--served-model-nameOverride the default model name reported to clients (serve command only).Any string identifier
--weight-path PATHCustom model weights pathValid file path (supports multiple paths via repeated flags)

Device configuration

Controls hardware placement and memory usage.

OptionDescriptionDefaultValues
--devicesTarget devicescpu|gpu|gpu:{id} (e.g. gpu:0,1)
--device-specsSpecific device configurationCPUDeviceSpec format (e.g. DeviceSpec(id=-1, device_type='cpu'))
--device-memory-utilizationDevice memory fraction0.9Float between 0.0 and 1.0

Performance tuning

Optimization settings for batch processing, caching, and sequence handling.

OptionDescriptionDefaultValues
--cache-strategyCache strategynaive|continuous
--kv-cache-page-sizeToken count per KVCache page128Positive integer
--max-batch-sizeMaximum cache size per batch1Positive integer
--max-ce-batch-sizeMaximum context encoding batch size32Positive integer
--max-lengthMaximum input sequence lengthThe Hugging Face model's default max length is used.Positive integer (must be less than model's max config)
--max-new-tokensMaximum tokens to generate-1Integer (-1 for model max)
--data-parallel-degreeNumber of devices for data parallelism1Positive integer

Model state control

Options for saving or loading model states and handling external code

OptionDescriptionDefaultValues
--force-downloadForce re-download cached filesfalsetrue|false
--trust-remote-codeAllow custom Hugging Face codefalsetrue|false
--allow-safetensors-weights-fp32-bf6-bidirectional-castAllow automatic bidirectional dtype casts between fp32 and bfloat16falsetrue|false

Generation parameters

Controls for generation behavior.

OptionDescriptionDefaultValues
--enable-constrained-decodingEnable constrained generationfalsetrue|false
--enable-echoEnable model echofalsetrue|false
--image_urlURLs of images to include with prompt. Ignored if model doesn't support image inputs[]List of valid URLs
--rope-typeRoPE type for GGUF weightsnone|normal|neox
--seedRandom seed for generation reproducibilityInteger value
--temperatureSampling temperature for generation randomness1.0Float value (0.0 to 2.0)
--top-kLimit sampling to top K tokens255Positive integer (1 for greedy sampling)
--chat-templateCustom chat template for the modelValid chat template string

Was this page helpful?