Benchmark MAX on NVIDIA or AMD GPUs

Ehsan M. Kermani

Developer Relations

Judy Heflin

Technical Writer

benchmark

gpu

Datacenter GPU required

For the best performance, use an NVIDIA B200 / H200 / H100 or AMD MI355X / MI325X / MI300X.

MAX can serve models on a wide range of CPUs and GPUs, but the big LLMs most customers want require an amount of memory that's only available on the latest datacenter GPUs. Specifically, this tutorial uses the Gemma 3 27B model, which requires at least 60 GiB of memory.

Performance optimization is a key challenge in deploying AI inference workloads, especially when balancing factors like accuracy, latency, and cost. In this tutorial, we'll show you how to benchmark a Gemma 3 endpoint using the max benchmark command. This tool provides key metrics to evaluate the performance of the model server, including:

Request throughput
Input and output token throughput
Time-to-first-token (TTFT)
Time per output token (TPOT)

Our benchmarking script is adapted from vLLM with additional features, such as client-side GPU metric collection to ensure consistent and comprehensive performance measurement that's tailored to MAX. You can see the benchmark script source here.

System requirements:

Linux

WSL

GPU

Docker

This tutorial is intended for production environments using Docker, which can be difficult to set up with GPU access on some systems. If you have any trouble with Docker, you can instead run benchmarks on an endpoint created with the max serve command—for instructions, see the quickstart guide.

Get access to the model

From here on, you should be running commands on the system with your GPU. If you haven't already, open a shell to that system now.

You'll first need to authorize your Hugging Face account to access the Gemma model:

Obtain a Hugging Face access token and set it as an environment variable:
```
export HF_TOKEN="hf_..."
```
Agree to the Gemma 3 license on Hugging Face.

Start the model endpoint

We provide a pre-configured GPU-enabled Docker container that simplifies the process to deploy an endpoint with MAX. For more information, see MAX container.

Use this command to pull the MAX container and start the model endpoint:

NVIDIA
AMD

docker run --rm --gpus=all \
  --ipc=host \
  -p 8000:8000 \
  --env "HF_TOKEN=${HF_TOKEN}" \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  modular/max-nvidia-full:latest \
  --model-path google/gemma-3-27b-it

docker run \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add video \
  --ipc=host \
  -p 8000:8000 \
  --env "HF_TOKEN=${HF_TOKEN}" \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  modular/max-amd:latest \
  --model-path google/gemma-3-27b-it

If you want to try a different model, see our model repository.

The server is running when you see the following terminal message (beware Docker prints JSON logs by default):

🚀 Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

Start benchmarking

Open a second terminal and install the modular package to get the max CLI tool we'll use to perform benchmarking.

Set up your environment

pixi
uv
pip
conda

If you don't have it, install pixi:
```
curl -fsSL https://pixi.sh/install.sh | sh
```
Then restart your terminal for the changes to take effect.

Create a project:

pixi init max-benchmark \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd max-benchmark

Tip: You can skip the -c options if you add these channels as defaults.

Install the modular conda package:
- Nightly
- Stable
pixi add modular
pixi add "modular==25.6"
Start the virtual environment:
```
pixi shell
```

If you don't have it, install uv:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Then restart your terminal to make uv accessible.

Create a project:

uv init max-benchmark && cd max-benchmark

Create and start a virtual environment:
```
uv venv && source .venv/bin/activate
```

Install the modular Python package:

Nightly
Stable

uv pip install modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/ \
  --prerelease allow

uv pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

Create a project folder:
```
mkdir max-benchmark && cd max-benchmark
```

Create and activate a virtual environment:

python3 -m venv .venv/max-benchmark \
  && source .venv/max-benchmark/bin/activate

Install the modular Python package:

Nightly
Stable

pip install --pre modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/

pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

If you don't have it, install conda. A common choice is with brew:
```
brew install miniconda
```
Initialize conda for shell interaction:
```
conda init
```
If you're on a Mac, instead use:
```
conda init zsh
```
Then restart your terminal for the changes to take effect.
Create a project:
```
conda create -n max-benchmark
```
Start the virtual environment:
```
conda activate max-benchmark
```

Install the modular conda package:

Nightly
Stable

conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular

conda install -c conda-forge -c https://conda.modular.com/max/ modular

Benchmark the model

To benchmark MAX with the sharegpt dataset, use this command:

max benchmark \
  --backend modular \
  --model google/gemma-3-27b-it \
  --dataset-name sharegpt \
  --endpoint /v1/completions \
  --num-prompts 250

By default, this sends requests to localhost:8000, but you can override with the --host and --port arguments.

In order to download the dataset, you must have permission to write to your Hugging Face cache. You might need to change permissions with chown.

For more information about available max benchmark arguments, including other datasets you can use, see the max benchmark documentation.

Use your own dataset

The command above uses the sharegpt dataset from Hugging Face, but you can also provide a path to your own dataset.

For example, you can download the ShareGPT dataset with this command:

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

You can then use the local dataset with the --dataset-path argument:

max benchmark \
  ...
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \

Interpret the results

Of course, your results depend on your hardware, but the structure of the output should look like this:

============ Serving Benchmark Result ============
Successful requests:                     50
Failed requests:                         0
Benchmark duration (s):                  25.27
Total input tokens:                      12415
Total generated tokens:                  11010
Total nonempty serving response chunks:  11010
Input request rate (req/s):              inf
Request throughput (req/s):              1.97837
------------Client Experience Metrics-------------
Max Concurrency:                         50
Mean input token throughput (tok/s):     282.37
Std input token throughput (tok/s):      304.38
Median input token throughput (tok/s):   140.81
P90 input token throughput (tok/s):      9.76
P95 input token throughput (tok/s):      7.44
P99 input token throughput (tok/s):      4.94
Mean output token throughput (tok/s):    27.31
Std output token throughput (tok/s):     8.08
Median output token throughput (tok/s):  30.64
P90 output token throughput (tok/s):     12.84
P95 output token throughput (tok/s):     9.11
P99 output token throughput (tok/s):     4.71
---------------Time to First Token----------------
Mean TTFT (ms):                          860.54
Std TTFT (ms):                           228.57
Median TTFT (ms):                        809.41
P90 TTFT (ms):                           1214.68
P95 TTFT (ms):                           1215.34
P99 TTFT (ms):                           1215.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          46.72
Std TPOT (ms):                           39.77
Median TPOT (ms):                        32.63
P90 TPOT (ms):                           78.24
P95 TPOT (ms):                           111.87
P99 TPOT (ms):                           216.31
---------------Inter-token Latency----------------
Mean ITL (ms):                           31.16
Std ITL (ms):                            91.79
Median ITL (ms):                         1.04
P90 ITL (ms):                            176.93
P95 ITL (ms):                            272.52
P99 ITL (ms):                            276.72
-------------Per-Request E2E Latency--------------
Mean Request Latency (ms):               7694.01
Std Request Latency (ms):                6284.40
Median Request Latency (ms):             5667.19
P90 Request Latency (ms):                16636.07
P95 Request Latency (ms):                21380.10
P99 Request Latency (ms):                25251.18

For more information about each metric, see the MAX benchmarking key metrics.

Measure latency with finite request rates

Latency metrics like time-to-first-token (TTFT) and time per output token (TPOT) matter most when the server isn't overloaded. An overloaded server will queue requests, which results in a massive increase in latency that varies depending on the size of the benchmark more than the actual latency of the server. Benchmarks with a larger number of prompts result in a deeper queue.

If you'd like to vary the size of the queue, you can adjust the request rate with the --request-rate flag. This creates a stochastic request load with an average rate of N requests per second.

Comparing to alternatives

You can run the benchmarking script using the Modular, vLLM, or TensorRT-LLM backends to compare performance with alternative LLM serving frameworks. Before running the benchmark, make sure you set up and launch the corresponding inference engine so the script can send requests to it.

When using the TensorRT-LLM backend, be sure to change the --endpoint to /v2/models/ensemble/generate_stream. MAX achieves competitive throughput on most workloads and will further improve with upcoming optimizations.

Optional cleanup

When you're done benchmarking, you can clean up the Docker image with the following command:

NVIDIA
AMD

docker rmi $(docker images -q modular/max-nvidia-full:latest)

docker rmi $(docker images -q modular/max-amd:latest)

Next steps

Now that you have detailed benchmarking results for Gemma 3 on MAX using an NVIDIA or AMD GPU, here are some other topics to explore next:

Deploy Llama 3 on GPU with MAX

Learn how to deploy Llama 3 on GPU with MAX.

Deploy Llama 3 on GPU-powered Kubernetes clusters

Learn how to deploy Llama 3 using Kubernetes, MAX, and NVIDIA GPUs

To read more about our performance methodology, check our our blog post, MAX GPU: State of the Art Throughput on a New GenAI platform.

You can also share your experience on the Modular Forum and in our Discord Community. Be sure to stay up to date with all the performance improvements coming soon by signing up for our newsletter.

Get access to the model​

Start the model endpoint​

Start benchmarking​

Set up your environment​

Benchmark the model​

Use your own dataset​

Interpret the results​

Measure latency with finite request rates​

Comparing to alternatives​

Next steps​