Quickstart

A major component of the Modular Platform is MAX, our developer framework that abstracts away the complexity of building and serving high-performance GenAI models on a wide range of hardware, including NVIDIA and AMD GPUs.

In this quickstart, you'll create an endpoint for an open-source LLM using MAX, run an inference from a Python client, and then benchmark the endpoint.

GPU required

We recommend using on an NVIDIA B200 / H200 / H100 or AMD MI355X / MI325X / MI300X.

MAX can serve models on a wide range of CPUs and GPUs, but the LLMs most customers want require a lot of memory, which is why we suggest production-grade GPUs. This guide does offer some smaller models, but to use the latest LLMs, you'll still need a compatible GPU.

System requirements:

Linux

WSL

GPU

If you'd rather create an endpoint with Docker, see our tutorial to benchmark MAX.

Set up your project

First, install the max CLI that you'll use to start the model endpoint.

For the most reliable experience, we recommend installing with pixi.

pixi
uv
pip
conda

If you don't have it, install pixi:
```
curl -fsSL https://pixi.sh/install.sh | sh
```
Then restart your terminal for the changes to take effect.

Create a project:

pixi init quickstart \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd quickstart

Tip: You can skip the -c options if you add these channels as defaults.

Install the modular conda package:
- Nightly
- Stable
pixi add modular
pixi add "modular==25.6"
Start the virtual environment:
```
pixi shell
```

If you don't have it, install uv:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Then restart your terminal to make uv accessible.
Create a project:
```
uv init quickstart && cd quickstart
```
Create and start a virtual environment:
```
uv venv && source .venv/bin/activate
```

Install the modular Python package:

Nightly
Stable

uv pip install modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/ \
  --prerelease allow

uv pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

Create a project folder:
```
mkdir quickstart && cd quickstart
```

Create and activate a virtual environment:

python3 -m venv .venv/quickstart \
  && source .venv/quickstart/bin/activate

Install the modular Python package:

Nightly
Stable

pip install --pre modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/

pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

If you don't have it, install conda. A common choice is with brew:
```
brew install miniconda
```
Initialize conda for shell interaction:
```
conda init
```
If you're on a Mac, instead use:
```
conda init zsh
```
Then restart your terminal for the changes to take effect.
Create a project:
```
conda create -n quickstart
```
Start the virtual environment:
```
conda activate quickstart
```

Install the modular conda package:

Nightly
Stable

conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular

conda install -c conda-forge -c https://conda.modular.com/max/ modular

Start a model endpoint

Now you'll serve an LLM from a local endpoint using max serve.

First, pick whether you want to perform text-to-text inference or image-to-text (multimodal) inference, and then select a model size. We've included a small number of model options to keep it simple, but you can explore more models in our model repository.

Text to text
Image to text

Select a model to change the code below:

Requires >60 GiB of GPU RAM — we suggest an H100, MI300, or better

Google's Gemma 3 models are multimodal, but MAX currently supports text input only for Gemma 3. It comes in many sizes, but they all require a compatible GPU.

Start the endpoint with the max CLI:

Add your HF Access Token as an environment variable:
```
export HF_TOKEN="hf_..."
```

Agree to the Gemma 3 license.

Start the endpoint:
```
max serve --model google/gemma-3-27b-it
```

Select a model to change the code below:

Requires >36 GiB of GPU RAM — we suggest an H100, MI300, or better

OpenGVLab's multimodal InternVL3 models come in many sizes, but they all require a compatible GPU. They aren't gated on Hugging Face, so you don't need to provide a Hugging Face access token to start the endpoint.

Start the endpoint with the max CLI:

max serve --model OpenGVLab/InternVL3-14B-Instruct --trust-remote-code

It will take some time to download the model, compile it, and start the server. While that's working, you can get started on the next step.

Run inference with the endpoint

Open a new terminal and send an inference request using the openai Python API:

Navigate to the project you created above and then install the openai package:
- pixi
- uv
- pip
- conda
pixi add openai
uv add openai
pip install openai
conda install -c conda-forge openai

Activate the virtual environment:

pixi
uv
pip
conda

pixi shell

source .venv/bin/activate

source .venv/quickstart/bin/activate

conda init

Or if you're on a Mac, use:

conda init zsh

Create a new file that sends an inference request:

generate-text.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

completion = client.chat.completions.create(
   model="google/gemma-3-27b-it",
   messages=[
       {
         "role": "user",
         "content": "Who won the world series in 2020?"
       },
   ],
)

print(completion.choices[0].message.content)

Notice that the OpenAI API requires the api_key argument, but you don't need that with MAX.

Wait until the model server is ready—when it is, you'll see this message in your first terminal:
```
🚀 Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
```
Then run the Python script from your second terminal, and you should see results like this (your results may vary, especially for different model sizes):
```
python generate-text.py
```
```
The **Los Angeles Dodgers** won the World Series in 2020!

They defeated the Tampa Bay Rays 4 games to 2. It was their first World Series title since 1988.
```

Navigate to the project you created above and then install the openai package:
- pixi
- uv
- pip
- conda
pixi add openai
uv add openai
pip install openai
conda install -c conda-forge openai

Activate the virtual environment:

pixi
uv
pip
conda

pixi shell

source .venv/bin/activate

source .venv/quickstart/bin/activate

conda init

Or if you're on a Mac, use:

conda init zsh

Create a new file that sends an inference request:

generate-image-caption.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

completion = client.chat.completions.create(
   model="OpenGVLab/InternVL3-14B-Instruct",
   messages=[
       {
           "role": "user",
           "content": [
               {
                   "type": "text",
                   "text": "Write a caption for this image"
               },
               {
                   "type": "image_url",
                   "image_url": {
                       "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
                   }
               }
           ]
       }
   ],
   max_tokens=300
)

print(completion.choices[0].message.content)

Notice that the OpenAI API requires the api_key argument, but you don't need that with MAX.

Wait until the model server is ready—when it is, you'll see this message in your first terminal:

🚀 Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

Then run the Python script from your second terminal, and you should see results like this (your results will always be different):

python generate-image-caption.py

In a charming English countryside setting, Mr. Bun, dressed elegantly in a tweed outfit, stands proudly on a dirt path, surrounded by lush greenery and blooming wildflowers.

Benchmark the endpoint

While still in your second terminal, run the following command to benchmark your endpoint:

max benchmark \
  --model google/gemma-3-27b-it \
  --backend modular \
  --endpoint /v1/chat/completions \
  --dataset-name sonnet \
  --num-prompts 500 \
  --sonnet-input-len 550 \
  --output-lengths 256 \
  --sonnet-prefix-len 200

max benchmark \
  --model OpenGVLab/InternVL3-14B-Instruct \
  --backend modular \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 500 \
  --random-input-len 40 \
  --random-output-len 150 \
  --random-image-size 512,512 \
  --random-coefficient-of-variation 0.1,0.6

When it's done, you'll see the results printed to the terminal.

If you want to save the results, add the --save-result flag and it'll save a JSON file in the local directory. You can specify the file name with --result-filename and change the directory with --result-dir. For example:

max benchmark \
  ...
  --save-result \
  --result-filename "quickstart-benchmark.json" \
  --result-dir "results"

The benchmark options above are just a starting point. When you want to save your own benchmark configurations, you can define them in a YAML file and pass it to the --config-file option. For example configurations, see our benchmark config files on GitHub.

For more details about the tool, including other datasets and configuration options, see the max benchmark documentation.

GPU ran out of memory

If the server log says, GPU ran out of memory during model execution, try reducing the benchmark input length with the option corresponding to your dataset (--sonnet-input-len or --random-input-len). Also consider restarting max serve and adding --device-memory-utilization with a value as low as 0.5 (the default is 0.9).

Next steps

Now that you have an endpoint, connect to it with our GenAI Cookbook—an open-source project for building React-based interfaces for any model endpoint. Just clone the repo, run it with npm, and pick a recipe such as a chat interface, a drag-and-drop image caption tool, or build your own.

To get started, see the project README.

Stay in touch

If you have any issues or want to share your experience, reach out on the Modular Forum or Discord.

Get the latest updates

Stay up to date with announcements and releases. We're moving fast over here.

Talk to an AI Expert

Connect with our product experts to explore how we can help you deploy and serve AI models with high performance, scalability, and cost-efficiency.

Request a demo

Set up your project​

Start a model endpoint​

Run inference with the endpoint​

Benchmark the endpoint​

Next steps​

Stay in touch​