Text to text

MAX makes it easy to generate text with large language models, whether for conversational applications, single-turn prompts, or offline inference workflows. MAX text completion endpoints are fully compatible with the OpenAI API, so you can use familiar tools and libraries.

Text completions let you instruct a model to produce new text based on a prompt or an ongoing conversation. They can be used for a wide range of tasks, including writing content, generating synthetic data, building chatbots, or powering multi-turn assistants. MAX provides two main endpoints for text completions: v1/chat/completions and v1/completions.

Endpoints

The v1/chat/completions endpoint is recommended as the default for most text use cases. It supports both single-turn and multi-turn scenarios. The v1/completions endpoint is also supported for traditional single-turn text generation tasks, which is useful for offline inference or generating text from a prompt without conversational context.

`v1/chat/completions`

The v1/chat/completions endpoint is designed for chat-based models and supports both single-turn and multi-turn interactions. You provide a sequence of structured messages with roles (system, user, assistant), and the model generates a response.

For example, within the v1/chat/completions request body, the "messages" array might look similar to the following:

"messages": [
  {
    "role:" "system",
    "content": "You are a helpful assistant."
  },
  {
    "role": "user",
    "content": "Who won the world series in 2020?"
  }
]

Use a combination of roles to give the model the context it needs. A system message can define overall model response behavior, user messages represent instructions or prompts from the end-user interacting with the model, and assistant messages are a way to incorporate past model responses into the message context.

Use this endpoint whenever you want conversational interaction, such as:

Building chatbots or assistants
Implementing Q&A systems
Supporting multi-turn dialogue in applications

It's also fully compatible with single-turn use cases, making it versatile enough for general text generation workflows.

`v1/completions`

The v1/completions endpoint supports traditional text completions. You provide a prompt, and the model returns generated text. This endpoint is ideal when you only need a single response per request, such as:

Offline inference workflows
Synthetic text generation
One-off text generation tasks

Quickstart

Get started quickly serving Gemma 3 locally with the max CLI and interact with it through the MAX REST and Python APIs. You'll learn to configure the server and make requests using the OpenAI client libraries as a drop-in replacement.

System requirements:

Mac

Linux

WSL

Set up your environment

Create a Python project to install our APIs and CLI tools:

pixi
uv
pip
conda

If you don't have it, install pixi:
```
curl -fsSL https://pixi.sh/install.sh | sh
```
Then restart your terminal for the changes to take effect.

Create a project:

pixi init chat-quickstart \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd chat-quickstart

Tip: You can skip the -c options if you add these channels as defaults.

Install the modular conda package:
- Nightly
- Stable
pixi add modular
pixi add "modular==25.6"
Start the virtual environment:
```
pixi shell
```

If you don't have it, install uv:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Then restart your terminal to make uv accessible.

Create a project:

uv init chat-quickstart && cd chat-quickstart

Create and start a virtual environment:
```
uv venv && source .venv/bin/activate
```

Install the modular Python package:

Nightly
Stable

uv pip install modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/ \
  --prerelease allow

uv pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

Create a project folder:

mkdir chat-quickstart && cd chat-quickstart

Create and activate a virtual environment:

python3 -m venv .venv/chat-quickstart \
  && source .venv/chat-quickstart/bin/activate

Install the modular Python package:

Nightly
Stable

pip install --pre modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/

pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

If you don't have it, install conda. A common choice is with brew:
```
brew install miniconda
```
Initialize conda for shell interaction:
```
conda init
```
If you're on a Mac, instead use:
```
conda init zsh
```
Then restart your terminal for the changes to take effect.
Create a project:
```
conda create -n chat-quickstart
```
Start the virtual environment:
```
conda activate chat-quickstart
```

Install the modular conda package:

Nightly
Stable

conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular

conda install -c conda-forge -c https://conda.modular.com/max/ modular

Serve your model

Use the max serve command to start a local server with the Gemma 3 model:

max serve --model google/gemma-3-12b-it

This creates a server running the google/gemma-3-12b-it large language model on http://localhost:8000/v1/chat/completions, an OpenAI compatible endpoint.

While this example uses the Gemma 3 model, you can replace it with any of the models listed in the MAX Builds site.

When searching for a model using the MAX Builds site, ensure that the model type can fit into memory of your machine. You can filter and sort models by hardware type, and size of the model. For more information and to learn how to use the MAX Builds site, see MAX Builds in 60 seconds.

The endpoint is ready when you see this message printed in your terminal:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

For a complete list of max CLI commands and options, refer to the MAX CLI reference.

Generate a text chat completion

MAX supports OpenAI's REST APIs and you can interact with the model using either the OpenAI Python SDK or curl:

Python
curl

You can use OpenAI's Python client to interact with the model. First, install the OpenAI API:

pixi
uv
pip
conda

pixi add openai

uv add openai

pip install openai

conda install openai

Then, create a client and make a request to the model:

generate-text.py
from openai import OpenAI

client = OpenAI(
    base_url = 'http://0.0.0.0:8000/v1',
    api_key='EMPTY', # required by the API, but not used by MAX
)

response = client.chat.completions.create(
  model="google/gemma-3-12b-it",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assistant", "content": "The LA Dodgers won in 2020."},
    {"role": "user", "content": "Where was it played?"}
  ]
)
print(response.choices[0].message.content)

In this example, you're using the OpenAI Python client to interact with the MAX endpoint running on local host 8000. The client object is initialized with the base URL http://0.0.0.0:8000/v1 and the API key is ignored.

When you run this code, the model should respond with information about the 2020 World Series location:

python generate-text.py

The 2020 World Series was played at Globe Life Field in Arlington, Texas. It was a neutral site due to the COVID-19 pandemic.

The following curl command sends a chat request to the model's chat completions endpoint:

curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "google/gemma-3-12b-it",
        "messages": [
            {
            "role": "system",
            "content": "You are a helpful assistant."
            },
            {
            "role": "user",
            "content": "Hello, how are you?"
            }
        ],
        "max_tokens": 100
    }'

You should receive a response similar to this:

{
  "id": "18b0abd2d2fd463ea43efe2c147bcac0",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": " I'm doing well, thank you for asking. How can I assist you today?",
        "refusal": "",
        "tool_calls": null,
        "role": "assistant",
        "function_call": null
      },
      "logprobs": {
        "content": [],
        "refusal": []
      }
    }
  ],
  "created": 1743543698,
  "model": "google/gemma-3-12b-it",
  "service_tier": null,
  "system_fingerprint": null,
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 17,
    "prompt_tokens": null,
    "total_tokens": 17
  }
}

For complete details on all available API endpoints and options, see the REST API documentation.

Next steps

Now that you have successfully set up MAX with an OpenAI-compatible chat endpoint, check out additional serving optimizations specific to your use case.

Endpoints​

v1/chat/completions​

v1/completions​

Quickstart​

Set up your environment​

Serve your model​

Generate a text chat completion​

Next steps​