Text to text
MAX makes it easy to generate text with large language models, whether for conversational applications, single-turn prompts, or offline inference workflows. MAX text completion endpoints are fully compatible with the OpenAI API, so you can use familiar tools and libraries.
Text completions let you instruct a model to produce new text based on a prompt
or an ongoing conversation. They can be used for a wide range of tasks,
including writing content, generating synthetic data, building chatbots, or
powering multi-turn assistants. MAX provides two main endpoints for text
completions:
v1/chat/completions
and v1/completions
.
Endpoints
The v1/chat/completions
endpoint is recommended as the default for most text use cases. It supports
both single-turn and multi-turn scenarios. The v1/completions
endpoint is
also supported for traditional single-turn text generation tasks, which is
useful for offline inference or generating text from a prompt without
conversational context.
v1/chat/completions
The v1/chat/completions
endpoint is designed for chat-based models and supports both single-turn and
multi-turn interactions. You provide a sequence of structured messages with
roles (system
, user
, assistant
), and the model generates a response.
For example, within the v1/chat/completions
request body, the "messages"
array might look similar to the following:
"messages": [
{
"role:" "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who won the world series in 2020?"
}
]
Use a combination of roles to give the model the context it needs. A system
message can define overall model response behavior, user
messages represent
instructions or prompts from the end-user interacting with the model, and
assistant
messages are a way to incorporate past model responses into the
message context.
Use this endpoint whenever you want conversational interaction, such as:
- Building chatbots or assistants
- Implementing Q&A systems
- Supporting multi-turn dialogue in applications
It's also fully compatible with single-turn use cases, making it versatile enough for general text generation workflows.
v1/completions
The v1/completions
endpoint
supports traditional text completions. You provide a prompt, and the model
returns generated text. This endpoint is ideal when you only need a single
response per request, such as:
- Offline inference workflows
- Synthetic text generation
- One-off text generation tasks
Quickstart
Get started quickly serving Llama 3.1 locally with the max
CLI and interact
with it through the MAX REST and Python APIs. You'll learn to configure the
server and make requests using the OpenAI client libraries as a drop-in
replacement.
System requirements:
Mac
Linux
WSL
Set up your environment
Create a Python project to install our APIs and CLI tools:
- pixi
- uv
- pip
- conda
- If you don't have it, install
pixi
:curl -fsSL https://pixi.sh/install.sh | sh
Then restart your terminal for the changes to take effect.
- Create a project:
pixi init chat-quickstart \ -c https://conda.modular.com/max-nightly/ -c conda-forge \ && cd chat-quickstart
- Install the
modular
conda package:- Nightly
- Stable
pixi add modular
pixi add "modular=25.5"
- Start the virtual environment:
pixi shell
- If you don't have it, install
uv
:curl -LsSf https://astral.sh/uv/install.sh | sh
Then restart your terminal to make
uv
accessible. - Create a project:
uv init chat-quickstart && cd chat-quickstart
- Create and start a virtual environment:
uv venv && source .venv/bin/activate
- Install the
modular
Python package:- Nightly
- Stable
uv pip install modular \ --index-url https://dl.modular.com/public/nightly/python/simple/ \ --prerelease allow
uv pip install modular \ --extra-index-url https://modular.gateway.scarf.sh/simple/
- Create a project folder:
mkdir chat-quickstart && cd chat-quickstart
- Create and activate a virtual environment:
python3 -m venv .venv/chat-quickstart \ && source .venv/chat-quickstart/bin/activate
- Install the
modular
Python package:- Nightly
- Stable
pip install --pre modular \ --index-url https://dl.modular.com/public/nightly/python/simple/
pip install modular \ --extra-index-url https://modular.gateway.scarf.sh/simple/
- If you don't have it, install conda. A common choice is with
brew
:brew install miniconda
- Initialize
conda
for shell interaction:conda init
If you're on a Mac, instead use:
conda init zsh
Then restart your terminal for the changes to take effect.
- Create a project:
conda create -n chat-quickstart
- Start the virtual environment:
conda activate chat-quickstart
- Install the
modular
conda package:- Nightly
- Stable
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular
Serve your model
Use the max serve
command to start a local model server
with the Llama 3.1 model:
max serve \
--model modularai/Llama-3.1-8B-Instruct-GGUF
This creates a server running the Llama-3.1-8B-Instruct-GGUF
large language
model on http://localhost:8000/v1/chat/completions
, an OpenAI compatible
endpoint.
While this example uses the Llama 3.1 model, you can replace it with any of the models listed in the MAX Builds site.
The endpoint is ready when you see this message printed in your terminal:
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
For a complete list of max
CLI commands and options, refer to the MAX CLI
reference.
Generate a text chat completion
MAX supports OpenAI's REST APIs and you can interact with the model using either the OpenAI Python SDK or curl:
- Python
- curl
You can use OpenAI's Python client to interact with the model. First, install the OpenAI API:
- pixi
- uv
- pip
- conda
pixi add openai
uv add openai
pip install openai
conda install openai
Then, create a client and make a request to the model:
from openai import OpenAI
client = OpenAI(
base_url = 'http://0.0.0.0:8000/v1',
api_key='EMPTY', # required by the API, but not used by MAX
)
response = client.chat.completions.create(
model="modularai/Llama-3.1-8B-Instruct-GGUF",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The LA Dodgers won in 2020."},
{"role": "user", "content": "Where was it played?"}
]
)
print(response.choices[0].message.content)
In this example, you're using the OpenAI Python client to interact with the MAX
endpoint running on local host 8000
. The client
object is initialized with
the base URL http://0.0.0.0:8000/v1
and the API key is ignored.
When you run this code, the model should respond with information about the 2020 World Series location:
python generate-text.py
The 2020 World Series was played at Globe Life Field in Arlington, Texas. It was a neutral site due to the COVID-19 pandemic.
The following curl
command sends a chat request to the model's chat
completions endpoint:
curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello, how are you?"
}
],
"max_tokens": 100
}'
You should receive a response similar to this:
{
"id": "18b0abd2d2fd463ea43efe2c147bcac0",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": " I'm doing well, thank you for asking. How can I assist you today?",
"refusal": "",
"tool_calls": null,
"role": "assistant",
"function_call": null
},
"logprobs": {
"content": [],
"refusal": []
}
}
],
"created": 1743543698,
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"service_tier": null,
"system_fingerprint": null,
"object": "chat.completion",
"usage": {
"completion_tokens": 17,
"prompt_tokens": null,
"total_tokens": 17
}
}
For complete details on all available API endpoints and options, see the REST API documentation.
Next steps
Now that you have successfully set up MAX with an OpenAI-compatible chat endpoint, check out additional serving optimizations specific to your use case.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!