
Start a chat endpoint
MAX provides OpenAI-compatible endpoints for using open source models with the same API interface as OpenAI. This allows you to replace commercial models with alternatives from the MAX Builds site with minimal code changes.
This tutorial shows you how to serve Llama 3.1 locally with the max
CLI and interact with it through REST and Python APIs. You'll learn to configure
the server and make requests using the OpenAI client libraries as a drop-in
replacement.
Install required packages
To get started with MAX, you need to install the Magic CLI and the max
CLI.
-
If you don't have the
magic
CLI yet, you can install it on macOS and Ubuntu Linux with this command:curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the
source
command that's printed in your terminal. -
Install the
max
CLI:magic global install max
magic global install max
For a complete list of max
CLI commands and options, refer to the MAX
Pipelines reference.
Serve your model
Use the max serve
command to start a
local model server with the Llama 3.1 model:
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
While this example uses the Llama 3.1 model, you can replace it with any of the models listed in the MAX Builds site.
The server is ready when you see a message indicating it's running on http://0.0.0.0:8000:
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Interact with the model
After the server is running, you can interact with the model using different
methods. The MAX endpoint supports OpenAI REST APIs, so you can
send requests from your client using the openai
Python API.
- cURL
- Python
The following curl command sends a simple chat request to the model's chat completions endpoint:
curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello, how are you?"
}
],
"max_tokens": 100
}'
curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello, how are you?"
}
],
"max_tokens": 100
}'
You should receive a response similar to this:
{
"id": "18b0abd2d2fd463ea43efe2c147bcac0",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": " I'm doing well, thank you for asking. How can I assist you today?",
"refusal": "",
"tool_calls": null,
"role": "assistant",
"function_call": null
},
"logprobs": {
"content": [],
"refusal": []
}
}
],
"created": 1743543698,
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"service_tier": null,
"system_fingerprint": null,
"object": "chat.completion",
"usage": {
"completion_tokens": 17,
"prompt_tokens": null,
"total_tokens": 17
}
}
{
"id": "18b0abd2d2fd463ea43efe2c147bcac0",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": " I'm doing well, thank you for asking. How can I assist you today?",
"refusal": "",
"tool_calls": null,
"role": "assistant",
"function_call": null
},
"logprobs": {
"content": [],
"refusal": []
}
}
],
"created": 1743543698,
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"service_tier": null,
"system_fingerprint": null,
"object": "chat.completion",
"usage": {
"completion_tokens": 17,
"prompt_tokens": null,
"total_tokens": 17
}
}
You can also use OpenAI's Python client to interact with the model.
To get started, install the OpenAI Python client:
pip install openai
pip install openai
Then, create a client and make a request to the model:
from openai import OpenAI
client = OpenAI(
base_url = 'http://0.0.0.0:8000/v1',
api_key='max-serve', # required, but unused for local deployment
)
response = client.chat.completions.create(
model="modularai/Llama-3.1-8B-Instruct-GGUF",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The LA Dodgers won in 2020."},
{"role": "user", "content": "Where was it played?"}
]
)
print(response.choices[0].message.content)
from openai import OpenAI
client = OpenAI(
base_url = 'http://0.0.0.0:8000/v1',
api_key='max-serve', # required, but unused for local deployment
)
response = client.chat.completions.create(
model="modularai/Llama-3.1-8B-Instruct-GGUF",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The LA Dodgers won in 2020."},
{"role": "user", "content": "Where was it played?"}
]
)
print(response.choices[0].message.content)
In this example, you're using the OpenAI Python client to interact with the MAX endpoint running on local host 8000
.
The client
object is initialized with the base URL http://0.0.0.0:8000/v1
and the API key max-serve
.
The api_key
is required, but the value is not used and can be set to anything.
When you run this code, the model should respond with information about the 2020 World Series location:
The 2020 World Series was played at Globe Life Field in Arlington, Texas. It was a neutral site due to the COVID-19 pandemic.
The 2020 World Series was played at Globe Life Field in Arlington, Texas. It was a neutral site due to the COVID-19 pandemic.
For complete details on all available API endpoints and options, see the MAX Serve API documentation.
Next steps
Now that you have successfully set up MAX with OpenAI-compatible endpoints, checkout out these other tutorials:
Did this tutorial work for you?
Thank you! We'll create more content like this.
Thank you for helping us improve!