Skip to main content
Log in

Start a chat endpoint

MAX provides OpenAI-compatible endpoints for using open source models with the same API interface as OpenAI. This allows you to replace commercial models with alternatives from the MAX Builds site with minimal code changes.

This tutorial shows you how to serve Llama 3.1 locally with the max CLI and interact with it through REST and Python APIs. You'll learn to configure the server and make requests using the OpenAI client libraries as a drop-in replacement.

Install required packages

To get started with MAX, you need to install the Magic CLI and the max CLI.

  1. If you don't have the magic CLI yet, you can install it on macOS and Ubuntu Linux with this command:

    curl -ssL https://magic.modular.com/ | bash
    curl -ssL https://magic.modular.com/ | bash

    Then run the source command that's printed in your terminal.

  2. Install the max CLI:

    magic global install max
    magic global install max

For a complete list of max CLI commands and options, refer to the MAX Pipelines reference.

Serve your model

Use the max serve command to start a local model server with the Llama 3.1 model:

max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF

While this example uses the Llama 3.1 model, you can replace it with any of the models listed in the MAX Builds site.

The server is ready when you see a message indicating it's running on http://0.0.0.0:8000:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

Interact with the model

After the server is running, you can interact with the model using different methods. The MAX endpoint supports OpenAI REST APIs, so you can send requests from your client using the openai Python API.

The following curl command sends a simple chat request to the model's chat completions endpoint:

curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello, how are you?"
}
],
"max_tokens": 100
}'
curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello, how are you?"
}
],
"max_tokens": 100
}'

You should receive a response similar to this:

{
"id": "18b0abd2d2fd463ea43efe2c147bcac0",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": " I'm doing well, thank you for asking. How can I assist you today?",
"refusal": "",
"tool_calls": null,
"role": "assistant",
"function_call": null
},
"logprobs": {
"content": [],
"refusal": []
}
}
],
"created": 1743543698,
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"service_tier": null,
"system_fingerprint": null,
"object": "chat.completion",
"usage": {
"completion_tokens": 17,
"prompt_tokens": null,
"total_tokens": 17
}
}
{
"id": "18b0abd2d2fd463ea43efe2c147bcac0",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": " I'm doing well, thank you for asking. How can I assist you today?",
"refusal": "",
"tool_calls": null,
"role": "assistant",
"function_call": null
},
"logprobs": {
"content": [],
"refusal": []
}
}
],
"created": 1743543698,
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"service_tier": null,
"system_fingerprint": null,
"object": "chat.completion",
"usage": {
"completion_tokens": 17,
"prompt_tokens": null,
"total_tokens": 17
}
}

For complete details on all available API endpoints and options, see the MAX Serve API documentation.

Next steps

Now that you have successfully set up MAX with OpenAI-compatible endpoints, checkout out these other tutorials:

Did this tutorial work for you?