Start a chat endpoint

max

llama3

openai

The MAX framework simplifies the process to serve open source models with the same API interface as OpenAI. This allows you to replace commercial models with alternatives from the MAX Builds site with minimal code changes.

This tutorial shows you how to serve Llama 3.1 locally with the max CLI and interact with it through REST and Python APIs. You'll learn to configure the server and make requests using the OpenAI client libraries as a drop-in replacement.

System requirements:

Mac

Linux

WSL

Set up your environment

Create a Python project to install our APIs and CLI tools:

pixi
uv
pip
conda

If you don't have it, install pixi:

curl -fsSL https://pixi.sh/install.sh | sh
curl -fsSL https://pixi.sh/install.sh | sh

Then restart your terminal for the changes to take effect.

Create a project:

pixi init chat-tutorial \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd chat-tutorial
pixi init chat-tutorial \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd chat-tutorial

Install the modular conda package:

Nightly
Stable

pixi add modular
pixi add modular

pixi add "modular=25.4"
pixi add "modular=25.4"

Start the virtual environment:
```
pixi shell
```
```
pixi shell
```

If you don't have it, install uv:

curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh

Then restart your terminal to make uv accessible.

Create a project:

uv init chat-tutorial && cd chat-tutorial
uv init chat-tutorial && cd chat-tutorial

Create and start a virtual environment:

uv venv && source .venv/bin/activate
uv venv && source .venv/bin/activate

Install the modular Python package:

Nightly
Stable

uv pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --index-url https://dl.modular.com/public/nightly/python/simple/ \
  --index-strategy unsafe-best-match --prerelease allow
uv pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --index-url https://dl.modular.com/public/nightly/python/simple/ \
  --index-strategy unsafe-best-match --prerelease allow

uv pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --extra-index-url https://modular.gateway.scarf.sh/simple/ \
  --index-strategy unsafe-best-match
uv pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --extra-index-url https://modular.gateway.scarf.sh/simple/ \
  --index-strategy unsafe-best-match

Create a project folder:

mkdir chat-tutorial && cd chat-tutorial
mkdir chat-tutorial && cd chat-tutorial

Create and activate a virtual environment:

python3 -m venv .venv/chat-tutorial \
  && source .venv/chat-tutorial/bin/activate
python3 -m venv .venv/chat-tutorial \
  && source .venv/chat-tutorial/bin/activate

Install the modular Python package:

Nightly
Stable

pip install --pre modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --index-url https://dl.modular.com/public/nightly/python/simple/
pip install --pre modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --index-url https://dl.modular.com/public/nightly/python/simple/

pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --extra-index-url https://modular.gateway.scarf.sh/simple/
pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

If you don't have it, install conda. A common choice is with brew:
```
brew install miniconda
```
```
brew install miniconda
```
Initialize conda for shell interaction:
```
conda init
```
```
conda init
```
If you're on a Mac, instead use:
```
conda init zsh
```
```
conda init zsh
```
Then restart your terminal for the changes to take effect.

Create a project:

conda create -n chat-tutorial
conda create -n chat-tutorial

Start the virtual environment:

conda activate chat-tutorial

conda activate chat-tutorial

Install the modular conda package:

Nightly
Stable

conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular

conda install -c conda-forge -c https://conda.modular.com/max/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular

Serve your model

Use the max serve command to start a local model server with the Llama 3.1 model:

max serve \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF
max serve \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF

While this example uses the Llama 3.1 model, you can replace it with any of the models listed in the MAX Builds site.

When searching for a model using the MAX Builds site, ensure that the model type can fit into memory of your machine. You can filter and sort models by hardware type, and size of the model. For more information and to learn how to use the MAX Builds site, see MAX Builds in 60 seconds.

The server is ready when you see a message indicating it's running on http://0.0.0.0:8000:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

For a complete list of max CLI commands and options, refer to the MAX CLI reference.

Interact with the model

After the server is running, you can interact with the model using different methods. The MAX endpoint supports OpenAI REST APIs, so you can send requests from your client using the openai Python API.

Python
cURL

You can use OpenAI's Python client to interact with the model.

To get started, install the OpenAI Python client:

pip install openai
pip install openai

Then, create a client and make a request to the model:

generate-text.py
from openai import OpenAI

client = OpenAI(
    base_url = 'http://0.0.0.0:8000/v1',
    api_key='EMPTY', # required by the API, but not used by MAX
)

response = client.chat.completions.create(
  model="modularai/Llama-3.1-8B-Instruct-GGUF",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assistant", "content": "The LA Dodgers won in 2020."},
    {"role": "user", "content": "Where was it played?"}
  ]
)
print(response.choices[0].message.content)
from openai import OpenAI

client = OpenAI(
    base_url = 'http://0.0.0.0:8000/v1',
    api_key='EMPTY', # required by the API, but not used by MAX
)

response = client.chat.completions.create(
  model="modularai/Llama-3.1-8B-Instruct-GGUF",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assistant", "content": "The LA Dodgers won in 2020."},
    {"role": "user", "content": "Where was it played?"}
  ]
)
print(response.choices[0].message.content)

In this example, you're using the OpenAI Python client to interact with the MAX endpoint running on local host 8000. The client object is initialized with the base URL http://0.0.0.0:8000/v1 and the API key is ignored.

When you run this code, the model should respond with information about the 2020 World Series location:

python generate-text.py

python generate-text.py

The 2020 World Series was played at Globe Life Field in Arlington, Texas. It was a neutral site due to the COVID-19 pandemic.

The 2020 World Series was played at Globe Life Field in Arlington, Texas. It was a neutral site due to the COVID-19 pandemic.

The following curl command sends a simple chat request to the model's chat completions endpoint:

curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
        "messages": [
            {
            "role": "system",
            "content": "You are a helpful assistant."
            },
            {
            "role": "user",
            "content": "Hello, how are you?"
            }
        ],
        "max_tokens": 100
    }'
curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
        "messages": [
            {
            "role": "system",
            "content": "You are a helpful assistant."
            },
            {
            "role": "user",
            "content": "Hello, how are you?"
            }
        ],
        "max_tokens": 100
    }'

You should receive a response similar to this:

{
  "id": "18b0abd2d2fd463ea43efe2c147bcac0",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": " I'm doing well, thank you for asking. How can I assist you today?",
        "refusal": "",
        "tool_calls": null,
        "role": "assistant",
        "function_call": null
      },
      "logprobs": {
        "content": [],
        "refusal": []
      }
    }
  ],
  "created": 1743543698,
  "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
  "service_tier": null,
  "system_fingerprint": null,
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 17,
    "prompt_tokens": null,
    "total_tokens": 17
  }
}
{
  "id": "18b0abd2d2fd463ea43efe2c147bcac0",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": " I'm doing well, thank you for asking. How can I assist you today?",
        "refusal": "",
        "tool_calls": null,
        "role": "assistant",
        "function_call": null
      },
      "logprobs": {
        "content": [],
        "refusal": []
      }
    }
  ],
  "created": 1743543698,
  "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
  "service_tier": null,
  "system_fingerprint": null,
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 17,
    "prompt_tokens": null,
    "total_tokens": 17
  }
}

For complete details on all available API endpoints and options, see the MAX Serve API documentation.

Next steps

Now that you have successfully set up MAX with OpenAI-compatible endpoints, checkout out these other tutorials:

Set up your environment​

Serve your model​

Interact with the model​

Next steps​

Set up your environment

Serve your model

Interact with the model

Next steps