Generate image descriptions with Llama 3.2 Vision

The MAX framework simplifies the process to create an endpoint for multimodal models that handle both text and images, such as Llama 3.2 11B Vision Instruct, which excels at tasks such as image captioning and visual question answering. This tutorial walks you through installing the necessary tools, configuring access, and serving the model locally with an OpenAI-compatible endpoint.

GPU required

To run the model in this tutorial, your system must have a compatible GPU.

System requirements:

Mac

Linux

WSL

GPU

Set up your environment

Create a Python project to install our APIs and CLI tools:

pixi
uv
pip
conda

If you don't have it, install pixi:
```
curl -fsSL https://pixi.sh/install.sh | sh
```
Then restart your terminal for the changes to take effect.

Create a project:

pixi init vision-tutorial \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd vision-tutorial

Tip: You can skip the -c options if you add these channels as defaults.

Install the modular Python package:
- Nightly
- Stable
pixi add modular
pixi add "modular=25.5"
Start the virtual environment:
```
pixi shell
```

If you don't have it, install uv:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Then restart your terminal to make uv accessible.

Create a project:

uv init vision-tutorial && cd vision-tutorial

Create and start a virtual environment:
```
uv venv && source .venv/bin/activate
```

Install the modular Python package:

Nightly
Stable

uv pip install modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/ \
  --prerelease allow

uv pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

Create a project folder:

mkdir vision-tutorial && cd vision-tutorial

Create and activate a virtual environment:

python3 -m venv .venv/vision-tutorial \
  && source .venv/vision-tutorial/bin/activate

Install the modular Python package:

Nightly
Stable

pip install --pre modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/

pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

If you don't have it, install conda. A common choice is with brew:
```
brew install miniconda
```
Initialize conda for shell interaction:
```
conda init
```
If you're on a Mac, instead use:
```
conda init zsh
```
Then restart your terminal for the changes to take effect.
Create a project:
```
conda create -n vision-tutorial
```
Start the virtual environment:
```
conda activate vision-tutorial
```

Install the modular conda package:

Nightly
Stable

conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular

conda install -c conda-forge -c https://conda.modular.com/max/ modular

Serve your model

To get the model used in this tutorial, you must have a Hugging Face user access token and approved access to the Llama 3.2 11B Vision Instruct Hugging Face repo.

To create a Hugging Face user access token, see Access Tokens. Within your local environment, save your access token as an environment variable:

export HF_TOKEN="hf_..."

Use the max serve command to start a local model server with the Llama 3.2 Vision model:

max serve \
  --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
  --max-length 108172 \
  --max-batch-size 1

You may need to alter the --max-length and --max-batch-size parameters depending on the amount of memory you have access to. The above command is optimized for a p4d.24xlarge instance with one NVIDIA A100 GPU and 96 vCPUs.

This will create a server running the Llama-3.2-11B-Vision-Instruct text-to-image model on http://localhost:8000/v1/chat/completions, an OpenAI compatible endpoint.

While this example uses the Llama 3.2 Vision model, you can replace it with any of the models listed in the MAX Builds site.

The endpoint is ready when you see this message printed in your terminal:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

For a complete list of max CLI commands and options, refer to the MAX CLI reference.

Interact with your model

MAX supports OpenAI's REST APIs and you can interact with the model using either the OpenAI Python SDK or curl:

Python
curl

You can use OpenAI's Python client to interact with the vision model. First, install the OpenAI API:

pixi
uv
pip
conda

pixi add openai

uv add openai

pip install openai

conda install openai

Then, create a client and make a request to the model:

generate-image-description.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
                    }
                }
            ]
        }
    ],
    max_tokens=300
)

print(response.choices[0].message.content)

In this example, you're using the OpenAI Python client to interact with the MAX endpoint running on local host 8000. The client object is initialized with the base URL http://0.0.0.0:8000/v1 and the API key is ignored.

When you run this code, the model should respond with information about the image:

python generate-image-description.py

A rabbit is sitting in a field. It has long ears and a white belly. It is looking at the camera.

You can send requests to the local endpoint using curl. The following request includes an image URL and a question to answer about the provided image:

curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
          }
        }
      ]
    }
  ],
  "max_tokens": 300
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'

This sends an object location to an image along with a text prompt to the model, and you should receive a response similar to this:

A rabbit is sitting in a field. It has long ears and a white belly. It is looking at the camera.

When making requests with max serve, you do not need to include model-specific image tags within your prompt.

For complete details on all available API endpoints and options, see the MAX Serve API documentation.

Next steps

Now that you have successfully set up MAX with an OpenAI-compatible endpoint, checkout out these other tutorials:

Set up your environment​

Serve your model​

Interact with your model​

Next steps​

Set up your environment

Serve your model

Interact with your model

Next steps