
Generate image descriptions with Llama 3.2 Vision
The MAX framework simplifies the process to create an endpoint for multimodal models that handle both text and images, such as Llama 3.2 11B Vision Instruct, which excels at tasks such as image captioning and visual question answering. This tutorial walks you through installing the necessary tools, configuring access, and serving the model locally with an OpenAI-compatible endpoint.
System requirements:
Mac
Linux
WSL
GPU
Set up your environment
Create a Python project to install our APIs and CLI tools:
- pip
- uv
- magic
- Create a project folder:
mkdir vision-tutorial && cd vision-tutorial
mkdir vision-tutorial && cd vision-tutorial
- Create and activate a virtual environment:
python3 -m venv .venv/vision-tutorial \
&& source .venv/vision-tutorial/bin/activatepython3 -m venv .venv/vision-tutorial \
&& source .venv/vision-tutorial/bin/activate - Install the
modular
Python package:- Nightly
- Stable
pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpupip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu
- Install
uv
:curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh
Then restart your terminal to make
uv
accessible. - Create a project:
uv init vision-tutorial && cd vision-tutorial
uv init vision-tutorial && cd vision-tutorial
- Create and start a virtual environment:
uv venv && source .venv/bin/activate
uv venv && source .venv/bin/activate
- Install the
modular
Python package:- Nightly
- Stable
uv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-strategy unsafe-best-match
- Install
magic
:curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the
source
command that's printed in your terminal. - Create a project:
magic init vision-tutorial --format pyproject && cd vision-tutorial
magic init vision-tutorial --format pyproject && cd vision-tutorial
- Install the
max-pipelines
conda package:- Nightly
- Stable
magic add max-pipelines
magic add max-pipelines
magic add "max-pipelines==25.3"
magic add "max-pipelines==25.3"
- Start the virtual environment:
magic shell
magic shell
Configure Hugging Face access
To get the model used below, you must have a Hugging Face user access token and approved access to the Llama 3.2 11B Vision Instruct Hugging Face repo.
To create a Hugging Face user access token, see Access Tokens. Within your local environment, save your access token as an environment variable.
export HF_TOKEN="hf_..."
export HF_TOKEN="hf_..."
Generate a sample description
You can generate an image description using the
max generate
command. Downloading the
Llama 3.2 11B Vision Instruct model weights takes some time.
max generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url="https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172
max generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url="https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172
When using the max
CLI tool with multimodal input, you must provide
both a --prompt
and an --image_url
. Additionally, the prompt should be in a
valid format for the model used. For Llama 3.2 Vision 11B Instruct, you must
include the <|image|>
tag in the prompt if the input includes an image to
reason about. For more information about Llama 3.2 Vision prompt templates, see
Vision Model Inputs and Outputs.
Serve the Llama 3.2 Vision model
You can alternatively serve the Llama 3.2 Vision model and make multiple
requests to a local endpoint. If you already tested the model with the
max generate
command, you do not have to wait for the model to
download again.
Serve the model with the max serve
command:
max serve \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--max-length 108172 \
--max-batch-size 1
max serve \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--max-length 108172 \
--max-batch-size 1
The endpoint is ready when you see this message printed in your terminal:
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Test the endpoint
After the server is running, you can test it by opening a new terminal window
and sending a curl
request.
The following request includes an image URL and a question to answer about the provided image:
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
}
}
]
}
],
"max_tokens": 300
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
}
}
]
}
],
"max_tokens": 300
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
This sends an image along with a text prompt to the model, and you should receive a response describing the image. You can test the endpoint with any local base64-encoded image or any image URL.
Next steps
Now that you have successfully deployed Llama 3.2 Vision, you can:
- Experiment with different images and prompts
- Explore deployment configurations and additional features, such as function calling, prefix caching, and structured output
- Deploy the model to a containerized cloud environment for scalable serving
Did this tutorial work for you?
Thank you! We'll create more content like this.
Thank you for helping us improve!