Skip to main content
Log in

Generate image descriptions with Llama 3.2 Vision

Judy Heflin

MAX (Modular Accelerated Xecution) now supports multimodal models, simplifying the deployment of AI systems that handle both text and images. You can now serve models like Llama 3.2 11B Vision Instruct, which excels at tasks such as image captioning and visual question answering. This guide walks you through installing the necessary tools, configuring access, and serving the model with MAX.

Install max-pipelines

We'll use the max-pipelines CLI tool to create a local endpoint.

  1. If you don't have the magic CLI yet, you can install it on macOS and Ubuntu Linux with this command:

    curl -ssL https://magic.modular.com/ | bash
    curl -ssL https://magic.modular.com/ | bash

    Then run the source command that's printed in your terminal.

  2. Install max-pipelines:

    magic global install max-pipelines
    magic global install max-pipelines

Configure Hugging Face access

To download and use Llama 3.2 11B Vision Instruct from Hugging Face, you must have a Hugging Face account, a Hugging Face user access token, and access to the Llama 3.2 11B Vision Instruct Hugging Face gated repository.

To create a Hugging Face user access token, see Access Tokens. Within your local environment, save your access token as an environment variable.

export HF_TOKEN="hf_..."
export HF_TOKEN="hf_..."

Generate a sample description

You can generate an image description using the max-pipelines generate command. Downloading the Llama 3.2 11B Vision Instruct model weights takes some time.

max-pipelines generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url="https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172
max-pipelines generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url="https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172

When using the max-pipelines CLI tool with multimodal input, you must provide both a --prompt and an --image_url. Additionally, the prompt should be in a valid format for the model used. For Llama 3.2 Vision 11B Instruct, you must include the <|image|> tag in the prompt if the input includes an image to reason about. For more information about Llama 3.2 Vision prompt templates, see Vision Model Inputs and Outputs.

Serve the Llama 3.2 Vision model

You can alternatively serve the Llama 3.2 Vision model and make multiple requests to a local endpoint. If you already tested the model with the max-pipelines generate command, you do not have to wait for the model to download again.

Serve the model with the max-pipelines serve command:

max-pipelines serve \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--max-length 108172 \
--max-batch-size 1
max-pipelines serve \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--max-length 108172 \
--max-batch-size 1

The endpoint is ready when you see this message printed in your terminal:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

Test the endpoint

After the server is running, you can test it by opening a new terminal window and sending a curl request.

The following request includes an image URL and a question to answer about the provided image:

curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
}
}
]
}
],
"max_tokens": 300
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
}
}
]
}
],
"max_tokens": 300
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'

This sends an image along with a text prompt to the model, and you should receive a response describing the image. You can test the endpoint with any local base64-encoded image or any image URL.

Next steps

Now that you have successfully deployed Llama 3.2 Vision, you can:

  • Experiment with different images and prompts
  • Explore deployment configurations and additional features, such as function calling, prefix caching, and structured output
  • Deploy the model to a containerized cloud environment for scalable serving

Did this tutorial work for you?