Skip to main content

Image to text

Multimodal large language models are capable of processing images and text together in a single request. They can describe visual content, answer questions about images, and support tasks such as image captioning, document analysis, chart interpretation, optical character recognition (OCR), and content moderation.

Endpoint

You can interact with a multimodal LLM through the v1/chat/completions endpoint by including image inputs alongside text in the request. This allows you to provide an image URL or base64-encoded image as part of the conversation, enabling use cases such as image captioning, asking questions about a photo, requesting a chart summary, or combining text prompts with visual context.

Within the v1/chat/completions request body, the "messages" array might look similar to the following for image-to-text:

"messages": [
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "What is in this image?"
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "https://example.com/path/to/your-image.jpg"
      }
    ]
  }
]

Quickstart

In this quickstart, learn how to set up and run Llama 3.2 11B Vision Instruct, which excels at tasks such as image captioning and visual question answering.

System requirements:

Set up your environment

Create a Python project to install our APIs and CLI tools:

  1. If you don't have it, install pixi:
    curl -fsSL https://pixi.sh/install.sh | sh

    Then restart your terminal for the changes to take effect.

  2. Create a project:
    pixi init vision-quickstart \
      -c https://conda.modular.com/max-nightly/ -c conda-forge \
      && cd vision-quickstart
  3. Install the modular conda package:
    pixi add modular
  4. Start the virtual environment:
    pixi shell

Serve your model

To get Llama 3.2 11B Vision Instruct, you must have a Hugging Face user access token and approved access to the Llama 3.2 11B Vision Instruct Hugging Face repo.

To create a Hugging Face user access token, see Access Tokens. Within your local environment, save your access token as an environment variable:

export HF_TOKEN="hf_..."

Use the max serve command to start a local model server with the Llama 3.2 Vision model:

max serve \
  --model meta-llama/Llama-3.2-11B-Vision-Instruct \
  --max-length 108172 \
  --max-batch-size 1

This will create a server running the Llama-3.2-11B-Vision-Instruct multimodal model on http://localhost:8000/v1/chat/completions, an OpenAI compatible endpoint.

While this example uses the Llama 3.2 Vision model, you can replace it with any of the models listed in the MAX Builds site.

The endpoint is ready when you see this message printed in your terminal:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

For a complete list of max CLI commands and options, refer to the MAX CLI reference.

Interact with your model

MAX supports OpenAI's REST APIs and you can interact with the model using either the OpenAI Python SDK or curl:

You can use OpenAI's Python client to interact with the vision model. First, install the OpenAI API:

pixi add openai

Then, create a client and make a request to the model:

generate-image-description.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
                    }
                }
            ]
        }
    ],
    max_tokens=300
)

print(response.choices[0].message.content)

In this example, you're using the OpenAI Python client to interact with the MAX endpoint running on local host 8000. The client object is initialized with the base URL http://0.0.0.0:8000/v1 and the API key is ignored.

When you run this code, the model should respond with information about the image:

python generate-image-description.py
A rabbit is sitting in a field. It has long ears and a white belly. It is looking at the camera.

For complete details on all available API endpoints and options, see the MAX Serve API documentation.

Next steps

Now that you have successfully set up MAX with an OpenAI-compatible endpoint, checkout out these other tutorials:

Was this page helpful?