Structured output

MAX supports the generation of structured output using llguidance as a backend. Structured output, also sometimes referred to as constrained decoding, allows users to enforce specific output formats, ensuring structured and predictable responses from a model.

Structured output is compatible with GPU deployments and MAX models only. Support for PyTorch models and CPU deployments is in progress.

When to use structured output

If you want to structure a model's output when it responds to a user, then you should use a structured output response_format.

If you are connecting a model to tools, functions, data, or other systems, then you should use function calling instead of structured outputs.

How structured output works

To use structured output, use the --enable-structured-output flag when serving your model with the max CLI.

max serve \
	--model "meta-llama/Llama-3.1-8B-Instruct" \
	--enable-structured-output

Then, when making inference requests, you must specify a response_format JSON schema. Both the /chat/completions and /completions API endpoints are compatible with structured output.

We recommend testing your structured output responses thoroughly as they are sensitive to the way the model was trained.

JSON schema

To specify a structured output within your inference request, use the following format:

You can increase the accuracy of structured output responses by mentioning JSON output specifications in your system prompt.

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are an assistant that extracts calendar events from text."
      },
      {
        "role": "user",
        "content": "Alice and Bob are going to a movie on Friday."
      }
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "CalendarEvent",
        "schema": {
          "type": "object",
          "properties": {
            "activity": { "type": "string" },
            "day": { "type": "string" },
            "participants": {
              "type": "array",
              "items": { "type": "string" }
            }
          },
          "required": ["activity", "day", "participants"],
          "additionalProperties": false
        }
      }
    }
  }'

Schema validation

You can also define your structured output using the Pydantic BaseModel to validate your JSON schema in Python.

Here's an example:

structured-output.py
from pydantic import BaseModel
from openai import OpenAI

client = OpenAI(
        base_url="http://0.0.0.0:8000/v1",
        api_key="EMPTY"
        )


class CalendarEvent(BaseModel):
    activity: str
    day: str
    participants: list[str]


completion = client.chat.completions.parse(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "Extract the calendar event information."},
        {"role": "user", "content": "Alice and Bob are going to a movie on Friday."},
    ],
    response_format=CalendarEvent,
)

event = completion.choices[0].message.parsed
print(event)

Supported models

All text generation models support structured output with MAX. As new models are added, they will also be compatible with structured output. This functionality is implemented at the pipeline level, ensuring consistency across different models.

However, structured output currently doesn't support PyTorch models or CPU deployments—only MAX models deployed on GPUs.

Next steps

For more examples, you can explore structured output recipes.

After defining your output structure, you can explore deploying your workflow on GPUs.

When to use structured output​

How structured output works​

JSON schema​

Schema validation​

Supported models​

Next steps​