Structured output
MAX supports the generation of structured output using llguidance as a backend. Structured output, also sometimes referred to as constrained decoding, allows users to enforce specific output formats, ensuring structured and predictable responses from a model.
When to use structured output
If you want to structure a model's output when it responds to a user, then you
should use a structured output response_format
.
If you are connecting a model to tools, functions, data, or other systems, then you should use function calling instead of structured outputs.
How structured output works
To use structured output, use the --enable-structured-output
flag when serving
your model with the max
CLI.
max serve \
--model-path="modularai/Llama-3.1-8B-Instruct-GGUF" \
--enable-structured-output
Then, when making inference requests, you must specify a response_format
JSON
schema. Both the /chat/completions
and /completions
API endpoints are
compatible with structured output.
We recommend testing your structured output responses thoroughly as they are sensitive to the way the model was trained.
JSON schema
To specify a structured output within your inference request, use the following format:
curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"messages": [
{
"role": "system",
"content": "You are an assistant that extracts calendar events from text."
},
{
"role": "user",
"content": "Alice and Bob are going to a movie on Friday."
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "CalendarEvent",
"schema": {
"type": "object",
"properties": {
"activity": { "type": "string" },
"day": { "type": "string" },
"participants": {
"type": "array",
"items": { "type": "string" }
}
},
"required": ["activity", "day", "participants"],
"additionalProperties": false
}
}
}
}'
Schema validation
You can also define your structured output using the Pydantic
BaseModel
to validate your
JSON schema in Python.
Here's an example:
from pydantic import BaseModel
from openai import OpenAI
client = OpenAI(
base_url="http://0.0.0.0:8000/v1",
api_key="EMPTY"
)
class CalendarEvent(BaseModel):
activity: str
day: str
participants: list[str]
completion = client.chat.completions.parse(
model="modularai/Llama-3.1-8B-Instruct-GGUF",
messages=[
{"role": "system", "content": "Extract the calendar event information."},
{"role": "user", "content": "Alice and Bob are going to a movie on Friday."},
],
response_format=CalendarEvent,
)
event = completion.choices[0].message.parsed
print(event)
Supported models
All text generation models support structured output with MAX. As new models are added, they will also be compatible with structured output. This functionality is implemented at the pipeline level, ensuring consistency across different models.
However, structured output currently doesn't support PyTorch models or CPU deployments—only MAX models deployed on GPUs.
Next steps
For more examples, you can explore structured output recipes.
After defining your output structure, you can explore deploying your workflow on GPUs.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!