Skip to main content
Log in

Structured output

MAX supports the generation of structured output using XGrammar as a backend. Structured output, also sometimes referred to as constrained decoding, allows users to enforce specific output formats, ensuring structured and predictable responses from a model.

When to use structured output

If you want to structure a model's output when it responds to a user, then you should use a structured output response_format.

If you are connecting a model to tools, functions, data, or other systems, then you should use function calling instead of structured outputs.

How structured output works

To use structured output, use the --enable-structured-output flag when serving your model with the max CLI.

max serve \
--model-path="modularai/Llama-3.1-8B-Instruct-GGUF" \
--enable-structured-output
max serve \
--model-path="modularai/Llama-3.1-8B-Instruct-GGUF" \
--enable-structured-output

Then, when making inference requests, you must specify a response_format JSON schema. Both the /chat/completions and /completions API endpoints are compatible with structured output.

JSON schema

To specify a structured output within your inference request, use the following format:

curl -N http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model"="modularai/Llama-3.1-8B-Instruct-GGUF",
"messages"=[
{"role": "system", "content": "You are a helpful math tutor.
Guide the user through the solution step by step.
Provide your guidance in JSON format."},
{"role": "user", "content": "How can I solve 8x + 7 = -23"}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "math_response",
"schema": {
"type": "object",
"properties": {
"steps": {
"type": "array",
"items": {
"type": "object",
"properties": {
"explanation": {"type": "string"},
"output": {"type": "string"}
},
"required": ["explanation", "output"],
"additionalProperties": False
}
},
"final_answer": {"type": "string"}
},
"required": ["steps", "final_answer"],
"additionalProperties": False
}
}
}
curl -N http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model"="modularai/Llama-3.1-8B-Instruct-GGUF",
"messages"=[
{"role": "system", "content": "You are a helpful math tutor.
Guide the user through the solution step by step.
Provide your guidance in JSON format."},
{"role": "user", "content": "How can I solve 8x + 7 = -23"}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "math_response",
"schema": {
"type": "object",
"properties": {
"steps": {
"type": "array",
"items": {
"type": "object",
"properties": {
"explanation": {"type": "string"},
"output": {"type": "string"}
},
"required": ["explanation", "output"],
"additionalProperties": False
}
},
"final_answer": {"type": "string"}
},
"required": ["steps", "final_answer"],
"additionalProperties": False
}
}
}

Schema validation

You can also define your structured output using the Pydantic BaseModel to validate your JSON schema in Python.

Here's an example:

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class CalendarEvent(BaseModel):
name: str
date: str
participants: list[str]

completion = client.beta.chat.completions.parse(
model="modularai/Llama-3.1-8B-Instruct-GGUF",
messages=[
{"role": "system", "content": "Extract the event information."},
{"role": "user", "content": "Alice and Bob are going to a movie on Friday."},
],
response_format=CalendarEvent,
)

event = completion.choices[0].message.parsed
from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class CalendarEvent(BaseModel):
name: str
date: str
participants: list[str]

completion = client.beta.chat.completions.parse(
model="modularai/Llama-3.1-8B-Instruct-GGUF",
messages=[
{"role": "system", "content": "Extract the event information."},
{"role": "user", "content": "Alice and Bob are going to a movie on Friday."},
],
response_format=CalendarEvent,
)

event = completion.choices[0].message.parsed

Supported models

All text generation models support structured output with MAX. As new models are added, they will also be compatible with structured output. This functionality is implemented at the pipeline level, ensuring consistency across different models.

However, structured output currently doesn't support PyTorch models or CPU deployments—only MAX models deployed on GPUs.

Next steps

For more examples, you can explore structured output recipes.

After defining your output structure, you can explore deploying your workflow on GPUs.