
Serve custom model architectures
MAX comes with built-in support for popular model architectures like
Gemma3ForCausalLM
, Qwen2ForCausalLM
, and LlamaForCausalLM
, so you can
instantly deploy them by passing a specific Hugging Face model name to the max serve
command (explore our model
repo). You can also use MAX to
serve a custom model architecture with the max serve
command, which provides an
OpenAI-compatible API.
In this tutorial, you'll implement a custom architecture based on the Qwen2 model by extending MAX's existing Llama3 implementation. This approach demonstrates how to leverage MAX's built-in architectures to quickly support new models with similar structures. By the end of this tutorial, you'll understand how to:
- Set up the required file structure for custom architectures.
- Extend existing MAX model implementations.
- Register your model architecture with MAX.
- Serve your model and make inference requests.
Set up your environment
Create a Python project and install the necessary dependencies:
- pixi
- uv
- pip
- conda
- If you don't have it, install
pixi
:curl -fsSL https://pixi.sh/install.sh | sh
curl -fsSL https://pixi.sh/install.sh | sh
Then restart your terminal for the changes to take effect.
- Create a project:
pixi init qwen2 \
-c https://conda.modular.com/max-nightly/ -c conda-forge \
&& cd qwen2pixi init qwen2 \
-c https://conda.modular.com/max-nightly/ -c conda-forge \
&& cd qwen2 - Install the
modular
conda package:- Nightly
- Stable
pixi add modular
pixi add modular
pixi add "modular=25.4"
pixi add "modular=25.4"
- Start the virtual environment:
pixi shell
pixi shell
- If you don't have it, install
uv
:curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh
Then restart your terminal to make
uv
accessible. - Create a project:
uv init qwen2 && cd qwen2
uv init qwen2 && cd qwen2
- Create and start a virtual environment:
uv venv && source .venv/bin/activate
uv venv && source .venv/bin/activate
- Install the
modular
Python package:- Nightly
- Stable
uv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-match --prerelease allowuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-match --prerelease allowuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/ \
--index-strategy unsafe-best-match
- Create a project folder:
mkdir qwen2 && cd qwen2
mkdir qwen2 && cd qwen2
- Create and activate a virtual environment:
python3 -m venv .venv/qwen2 \
&& source .venv/qwen2/bin/activatepython3 -m venv .venv/qwen2 \
&& source .venv/qwen2/bin/activate - Install the
modular
Python package:- Nightly
- Stable
pip install --pre modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/pip install --pre modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/
- If you don't have it, install conda. A common choice is with
brew
:brew install miniconda
brew install miniconda
- Initialize
conda
for shell interaction:conda init
conda init
If you're on a Mac, instead use:
conda init zsh
conda init zsh
Then restart your terminal for the changes to take effect.
- Create a project:
conda create -n qwen2
conda create -n qwen2
- Start the virtual environment:
conda activate qwen2
conda activate qwen2
- Install the
modular
conda package:- Nightly
- Stable
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular
Understand the architecture structure
Before creating your custom architecture, let's understand how to organize your custom model project. Create the following structure in your project directory:
qwen2/
├── __init__.py
├── arch.py
└── model.py
qwen2/
├── __init__.py
├── arch.py
└── model.py
Here's what each file does:
-
__init__.py
: Makes your architecture discoverable by MAX. -
arch.py
: Registers your model with MAX, specifying supported encodings, capabilities, and which existing components to reuse. -
model.py
: Contains your model implementation that extends an existing MAX model class.
When extending an existing architecture, you can often reuse configuration handling and weight adapters from the parent model, significantly reducing the amount of code you need to write.
Implement the main model class
When your model is similar to an existing architecture, you can extend that model
class instead of building from scratch. In this example, we'll extend the
Llama3Model
class to implement the Qwen2Model
class:
from __future__ import annotations
from typing import Optional
from max.driver import Device
from max.engine import InferenceSession
from max.graph.weights import Weights, WeightsAdapter
from max.nn import ReturnLogits
from max.pipelines.architectures.llama3.model import Llama3Model
from max.pipelines.lib import (
KVCacheConfig,
PipelineConfig,
SupportedEncoding,
)
from transformers import AutoConfig
class Qwen2Model(Llama3Model):
"""Qwen2 pipeline model implementation."""
attention_bias: bool = True
"""Whether to use attention bias."""
def __init__(
self,
pipeline_config: PipelineConfig,
session: InferenceSession,
huggingface_config: AutoConfig,
encoding: SupportedEncoding,
devices: list[Device],
kv_cache_config: KVCacheConfig,
weights: Weights,
adapter: Optional[WeightsAdapter] = None,
return_logits: ReturnLogits = ReturnLogits.LAST_TOKEN,
) -> None:
super().__init__(
pipeline_config,
session,
huggingface_config,
encoding,
devices,
kv_cache_config,
weights,
adapter,
return_logits,
)
from __future__ import annotations
from typing import Optional
from max.driver import Device
from max.engine import InferenceSession
from max.graph.weights import Weights, WeightsAdapter
from max.nn import ReturnLogits
from max.pipelines.architectures.llama3.model import Llama3Model
from max.pipelines.lib import (
KVCacheConfig,
PipelineConfig,
SupportedEncoding,
)
from transformers import AutoConfig
class Qwen2Model(Llama3Model):
"""Qwen2 pipeline model implementation."""
attention_bias: bool = True
"""Whether to use attention bias."""
def __init__(
self,
pipeline_config: PipelineConfig,
session: InferenceSession,
huggingface_config: AutoConfig,
encoding: SupportedEncoding,
devices: list[Device],
kv_cache_config: KVCacheConfig,
weights: Weights,
adapter: Optional[WeightsAdapter] = None,
return_logits: ReturnLogits = ReturnLogits.LAST_TOKEN,
) -> None:
super().__init__(
pipeline_config,
session,
huggingface_config,
encoding,
devices,
kv_cache_config,
weights,
adapter,
return_logits,
)
By inheriting from Llama3Model
, the Qwen2 implementation automatically gets:
- The
execute
,prepare_initial_token_inputs
, andprepare_next_token_inputs
methods required by MAX. - Graph building logic for transformer architectures.
- Configuration handling from Hugging Face models.
- Weight loading and conversion capabilities.
The only modification needed is setting attention_bias = True
to match Qwen2's
architecture specifics. This approach works because Qwen2 and Llama3 share
similar transformer architectures.
Define your architecture registration
The arch.py
file that tells MAX about your model's capabilities. When
extending an existing architecture, you can reuse many components:
from max.graph.weights import WeightsFormat
from max.interfaces import PipelineTask
from max.nn.kv_cache import KVCacheStrategy
from max.pipelines.architectures.llama3 import weight_adapters
from max.pipelines.lib import (
RopeType,
SupportedArchitecture,
SupportedEncoding,
TextTokenizer,
)
from .model import Qwen2Model
qwen2_arch = SupportedArchitecture(
name="Qwen2ForCausalLM",
task=PipelineTask.TEXT_GENERATION,
example_repo_ids=["Qwen/Qwen2.5-7B-Instruct", "Qwen/QwQ-32B"],
default_weights_format=WeightsFormat.safetensors,
default_encoding=SupportedEncoding.bfloat16,
supported_encodings={
SupportedEncoding.float32: [KVCacheStrategy.PAGED],
SupportedEncoding.bfloat16: [KVCacheStrategy.PAGED],
},
pipeline_model=Qwen2Model,
tokenizer=TextTokenizer,
rope_type=RopeType.normal,
weight_adapters={
WeightsFormat.safetensors: weight_adapters.convert_safetensor_state_dict,
WeightsFormat.gguf: weight_adapters.convert_gguf_state_dict,
},
)
from max.graph.weights import WeightsFormat
from max.interfaces import PipelineTask
from max.nn.kv_cache import KVCacheStrategy
from max.pipelines.architectures.llama3 import weight_adapters
from max.pipelines.lib import (
RopeType,
SupportedArchitecture,
SupportedEncoding,
TextTokenizer,
)
from .model import Qwen2Model
qwen2_arch = SupportedArchitecture(
name="Qwen2ForCausalLM",
task=PipelineTask.TEXT_GENERATION,
example_repo_ids=["Qwen/Qwen2.5-7B-Instruct", "Qwen/QwQ-32B"],
default_weights_format=WeightsFormat.safetensors,
default_encoding=SupportedEncoding.bfloat16,
supported_encodings={
SupportedEncoding.float32: [KVCacheStrategy.PAGED],
SupportedEncoding.bfloat16: [KVCacheStrategy.PAGED],
},
pipeline_model=Qwen2Model,
tokenizer=TextTokenizer,
rope_type=RopeType.normal,
weight_adapters={
WeightsFormat.safetensors: weight_adapters.convert_safetensor_state_dict,
WeightsFormat.gguf: weight_adapters.convert_gguf_state_dict,
},
)
This configuration demonstrates several key features of MAX's architecture
system. The
name
parameter must match the model class name in Hugging Face configs, while task
specifies the pipeline task type using PipelineTask
from max.interfaces
. The
rope_type
parameter specifies the type of rotary position embeddings used by
the model.
One of the significant advantages of extending existing architectures is the ability to reuse components. In this case, we're reusing Llama3's weight adapters instead of creating custom ones, which handles the conversion between different weight formats like SafeTensors and GGUF. This reuse pattern is common when extending existing architectures—you can often leverage adapters, configuration handling, and other utilities from the parent model.
Load your architecture
Create an __init__.py
file to make your architecture discoverable by MAX:
from .arch import qwen2_arch
ARCHITECTURES = [qwen2_arch]
__all__ = ["qwen2_arch", "ARCHITECTURES"]
from .arch import qwen2_arch
ARCHITECTURES = [qwen2_arch]
__all__ = ["qwen2_arch", "ARCHITECTURES"]
MAX automatically loads any architectures listed in the ARCHITECTURES
variable
when you specify your module with the
--custom-architectures
flag.
Test your custom architecture
You can now test your custom architecture using the --custom-architectures
flag. From your project directory, run the following command:
max serve \
--model-path Qwen/Qwen2.5-7B-Instruct \
--custom-architectures qwen2
max serve \
--model-path Qwen/Qwen2.5-7B-Instruct \
--custom-architectures qwen2
The --model-path
flag tells MAX to use a specified model. You can specify the
model path to a Hugging Face model, or a local directory containing a model.
While the --custom-architectures
flag tells MAX to load custom architectures
from the specified Python module that we just built.
The server is ready when you see this message:
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Now you can test your custom architecture. If you implemented an architecture to do text generation, you can send a request to that endpoint. For example:
- cURL
- Python
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "Hello! Can you help me with a simple task?"}
],
"max_tokens": 100
}'
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "Hello! Can you help me with a simple task?"}
],
"max_tokens": 100
}'
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY", # Required by API but not used by MAX
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user", "content": "Hello! Can you help me with a simple task?"}
],
max_tokens=100,
)
print(response.choices[0].message.content)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY", # Required by API but not used by MAX
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user", "content": "Hello! Can you help me with a simple task?"}
],
max_tokens=100,
)
print(response.choices[0].message.content)
Next steps
Congratulations! You've successfully created a custom architecture for MAX
pipelines and served it with the max serve
command.
While this tutorial showed the simplified approach of extending an existing architecture, you may need to implement a model from scratch if your architecture differs significantly from MAX's built-in models. In that case, you would:
- Implement the full
PipelineModel
interface includingexecute
,prepare_initial_token_inputs
, andprepare_next_token_inputs
methods. - Create custom configuration classes to handle model parameters.
- Write custom weight adapters for converting between different formats.
- Build the computation graph using MAX's graph API.
For implementation details, explore the existing supported model architectures on GitHub. Each subdirectory represents a different model family with its own implementation. You can examine these architectures to understand different approaches and find the best base for your custom architecture.
Here are some areas to explore further:
Did this tutorial work for you?
Thank you! We'll create more content like this.
Thank you for helping us improve!