Skip to main content

Serve custom model architectures

MAX comes with built-in support for popular model architectures like Gemma3ForCausalLM, Qwen2ForCausalLM, and LlamaForCausalLM, so you can instantly deploy them by passing a specific Hugging Face model name to the max serve command (explore our model repo). You can also use MAX to serve a custom model architecture with the max serve command, which provides an OpenAI-compatible API.

In this tutorial, you'll implement a custom architecture based on the Qwen2 model by extending MAX's existing Llama3 implementation. This approach demonstrates how to leverage MAX's built-in architectures to quickly support new models with similar structures. By the end of this tutorial, you'll understand how to:

  • Set up the required file structure for custom architectures.
  • Extend existing MAX model implementations.
  • Register your model architecture with MAX.
  • Serve your model and make inference requests.

Set up your environment

Create a Python project and install the necessary dependencies:

  1. If you don't have it, install pixi:
    curl -fsSL https://pixi.sh/install.sh | sh
    curl -fsSL https://pixi.sh/install.sh | sh

    Then restart your terminal for the changes to take effect.

  2. Create a project:
    pixi init qwen2 \
    -c https://conda.modular.com/max-nightly/ -c conda-forge \
    && cd qwen2
    pixi init qwen2 \
    -c https://conda.modular.com/max-nightly/ -c conda-forge \
    && cd qwen2
  3. Install the modular conda package:
    pixi add modular
    pixi add modular
  4. Start the virtual environment:
    pixi shell
    pixi shell

Understand the architecture structure

Before creating your custom architecture, let's understand how to organize your custom model project. Create the following structure in your project directory:

qwen2/
├── __init__.py
├── arch.py
└── model.py
qwen2/
├── __init__.py
├── arch.py
└── model.py

Here's what each file does:

  • __init__.py: Makes your architecture discoverable by MAX.

  • arch.py: Registers your model with MAX, specifying supported encodings, capabilities, and which existing components to reuse.

  • model.py: Contains your model implementation that extends an existing MAX model class.

When extending an existing architecture, you can often reuse configuration handling and weight adapters from the parent model, significantly reducing the amount of code you need to write.

Implement the main model class

When your model is similar to an existing architecture, you can extend that model class instead of building from scratch. In this example, we'll extend the Llama3Model class to implement the Qwen2Model class:

model.py
from __future__ import annotations

from typing import Optional

from max.driver import Device
from max.engine import InferenceSession
from max.graph.weights import Weights, WeightsAdapter
from max.nn import ReturnLogits
from max.pipelines.architectures.llama3.model import Llama3Model
from max.pipelines.lib import (
KVCacheConfig,
PipelineConfig,
SupportedEncoding,
)
from transformers import AutoConfig


class Qwen2Model(Llama3Model):
"""Qwen2 pipeline model implementation."""

attention_bias: bool = True
"""Whether to use attention bias."""

def __init__(
self,
pipeline_config: PipelineConfig,
session: InferenceSession,
huggingface_config: AutoConfig,
encoding: SupportedEncoding,
devices: list[Device],
kv_cache_config: KVCacheConfig,
weights: Weights,
adapter: Optional[WeightsAdapter] = None,
return_logits: ReturnLogits = ReturnLogits.LAST_TOKEN,
) -> None:
super().__init__(
pipeline_config,
session,
huggingface_config,
encoding,
devices,
kv_cache_config,
weights,
adapter,
return_logits,
)
from __future__ import annotations

from typing import Optional

from max.driver import Device
from max.engine import InferenceSession
from max.graph.weights import Weights, WeightsAdapter
from max.nn import ReturnLogits
from max.pipelines.architectures.llama3.model import Llama3Model
from max.pipelines.lib import (
KVCacheConfig,
PipelineConfig,
SupportedEncoding,
)
from transformers import AutoConfig


class Qwen2Model(Llama3Model):
"""Qwen2 pipeline model implementation."""

attention_bias: bool = True
"""Whether to use attention bias."""

def __init__(
self,
pipeline_config: PipelineConfig,
session: InferenceSession,
huggingface_config: AutoConfig,
encoding: SupportedEncoding,
devices: list[Device],
kv_cache_config: KVCacheConfig,
weights: Weights,
adapter: Optional[WeightsAdapter] = None,
return_logits: ReturnLogits = ReturnLogits.LAST_TOKEN,
) -> None:
super().__init__(
pipeline_config,
session,
huggingface_config,
encoding,
devices,
kv_cache_config,
weights,
adapter,
return_logits,
)

By inheriting from Llama3Model, the Qwen2 implementation automatically gets:

The only modification needed is setting attention_bias = True to match Qwen2's architecture specifics. This approach works because Qwen2 and Llama3 share similar transformer architectures.

Define your architecture registration

The arch.py file that tells MAX about your model's capabilities. When extending an existing architecture, you can reuse many components:

arch.py
from max.graph.weights import WeightsFormat
from max.interfaces import PipelineTask
from max.nn.kv_cache import KVCacheStrategy
from max.pipelines.architectures.llama3 import weight_adapters
from max.pipelines.lib import (
RopeType,
SupportedArchitecture,
SupportedEncoding,
TextTokenizer,
)

from .model import Qwen2Model

qwen2_arch = SupportedArchitecture(
name="Qwen2ForCausalLM",
task=PipelineTask.TEXT_GENERATION,
example_repo_ids=["Qwen/Qwen2.5-7B-Instruct", "Qwen/QwQ-32B"],
default_weights_format=WeightsFormat.safetensors,
default_encoding=SupportedEncoding.bfloat16,
supported_encodings={
SupportedEncoding.float32: [KVCacheStrategy.PAGED],
SupportedEncoding.bfloat16: [KVCacheStrategy.PAGED],
},
pipeline_model=Qwen2Model,
tokenizer=TextTokenizer,
rope_type=RopeType.normal,
weight_adapters={
WeightsFormat.safetensors: weight_adapters.convert_safetensor_state_dict,
WeightsFormat.gguf: weight_adapters.convert_gguf_state_dict,
},
)
from max.graph.weights import WeightsFormat
from max.interfaces import PipelineTask
from max.nn.kv_cache import KVCacheStrategy
from max.pipelines.architectures.llama3 import weight_adapters
from max.pipelines.lib import (
RopeType,
SupportedArchitecture,
SupportedEncoding,
TextTokenizer,
)

from .model import Qwen2Model

qwen2_arch = SupportedArchitecture(
name="Qwen2ForCausalLM",
task=PipelineTask.TEXT_GENERATION,
example_repo_ids=["Qwen/Qwen2.5-7B-Instruct", "Qwen/QwQ-32B"],
default_weights_format=WeightsFormat.safetensors,
default_encoding=SupportedEncoding.bfloat16,
supported_encodings={
SupportedEncoding.float32: [KVCacheStrategy.PAGED],
SupportedEncoding.bfloat16: [KVCacheStrategy.PAGED],
},
pipeline_model=Qwen2Model,
tokenizer=TextTokenizer,
rope_type=RopeType.normal,
weight_adapters={
WeightsFormat.safetensors: weight_adapters.convert_safetensor_state_dict,
WeightsFormat.gguf: weight_adapters.convert_gguf_state_dict,
},
)

This configuration demonstrates several key features of MAX's architecture system. The name parameter must match the model class name in Hugging Face configs, while task specifies the pipeline task type using PipelineTask from max.interfaces. The rope_type parameter specifies the type of rotary position embeddings used by the model.

One of the significant advantages of extending existing architectures is the ability to reuse components. In this case, we're reusing Llama3's weight adapters instead of creating custom ones, which handles the conversion between different weight formats like SafeTensors and GGUF. This reuse pattern is common when extending existing architectures—you can often leverage adapters, configuration handling, and other utilities from the parent model.

Load your architecture

Create an __init__.py file to make your architecture discoverable by MAX:

__init__.py
from .arch import qwen2_arch

ARCHITECTURES = [qwen2_arch]

__all__ = ["qwen2_arch", "ARCHITECTURES"]
from .arch import qwen2_arch

ARCHITECTURES = [qwen2_arch]

__all__ = ["qwen2_arch", "ARCHITECTURES"]

MAX automatically loads any architectures listed in the ARCHITECTURES variable when you specify your module with the --custom-architectures flag.

Test your custom architecture

You can now test your custom architecture using the --custom-architectures flag. From your project directory, run the following command:

max serve \
--model-path Qwen/Qwen2.5-7B-Instruct \
--custom-architectures qwen2
max serve \
--model-path Qwen/Qwen2.5-7B-Instruct \
--custom-architectures qwen2

The --model-path flag tells MAX to use a specified model. You can specify the model path to a Hugging Face model, or a local directory containing a model. While the --custom-architectures flag tells MAX to load custom architectures from the specified Python module that we just built.

The server is ready when you see this message:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

Now you can test your custom architecture. If you implemented an architecture to do text generation, you can send a request to that endpoint. For example:

curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "Hello! Can you help me with a simple task?"}
],
"max_tokens": 100
}'
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "Hello! Can you help me with a simple task?"}
],
"max_tokens": 100
}'

Next steps

Congratulations! You've successfully created a custom architecture for MAX pipelines and served it with the max serve command.

While this tutorial showed the simplified approach of extending an existing architecture, you may need to implement a model from scratch if your architecture differs significantly from MAX's built-in models. In that case, you would:

  1. Implement the full PipelineModel interface including execute, prepare_initial_token_inputs, and prepare_next_token_inputs methods.
  2. Create custom configuration classes to handle model parameters.
  3. Write custom weight adapters for converting between different formats.
  4. Build the computation graph using MAX's graph API.

For implementation details, explore the existing supported model architectures on GitHub. Each subdirectory represents a different model family with its own implementation. You can examine these architectures to understand different approaches and find the best base for your custom architecture.

Here are some areas to explore further:

Did this tutorial work for you?