Intro to Mammoth

Mammoth (formerly referred to as MAX Inference Cluster) is a Kubernetes-native distributed AI serving tool that makes it easier to run and manage LLMs at scale using MAX as a backend for optimal model performance. It's built on the Modular Platform and is designed to give you efficient use of your hardware with minimal configuration, even when running multiple models across thousands of nodes.

**Figure 1.** A simplified diagram of how the Modular Platform scales your GenAI deployment.

The Mammoth control plane automatically selects the best available hardware to meet performance targets when deploying a model and supports both manual and automatic scaling. Mammoth's built-in orchestrator intelligently routes traffic, taking into account hardware load, GPU memory, and caching states. You can deploy and serve multiple models simultaneously across different hardware types or versions without complex setup or duplication of infrastructure.

Mammoth is not yet generally available. Get in touch to learn about early access for enterprise teams.

When to use Mammoth

Use Mammoth when you need to serve one or more LLMs at scale, with high performance and minimal operational overhead, optimized to your unique workflow.

Mammoth is a great solution if:

You're running inference across heterogeneous GPU clusters (NVIDIA and AMD) and need optimized, vendor-agnostic orchestration.
You want a self-hosted, low-configuration deployment experience that works out of the box, regardless of hardware or cloud provider.
You need to dynamically scale workloads based on traffic and resource availability, with fine-grained control over model placement and scheduling.
You're managing fleets of models and want a unified serving layer without duplicating infrastructure.
You're working in a Kubernetes environment and want native integration that's easy to operate and extend.
You want to optimize total cost of ownership with cluster-level efficiency features like disaggregated inference and KV cache-aware routing.

Additionally, because Mammoth is built on the MAX framework, you can use its APIs and tools to customize and optimize every layer of the stack, from high-level orchestration down to GPU kernels written in Mojo.

How Mammoth works

Mammoth consists of a lightweight control plane, an intelligent orchestrator, and advanced optimizations such as disaggregated inference, all working together to efficiently deploy and run models across diverse hardware environments.

**Figure 2.** An overview of the Mammoth components, including the control plane, orchestrator, and disaggregated inference on separate prefill and decode nodes.

At the heart of Mammoth is its control plane, which takes care of setting up, running, and scaling models automatically. Just provide the model ID (such as modularai/Llama-3.1-8B-Instruct) or a path to the model on an external storage provider like S3, and the control plane handles the rest.

You can interact with the control plane for:

Model deployment: Launch models with a single command.
Model management: Modify or delete deployed models.
Multi-model orchestration: Run multiple models efficiently across shared infrastructure.
Scaling: Adjust replicas manually or let Mammoth autoscale intelligently.
Resource allocation: Automatically allocate GPU resources to model deployment.

The Mammoth control plane extends the Kubernetes API with custom resource definitions (CRDs) and controls those resources with an operator. When you create, update, or delete a resource, the control plane provisions infrastructure, deploys or reconfigures models, and cleans up resources as needed.

Install Mammoth

You interact with the control plane using kubectl, the Kubernetes command-line tool for managing cluster resources. Mammoth provides a kubectl plugin that adds high-level commands for deploying, scaling, and managing models. These commands abstract away low-level Kubernetes details, so you can run production-grade inference workloads with minimal setup.

Reach out if you're interested in early access to Mammoth for your enterprise-grade inference workloads.

After you have access to the Mammoth kubectl plugin, you can install Mammoth to your cluster with the following command:

kubectl mammoth install
kubectl mammoth install

Deploy models

After installing Mammoth in your cluster, deploying a model is simple. Just specify the model name and how many GPUs to allocate per node.

kubectl mammoth deploy model google/gemma-3-12b-it --resources gpu:1
kubectl mammoth deploy model google/gemma-3-12b-it --resources gpu:1

The control plane finds available NVIDIA or AMD GPUs and deploys your model automatically. To deploy multiple models, just repeat the kubectl mammoth deploy command.

Scale deployments

You can scale deployments manually with the following command:

kubectl mammoth scale model google/gemma-3-12b-it --replicas 5
kubectl mammoth scale model google/gemma-3-12b-it --replicas 5

The control plane adjusts the deployment to the desired number of replicas and allocates resources accordingly. For production use, intelligent autoscaling is built in and configurable.

Allocate resources

You can fine-tune resource allocation for each deployment. For example, with disaggregated inference, you can assign separate GPU resources to nodes that handle prefill and decode stages independently.

kubectl mammoth deploy model deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
    --enable-disaggregated-mode \
    --prefill-decode-ratio=2:1 \
    --prefill-resources=gpu:2 \
    --decode-resources=gpu:2
kubectl mammoth deploy model deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
    --enable-disaggregated-mode \
    --prefill-decode-ratio=2:1 \
    --prefill-resources=gpu:2 \
    --decode-resources=gpu:2

In this example, Mammoth deploys 2 prefill nodes and 1 decode node, with each node running on 2 GPUs. Mammoth provisions and orchestrates the relevant nodes and associated infrastructure automatically.

Become a design partner

Mammoth is currently only available through Modular's early access program where we're actively partnering with select organizations as design partners. Design partners get early access to new features and share feedback to help shape the future of Mammoth.

Talk to an AI expert to learn more about how Mammoth can support your use case and help you scale with confidence.

Get the latest updates

Stay up to date with announcements and releases. We're moving fast over here.

Talk to an AI Expert

Connect with our product experts to explore how we can help you deploy and serve AI models with high performance, scalability, and cost-efficiency.

Book a call

When to use Mammoth​

How Mammoth works​

Install Mammoth​

Deploy models​

Scale deployments​

Allocate resources​

Become a design partner​