Mammoth

Mammoth (formerly referred to as MAX Inference Cluster) is a Kubernetes-native distributed AI serving tool that makes it easier to run and manage LLMs at scale using MAX as a backend for optimal model performance. It's built on the Modular Platform and is designed to give you efficient use of your hardware with minimal configuration, even when running multiple models across thousands of nodes.

The Mammoth control plane automatically selects the best available hardware to meet performance targets when deploying a model and supports both manual and automatic scaling. Mammoth's built-in router intelligently distributes traffic, taking into account hardware load, GPU memory, and caching states. You can deploy and serve multiple models simultaneously across different hardware types or versions without complex setup or duplication of infrastructure.

Mammoth is not yet generally available. Get in touch to learn about early access for enterprise teams.

Spin up an inference cluster and deploy models at scale from your CLI.

Become a design partner

Mammoth is currently only available through Modular's early access program where we're actively partnering with select organizations as design partners. Design partners collaborate directly with Modular's engineering and product teams, gain early access to in-development features, and receive tailored guidance on integrating the Modular Platform into their existing generative AI workloads.

Get the latest updates

Stay up to date with announcements and releases. We're moving fast over here.

Talk to an AI Expert

Connect with our product experts to explore how we can help you deploy and serve AI models with high performance, scalability, and cost-efficiency.

Book a call

Become a design partner​

Get the latest updates

Talk to an AI Expert

Become a design partner