Offline inference

Offline inference with MAX allows you to run large language models directly in Python without relying on external API endpoints. This is in contrast to online inference, where you would send requests to a remote service.

When to use offline inference

You'll want to use offline inference in scenarios where you want to perform model inference without the need for a separate model inference server. Typically this includes where you have to process a batch of inputs concurrently.

This approach is beneficial for tasks that require high throughput and can be executed in a controlled environment, such as data preprocessing, model evaluation, or when working with large datasets that need to be processed in batches.

How offline inference works

The core of offline inference revolves around the the LLM class which provides a Python interface to load and run language models.

Specify the model from a Hugging Face repository or a local path and MAX handles the process of downloading the model. The PipelineConfig class allows you to specify parameters related to the inference pipeline, such as max_length and max_num_steps. The generate() function is used to generate text from the model.

The Python API for offline inference currently supports text-only input and does not support multi-modal models. If you need to work with vision capabilities, see the tutorial on Generate image descriptions with Llama 3.2 Vision.

Quickstart

This quickstart demonstrates how to use offline inference using a Hugging Face model with MAX in Python.

Set up your project:

pixi
uv
pip
conda

If you don't have it, install pixi:

curl -fsSL https://pixi.sh/install.sh | sh
curl -fsSL https://pixi.sh/install.sh | sh

Then restart your terminal for the changes to take effect.

Create a project:

pixi init example-project \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd example-project
pixi init example-project \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd example-project

Install the modular conda package:

Nightly
Stable

pixi add modular
pixi add modular

pixi add "modular=25.4"
pixi add "modular=25.4"

Start the virtual environment:
```
pixi shell
```
```
pixi shell
```

If you don't have it, install uv:

curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh

Then restart your terminal to make uv accessible.

Create a project:

uv init example-project && cd example-project
uv init example-project && cd example-project

Create and start a virtual environment:

uv venv && source .venv/bin/activate
uv venv && source .venv/bin/activate

Install the modular Python package:

Nightly
Stable

uv pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --index-url https://dl.modular.com/public/nightly/python/simple/ \
  --index-strategy unsafe-best-match --prerelease allow
uv pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --index-url https://dl.modular.com/public/nightly/python/simple/ \
  --index-strategy unsafe-best-match --prerelease allow

uv pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --extra-index-url https://modular.gateway.scarf.sh/simple/ \
  --index-strategy unsafe-best-match
uv pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --extra-index-url https://modular.gateway.scarf.sh/simple/ \
  --index-strategy unsafe-best-match

Create a project folder:

mkdir example-project && cd example-project
mkdir example-project && cd example-project

Create and activate a virtual environment:

python3 -m venv .venv/example-project \
  && source .venv/example-project/bin/activate
python3 -m venv .venv/example-project \
  && source .venv/example-project/bin/activate

Install the modular Python package:

Nightly
Stable

pip install --pre modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --index-url https://dl.modular.com/public/nightly/python/simple/
pip install --pre modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --index-url https://dl.modular.com/public/nightly/python/simple/

pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --extra-index-url https://modular.gateway.scarf.sh/simple/
pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

If you don't have it, install conda. A common choice is with brew:
```
brew install miniconda
```
```
brew install miniconda
```
Initialize conda for shell interaction:
```
conda init
```
```
conda init
```
If you're on a Mac, instead use:
```
conda init zsh
```
```
conda init zsh
```
Then restart your terminal for the changes to take effect.

Create a project:

conda create -n example-project
conda create -n example-project

Start the virtual environment:

conda activate example-project

conda activate example-project

Install the modular conda package:

Nightly
Stable

conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular

conda install -c conda-forge -c https://conda.modular.com/max/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular

Create a file named main.py with the following code:

from max.entrypoints.llm import LLM
from max.pipelines import PipelineConfig


def main():
    model_path = "modularai/Llama-3.1-8B-Instruct-GGUF"
    pipeline_config = PipelineConfig(model_path=model_path)
    llm = LLM(pipeline_config)

    prompts = [
        "In the beginning, there was",
        "I believe the meaning of life is",
        "The fastest way to learn python is",
    ]

    print("Generating responses...")
    responses = llm.generate(prompts, max_new_tokens=50)
    for i, (prompt, response) in enumerate(zip(prompts, responses)):
        print(f"========== Response {i} ==========")
        print(prompt + response)
        print()


if __name__ == "__main__":
    main()
from max.entrypoints.llm import LLM
from max.pipelines import PipelineConfig


def main():
    model_path = "modularai/Llama-3.1-8B-Instruct-GGUF"
    pipeline_config = PipelineConfig(model_path=model_path)
    llm = LLM(pipeline_config)

    prompts = [
        "In the beginning, there was",
        "I believe the meaning of life is",
        "The fastest way to learn python is",
    ]

    print("Generating responses...")
    responses = llm.generate(prompts, max_new_tokens=50)
    for i, (prompt, response) in enumerate(zip(prompts, responses)):
        print(f"========== Response {i} ==========")
        print(prompt + response)
        print()


if __name__ == "__main__":
    main()

This script downloads the modularai/Llama-3.1-8B-Instruct-GGUF model (if not already downloaded) and then run inference locally. While the initial model download requires internet access, the actual inference process is self-contained and does not send requests to a remote service for generating text.

You can update the script to use a different model or modify the prompts to generate different responses. For a list of available models, see our Model repository. We chose the Llama-3.1-8B-Instruct-GGUF model for this example because it's not gated, meaning it's freely available without requiring special access permissions or authentication.

For offline inference, MAX supports models in GGUF format. This includes most generative LLMs with "Chat" modality, but the specific configuration parameters might vary between models. Always refer to the model's documentation for compatibility details and optimal configuration settings.

Run the script:

python main.py

python main.py

This command will download the model and generate responses for the prompts.

You should see output like the following:

Generating responses...
========== Response 0 ==========
In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the
closest major galaxy to our own Milky Way, and it's been a source of fascination
for astronomers and space enthusiasts for centuries. But what if I told you that
there's

========== Response 1 ==========
I believe the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is

========== Response 2 ==========
The fastest way to learn python is to practice with real-world projects. Here are
some ideas for projects that you can use to learn Python:

1. **Command Line Calculator**: Create a command line calculator that can perform
basic arithmetic operations like addition, subtraction, multiplication, and
division. 
Generating responses...
========== Response 0 ==========
In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the
closest major galaxy to our own Milky Way, and it's been a source of fascination
for astronomers and space enthusiasts for centuries. But what if I told you that
there's

========== Response 1 ==========
I believe the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is

========== Response 2 ==========
The fastest way to learn python is to practice with real-world projects. Here are
some ideas for projects that you can use to learn Python:

1. **Command Line Calculator**: Create a command line calculator that can perform
basic arithmetic operations like addition, subtraction, multiplication, and
division. 

Next steps

For more information on offline inference, see the following:

When to use offline inference​

How offline inference works​

Quickstart​

Next steps​

When to use offline inference

How offline inference works

Quickstart

Next steps