Skip to main content
Log in

Quickstart

In this quickstart guide, you'll learn how to install Modular in a Python environment and run inference with a GenAI model. We'll first use our Python API to run offline inference, then start a local endpoint and use the OpenAI Python API to send inference requests.

System requirements:

Set up your project

First, install the max CLI and Python library:

  1. Create a project folder:
    mkdir modular && cd modular
    mkdir modular && cd modular
  2. Create and activate a virtual environment:
    python3 -m venv .venv/modular \
    && source .venv/modular/bin/activate
    python3 -m venv .venv/modular \
    && source .venv/modular/bin/activate
  3. Install the modular Python package:
    pip install modular \
    --index-url https://download.pytorch.org/whl/cpu \
    --extra-index-url https://dl.modular.com/public/nightly/python/simple/
    pip install modular \
    --index-url https://download.pytorch.org/whl/cpu \
    --extra-index-url https://dl.modular.com/public/nightly/python/simple/

Run offline inference

You can run inference locally with the max Python API. Just specify the Hugging Face model you want and then generate results with one or more prompts.

In this example, we use a Llama 3.1 model that's not gated on Hugging Face, so you don't need an access token:

offline-inference.py
from max.entrypoints.llm import LLM
from max.pipelines import PipelineConfig
from max.serve.config import Settings


def main():
model_path = "modularai/Llama-3.1-8B-Instruct-GGUF"
pipeline_config = PipelineConfig(model_path=model_path)
settings = Settings()
llm = LLM(settings, pipeline_config)

prompts = [
"In the beginning, there was",
"I believe the meaning of life is",
"The fastest way to learn python is",
]

print("Generating responses...")
responses = llm.generate(prompts, max_new_tokens=50)
for i, (prompt, response) in enumerate(zip(prompts, responses)):
print(f"========== Response {i} ==========")
print(prompt + response)
print()


if __name__ == "__main__":
main()
from max.entrypoints.llm import LLM
from max.pipelines import PipelineConfig
from max.serve.config import Settings


def main():
model_path = "modularai/Llama-3.1-8B-Instruct-GGUF"
pipeline_config = PipelineConfig(model_path=model_path)
settings = Settings()
llm = LLM(settings, pipeline_config)

prompts = [
"In the beginning, there was",
"I believe the meaning of life is",
"The fastest way to learn python is",
]

print("Generating responses...")
responses = llm.generate(prompts, max_new_tokens=50)
for i, (prompt, response) in enumerate(zip(prompts, responses)):
print(f"========== Response {i} ==========")
print(prompt + response)
print()


if __name__ == "__main__":
main()

Run it and you should see a response similar to this:

python offline-inference.py
python offline-inference.py
========== Response 0 ==========
In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the closest major galaxy to our own Milky Way, and it's been a source of fascination for astronomers and space enthusiasts for centuries. But what if I told you that there's

========== Response 1 ==========
I believe the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is

========== Response 2 ==========
The fastest way to learn python is to practice with real-world projects. Here are some ideas for projects that you can use to learn Python:

1. **Command Line Calculator**: Create a command line calculator that can perform basic arithmetic operations like addition, subtraction, multiplication, and division.
========== Response 0 ==========
In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the closest major galaxy to our own Milky Way, and it's been a source of fascination for astronomers and space enthusiasts for centuries. But what if I told you that there's

========== Response 1 ==========
I believe the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is

========== Response 2 ==========
The fastest way to learn python is to practice with real-world projects. Here are some ideas for projects that you can use to learn Python:

1. **Command Line Calculator**: Create a command line calculator that can perform basic arithmetic operations like addition, subtraction, multiplication, and division.

More information about this API is available in the offline inference guide.

Run inference with an endpoint

Now let's start a local server that runs the model using an OpenAI-compatible endpoint:

  1. Install the openai client library:

    pip install openai
    pip install openai
  2. Start the endpoint with the max CLI:

    max serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF
    max serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF
  3. Create a new file that sends an inference request:

    generate-text.py
    from openai import OpenAI

    client = OpenAI(
    base_url="http://0.0.0.0:8000/v1",
    api_key="EMPTY",
    )

    completion = client.chat.completions.create(
    model="modularai/Llama-3.1-8B-Instruct-GGUF",
    messages=[
    {
    "role": "user",
    "content": "Who won the world series in 2020?"
    },
    ],
    )

    print(completion.choices[0].message.content)
    from openai import OpenAI

    client = OpenAI(
    base_url="http://0.0.0.0:8000/v1",
    api_key="EMPTY",
    )

    completion = client.chat.completions.create(
    model="modularai/Llama-3.1-8B-Instruct-GGUF",
    messages=[
    {
    "role": "user",
    "content": "Who won the world series in 2020?"
    },
    ],
    )

    print(completion.choices[0].message.content)

    Notice that the OpenAI API requires the api_key argument, but our endpoint doesn't use it.

  4. Run it and you should see results like this:

    python generate-text.py
    python generate-text.py
    The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
    The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.

That's it. You just served Llama 3 on your local CPU and ran inference using our OpenAI-compatible Serve API.

You can also deploy the same endpoint to a cloud GPU using our Docker container.

To run a different model, change the --model-path to something else from our model repository.

Stay in touch

Get the latest updates

Stay up to date with announcements and releases. We're moving fast over here.

Talk to an AI Expert

Connect with our product experts to explore how we can help you deploy and serve AI models with high performance, scalability, and cost-efficiency.

Book a call

Try a tutorial

For a more detailed walkthrough of how to build and deploy with MAX, check out these tutorials.