Quickstart
In this quickstart guide, you'll learn how to install Modular in a Python environment and run inference with a GenAI model. We'll first use our Python API to run offline inference, then start a local endpoint and use the OpenAI Python API to send inference requests.
System requirements:
Mac
Linux
WSL
Docker
Set up your project
First, install the max
CLI and Python library:
- pip
- uv
- magic
- Create a project folder:
mkdir modular && cd modular
mkdir modular && cd modular
- Create and activate a virtual environment:
python3 -m venv .venv/modular \
&& source .venv/modular/bin/activatepython3 -m venv .venv/modular \
&& source .venv/modular/bin/activate - Install the
modular
Python package:- Nightly
- Stable
pip install modular \
--index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--index-url https://download.pytorch.org/whl/cpupip install modular \
--index-url https://download.pytorch.org/whl/cpu
- Install
uv
:curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh
Then restart your terminal to make
uv
accessible. - Create a project:
uv init modular && cd modular
uv init modular && cd modular
- Create and start a virtual environment:
uv venv && source .venv/bin/activate
uv venv && source .venv/bin/activate
- Install the
modular
Python package:- Nightly
- Stable
uv pip install modular \
--index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/uv pip install modular \
--index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/uv pip install modular \
--index-url https://download.pytorch.org/whl/cpuuv pip install modular \
--index-url https://download.pytorch.org/whl/cpu
- Install
magic
:curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the
source
command that's printed in your terminal. - Create a project:
magic init modular --format pyproject && cd modular
magic init modular --format pyproject && cd modular
- Install the
max-pipelines
conda package:- Nightly
- Stable
magic add max-pipelines
magic add max-pipelines
magic add "max-pipelines==25.3"
magic add "max-pipelines==25.3"
- Start the virtual environment:
magic shell
magic shell
Run offline inference
You can run inference locally with the max
Python API. Just specify
the Hugging Face model you want and then generate results with one or more
prompts.
In this example, we use a Llama 3.1 model that's not gated on Hugging Face, so you don't need an access token:
from max.entrypoints.llm import LLM
from max.pipelines import PipelineConfig
from max.serve.config import Settings
def main():
model_path = "modularai/Llama-3.1-8B-Instruct-GGUF"
pipeline_config = PipelineConfig(model_path=model_path)
settings = Settings()
llm = LLM(settings, pipeline_config)
prompts = [
"In the beginning, there was",
"I believe the meaning of life is",
"The fastest way to learn python is",
]
print("Generating responses...")
responses = llm.generate(prompts, max_new_tokens=50)
for i, (prompt, response) in enumerate(zip(prompts, responses)):
print(f"========== Response {i} ==========")
print(prompt + response)
print()
if __name__ == "__main__":
main()
from max.entrypoints.llm import LLM
from max.pipelines import PipelineConfig
from max.serve.config import Settings
def main():
model_path = "modularai/Llama-3.1-8B-Instruct-GGUF"
pipeline_config = PipelineConfig(model_path=model_path)
settings = Settings()
llm = LLM(settings, pipeline_config)
prompts = [
"In the beginning, there was",
"I believe the meaning of life is",
"The fastest way to learn python is",
]
print("Generating responses...")
responses = llm.generate(prompts, max_new_tokens=50)
for i, (prompt, response) in enumerate(zip(prompts, responses)):
print(f"========== Response {i} ==========")
print(prompt + response)
print()
if __name__ == "__main__":
main()
Run it and you should see a response similar to this:
python offline-inference.py
python offline-inference.py
========== Response 0 ==========
In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the closest major galaxy to our own Milky Way, and it's been a source of fascination for astronomers and space enthusiasts for centuries. But what if I told you that there's
========== Response 1 ==========
I believe the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is
========== Response 2 ==========
The fastest way to learn python is to practice with real-world projects. Here are some ideas for projects that you can use to learn Python:
1. **Command Line Calculator**: Create a command line calculator that can perform basic arithmetic operations like addition, subtraction, multiplication, and division.
========== Response 0 ==========
In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the closest major galaxy to our own Milky Way, and it's been a source of fascination for astronomers and space enthusiasts for centuries. But what if I told you that there's
========== Response 1 ==========
I believe the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is
========== Response 2 ==========
The fastest way to learn python is to practice with real-world projects. Here are some ideas for projects that you can use to learn Python:
1. **Command Line Calculator**: Create a command line calculator that can perform basic arithmetic operations like addition, subtraction, multiplication, and division.
More information about this API is available in the offline inference guide.
Run inference with an endpoint
Now let's start a local server that runs the model using an OpenAI-compatible endpoint:
-
Install the
openai
client library:- pip
- uv
- magic
pip install openai
pip install openai
uv add openai
uv add openai
magic add openai
magic add openai
-
Start the endpoint with the
max
CLI:max serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF
max serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF
-
Create a new file that sends an inference request:
generate-text.pyfrom openai import OpenAI
client = OpenAI(
base_url="http://0.0.0.0:8000/v1",
api_key="EMPTY",
)
completion = client.chat.completions.create(
model="modularai/Llama-3.1-8B-Instruct-GGUF",
messages=[
{
"role": "user",
"content": "Who won the world series in 2020?"
},
],
)
print(completion.choices[0].message.content)from openai import OpenAI
client = OpenAI(
base_url="http://0.0.0.0:8000/v1",
api_key="EMPTY",
)
completion = client.chat.completions.create(
model="modularai/Llama-3.1-8B-Instruct-GGUF",
messages=[
{
"role": "user",
"content": "Who won the world series in 2020?"
},
],
)
print(completion.choices[0].message.content)Notice that the
OpenAI
API requires theapi_key
argument, but our endpoint doesn't use it. -
Run it and you should see results like this:
python generate-text.py
python generate-text.py
The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
That's it. You just served Llama 3 on your local CPU and ran inference using our OpenAI-compatible Serve API.
You can also deploy the same endpoint to a cloud GPU using our Docker container.
To run a different model, change the --model-path
to something else from our
model repository.
Stay in touch
Get the latest updates
Stay up to date with announcements and releases. We're moving fast over here.
Talk to an AI Expert
Connect with our product experts to explore how we can help you deploy and serve AI models with high performance, scalability, and cost-efficiency.
Try a tutorial
For a more detailed walkthrough of how to build and deploy with MAX, check out these tutorials.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!