Quickstart
In this quickstart guide, you'll learn how to install Modular in a Python environment and run inference with a GenAI model. We'll first use our Python API to run offline inference, then start a local endpoint and use the OpenAI Python API to send inference requests.
System requirements:
Mac
Linux
WSL
Docker
Set up your project
First, install the max
CLI and Python library:
- pip
- uv
- magic
- Create a project folder:
mkdir quickstart && cd quickstart
mkdir quickstart && cd quickstart
- Create and activate a virtual environment:
python3 -m venv .venv/quickstart \
&& source .venv/quickstart/bin/activatepython3 -m venv .venv/quickstart \
&& source .venv/quickstart/bin/activate - Install the
modular
Python package:- Nightly
- Stable
pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpupip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu
- Install
uv
:curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh
Then restart your terminal to make
uv
accessible. - Create a project:
uv init quickstart && cd quickstart
uv init quickstart && cd quickstart
- Create and start a virtual environment:
uv venv && source .venv/bin/activate
uv venv && source .venv/bin/activate
- Install the
modular
Python package:- Nightly
- Stable
uv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-strategy unsafe-best-match
- Install
magic
:curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the
source
command that's printed in your terminal. - Create a project:
magic init quickstart --format pyproject && cd quickstart
magic init quickstart --format pyproject && cd quickstart
- Install the
max-pipelines
conda package:- Nightly
- Stable
magic add max-pipelines
magic add max-pipelines
magic add "max-pipelines==25.3"
magic add "max-pipelines==25.3"
- Start the virtual environment:
magic shell
magic shell
Run offline inference
You can run inference locally with the max
Python API. Just specify
the Hugging Face model you want and then generate results with one or more
prompts.
In this example, we use a Llama 3.1 model that's not gated on Hugging Face, so you don't need an access token:
from max.entrypoints.llm import LLM
from max.pipelines import PipelineConfig
def main():
model_path = "modularai/Llama-3.1-8B-Instruct-GGUF"
pipeline_config = PipelineConfig(model_path=model_path)
llm = LLM(pipeline_config)
prompts = [
"In the beginning, there was",
"I believe the meaning of life is",
"The fastest way to learn python is",
]
print("Generating responses...")
responses = llm.generate(prompts, max_new_tokens=50)
for i, (prompt, response) in enumerate(zip(prompts, responses)):
print(f"========== Response {i} ==========")
print(prompt + response)
print()
if __name__ == "__main__":
main()
from max.entrypoints.llm import LLM
from max.pipelines import PipelineConfig
def main():
model_path = "modularai/Llama-3.1-8B-Instruct-GGUF"
pipeline_config = PipelineConfig(model_path=model_path)
llm = LLM(pipeline_config)
prompts = [
"In the beginning, there was",
"I believe the meaning of life is",
"The fastest way to learn python is",
]
print("Generating responses...")
responses = llm.generate(prompts, max_new_tokens=50)
for i, (prompt, response) in enumerate(zip(prompts, responses)):
print(f"========== Response {i} ==========")
print(prompt + response)
print()
if __name__ == "__main__":
main()
Run it and you should see a response similar to this:
python offline-inference.py
python offline-inference.py
========== Response 0 ==========
In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the closest major galaxy to our own Milky Way, and it's been a source of fascination for astronomers and space enthusiasts for centuries. But what if I told you that there's
========== Response 1 ==========
I believe the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is
========== Response 2 ==========
The fastest way to learn python is to practice with real-world projects. Here are some ideas for projects that you can use to learn Python:
1. **Command Line Calculator**: Create a command line calculator that can perform basic arithmetic operations like addition, subtraction, multiplication, and division.
========== Response 0 ==========
In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the closest major galaxy to our own Milky Way, and it's been a source of fascination for astronomers and space enthusiasts for centuries. But what if I told you that there's
========== Response 1 ==========
I believe the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is
========== Response 2 ==========
The fastest way to learn python is to practice with real-world projects. Here are some ideas for projects that you can use to learn Python:
1. **Command Line Calculator**: Create a command line calculator that can perform basic arithmetic operations like addition, subtraction, multiplication, and division.
More information about this API is available in the offline inference guide.
Run inference with an endpoint
Now let's start a local server that runs the model using an OpenAI-compatible endpoint:
-
Install the
openai
client library:- pip
- uv
- magic
pip install openai
pip install openai
uv add openai
uv add openai
magic add openai
magic add openai
-
Start the endpoint with the
max
CLI:max serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF
max serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF
-
Create a new file that sends an inference request:
generate-text.pyfrom openai import OpenAI
client = OpenAI(
base_url="http://0.0.0.0:8000/v1",
api_key="EMPTY",
)
completion = client.chat.completions.create(
model="modularai/Llama-3.1-8B-Instruct-GGUF",
messages=[
{
"role": "user",
"content": "Who won the world series in 2020?"
},
],
)
print(completion.choices[0].message.content)from openai import OpenAI
client = OpenAI(
base_url="http://0.0.0.0:8000/v1",
api_key="EMPTY",
)
completion = client.chat.completions.create(
model="modularai/Llama-3.1-8B-Instruct-GGUF",
messages=[
{
"role": "user",
"content": "Who won the world series in 2020?"
},
],
)
print(completion.choices[0].message.content)Notice that the
OpenAI
API requires theapi_key
argument, but our endpoint doesn't use it. -
Run it and you should see results like this:
python generate-text.py
python generate-text.py
The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
That's it. You just served Llama 3 on your local CPU and ran inference using our OpenAI-compatible Serve API.
You can also deploy the same endpoint to a cloud GPU using our Docker container.
To run a different model, change the --model-path
to something else from our
model repository.
Keep going
There's still a lot more to learn. Here are some directions you can go:
Docs
Serving
Try more serving features like function calling, tool use, structured output, and more.
Deploying
Try a tutorial to deploy a model on a cloud GPU using our Docker container.
Developing
Discover all the ways you can customize your AI deployments, such as writing custom ops and GPU kernels in Mojo.
Mojo manual
Learn to program in Mojo, a Pythonic systems programming language that allows you to write code for both CPUs and GPUs.
Resources
Stay in touch
Get the latest updates
Stay up to date with announcements and releases. We're moving fast over here.
Talk to an AI Expert
Connect with our product experts to explore how we can help you deploy and serve AI models with high performance, scalability, and cost-efficiency.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!