Skip to main content

Quickstart

A major component of the Modular Platform is MAX, our developer framework that abstracts away the complexity of building and serving high-performance GenAI models on a wide range of hardware, including NVIDIA and AMD GPUs.

In this quickstart, you'll create an endpoint for an open-source LLM using MAX, run an inference from a Python client, and then benchmark the endpoint.

System requirements:

If you'd rather create an endpoint with Docker, see our tutorial to benchmark MAX.

Set up your project

First, install the max CLI that you'll use to start the model endpoint.

  1. If you don't have it, install pixi:
    curl -fsSL https://pixi.sh/install.sh | sh

    Then restart your terminal for the changes to take effect.

  2. Create a project:
    pixi init quickstart \
      -c https://conda.modular.com/max-nightly/ -c conda-forge \
      && cd quickstart
  3. Install the modular conda package:
    pixi add modular
  4. Start the virtual environment:
    pixi shell

Start a model endpoint

Now you'll serve an LLM from a local endpoint using max serve.

First, pick whether you want to perform text-to-text inference or image-to-text (multimodal) inference, and then select a model size. We've included a small number of model options to keep it simple, but you can explore more models in our model repository.

Select a model to change the code below:

Requires >60 GiB of GPU RAM — we suggest an H100, MI300, or better

Google's Gemma 3 models are multimodal, but MAX currently supports text input only for Gemma 3. It comes in many sizes, but they all require a compatible GPU.

Start the endpoint with the max CLI:

  1. Add your HF Access Token as an environment variable:

    export HF_TOKEN="hf_..."
  1. Agree to the Gemma 3 license.
  1. Start the endpoint:

    max serve --model google/gemma-3-27b-it

It will take some time to download the model, compile it, and start the server. While that's working, you can get started on the next step.

Run inference with the endpoint

Open a new terminal and send an inference request using the openai Python API:

  1. Navigate to the project you created above and then install the openai package:

    pixi add openai
  2. Activate the virtual environment:

    pixi shell
  3. Create a new file that sends an inference request:

    generate-text.py
    from openai import OpenAI
    
    client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
    
    completion = client.chat.completions.create(
       model="google/gemma-3-27b-it",
       messages=[
           {
             "role": "user",
             "content": "Who won the world series in 2020?"
           },
       ],
    )
    
    print(completion.choices[0].message.content)

    Notice that the OpenAI API requires the api_key argument, but you don't need that with MAX.

  4. Wait until the model server is ready—when it is, you'll see this message in your first terminal:

    🚀 Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

    Then run the Python script from your second terminal, and you should see results like this (your results may vary, especially for different model sizes):

    python generate-text.py
    The **Los Angeles Dodgers** won the World Series in 2020!
    
    They defeated the Tampa Bay Rays 4 games to 2. It was their first World Series title since 1988.

Benchmark the endpoint

While still in your second terminal, run the following command to benchmark your endpoint:

max benchmark \
  --model google/gemma-3-27b-it \
  --backend modular \
  --endpoint /v1/chat/completions \
  --dataset-name sonnet \
  --num-prompts 500 \
  --sonnet-input-len 550 \
  --output-lengths 256 \
  --sonnet-prefix-len 200

When it's done, you'll see the results printed to the terminal.

If you want to save the results, add the --save-result flag and it'll save a JSON file in the local directory. You can specify the file name with --result-filename and change the directory with --result-dir. For example:

max benchmark \
  ...
  --save-result \
  --result-filename "quickstart-benchmark.json" \
  --result-dir "results"

The benchmark options above are just a starting point. When you want to save your own benchmark configurations, you can define them in a YAML file and pass it to the --config-file option. For example configurations, see our benchmark config files on GitHub.

For more details about the tool, including other datasets and configuration options, see the max benchmark documentation.

Next steps

Now that you have an endpoint, connect to it with our GenAI Cookbook—an open-source project for building React-based interfaces for any model endpoint. Just clone the repo, run it with npm, and pick a recipe such as a chat interface, a drag-and-drop image caption tool, or build your own.

To get started, see the project README.

Stay in touch

If you have any issues or want to share your experience, reach out on the Modular Forum or Discord.

Get the latest updates

Stay up to date with announcements and releases. We're moving fast over here.

Talk to an AI Expert

Connect with our product experts to explore how we can help you deploy and serve AI models with high performance, scalability, and cost-efficiency.

Book a call

Was this page helpful?