Skip to main content

Deploy Llama 3 on GPU with MAX

Ehsan M. Kermani

This tutorial shows you how to serve Llama 3 with an OpenAI-compatible endpoint, from local testing to production deployment on major cloud platforms. You'll learn to automate the deployment process using Infrastructure-as-Code (Iac) and optimize performance with GPU resources.

MAX provides a streamlined way to deploy large language models (LLMs) with production-ready features like GPU acceleration, automatic scaling, and monitoring capabilities. Whether you're building a prototype or preparing for production deployment, this tutorial will help you set up a robust serving infrastructure for Llama 3.

And although we're using Llama 3 in these instructions, you can swap it for one of the hundreds of other LLMs from Hugging Face by browsing our model repository.

The tutorial is organized into the following sections:

  • Local setup: Run Llama 3 locally to verify its basic functionality.
  • Cloud deployment: Deploy Llama 3 to AWS, GCP, or Azure using IaC templates and CLI commands.

System requirements:

Local setup

In this section, you will set up and run Llama 3 locally to understand its capabilities and validate functionality before moving to the cloud. This part doesn't require a GPU because MAX can also run Llama 3 on CPUs, but we recommend using a compatible GPU for the best performance.

1. Set up your environment

Create a Python project to install our APIs and CLI tools:

  1. Create a project folder:
    mkdir llama3-tutorial && cd llama3-tutorial
    mkdir llama3-tutorial && cd llama3-tutorial
  2. Create and activate a virtual environment:
    python3 -m venv .venv/llama3-tutorial \
    && source .venv/llama3-tutorial/bin/activate
    python3 -m venv .venv/llama3-tutorial \
    && source .venv/llama3-tutorial/bin/activate
  3. Install the modular Python package:
    pip install modular \
    --extra-index-url https://download.pytorch.org/whl/cpu \
    --extra-index-url https://dl.modular.com/public/nightly/python/simple/
    pip install modular \
    --extra-index-url https://download.pytorch.org/whl/cpu \
    --extra-index-url https://dl.modular.com/public/nightly/python/simple/

2. Serve Llama 3 locally

Next, use the max CLI tool to start an endpoint with the Llama 3 model locally, and ensure that the model runs as expected before deploying it in the cloud.

  1. Generate a response to a prompt with the following command:

    max generate --model-path=modularai/Llama-3.1-8B-Instruct-GGUF \
    --prompt "What is the meaning of life?" \
    --max-length 250
    max generate --model-path=modularai/Llama-3.1-8B-Instruct-GGUF \
    --prompt "What is the meaning of life?" \
    --max-length 250
  2. Start the model server using max serve:

    max serve --model-path modularai/Llama-3.1-8B-Instruct-GGUF
    max serve --model-path modularai/Llama-3.1-8B-Instruct-GGUF

    This starts a local endpoint with an OpenAI-compatible endpoint. Next, we'll send it an inference request.

3. Test the local endpoint

The endpoint is ready when you see this message in the terminal:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

Then, you can test its functionality by sending a curl request from a new terminal:

curl -N http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the World Series in 2020?"}
]
}' | jq -r '.choices[].message.content'
curl -N http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the World Series in 2020?"}
]
}' | jq -r '.choices[].message.content'

You should see output like this:

The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.

To learn more about the supported REST body parameters, see our API reference for chat completion.

Now that the model works locally, we'll transition to cloud deployment.

Cloud deployment paths

We will use Infrastructure-as-Code (IaC) to create, configure, and deploy Llama 3 in the cloud. The cloud deployment instructions are divided by provider: AWS, GCP, and Azure.

Cloud deployment overview

For AWS, we will use CloudFormation, for GCP, we will use Deployment Manager, and for Azure, we will use Resource Manager. These IaC templates handle resource provisioning, networking, and security configuration. This approach simplifies deployments and ensures they are repeatable.

The key steps are:

  • Create and Deploy Stack/Resources: Use IaC templates for each cloud provider to deploy Llama 3.
  • Test the Endpoint: Retrieve the public IP address after deployment and send a request to test the Llama 3 endpoint in the cloud.

Each cloud-specific tab provides complete commands for setup, configuration, deployment, and testing.

To better understand the flow of the deployment, here is a high-level overview of the architecture:

Figure 1. Architecture diagram of the cloud stack for deploying MAX.

This architecture diagram illustrates the two-phase deployment setup for serving the Llama 3 model with MAX on cloud provider infrastructure.

The deployment process is divided into two phases:

  • Phase 1: Cloud stack creation: In this initial phase, the following infrastructure is provisioned and configured to prepare for serving requests:
    • Public IP assignment: The cloud provider assigns a public IP to the virtual machine (VM), allowing it to be accessed externally.
    • Firewall/Security group configuration: Security settings, such as firewall rules or security groups, are applied to allow traffic on port 80. This setup ensures that only HTTP requests can access the instance securely.
    • GPU compute instance setup: A GPU-enabled VM is created to handle model inference efficiently. This instance includes:
      • GPU drivers/runtime installation: Necessary GPU drivers and runtime libraries are installed to enable hardware acceleration for model processing.
      • Docker container initialization: A Docker container is launched on the VM, where it pulls the necessary images from the Docker Container Registry. This registry serves as a central repository for storing Docker images, making it easy to deploy and update the application.

Inside the container, MAX is set up alongside the Llama 3 model. This setup prepares the environment for serving requests but does not yet expose the endpoint to users.

  • Phase 2: Serving the user endpoint: Once the cloud stack is configured and the VM is set up, the deployment enters the second phase, where it starts serving user requests:
    • HTTP endpoint exposure: With the VM and Docker container ready, the system opens an OpenAI compatible HTTP endpoint on port 80, allowing users to interact with the deployed Llama 3 model.
    • Request handling by MAX: When a user sends an HTTP request to the public IP, MAX processes the incoming request within the Docker container and forwards it to the Llama 3 model for inference. The model generates a response, which is then returned to the user via the endpoint.

Prerequisites

Be sure that you have the following prerequisites, as well as appropriate access and permissions for the cloud provider of your choice.

  • GPU resources: You'll need access to GPU resources in your cloud account with the following specifications:

    This tutorial has been tested on g5.4xlarge (A10G 24GB) on AWS, g2-standard-8 (L4 32GB) on GCP, and Standard_NV36ads_A10_v5 (A10G 24GB) on Azure.

  • A Hugging Face user access token: A valid Hugging Face token is required to access the model. To create a Hugging Face user access token, see Access Tokens. You must make your token available in your environment with the following command:

    export HF_TOKEN="<YOUR-HUGGING-FACE-HUB-TOKEN>"
    export HF_TOKEN="<YOUR-HUGGING-FACE-HUB-TOKEN>"
  • Docker installation: Install the Docker Engine and CLI. We use a pre-configured GPU-enabled Docker container from our public repository. The container image (docker.modular.com/modular/max-nvidia-full:latest) is available on Docker Hub. For more information, see MAX container.

  • Cloud CLI tools: Before deploying, ensure that you have the respective cloud provider CLI tools installed.

Configure the AWS CLI:

aws configure
aws configure

Log in to your AWS account:

aws sso login
aws sso login

Check the credentials via cat ~/.aws/credentials to make sure it is set up correctly. You can also include the credentials as environment variables:

export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"
export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"

1. Create stack/deployment

In this section, we'll walk through creating a deployment stack on AWS, GCP, and Azure. Each cloud provider has its own configuration steps, detailed below, but we simplify the setup by using Infrastructure-as-Code (IaC) templates.

Start by cloning the MAX repository and navigating to the max/examples/cloud-configs/ directory, where the necessary IaC templates and configuration files are organized for each cloud provider.

git clone -b stable https://github.com/modular/modular && cd max/examples/cloud-configs
git clone -b stable https://github.com/modular/modular && cd max/examples/cloud-configs

This directory includes all files required to deploy the MAX Serve setup to AWS, GCP, or Azure:

max/examples/cloud-configs/
├── aws
│ ├── max-serve-aws.yaml
│ └── notify.sh
├── azure
│ ├── max-serve-azure.json
│ └── notify.sh
└── gcp
├── max-serve-gcp.jinja
└── notify.sh
max/examples/cloud-configs/
├── aws
│ ├── max-serve-aws.yaml
│ └── notify.sh
├── azure
│ ├── max-serve-azure.json
│ └── notify.sh
└── gcp
├── max-serve-gcp.jinja
└── notify.sh

With these IaC templates ready, choose your preferred cloud provider and follow the step-by-step instructions specific to each platform.

First navigate to the AWS directory:

cd aws
cd aws

Set the region in your environment:

export REGION="REGION" # example: `us-east-1`
export REGION="REGION" # example: `us-east-1`

Then, create the stack. You can explore the max-serve-aws.yaml file for AWS CloudFormation configuration information.

export STACK_NAME="max-serve-stack"

aws cloudformation create-stack --stack-name ${STACK_NAME} \
--template-body file://max-serve-aws.yaml \
--parameters \
ParameterKey=InstanceType,ParameterValue=g5.4xlarge \
ParameterKey=HuggingFaceHubToken,ParameterValue=${HF_TOKEN} \
ParameterKey=HuggingFaceRepoId,ParameterValue=modularai/Llama-3.1-8B-Instruct-GGUF \
--capabilities CAPABILITY_IAM \
--region $REGION
export STACK_NAME="max-serve-stack"

aws cloudformation create-stack --stack-name ${STACK_NAME} \
--template-body file://max-serve-aws.yaml \
--parameters \
ParameterKey=InstanceType,ParameterValue=g5.4xlarge \
ParameterKey=HuggingFaceHubToken,ParameterValue=${HF_TOKEN} \
ParameterKey=HuggingFaceRepoId,ParameterValue=modularai/Llama-3.1-8B-Instruct-GGUF \
--capabilities CAPABILITY_IAM \
--region $REGION

2. Wait for resources to be ready

In this step, we'll wait for the resources to be ready. Stack and deployment creation may take some time to complete.

aws cloudformation wait stack-create-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}
aws cloudformation wait stack-create-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}

3. Retrieve instance information

After the resources are deployed, you'll need to get the instance information, such as the public DNS or IP address that we will use to test the endpoint.

INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name ${STACK_NAME} \
--query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" \
--output text \
--region ${REGION})
PUBLIC_IP=$(aws ec2 describe-instances --instance-ids ${INSTANCE_ID} \
--query 'Reservations[0].Instances[0].PublicIpAddress' \
--output text \
--region ${REGION})
echo "Instance ID: ${INSTANCE_ID}"
echo "Public IP: ${PUBLIC_IP}"
aws ec2 wait instance-running --instance-ids ${INSTANCE_ID} --region ${REGION}
INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name ${STACK_NAME} \
--query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" \
--output text \
--region ${REGION})
PUBLIC_IP=$(aws ec2 describe-instances --instance-ids ${INSTANCE_ID} \
--query 'Reservations[0].Instances[0].PublicIpAddress' \
--output text \
--region ${REGION})
echo "Instance ID: ${INSTANCE_ID}"
echo "Public IP: ${PUBLIC_IP}"
aws ec2 wait instance-running --instance-ids ${INSTANCE_ID} --region ${REGION}

4. Test the endpoint

  1. Wait until the server is ready to test the endpoint

    It will take some time for the stack or deployment to pull the MAX Docker image and set it up for serving. We need to wait for the Docker logs to appear and then make sure that the Docker container is running on port 8000.

    The server is ready when you see the following log:

    Server ready on http://0.0.0.0:8000
    Server ready on http://0.0.0.0:8000

    We provide a simple script to monitor the startup progress and notify you when the server is ready.

    For AWS, you can see the logs in the AWS CloudWatch UI within the log group /aws/ec2/${STACK_NAME}-logs and log stream instance-logs.

    Alternatively, you can use the provided bash script to monitor the logs until the server is ready:

    bash notify.sh ${REGION} ${STACK_NAME} ${PUBLIC_IP}
    bash notify.sh ${REGION} ${STACK_NAME} ${PUBLIC_IP}
  2. When the server is ready, use the public IP address that we obtained from previous step to test the endpoint with the following curl request:

    curl -N http://$PUBLIC_IP/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
    "stream": true,
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the World Series in 2020?"}
    ]
    }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
    curl -N http://$PUBLIC_IP/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
    "stream": true,
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the World Series in 2020?"}
    ]
    }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'

5. Delete the cloud resources

Cleaning up resources to avoid unwanted costs is critical. Use the following commands to delete resources for each platform. This section provides steps to safely terminate all resources used in the tutorial.

First, delete the stack:

aws cloudformation delete-stack --stack-name ${STACK_NAME}
aws cloudformation delete-stack --stack-name ${STACK_NAME}

Wait for the stack to be deleted:

aws cloudformation wait stack-delete-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}
aws cloudformation wait stack-delete-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}

Cost estimate

When deploying Llama 3 in a cloud environment, several cost factors come into play:

Primary cost components:

  • Compute Resources: GPU instances (like AWS g5.4xlarge, GCP g2-standard-8, or Azure Standard_NV36ads_A10_v5) form the bulk of the costs
  • Network Transfer: Costs associated with data ingress/egress, which is critical for high-traffic applications
  • Storage: Expenses for boot volumes and any additional storage requirements
  • Additional Services: Costs for logging, monitoring, and other supporting cloud services

For detailed cost estimates specific to your use case, we recommend using these official pricing calculators:

Next steps

Congratulations on successfully running MAX Pipelines locally and deploying Llama 3 to the cloud! 🎉

Now that you've mastered the essentials of setting up and deploying the Llama 3 model with MAX, here are some other topics to explore next:

To stay up to date with new releases, sign up for our newsletter and join our community. And if you're interested in becoming a design partner to get early access and give us feedback, please contact us.

Did this tutorial work for you?