Bring your own fine-tuned model to MAX pipelines

Developer Relations

Updated: Aug 16, 2024 |

15 min read

pipeline

In the MAX 24.4 release, we have introduced native support for quantization and GGUF weight format. In this tutorial, we'll guide you through the steps to integrate your fine-tuned custom model into the MAX pipelines. More specifically, we will start with the initial configuration and then demonstrate how to download a model from the Hugging Face Hub. If the model is not already available in a supported quantized GGUF format, we'll show you how to convert it to prepare for ingestion into the MAX pipelines. Finally, we will explore how to use the quantized GGUF model via the MAX pipelines CLI.

About model customization

Model customization in machine learning typically involves modifying a pre-trained model to better suit specific tasks or datasets. One effective approach is fine-tuning, where a model trained on a large dataset is further trained (or fine-tuned) on a smaller, task-specific dataset. In this tutorial, we focus on Low Rank Adaptation (LoRA). LoRA (and its quantized variant QLoRA) allows for efficient adaptation of large models by only updating a small set of additional parameters, preserving the original model's structure by integrating LoRA layers without altering the primary architecture. For this tutorial, we are assuming the LoRA weights have been merged into the original model such as Llama3.1. Such a functionality is provided by major fine-tuning libraries such as unsloth save_pretrained_merged or using PEFT model merging APIs.

Step 1: Set up Hugging Face access

To interact with models hosted on Hugging Face, secure access is required either via SSH or an access token. Follow the instructions in the Hugging Face documentation to set up SSH. We can verify our configuration by running:

ssh -T git@hf.co
ssh -T git@hf.co

A successful setup will display Hi <USERNAME>, welcome to Hugging Face.

Step 2: Set up MAX pipelines

Next is to clone the MAX GitHub repository and navigate to the MAX pipeline:

git clone https://github.com/modular/max && cd max
cd pipelines/python
git clone https://github.com/modular/max && cd max
cd pipelines/python

Step 3: Include the `huggingface_hub` CLI

We'll use the magic CLI to create a virtual environment and install the required packages.

If you don't have the magic CLI yet, you can install it on macOS and Ubuntu Linux with this command:

curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash

Then run the source command that's printed in your terminal.

Now install the huggingface_hub library to enable interactions with the Hugging Face Hub. This package facilitates the download, and management of models and datasets:

magic add --pypi huggingface_hub hf_transfer
magic add --pypi huggingface_hub hf_transfer

With the Hugging Face Hub CLI installed, we can proceed to the next steps of downloading and converting our model.

Step 3: Convert to GGUF format

If your model is already in the GGUF format, you can skip this conversion step and proceed directly to the next step. If not, here are the most common methods to convert a model to a quantized GGUF format suitable for deployment:

Automated conversion via Hugging Face space: We can use the gguf-my-repo space for a streamlined conversion process to convert to a supported quantized GGUF format. Remember to log in and for the sake of this tutorial, we choose the Q4_K_M quantization method.

You can see all the supported quantization encoding in the encodings module. For demonstration, we will choose mlabonne/FineLlama-3.1-8B.

After conversion, the model will be available under your HugginFace USERNAME, ready for download and deployment.

The following will download the converted GGUF model:
```
HF_HUB_ENABLE_HF_TRANSFER=1 magic run huggingface-cli download \
<USERNAME>/FineLlama-3.1-8B-Q4_K_M-GGUF \
--repo-type model \
--local-dir ./models
```
```
HF_HUB_ENABLE_HF_TRANSFER=1 magic run huggingface-cli download \
<USERNAME>/FineLlama-3.1-8B-Q4_K_M-GGUF \
--repo-type model \
--local-dir ./models
```

Manually convert via llama.cpp script: Alternatively, utilize the llama.cpp converter script to manually convert your model.

git clone https://github.com/ggerganov/llama.cpp

# If your model is available in Hugging Face <REPO-ID/MODEL-ID>.
# Ensure you replace <REPO-ID/MODEL-ID> with the appropriate
# repository or model ID from Hugging Face.
# Otherwise skip this command.
HF_HUB_ENABLE_HF_TRANSFER=1 magic run huggingface-cli download <REPO-ID/MODEL-ID> \
 --repo-type model \
 --local-dir ./models

python llama.cpp/convert_hf_to_gguf.py models
git clone https://github.com/ggerganov/llama.cpp

# If your model is available in Hugging Face <REPO-ID/MODEL-ID>.
# Ensure you replace <REPO-ID/MODEL-ID> with the appropriate
# repository or model ID from Hugging Face.
# Otherwise skip this command.
HF_HUB_ENABLE_HF_TRANSFER=1 magic run huggingface-cli download <REPO-ID/MODEL-ID> \
 --repo-type model \
 --local-dir ./models

python llama.cpp/convert_hf_to_gguf.py models

With all the requirements in place we are now ready to use our custom model in MAX pipelines.

Step 4: Run the custom model

Now that we have our fine-tuned Llama 3.1 model in GGUF format (finellama-3.1-8b-q4_k_m.gguf for demonstration), we can run it using the MAX pipelines. To do this, we will use the magic CLI with the --weight-path option:

View MAX pipelines options

Check out magic run llama3 --help for the available options

magic run llama3 --version 3.1 \
  --weight-path "./models/finellama-3.1-8b-q4_k_m.gguf" \
  --quantization-encoding "q4_k" \
  --prompt "What is the meaning of life?"
magic run llama3 --version 3.1 \
  --weight-path "./models/finellama-3.1-8b-q4_k_m.gguf" \
  --quantization-encoding "q4_k" \
  --prompt "What is the meaning of life?"

It generates the following beautiful answer:

The meaning of life is a question that has been pondered by philosophers, scientists,
and spiritual leaders for centuries. It is a question that has no definitive answer,
as it is deeply personal and subjective to each individual. However, many have
attempted to provide their own interpretations or explanations.

One interpretation of the meaning of life is that it is simply to live and experience
the world around us. This view suggests that the purpose of life is to experience all
that it has to offer, whether it be through the senses, emotions, or intellectual
pursuits. In this sense, the meaning of life is not necessarily tied to any specific goal
or achievement, but rather to the process of living itself.

Another interpretation is that the meaning of life is to find purpose and meaning in
our lives. This view suggests that we are here to seek out our own unique purpose and
to strive to achieve it. This can be achieved through various means, such as through our
work, relationships, or personal pursuits.

A third interpretation is that the meaning of life is to connect with something larger
than ourselves. This view suggests that we are here to connect with a higher power,
whether it be through religion, spirituality, or a sense of awe and wonder at the
universe. In this sense, the meaning of life is to find a sense of purpose and
connection that transcends our individual lives.

Ultimately, the meaning of life is a question that each person must answer for themselves.
It is a question that requires us to reflect on our own values, beliefs, and experiences.
As the saying goes, "Ask a flower" - the meaning of life is not something that can be
answered in words, but rather in the experience of living itself.
The meaning of life is a question that has been pondered by philosophers, scientists,
and spiritual leaders for centuries. It is a question that has no definitive answer,
as it is deeply personal and subjective to each individual. However, many have
attempted to provide their own interpretations or explanations.

One interpretation of the meaning of life is that it is simply to live and experience
the world around us. This view suggests that the purpose of life is to experience all
that it has to offer, whether it be through the senses, emotions, or intellectual
pursuits. In this sense, the meaning of life is not necessarily tied to any specific goal
or achievement, but rather to the process of living itself.

Another interpretation is that the meaning of life is to find purpose and meaning in
our lives. This view suggests that we are here to seek out our own unique purpose and
to strive to achieve it. This can be achieved through various means, such as through our
work, relationships, or personal pursuits.

A third interpretation is that the meaning of life is to connect with something larger
than ourselves. This view suggests that we are here to connect with a higher power,
whether it be through religion, spirituality, or a sense of awe and wonder at the
universe. In this sense, the meaning of life is to find a sense of purpose and
connection that transcends our individual lives.

Ultimately, the meaning of life is a question that each person must answer for themselves.
It is a question that requires us to reflect on our own values, beliefs, and experiences.
As the saying goes, "Ask a flower" - the meaning of life is not something that can be
answered in words, but rather in the experience of living itself.

Next steps

Congratulations on successfully integrating your fine-tuned Llama3.1 model into the MAX pipelines! 🎉

We have navigated through setting up secure access, downloading and converting models, and finally running your custom model in MAX pipelines. We encourage you to further customize your models via the MAX Graph API, test your pipeline and explore other MAX features including how to deploy your fine-tuned model on GPU using MAX Serve.

Here are some other topics to explore next:

Getting started with MAX Graph

Learn how to build a model graph with our Mojo API for inference with MAX Engine.

Deploy Llama3 on GPU with MAX Serve

Learn how to deploy Llama3 on GPU with MAX Serve.

About model customization​

Step 1: Set up Hugging Face access​

Step 2: Set up MAX pipelines​

Step 3: Include the huggingface_hub CLI​

Step 3: Convert to GGUF format​

Step 4: Run the custom model​

Next steps​