Embeddings

Text embeddings are rich numerical representations of text. They capture semantic meaning in a way that allows computers to compare, cluster, and search text effectively.

Use embeddings whenever you need to measure similarity between pieces of text, perform semantic search, build recommendation systems, or cluster documents. They are foundational for many modern NLP tasks.

In contemporary GenAI applications, embeddings are especially powerful in agentic workflows, including:

Retrieval-Augmented Generation (RAG): Embeddings make it possible to store and search large collections of documents, grounding model responses in your own data instead of relying only on a model's training knowledge.
Context injection for agents: Embeddings help agents decide which pieces of external knowledge (APIs, tools, or documents) are most relevant to the current query.
Personalization and recommendations: By embedding both user data and content, systems can deliver more tailored results.
Clustering and analytics: Embeddings allow grouping similar inputs for downstream tasks like summarization, deduplication, and insight extraction.

Endpoint

MAX supports the v1/embeddings endpoint, which is fully compatible with the OpenAI API.

To use the endpoint, provide the ID of an embedding model along with the text to embed. The API returns numerical embeddings that capture the semantic meaning of each input. The request payload should look similar to the following:

{
  "model": "sentence-transformers/all-mpnet-base-v2",
  "input": "The food was delicious and the service was excellent."
}

Quickstart

Serve and interact with an embedding model using an OpenAI-compatible endpoint. Specifically, we'll use MAX to serve the all-mpnet-base-v2 model, which is a powerful transformer that excels at capturing semantic relationships in text.

System requirements:

Mac

Linux

WSL

Set up your environment

Create a Python project to install our APIs and CLI tools:

pixi
uv
pip
conda

If you don't have it, install pixi:
```
curl -fsSL https://pixi.sh/install.sh | sh
```
Then restart your terminal for the changes to take effect.

Create a project:

pixi init embeddings-quickstart \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd embeddings-quickstart

Tip: You can skip the -c options if you add these channels as defaults.

Install the modular conda package:
- Nightly
- Stable
pixi add modular
pixi add "modular==25.6"
Start the virtual environment:
```
pixi shell
```

If you don't have it, install uv:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Then restart your terminal to make uv accessible.

Create a project:

uv init embeddings-quickstart && cd embeddings-quickstart

Create and start a virtual environment:
```
uv venv && source .venv/bin/activate
```

Install the modular Python package:

Nightly
Stable

uv pip install modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/ \
  --prerelease allow

uv pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

Create a project folder:

mkdir embeddings-quickstart && cd embeddings-quickstart

Create and activate a virtual environment:

python3 -m venv .venv/embeddings-quickstart \
  && source .venv/embeddings-quickstart/bin/activate

Install the modular Python package:

Nightly
Stable

pip install --pre modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/

pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

If you don't have it, install conda. A common choice is with brew:
```
brew install miniconda
```
Initialize conda for shell interaction:
```
conda init
```
If you're on a Mac, instead use:
```
conda init zsh
```
Then restart your terminal for the changes to take effect.
Create a project:
```
conda create -n embeddings-quickstart
```
Start the virtual environment:
```
conda activate embeddings-quickstart
```

Install the modular conda package:

Nightly
Stable

conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular

conda install -c conda-forge -c https://conda.modular.com/max/ modular

Serve your model

Use the max serve command to start a local model server with the all-mpnet-base-v2 model:

max serve \
  --model sentence-transformers/all-mpnet-base-v2

This will create a server running the all-mpnet-base-v2 embedding model on http://localhost:8000/v1/embeddings, an OpenAI compatible endpoint.

The endpoint is ready when you see this message printed in your terminal:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

For a complete list of max CLI commands and options, refer to the MAX CLI reference.

Interact with your model

MAX supports OpenAI's REST APIs and you can interact with the model using either the OpenAI Python SDK or curl:

Python
curl

You can use OpenAI's Python client to interact with the model. First, install the OpenAI API:

pixi
uv
pip
conda

pixi add openai

uv add openai

pip install openai

conda install openai

Then, create a client and make a request to the model:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# Create embeddings
response = client.embeddings.create(
    model="sentence-transformers/all-mpnet-base-v2",
    input="Run an embedding model with MAX Serve!",
)

print(f"{response.data[0].embedding[:5]}")

You should receive a response similar to this:

{"data":[{"index":0,"embedding":[-0.06595132499933243,0.005941616836935282,0.021467769518494606,0.23037832975387573,

The text has been shortened for brevity. This returns a numerical representation of the input text that can be used for semantic comparisons.

The following curl command sends an embeddings request to the model's chat completions

curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
    "input": "Run an embedding model with MAX Serve!",
    "model": "sentence-transformers/all-mpnet-base-v2"
}'

You should receive a response similar to this:

{"data":[{"index":0,"embedding":[-0.06595132499933243,0.005941616836935282,0.021467769518494606,0.23037832975387573,

The text has been shortened for brevity. This returns a numerical representation of the input text that can be used for semantic comparisons.

For complete details on all available API endpoints and options, see the REST API documentation.

Next steps

Now that you have successfully set up MAX with an OpenAI-compatible embeddings endpoint, checkout out these other tutorials:

Endpoint​

Quickstart​

Set up your environment​

Serve your model​

Interact with your model​

Next steps​