Skip to main content
Log in

Deploy a text embedding model with an endpoint

Text embeddings are rich numerical representations of text that power many modern natural language processing (NLP) applications. This tutorial shows you how to run and interact with an embeddings endpoint using MAX Serve. Specifically, we'll use the all-mpnet-base-v2 model, which is a powerful transformer that excels at capturing semantic relationships in text.

In this tutorial, you'll learn how to:

  • Set up a local embeddings server using the all-mpnet-base-v2 model
  • Build a smart knowledge base system using semantic similarity
  • Implement document clustering and topic-based organization
  • Create robust search functionality using embeddings

Local setup

In this section, you will set up and run the all-mpnet-base-v2 model locally using MAX Serve.

Start the embeddings server

Use the magic CLI tool to start the embeddings server locally:

  1. If you don't have the magic CLI yet, you can install it on macOS and Ubuntu Linux with this command:

    curl -ssL https://magic.modular.com/ | bash
    curl -ssL https://magic.modular.com/ | bash

    Then run the source command that's printed in your terminal.

  2. Use magic to install our max-pipelines CLI tool:

    magic global install max-pipelines
    magic global install max-pipelines
  3. Start a local endpoint for all-mpnet-base-v2:

    max-pipelines serve --model-path=sentence-transformers/all-mpnet-base-v2
    max-pipelines serve --model-path=sentence-transformers/all-mpnet-base-v2

    This will create a server running the all-mpnet-base-v2 embeddings model on http://localhost:8000/v1/embeddings, an OpenAI compatible endpoint.

    The endpoint is ready when you see the URI printed in your terminal:

    Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
    Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
  4. Send a curl request to the endpoint

    Let's send a curl request to see what kind of response we get back.

    With the server running in your first terminal, run the following command in the second terminal:

    curl http://localhost:8000/v1/embeddings \
    -H "Content-Type: application/json" \
    -d '{
    "input": "Run an embedding model with MAX Serve!",
    "model": "sentence-transformers/all-mpnet-base-v2"
    }'
    curl http://localhost:8000/v1/embeddings \
    -H "Content-Type: application/json" \
    -d '{
    "input": "Run an embedding model with MAX Serve!",
    "model": "sentence-transformers/all-mpnet-base-v2"
    }'

    The following is the expected output.

    {"data":[{"index":0,"embedding":[-0.06595132499933243,0.005941616836935282,0.021467769518494606,0.23037832975387573,
    {"data":[{"index":0,"embedding":[-0.06595132499933243,0.005941616836935282,0.021467769518494606,0.23037832975387573,

The text has been shortened for brevity. This returns a numerical representation of the input text that can be used for semantic comparisons.

Now that the endpoint is active and responsive, let's create an application that uses the embedding model and retrieves information.

Build a knowledge base system

Now, let's build a smart knowledge base using the all-mpnet-base-v2 model. You'll create a system that can match user queries to relevant documentation and automatically organize content into topics.

1. Install dependencies

Let's create a new Python project using magic to manage our packages.

  1. In a second terminal, run the following command:

    magic init embeddings --format pyproject && cd embeddings
    magic init embeddings --format pyproject && cd embeddings
  2. Add three new libraries to magic:

    magic add numpy scikit-learn requests
    magic add numpy scikit-learn requests

These libraries help measure similarity of sentences and handle various computational tasks. The requests library enables API communication with the embeddings endpoint.

2. Implement the knowledge base system

Now we will create a smart knowledge base system that can:

  • Process and store documents with their semantic embeddings
  • Search for relevant documents using natural language queries
  • Automatically organize content into topics using clustering
  • Suggest relevant topics based on user queries

The system uses embeddings from the all-mpnet-base-v2 model to understand the meaning of text, enabling semantic search and intelligent document organization.

  1. Create a new Python file called kb_system.py in the src/embeddings directory and add the following:

    import numpy as np
    from sklearn.metrics.pairwise import cosine_similarity
    from sklearn.cluster import KMeans
    import requests
    from typing import List, Dict, Tuple
    from functools import lru_cache
    import logging

    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)

    class SmartKnowledgeBase:
    def __init__(self, endpoint: str = "http://localhost:8000/v1/embeddings"):
    self.endpoint = endpoint
    self.documents: List[str] = []
    self.doc_titles: List[str] = []
    self.embeddings: np.ndarray = None
    self.clusters: Dict[int, List[int]] = {}

    def _get_embedding(self, texts: List[str], max_retries: int = 3) -> np.ndarray:
    """Get embeddings with retry logic."""
    for attempt in range(max_retries):
    try:
    response = requests.post(
    self.endpoint,
    headers={"Content-Type": "application/json"},
    json={"input": texts, "model": "sentence-transformers/all-mpnet-base-v2"},
    timeout=5
    ).json()
    return np.array([item["embedding"] for item in response["data"]])
    except Exception as e:
    if attempt == max_retries - 1:
    raise Exception(f"Failed to get embeddings after {max_retries} attempts: {e}")
    logger.warning(f"Attempt {attempt + 1} failed, retrying...")

    @lru_cache(maxsize=1000)
    def _get_embedding_cached(self, text: str) -> np.ndarray:
    """Cached version for single text embedding."""
    return self._get_embedding([text])[0]

    def add_document(self, title: str, content: str):
    """Add a single document with title."""
    self.doc_titles.append(title)
    self.documents.append(content)

    # Update embeddings
    if len(self.documents) == 1:
    self.embeddings = self._get_embedding([content])
    else:
    self.embeddings = np.vstack([self.embeddings, self._get_embedding([content])])

    # Recluster if we have enough documents
    if len(self.documents) >= 3:
    self._cluster_documents()

    def _cluster_documents(self, n_clusters: int = None):
    """Cluster documents into topics."""
    if n_clusters is None:
    n_clusters = max(2, len(self.documents) // 5)

    n_clusters = min(n_clusters, len(self.documents))
    kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(self.embeddings)

    self.clusters = {}
    for i in range(n_clusters):
    self.clusters[i] = np.where(kmeans.labels_ == i)[0].tolist()

    def search(self, query: str, top_k: int = 3) -> List[Tuple[str, str, float]]:
    """Find documents most similar to the query."""
    query_embedding = self._get_embedding_cached(query)
    similarities = cosine_similarity([query_embedding], self.embeddings)[0]
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    return [(self.doc_titles[i], self.documents[i], similarities[i])
    for i in top_indices]

    def get_topic_documents(self, topic_id: int) -> List[Tuple[str, str]]:
    """Get all documents in a topic cluster."""
    return [(self.doc_titles[i], self.documents[i])
    for i in self.clusters.get(topic_id, [])]

    def suggest_topics(self, query: str, top_k: int = 2) -> List[Tuple[int, float]]:
    query_embedding = self._get_embedding_cached(query)
    topic_similarities = []

    for topic_id, doc_indices in self.clusters.items():
    topic_embeddings = self.embeddings[doc_indices]
    similarity = cosine_similarity([query_embedding], topic_embeddings).max()
    topic_similarities.append((topic_id, similarity)) # Remove [0]

    return sorted(topic_similarities, key=lambda x: x[1], reverse=True)[:top_k]


    # Example usage
    if __name__ == "__main__":
    # Initialize knowledge base
    kb = SmartKnowledgeBase()

    # Add technical documentation
    kb.add_document(
    "Password Reset Guide",
    "To reset your password: 1. Click 'Forgot Password' 2. Enter your email "
    "3. Follow the reset link 4. Create a new password meeting security requirements"
    )

    kb.add_document(
    "Account Security",
    "Secure your account by enabling 2FA, using a strong password, and regularly "
    "monitoring account activity. Enable login notifications for suspicious activity."
    )

    kb.add_document(
    "Billing Overview",
    "Your billing cycle starts on the 1st of each month. View charges, update "
    "payment methods, and download invoices from the Billing Dashboard."
    )

    kb.add_document(
    "Payment Methods",
    "We accept credit cards, PayPal, and bank transfers. Update payment methods "
    "in Billing Settings. New payment methods are verified with a $1 hold."
    )

    kb.add_document(
    "Installation Guide",
    "Install by downloading the appropriate package for your OS. Run with admin "
    "privileges. Follow prompts to select installation directory and components."
    )

    kb.add_document(
    "System Requirements",
    "Minimum: 8GB RAM, 2GB storage, Windows 10/macOS 11+. Recommended: 16GB RAM, "
    "4GB storage, SSD, modern multi-core processor for optimal performance."
    )

    # Example 1: Search for password-related help
    print("\nSearching for password help:")
    results = kb.search("How do I change my password?")
    for title, content, score in results:
    print(f"\nTitle: {title}")
    print(f"Relevance: {score:.2f}")
    print(f"Content: {content[:100]}...")

    # Example 2: Get topic suggestions
    print("\nGetting topics for billing query:")
    query = "Where can I update my credit card?"
    topics = kb.suggest_topics(query)
    for topic_id, relevance in topics:
    print(f"\nTopic {topic_id} (Relevance: {relevance:.2f}):")
    for title, content in kb.get_topic_documents(topic_id):
    print(f"- {title}: {content[:50]}...")

    # Example 3: Get all documents in a topic
    print("\nAll documents in Topic 0:")
    for title, content in kb.get_topic_documents(0):
    print(f"\nTitle: {title}")
    print(f"Content: {content[:100]}...")
    import numpy as np
    from sklearn.metrics.pairwise import cosine_similarity
    from sklearn.cluster import KMeans
    import requests
    from typing import List, Dict, Tuple
    from functools import lru_cache
    import logging

    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)

    class SmartKnowledgeBase:
    def __init__(self, endpoint: str = "http://localhost:8000/v1/embeddings"):
    self.endpoint = endpoint
    self.documents: List[str] = []
    self.doc_titles: List[str] = []
    self.embeddings: np.ndarray = None
    self.clusters: Dict[int, List[int]] = {}

    def _get_embedding(self, texts: List[str], max_retries: int = 3) -> np.ndarray:
    """Get embeddings with retry logic."""
    for attempt in range(max_retries):
    try:
    response = requests.post(
    self.endpoint,
    headers={"Content-Type": "application/json"},
    json={"input": texts, "model": "sentence-transformers/all-mpnet-base-v2"},
    timeout=5
    ).json()
    return np.array([item["embedding"] for item in response["data"]])
    except Exception as e:
    if attempt == max_retries - 1:
    raise Exception(f"Failed to get embeddings after {max_retries} attempts: {e}")
    logger.warning(f"Attempt {attempt + 1} failed, retrying...")

    @lru_cache(maxsize=1000)
    def _get_embedding_cached(self, text: str) -> np.ndarray:
    """Cached version for single text embedding."""
    return self._get_embedding([text])[0]

    def add_document(self, title: str, content: str):
    """Add a single document with title."""
    self.doc_titles.append(title)
    self.documents.append(content)

    # Update embeddings
    if len(self.documents) == 1:
    self.embeddings = self._get_embedding([content])
    else:
    self.embeddings = np.vstack([self.embeddings, self._get_embedding([content])])

    # Recluster if we have enough documents
    if len(self.documents) >= 3:
    self._cluster_documents()

    def _cluster_documents(self, n_clusters: int = None):
    """Cluster documents into topics."""
    if n_clusters is None:
    n_clusters = max(2, len(self.documents) // 5)

    n_clusters = min(n_clusters, len(self.documents))
    kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(self.embeddings)

    self.clusters = {}
    for i in range(n_clusters):
    self.clusters[i] = np.where(kmeans.labels_ == i)[0].tolist()

    def search(self, query: str, top_k: int = 3) -> List[Tuple[str, str, float]]:
    """Find documents most similar to the query."""
    query_embedding = self._get_embedding_cached(query)
    similarities = cosine_similarity([query_embedding], self.embeddings)[0]
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    return [(self.doc_titles[i], self.documents[i], similarities[i])
    for i in top_indices]

    def get_topic_documents(self, topic_id: int) -> List[Tuple[str, str]]:
    """Get all documents in a topic cluster."""
    return [(self.doc_titles[i], self.documents[i])
    for i in self.clusters.get(topic_id, [])]

    def suggest_topics(self, query: str, top_k: int = 2) -> List[Tuple[int, float]]:
    query_embedding = self._get_embedding_cached(query)
    topic_similarities = []

    for topic_id, doc_indices in self.clusters.items():
    topic_embeddings = self.embeddings[doc_indices]
    similarity = cosine_similarity([query_embedding], topic_embeddings).max()
    topic_similarities.append((topic_id, similarity)) # Remove [0]

    return sorted(topic_similarities, key=lambda x: x[1], reverse=True)[:top_k]


    # Example usage
    if __name__ == "__main__":
    # Initialize knowledge base
    kb = SmartKnowledgeBase()

    # Add technical documentation
    kb.add_document(
    "Password Reset Guide",
    "To reset your password: 1. Click 'Forgot Password' 2. Enter your email "
    "3. Follow the reset link 4. Create a new password meeting security requirements"
    )

    kb.add_document(
    "Account Security",
    "Secure your account by enabling 2FA, using a strong password, and regularly "
    "monitoring account activity. Enable login notifications for suspicious activity."
    )

    kb.add_document(
    "Billing Overview",
    "Your billing cycle starts on the 1st of each month. View charges, update "
    "payment methods, and download invoices from the Billing Dashboard."
    )

    kb.add_document(
    "Payment Methods",
    "We accept credit cards, PayPal, and bank transfers. Update payment methods "
    "in Billing Settings. New payment methods are verified with a $1 hold."
    )

    kb.add_document(
    "Installation Guide",
    "Install by downloading the appropriate package for your OS. Run with admin "
    "privileges. Follow prompts to select installation directory and components."
    )

    kb.add_document(
    "System Requirements",
    "Minimum: 8GB RAM, 2GB storage, Windows 10/macOS 11+. Recommended: 16GB RAM, "
    "4GB storage, SSD, modern multi-core processor for optimal performance."
    )

    # Example 1: Search for password-related help
    print("\nSearching for password help:")
    results = kb.search("How do I change my password?")
    for title, content, score in results:
    print(f"\nTitle: {title}")
    print(f"Relevance: {score:.2f}")
    print(f"Content: {content[:100]}...")

    # Example 2: Get topic suggestions
    print("\nGetting topics for billing query:")
    query = "Where can I update my credit card?"
    topics = kb.suggest_topics(query)
    for topic_id, relevance in topics:
    print(f"\nTopic {topic_id} (Relevance: {relevance:.2f}):")
    for title, content in kb.get_topic_documents(topic_id):
    print(f"- {title}: {content[:50]}...")

    # Example 3: Get all documents in a topic
    print("\nAll documents in Topic 0:")
    for title, content in kb.get_topic_documents(0):
    print(f"\nTitle: {title}")
    print(f"Content: {content[:100]}...")

    The SmartKnowledgeBase class implements an intelligent document retrieval and organization system using embeddings. You can add documents (kb.add_document()), search based on the user's question (kb.searchsearch()), and retrieve results.

  2. Run the script:

    With the server running in your first terminal, run the following command in the second terminal:

    magic run python -m embeddings.kb_system
    magic run python -m embeddings.kb_system

    On your first run, this might take longer. The following is the expected output.

    Title: Password Reset Guide
    Relevance: 0.61
    Content: To reset your password: 1. Click 'Forgot Password' 2. Enter your email 3. Follow the reset link 4. C...
    Title: Password Reset Guide
    Relevance: 0.61
    Content: To reset your password: 1. Click 'Forgot Password' 2. Enter your email 3. Follow the reset link 4. C...

    The text has been shortened for brevity.

Conclusion

In this tutorial, you learned how to:

  • Set up and test a local embeddings server using the all-mpnet-base-v2 model
  • Build a smart knowledge base system that can process and retrieve documents based on semantic similarity
  • Implement document clustering and topic-based organization
  • Create a robust search functionality using embeddings

Did this tutorial work for you?