Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) enriches Large Language Model's (LLM's) responses with your own knowledge base data. Instead of relying only on what the model learned during training, RAG retrieves relevant passages from your data and integrates them directly into the prompt.

This guide builds a RAG pipeline using two IONOS CLOUD services:

  • IONOS Managed PostgreSQL with the vector extension (pgvector) as the vector store

  • IONOS CLOUD AI Model Hub for embedding generation and text generation

IONOS CLOUD hosts both services in German data centers to ensure your data remains within the country throughout the processing pipeline.

About this guide

This guide is intended for developers. It assumes you have basic knowledge of:

  • REST APIs and how to call them

  • SQL and PostgreSQL

  • A programming language to handle REST API endpoints (for illustration purposes, the guide uses Python and Bash scripting)

You are familiar with:

By the end of this guide, you will be able to answer customer queries using an LLM which adds data from your own PostgreSQL knowledge base to the answers.

Background

  • The IONOS CLOUD AI Model Hub API provides the LLMs and embedding models needed to implement Retrieval Augmented Generation (RAG), eliminating the need for manual hardware management.

  • IONOS CLOUD Managed PostgreSQL includes the vector extension (pgvector), which adds a vector data type and similarity operators, your vector store is a standard PostgreSQL table, managed alongside the rest of your relational data.

  • IONOS CLOUD hosts both services in German data centers to ensure your data remains within the country throughout the processing pipeline.

Before you begin

To get started, you need the following:

  • A running IONOS CLOUD Managed PostgreSQL cluster. For more information, see Set up a cluster in the DCD.

  • The vector extension is activated on your database. For more information, see Activate Extensions and run CREATE EXTENSION vector CASCADE; on your database.

  • An IONOS CLOUD API token with access to the AI Model Hub.

  • An embedding model. This guide uses BAAI/bge-m3 (1024-dimensional output). For the complete list of available models, see Embedding Models.

  • A nLLM for the final answer. For the full list, see Models.

Note: Use the Token Manager in the DCD to generate a new token if you encounter an HTTP 401 Unauthorized error.

Set up your environment

To get started, open your IDE to enter Python code. The examples use the requests, psycopg, and pgvector libraries:

Generate a header object to authenticate against the AI Model Hub REST API and open a connection to your PostgreSQL cluster:

After this step, you have a header object to call the AI Model Hub endpoints and a conn to run SQL on your PostgreSQL cluster.

Step 1: Create the vector store

Create a table for your text chunks and a vector index. The example matches the 1024-dimensional output of BAAI/bge-m3.

Note: If you choose a different embedding model, set the vector dimension to match the model's output size. For more information about each model's dimension, see Embedding Models.

Step 2: Generate embeddings and insert chunks

Split your documents into chunks. For each chunk, generate an embedding through the AI Model Hub and insert both the chunk text and its embedding into the documents table.

Chunking: Embedding a whole document as one vector loses detail. Split long texts into overlapping chunks of a few hundred tokens before embedding. A common starting point is 200–800 tokens per chunk with 10–20 percent overlap; tune this to your documents and your embedding model's context length.

Step 3: Retrieve relevant chunks

To answer a user query, embed the query with the same model and run a cosine similarity search. The pgvector <=> operator returns cosine distance: lower values indicate higher similarity.

This returns the TOP_K chunks most similar to the user query.

Step 4: Generate the final answer

Combine the retrieved chunks with the user query and send them to an LLM through the OpenAI-compatible /v1/chat/completions endpoint.

The response is a standard OpenAI-compatible chat completion. The answer text is at choices[0].message.content.

Note: The best prompt strongly depends on the LLM used. Adapt your system and user messages to improve results for the specific Large Language Model.

Summary

In this guide, you learned how to use IONOS CLOUD Managed PostgreSQL with pgvector together with the IONOS CLOUD AI Model Hub to implement Retrieval Augmented Generation. Namely, you learned how to:

  1. Prepare a PostgreSQL table as a pgvector-backed vector store.

  2. Generate embeddings through the AI Model Hub embedding API.

  3. Run cosine similarity queries to retrieve the chunks most relevant to a user question.

  4. Combine the retrieved context with an LLM to produce a final answer, all within IONOS CLOUD, with your data staying in Germany.

Last updated

Was this helpful?