Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) enriches Large Language Model's (LLM's) responses with your own knowledge base data. Instead of relying only on what the model learned during training, RAG retrieves relevant passages from your data and integrates them directly into the prompt.
This guide builds a RAG pipeline using two IONOS CLOUD services:
IONOS Managed PostgreSQL with the
vectorextension (pgvector) as the vector storeIONOS CLOUD AI Model Hub for embedding generation and text generation
IONOS CLOUD hosts both services in German data centers to ensure your data remains within the country throughout the processing pipeline.
About this guide
This guide is intended for developers. It assumes you have basic knowledge of:
REST APIs and how to call them
SQL and PostgreSQL
A programming language to handle REST API endpoints (for illustration purposes, the guide uses Python and Bash scripting)
You are familiar with:
By the end of this guide, you will be able to answer customer queries using an LLM which adds data from your own PostgreSQL knowledge base to the answers.
Background
The IONOS CLOUD AI Model Hub API provides the LLMs and embedding models needed to implement Retrieval Augmented Generation (RAG), eliminating the need for manual hardware management.
IONOS CLOUD Managed PostgreSQL includes the
vectorextension (pgvector), which adds a vector data type and similarity operators, your vector store is a standard PostgreSQL table, managed alongside the rest of your relational data.IONOS CLOUD hosts both services in German data centers to ensure your data remains within the country throughout the processing pipeline.
Plain text only: The AI Model Hub embedding API accepts plain text. If your source material is in PDF, Word, or another binary format, extract the text yourself (for example with PyMuPDF for PDFs or python-docx for Word documents) before embedding it.
Before you begin
To get started, you need the following:
A running IONOS CLOUD Managed PostgreSQL cluster. For more information, see Set up a cluster in the DCD.
The
vectorextension is activated on your database. For more information, see Activate Extensions and runCREATE EXTENSION vector CASCADE;on your database.An IONOS CLOUD API token with access to the AI Model Hub.
An embedding model. This guide uses
BAAI/bge-m3(1024-dimensional output). For the complete list of available models, see Embedding Models.A nLLM for the final answer. For the full list, see Models.
Note: Use the Token Manager in the DCD to generate a new token if you encounter an HTTP 401 Unauthorized error.
Set up your environment
To get started, open your IDE to enter Python code. The examples use the requests, psycopg, and pgvector libraries:
Generate a header object to authenticate against the AI Model Hub REST API and open a connection to your PostgreSQL cluster:
After this step, you have a header object to call the AI Model Hub endpoints and a conn to run SQL on your PostgreSQL cluster.
To get started, open a terminal and ensure that curl, jq, and psql are installed. curl is essential for communicating with the AI Model Hub REST API; jq improves the readability of JSON output and is also used to build request bodies; psql connects to your PostgreSQL cluster.
Export the API token and PostgreSQL connection variables:
Step 1: Create the vector store
Create a table for your text chunks and a vector index. The example matches the 1024-dimensional output of BAAI/bge-m3.
Note: If you choose a different embedding model, set the vector dimension to match the model's output size. For more information about each model's dimension, see Embedding Models.
Step 2: Generate embeddings and insert chunks
Split your documents into chunks. For each chunk, generate an embedding through the AI Model Hub and insert both the chunk text and its embedding into the documents table.
Chunking: Embedding a whole document as one vector loses detail. Split long texts into overlapping chunks of a few hundred tokens before embedding. A common starting point is 200–800 tokens per chunk with 10–20 percent overlap; tune this to your documents and your embedding model's context length.
Repeat for each chunk. For larger workloads, batch multiple chunks into a single /v1/embeddings request and a single INSERT ... VALUES (...), (...), (...); statement.
Step 3: Retrieve relevant chunks
To answer a user query, embed the query with the same model and run a cosine similarity search. The pgvector <=> operator returns cosine distance: lower values indicate higher similarity.
This returns the TOP_K chunks most similar to the user query.
Step 4: Generate the final answer
Combine the retrieved chunks with the user query and send them to an LLM through the OpenAI-compatible /v1/chat/completions endpoint.
The response is a standard OpenAI-compatible chat completion. The answer text is at choices[0].message.content.
Note: The best prompt strongly depends on the LLM used. Adapt your system and user messages to improve results for the specific Large Language Model.
Summary
In this guide, you learned how to use IONOS CLOUD Managed PostgreSQL with pgvector together with the IONOS CLOUD AI Model Hub to implement Retrieval Augmented Generation. Namely, you learned how to:
Prepare a PostgreSQL table as a pgvector-backed vector store.
Generate embeddings through the AI Model Hub embedding API.
Run cosine similarity queries to retrieve the chunks most relevant to a user question.
Combine the retrieved context with an LLM to produce a final answer, all within IONOS CLOUD, with your data staying in Germany.
Last updated
Was this helpful?