# Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) enriches Large Language Model's (LLM's) responses with your own knowledge base data. Instead of relying only on what the model learned during training, RAG retrieves relevant passages from your data and integrates them directly into the prompt.

This guide builds a RAG pipeline using two <code class="expression">space.vars.ionos\_cloud</code> services:

* **IONOS Managed PostgreSQL** with the `vector` extension (pgvector) as the vector store
* **IONOS CLOUD AI Model Hub** for embedding generation and text generation

<code class="expression">space.vars.ionos\_cloud</code> hosts both services in German data centers to ensure your data remains within the country throughout the processing pipeline.

## About this guide

This guide is intended for developers. It assumes you have basic knowledge of:

* REST APIs and how to call them
* SQL and PostgreSQL
* A programming language to handle REST API endpoints (for illustration purposes, the guide uses Python and Bash scripting)

You are familiar with:

* [<mark style="color:blue;">Text Generation</mark>](/cloud/ai/ai-model-hub/how-tos/text-generation.md)
* [<mark style="color:blue;">Text Embeddings</mark>](/cloud/ai/ai-model-hub/how-tos/text-embeddings.md)
* [<mark style="color:blue;">Managed PostgreSQL</mark>](/cloud/databases/postgresql.md)

By the end of this guide, you will be able to answer customer queries using an LLM which adds data from your own PostgreSQL knowledge base to the answers.

## Background

* The <code class="expression">space.vars.ionos\_cloud\_ai\_model\_hub</code> API provides the LLMs and embedding models needed to implement Retrieval Augmented Generation (RAG), eliminating the need for manual hardware management.
* <code class="expression">space.vars.ionos\_cloud</code> Managed PostgreSQL includes the `vector` extension (pgvector), which adds a vector data type and similarity operators, your vector store is a standard PostgreSQL table, managed alongside the rest of your relational data.
* <code class="expression">space.vars.ionos\_cloud</code> hosts both services in German data centers to ensure your data remains within the country throughout the processing pipeline.

{% hint style="warning" %}
**Plain text only:** The AI Model Hub embedding API accepts plain text. If your source material is in PDF, Word, or another binary format, extract the text yourself (for example with `PyMuPDF` for PDFs or `python-docx` for Word documents) before embedding it.
{% endhint %}

## Before you begin

To get started, you need the following:

* A running **IONOS CLOUD Managed PostgreSQL** cluster. For more information, see [<mark style="color:blue;">Set up a cluster in the DCD</mark>](/cloud/databases/postgresql/how-tos/setup-a-cluster-in-the-dcd.md).
* The `vector` extension is activated on your database. For more information, see [<mark style="color:blue;">Activate Extensions</mark>](/cloud/databases/postgresql/overview/activate-extensions.md) and run `CREATE EXTENSION vector CASCADE;` on your database.
* An **IONOS CLOUD API token** with access to the AI Model Hub.
* An **embedding model**. This guide uses `BAAI/bge-m3` (1024-dimensional output). For the complete list of available models, see [<mark style="color:blue;">Embedding Models</mark>](/cloud/ai/ai-model-hub/models/embedding-models.md).
* A n**LLM** for the final answer. For the full list, see [<mark style="color:blue;">Models</mark>](/cloud/ai/ai-model-hub/models.md).

{% hint style="info" %}
**Note:** Use the [<mark style="color:blue;">Token Manager</mark>](/cloud/set-up-ionos-cloud/management/identity-access-management/token-manager.md) in the DCD to generate a new token if you encounter an `HTTP 401 Unauthorized` error.
{% endhint %}

### Set up your environment

{% tabs %}
{% tab title="Python" %}
To get started, open your IDE to enter Python code. The examples use the `requests`, `psycopg`, and `pgvector` libraries:

```bash
pip install requests "psycopg[binary]" pgvector
```

Generate a header object to authenticate against the AI Model Hub REST API and open a connection to your PostgreSQL cluster:

```python
# Python example: authentication and PostgreSQL connection
import psycopg
from pgvector.psycopg import register_vector

IONOS_API_TOKEN = "[YOUR API TOKEN HERE]"
PG_DSN = "postgresql://[USER]:[PASSWORD]@[HOST]:5432/[DATABASE]"

header = {
    "Authorization": f"Bearer {IONOS_API_TOKEN}",
    "Content-Type": "application/json",
}

conn = psycopg.connect(PG_DSN, autocommit=True)
register_vector(conn)
```

After this step, you have a `header` object to call the AI Model Hub endpoints and a `conn` to run SQL on your PostgreSQL cluster.
{% endtab %}

{% tab title="Bash" %}
To get started, open a terminal and ensure that `curl`, `jq`, and `psql` are installed. `curl` is essential for communicating with the AI Model Hub REST API; `jq` improves the readability of JSON output and is also used to build request bodies; `psql` connects to your PostgreSQL cluster.

Export the API token and PostgreSQL connection variables:

```bash
#!/bin/bash

export IONOS_API_TOKEN=[YOUR API TOKEN HERE]
export PGHOST=[YOUR POSTGRESQL HOST HERE]
export PGUSER=[YOUR POSTGRESQL USER HERE]
export PGPASSWORD=[YOUR POSTGRESQL PASSWORD HERE]
export PGDATABASE=[YOUR POSTGRESQL DATABASE HERE]
```

{% endtab %}
{% endtabs %}

## Step 1: Create the vector store

Create a table for your text chunks and a vector index. The example matches the 1024-dimensional output of `BAAI/bge-m3`.

{% tabs %}
{% tab title="Python" %}

```python
# Python example: create the vector store
conn.execute("CREATE EXTENSION IF NOT EXISTS vector;")
conn.execute("""
    CREATE TABLE IF NOT EXISTS documents (
        id          bigserial    PRIMARY KEY,
        content     text         NOT NULL,
        embedding   vector(1024) NOT NULL
    );
""")
conn.execute("""
    CREATE INDEX IF NOT EXISTS documents_embedding_idx
        ON documents
        USING hnsw (embedding vector_cosine_ops);
""")
```

{% endtab %}

{% tab title="Bash" %}

```bash
#!/bin/bash

psql <<'SQL'
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS documents (
    id        bigserial    PRIMARY KEY,
    content   text         NOT NULL,
    embedding vector(1024) NOT NULL
);

CREATE INDEX IF NOT EXISTS documents_embedding_idx
    ON documents
    USING hnsw (embedding vector_cosine_ops);
SQL
```

{% endtab %}
{% endtabs %}

{% hint style="info" %}
**Note:** If you choose a different embedding model, set the vector dimension to match the model's output size. For more information about each model's dimension, see [<mark style="color:blue;">Embedding Models</mark>](/cloud/ai/ai-model-hub/models/embedding-models.md).
{% endhint %}

## Step 2: Generate embeddings and insert chunks

Split your documents into chunks. For each chunk, generate an embedding through the AI Model Hub and insert both the chunk text and its embedding into the `documents` table.

{% hint style="info" %}
**Chunking:** Embedding a whole document as one vector loses detail. Split long texts into overlapping chunks of a few hundred tokens before embedding. A common starting point is 200–800 tokens per chunk with 10–20 percent overlap; tune this to your documents and your embedding model's context length.
{% endhint %}

{% tabs %}
{% tab title="Python" %}

```python
# Python example: Generate embeddings and insert chunks
import requests

EMBEDDINGS_ENDPOINT = "https://openai.inference.de-txl.ionos.com/v1/embeddings"
EMBEDDING_MODEL = "BAAI/bge-m3"

chunks = [
    "IONOS operates data centers in Germany.",
    "The AI Model Hub provides OpenAI-compatible inference endpoints.",
    "pgvector adds a vector data type and similarity operators to PostgreSQL.",
]

response = requests.post(
    EMBEDDINGS_ENDPOINT,
    json={"model": EMBEDDING_MODEL, "input": chunks},
    headers=header,
).json()

with conn.cursor() as cur:
    for chunk, item in zip(chunks, response["data"]):
        cur.execute(
            "INSERT INTO documents (content, embedding) VALUES (%s, %s)",
            (chunk, item["embedding"]),
        )
```

{% endtab %}

{% tab title="Bash" %}

```bash
#!/bin/bash

EMBEDDING_MODEL=BAAI/bge-m3
CHUNK="IONOS operates data centers in Germany."

EMBEDDING=$(curl -s -H "Authorization: Bearer ${IONOS_API_TOKEN}" \
     -H "Content-Type: application/json" \
     -d "{\"model\": \"${EMBEDDING_MODEL}\", \"input\": [\"${CHUNK}\"]}" \
     https://openai.inference.de-txl.ionos.com/v1/embeddings \
     | jq -c '.data[0].embedding')

psql -c "INSERT INTO documents (content, embedding) VALUES ('${CHUNK}', '${EMBEDDING}');"
```

Repeat for each chunk. For larger workloads, batch multiple chunks into a single `/v1/embeddings` request and a single `INSERT ... VALUES (...), (...), (...);` statement.
{% endtab %}
{% endtabs %}

## Step 3: Retrieve relevant chunks

To answer a user query, embed the query with the same model and run a cosine similarity search. The pgvector `<=>` operator returns cosine distance: lower values indicate higher similarity.

{% tabs %}
{% tab title="Python" %}

```python
# Python example: retrieve relevant chunks
USER_QUERY = "Where does IONOS run its AI infrastructure?"
TOP_K = 3

query_embedding = requests.post(
    EMBEDDINGS_ENDPOINT,
    json={"model": EMBEDDING_MODEL, "input": USER_QUERY},
    headers=header,
).json()["data"][0]["embedding"]

rows = conn.execute(
    "SELECT content FROM documents ORDER BY embedding <=> %s LIMIT %s",
    (query_embedding, TOP_K),
).fetchall()

relevant_chunks = [row[0] for row in rows]
```

{% endtab %}

{% tab title="Bash" %}

```bash
#!/bin/bash

USER_QUERY="Where does IONOS run its AI infrastructure?"
TOP_K=3

QUERY_EMBEDDING=$(curl -s -H "Authorization: Bearer ${IONOS_API_TOKEN}" \
     -H "Content-Type: application/json" \
     -d "{\"model\": \"${EMBEDDING_MODEL}\", \"input\": [\"${USER_QUERY}\"]}" \
     https://openai.inference.de-txl.ionos.com/v1/embeddings \
     | jq -c '.data[0].embedding')

RELEVANT_CHUNKS=$(psql -At -c \
    "SELECT content FROM documents ORDER BY embedding <=> '${QUERY_EMBEDDING}' LIMIT ${TOP_K};")
```

{% endtab %}
{% endtabs %}

This returns the `TOP_K` chunks most similar to the user query.

## Step 4: Generate the final answer

Combine the retrieved chunks with the user query and send them to an LLM through the OpenAI-compatible `/v1/chat/completions` endpoint.

{% tabs %}
{% tab title="Python" %}

```python
# Python example: generate the final answer
CHAT_ENDPOINT = "https://openai.inference.de-txl.ionos.com/v1/chat/completions"
MODEL_ID = "[YOUR MODEL ID HERE]"

context = "\n\n".join(relevant_chunks)

answer = requests.post(
    CHAT_ENDPOINT,
    json={
        "model": MODEL_ID,
        "messages": [
            {
                "role": "system",
                "content": (
                    "Answer the question using only the provided context. "
                    "If the answer is not in the context, say that you do not know."
                ),
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {USER_QUERY}",
            },
        ],
        "temperature": 0.1,
        "max_tokens": 500,
    },
    headers=header,
).json()

print(answer["choices"][0]["message"]["content"])
```

{% endtab %}

{% tab title="Bash" %}

```bash
#!/bin/bash

MODEL_ID=[YOUR MODEL ID HERE]

BODY=$(jq -n \
    --arg model "${MODEL_ID}" \
    --arg ctx "${RELEVANT_CHUNKS}" \
    --arg q "${USER_QUERY}" \
    '{
        model: $model,
        messages: [
            {
                role: "system",
                content: "Answer the question using only the provided context. If the answer is not in the context, say that you do not know."
            },
            {
                role: "user",
                content: ("Context:\n" + $ctx + "\n\nQuestion: " + $q)
            }
        ],
        temperature: 0.1,
        max_tokens: 500
    }')

curl -s -H "Authorization: Bearer ${IONOS_API_TOKEN}" \
     -H "Content-Type: application/json" \
     -d "${BODY}" \
     https://openai.inference.de-txl.ionos.com/v1/chat/completions \
     | jq -r '.choices[0].message.content'
```

{% endtab %}
{% endtabs %}

The response is a standard OpenAI-compatible chat completion. The answer text is at `choices[0].message.content`.

{% hint style="info" %}
**Note:** The best prompt strongly depends on the LLM used. Adapt your system and user messages to improve results for the specific Large Language Model.
{% endhint %}

## Summary

In this guide, you learned how to use **IONOS CLOUD Managed PostgreSQL with pgvector** together with the **IONOS CLOUD AI Model Hub** to implement Retrieval Augmented Generation. Namely, you learned how to:

1. Prepare a PostgreSQL table as a pgvector-backed vector store.
2. Generate embeddings through the AI Model Hub embedding API.
3. Run cosine similarity queries to retrieve the chunks most relevant to a user question.
4. Combine the retrieved context with an LLM to produce a final answer, all within <code class="expression">space.vars.ionos\_cloud</code>, with your data staying in Germany.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.ionos.com/cloud/ai/ai-model-hub/how-tos/retrieval-augmented-generation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.