Document Embeddings

The IONOS AI Model Hub API allows you to access vector databases to persist your document collections and find semantically similar documents.

The vector database is used to persist documents in document collections. Each document is any form of pure text. In the document collection not only the input text is persisted, but also a transformation of the input text into an embedding. Each embedding is a vector of numbers. Input texts which are semantically similar have similar embeddings. A similarity search on a document collection finds the most similar embeddings for a given input text. These embeddings and the corresponding input text are returned to the user.

Overview

This tutorial is intended for developers. It assumes you have basic knowledge of:

  • REST APIs and how to call them

  • A programming language to handle REST API endpoints (for illustration purposes, the tutorials uses Python and Bash scripting)

By the end of this tutorial, you'll be able to:

  • Create, delete and query a document collection in the IONOS vector database

  • Save, delete and modify documents in the document collection and

  • Answer customer queries using the document collection.

Background

  • The IONOS AI Model Hub API offers a vector database that you can use to persist text in document collections without having to manage corresponding hardware yourself.

  • Our AI Model Hub API provides all required functionality without your data being transfered out of Germany.

Before you begin

To get started, you should open your IDE to enter Python code.

  1. Install required libraries

    You need to install the modules requests and pandas to your python environment:

     !pip install requests
     !pip install pandas
  2. Import required libraries

    You need to import the following modules:

     import requests
     import pandas as pd
     import base64
  3. Generate header for API requests

    Next generate a header document to authenticate yourself against the REST API:

     API_TOKEN = [YOUR API TOKEN HERE]
     header = {
         "Authorization": f"Bearer {API_TOKEN}", 
         "Content-Type": "application/json"
     }

After this step, you have installed all python modules and have one variable header you can use to access our vector database.

Manage document collections

In this section you learn how to create a document collection. We will use this document collection to fill it with the data from your knowledge base in the next step.

To track, if something went wrong this section also shows how to:

  • List existing document collections

  • Remove document collections

  • Get meta data of a document collection

Create document collections

  1. Create a document collections

To create a document collection, you have to specify the name of the collection and a description and invoke the endpoint to generate document collections:

COLLECTION_NAME = [ YOUR COLLECTION NAME HERE ]
COLLECTION_DESCRIPTION = [ YOUR COLLECTION DESCRIPTION HERE ]
endpoint = "https://inference.de-txl.ionos.com/collections"
body = {
    "properties": {
        "name": COLLECTION_NAME,
        "description": COLLECTION_DESCRIPTION
    }
}
result = requests.post(endpoint, json=body, headers=header)

If the creation of the document collection was successful, the status code of the request is 201 and it returns a JSON document with all relevant information concerning the document collection.

  1. Extract collection id from request result

To modify the document collection you need its identifier. You can extract it using:

result.json()['id']

List existing document collections

To ensure that the previous step went as expected, you can list the existing document collections.

  1. List all existing document collections

To retrieve a list of all document collections saved by you:

endpoint = "https://inference.de-txl.ionos.com/collections"
result = requests.get(endpoint, headers=header)

This query returns a JSON document consisting of your document collections and corresponding meta information

  1. Convert list of endpoints to a pandas dataframe

You can convert this JSON document to a human readable form using:

pd.json_normalize(result.json()['items'])

The result consists of 8 attributes of which 3 are relevant for you:

  • id: The identifier of the document collection

  • properties.description: The textual description of the document collection

  • properties.documentsCount: The number of documents persisted in the document collection

If you have not created a collection yet, the field items is an empty list.

Remove a document collection

If the list of document collections consists of document collections you do not need anymore, you can remove a document collection by invoking:

COLLECTION_ID = [ YOUR COLLECTION ID HERE ]
endpoint = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}"
requests.delete(endpoint, headers=header)

This query returns a status code which indicates whether the deletion was successful:

  • 204: Status code for successfull deletion

  • 404: Status code given the collection did not exist

Get meta data for a document collection

  1. Access meta data from a document collection

If you are interested in the meta data of a collection, you can extract it by invoking:

COLLECTION_ID = [ YOUR COLLECTION ID HERE ]
endpoint = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}"
result = requests.get(endpoint, headers=header)
result.status_code

This query returns a status code which indicates whether the collection exists:

  • 200: Status code if the collection exists

  • 404: Status code given the collection does not exist

  1. Extract collection meta data from request result

The body of the request consists of all meta data of the document collection.

result.json()

Manage documents in document collection

In this section, you learn how to add documents to the newly created document collection. To validate your insertion, this section also shows how to

  • List the documents in the document collection,

  • Get meta data for a document,

  • Update an existing document and

  • Prune a document collection.

Add documents to document collection

To add an entry to the document collection, you need to at least specify the content, the name of the content and the contentType:

COLLECTION_ID = [ YOUR COLLECTION ID HERE ]
CONTENT = [ YOUR CONTENT HERE ]
NAME = [ YOUR NAME HERE]
content_base64 = base64.b64encode(CONTENT.encode('utf-8')).decode("utf-8")
body = { 
    "items": [{ 
        "properties": { 
            "name": NAME, 
            "contentType": "text/plain", 
            "content": content_base64
        }
    }]
}
endpoint = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}/documents"
requests.put(endpoint, json=body, headers=header)

Note:

You need to encode your content using base64 prior to adding it to the document collection. This is done here in line 4 of the source code.

This request returns a status code 200 if adding the document to the document collection was successful.

List existing documents in document collection

To ensure that the previous step went as expected, you can list the existing documents of your document collection.

  1. List all existing documents in a document collections

To retrieve a list of all documents in the document collection saved by you:

COLLECTION_ID = [ YOUR COLLECTION ID HERE ]
endpoint_coll = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}"
endpoint = f"{endpoint_coll}/documents"
result = requests.get(endpoint, headers=header)

This query returns a JSON document consisting of your documents in the document collection and corresponding meta information

  1. Convert list of documents to a pandas dataframe

You can convert this JSON document to a pandas dataframe using:

pd.json_normalize(result.json()['items'])

The result consists of 10 attributes of which 5 are relevant for you:

  • id: The identifier of the document

  • properties.content: The base64 encoded content of the document

  • properties.name: The name of the document

  • properties.description: The description of the document

  • properties.labels.number_of_tokens: The number of tokens in the document

If you have not created the collection yet, the request will return a status code 404. It will return a JSON document with the field items set to an empty list if no documents were added yet.

Get meta data for a document

  1. Access meta data from a document

If you are interested in the metadata of a document, you can extract it by invoking:

COLLECTION_ID = [ YOUR COLLECTION ID HERE ]
DOCUMENT_ID = [ YOUR DOCUMENT ID HERE ]
endpoint_coll = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}"
endpoint = f"{endpoint_coll}/documents/{DOCUMENT_ID}"
result = requests.get(endpoint, headers=header)
result.status_code

This query returns a status code which indicates whether the document exists:

  • 200: Status code if the document exists

  • 404: Status code given the document does not exist

  1. Extract collection meta data from request result

The body of the request consists of all meta data of the document.

result.json()

Update a document

If you want to update a document, invoke:

COLLECTION_ID = [ YOUR COLLECTION ID HERE ]
DOCUMENT_ID = [ YOUR DOCUMENT ID HERE ]
CONTENT = [ YOUR CONTENT HERE ]
NAME = [ YOUR NAME HERE ]
endpoint_coll = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}"
endpoint = f"{endpoint_coll}/documents/{DOCUMENT_ID}"
content_base64 = base64.b64encode(CONTENT.encode('utf-8')).decode("utf-8")
body = { 
    "properties": { 
        "id": DOCUMENT_ID, 
        "name": NAME, 
        "contentType": 
        "text/plain", 
        "content": content_base64
    }
}
requests.put(endpoint, json=body, headers=header)

This will replace the existing entry in the document collection with the given id by the payload of this request.

Prune a document collection

If you want to remove all documents from a document collection invoke:

COLLECTION_ID = [ YOUR COLLECTION ID HERE ]
endpoint_coll = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}"
endpoint = f"{endpoint_coll}/documents"
requests.delete(endpoint, headers=header)

This query returns the status code 204 if pruning the document collection was successful.

Query documents in the document collection

Finally, this section shows how to use the document collection and the contained documents to answer a user query.

  1. Retrieve document relevant for querying

To retrieve the documents relevant for answering the user query, invoke the query endpoint as follows:

COLLECTION_ID = [ YOUR COLLECTION ID HERE ]
USER_QUERY = [ USER QUERY HERE ]
NUM_OF_DOCUMENTS = [ NUMBER OF DOCUMENTS TO CONSIDER HERE ]
endpoint = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}/query"
body = {"query": USER_QUERY, "limit": NUM_OF_DOCUMENTS }
relevant_documents = requests.post(endpoint, json=body, headers=header)

This will return a list of the NUM_OF_DOCUMENTS most relevant documents in your document collection for answering the user query.

  1. Decode Base64 encoded documents

Now, decode the retrieved documents back to string using:

[
    base64.b64decode(entry['document']['properties']['content']).decode()
    for entry in relevant_documents.json()['properties']['matches']
]

Summary

In this tutorial you learned how to use the IONOS AI Model Hub API to conduct semantic similarity searches using our vector database.

Namely, you learned how to:

  • Create a necessary document collection in the vector database and modify it

  • Insert your documents into the document collection and modify the documents

  • Conduct semantic similarity searches using your document collection.

Last updated