1 of 1

Intelligent Document Search with AI

AI Model Hub for Free: From December 1, 2024, to June 30, 2025, IONOS is offering all foundation models of the AI Model Hub for free. Create your contract now and get your AI journey started today!

Finding relevant documents in large collections is a common challenge. Traditional keyword-based searches often fail to capture the meaning behind queries, making it difficult to retrieve relevant results when different wording is used. AI-powered semantic search addresses this issue by understanding the meaning of text rather than relying solely on keyword matches. Various applications can benefit from semantic similarity search, including:

Knowledge Management: Enhances the ability to find relevant articles in knowledge bases.
Customer Support: Improves internal and external product documentation retrieval based on customer inquiries.
Market Research: Facilitates identification of competitor analysis reports with similar themes.
Project Management: Helps locate relevant project documentation across different organizations using a descriptive query.
Compliance and Auditing: Simplifies the search for documents related to regulatory requirements and audit procedures.

Overview

In this guide, we demonstrate how to automate semantic file search using two core components of the AI Model Hub:

A Document Collection to store all files and make them easily accessible.
A Large Language Model that generates responses based on content retrieved from the vector database.

By the end of this guide, you will have a functional semantic search system that allows you to find related documents based on meaning rather than exact keywords.

Example scenario

Let's assume we work for a fictional hosting company, "The Magnificent Hoster". We have several documents stored in our system:

1998-2008_history.txt: A plain text file documenting the company’s early years.
2009-2015_history.docx: A Word document detailing events from 2009 to 2015.
2016-today_history.pdf: A PDF covering recent company history.
awards.txt: A list of awards received and the corresponding years.
milestones.txt: A concise summary of the company’s history.

The goal is to read these documents, store them in a document collection, and use a Large Language Model to answer the query: What is the history of The Magnificent Hoster?

The answer provided by the LLM to this question could be:

Here are the historical milestones of The Magnificent Hoster:

* 1998: Founded by Maximilian "Max" Thompson as a small web hosting company.
* 2002: Launched HostMaster control panel for simplified hosting management.
* 2005: Introduced HostPro premium hosting solution for businesses.
* 2010: Launched CloudHost cloud-based hosting platform for scalability.
* 2016: Completed company rebranding effort for modernization and growth.

Notice that the output is generated using content from multiple files:

The key milestones originate from milestones.txt.
Additional details, such as the rebranding efforts, are taken from 2016-today_history.pdf.

Get started with semantic file search

To follow this tutorial, ensure you have:

Python 3.8 or higher installed on your machine,
The IONOS_API_TOKEN environment variable set with your authentication token.

Download the Python code and install the packages from requirements.txt to see it in action.

We will use Llama 3.1 8B as the Large Language Model and ChromaDB as the document collection data base engine in our example. This setup is effective for most use cases, but experimenting with different models and configurations can help optimize results.

Step 1: Extract Text from PDF and Word Files

Document collections support plain text only, so content from PDFs and Word documents must first be extracted. Python provides modules to handle this conversion.

Extract text from PDFs (documents.py):

with fitz.open(file_path) as doc:
    text = ''
    num_pages = len(doc)
    for page_num in range(num_pages):
        if max_pages is not None and page_num >= max_pages:
            break
        page = doc[page_num]
        text += page.get_text()

Extract content from Word documents (documents.py):

doc = Document(file_path)
text = ''
for para in doc.paragraphs:
    text += para.text + '\n'

Make sure to have the file_path variable pointing to your documents. After execution, the text variable will contain the extracted content in plain text format.

Step 2: Creating an Empty Document Collection

We assume your documents are already stored in your file system. To make files semantically searchable, first create a document collection (collection.py):

endpoint = "https://inference.de-txl.ionos.com/collections"
body = {
    "type": "collection",
    "properties": {
        "name": collection_name,
        "description": collection_description,
        "chunking": {
            "enabled": True,
            "strategy": {
                "config": {
                    "chunk_size": CHUNK_SIZE,
                    "chunk_overlap": CHUNK_OVERLAP
                }
            }
        },
        "embedding": {
            "model": EMBEDDING_MODEL
        },
        "engine": {
            "db_type": DATA_BACKEND
        }
    }
}
response = requests.post(endpoint, headers=HEADERS, json=body)

While collection_name and collection_description help you to identify the collection in the list of collections you generated, they do not impact the quality of the results. By adapting the parameters CHUNK_SIZE, CHUNK_OVERLAP, EMBEDDING_MODEL and DATA_BACKEND, you can influence the results. To find out more, see the tutorial on document collections.

If the creation of the document collection is successful, retrieve the Collection ID from the response:

collection_id = response.json()["id"]

This Collection ID is required for document storage and querying.

Step 3: Add Files to the Collection

Using the plain text you extracted in Step 1 and the document collection created in Step 2, upload your documents into the document collection (collection.py):

endpoint = f"https://inference.de-txl.ionos.com/collections/{collection_id}/documents"
body = {
    "type": "collection",
    "items": [
        {
            "type": "document",
            "properties": {
                "name": file_name,
                "contentType": "text/plain",
                "content": base64.b64encode(text.encode("utf-8")).decode("utf-8")
            }
        }
    ]
}
response = requests.put(endpoint, headers=HEADERS, json=body)

The content of the document has to be base64 encoded. This is because the vector collection only stores encoded content.

Step 4: Retrieve Relevant Documents

After all documents have been added to the document collection, the next step is a query for relevant documents. In our Python example, the query is defined in the variable query_string, e.g., "What is the history of The Magnificent Hoster?". To query the system and receive the most relevant documents (collection.py):

endpoint = f"https://inference.de-txl.ionos.com/collections/{collection_id}/query"
body = {"query": query_string, "limit": num_documents }
relevant_documents = requests.post(endpoint, json=body, headers=HEADERS)

relevant_docs = [
    {
        'file_name': entry['document']['properties']['name'],
        'content': base64.b64decode(entry['document']['properties']['content']).decode()
    } for entry in relevant_documents.json()['properties']['matches']
]

The variable num_documents specifies how many documents are returned. When selecting this value, you need to balance providing the Large Language Model with enough relevant context to generate an accurate response while avoiding overly long or costly prompts.

When extracting the most relevant content from the query result, we retrieve both the name of the document for user reference and its content. Since the content is base64 encoded, we must decode it into human-readable form before further processing.

Step 5: Query the Collection

Once relevant documents are retrieved, use a Large Language Model to generate an answer to the initial question (collection.py):

endpoint = "https://openai.inference.de-txl.ionos.com/v1/chat/completions"
print(f"The most relevant content is in files: {[entry['file_name'] for entry in relevant_docs]}")
prompt = [
    {"role": "system", "content": """
        Please use the information specified as context to answer the question.
        Formulate you answer in one sentence and be an honest AI. Answer in a list
        of five bullet points each starting with a year and the milestone in about 20 words.
        """},
    {"role": "system", "content": "; ".join([entry['content'] for entry in relevant_docs])},
    {"role": "user", "content": query_string}
]
body = {
    "model": model_name,
    "messages": prompt,
}
response = requests.post(endpoint, json=body, headers=HEADERS)

The first print command displays the names of the files identified as most relevant for the query_string by the document collection.

The prompt consists of three components:

Instruction: This entry guides the Large Language Model on how to generate the response. Two key instructions are:
1. The answer must be based on the provided context (from the second entry).
2. The response should be formatted as five bullet points, each starting with a year followed by a milestone, with approximately 20 words per point.
Context: A concatenation of the extracted content from the document collection, which serves as the factual basis for the answer.
Query: The query_string, which contains the actual question to be answered.

In this approach, the same query_string is used both for retrieving relevant documents from the collection and for generating the final answer. However, we keep the query separate from the formatting instructions. This separation is crucial — if the query and formatting instructions were combined, the document retrieval process might return results that are semantically similar to the instructions rather than focusing solely on relevant content.

The answer displayed at the beginning of this article was generated using this approach.

Step 6: Try It Yourself!

Now, it is up to you to use the code! Follow these steps to execute the pipeline on your machine. Download the source code, install dependencies, and run:

pip install -r requirements.txt
python src/main.py --query_string "What is the history of The Magnificent Hoster?" --input_path "input"

In this command, input_path is a folder in the root directory containing all documents that will serve as the knowledge base for querying. The query_string specifies the question to be answered.

After running the main script, you might see output like this:

Collection ID: e4be2d36-4db5-41c0-9897-3610742704ca
Processing: 1998-2008_history.txt
Document '1998-2008_history.txt' added to collection.
Processing: awards.txt
Document 'awards.txt' added to collection.
Processing: milestones.txt
Document 'milestones.txt' added to collection.
Processing: 2009-2015_history.docx
Document '2009-2015_history.docx' added to collection.
Processing: 2016-today_history.pdf
Document '2016-today_history.pdf' added to collection.
The most relevant content is in files: ['milestones.txt', '1998-2008_history.txt', '2016-today_history.pdf']
Query Result:
Here are the historical milestones of The Magnificent Hoster:
* 1998: Founded by Maximilian "Max" Thompson as a small web hosting company.
* 2002: Launched HostMaster control panel for simplified hosting management.
* 2005: Introduced HostPro premium hosting solution for businesses.
* 2010: Launched CloudHost cloud-based hosting platform for scalability.
* 2016: Completed company rebranding effort for modernization and growth.
Deleted collection: e4be2d36-4db5-41c0-9897-3610742704ca

Breakdown of the output:

Collection ID: Identifies the document collection created in Step 1.
Document Processing: Shows the progress of adding documents to the collection in Step 3.
Relevant Files: Lists the most relevant documents extracted in Step 4.
Final Output: Displays the response generated in Step 5.
Cleanup: The last line confirms the document collection has been deleted to avoid long-term storage costs.

Explore further

To deepen your understanding of the solution, experiment with modifications to analyze their impact. Here are some key areas to explore:

Modify the Query String: Adjust the query_string to observe how results change. Since the document collection only contains files related to The Magnificent Hoster, queries like “the history of our company” will return the same results.
Refine the Prompt (collections.py): Modify the output instructions to control the response format. Removing bullet point instructions, for example, will likely result in a full-text response. Adjusting the number of bullet points allows you to control the level of detail.
Expand the Input Data: Adding more documents to the input folder influences the retrieved results. If you include files about multiple companies, a query for “our company’s history” may no longer return relevant information.
Experiment with the Embedding Model: When creating the document collection, try different embedding models to analyze their effect on retrieving relevant documents.

In this tutorial, you learned how to build a system for searching and retrieving information from your files using the IONOS AI Model Hub API. Specifically, you:

Created a document collection
Extracted text from PDF and Word files
Stored the extracted text in the document collection
Implemented a querying mechanism to retrieve relevant information from the document collection
Generated answers using a Large Language Model

This approach enables you to unlock valuable insights hidden within your documents, even if they are spread across different systems. You can expand on this tutorial to develop more advanced applications tailored to your needs. Experiment with different models, parameters, and query strategies to optimize performance and enhance the user experience.

Want to take it a step further? Explore how AI-generated images can enrich text-based content in our next tutorial - Enriching Text with Generated Images!

Intelligent Document Search with AI

Knowledge Management: Enhances the ability to find relevant articles in knowledge bases.
Customer Support: Improves internal and external product documentation retrieval based on customer inquiries.
Market Research: Facilitates identification of competitor analysis reports with similar themes.
Project Management: Helps locate relevant project documentation across different organizations using a descriptive query.
Compliance and Auditing: Simplifies the search for documents related to regulatory requirements and audit procedures.

Overview

In this guide, we demonstrate how to automate semantic file search using two core components of the AI Model Hub:

A Document Collection to store all files and make them easily accessible.
A Large Language Model that generates responses based on content retrieved from the vector database.

By the end of this guide, you will have a functional semantic search system that allows you to find related documents based on meaning rather than exact keywords.

Example scenario

Let's assume we work for a fictional hosting company, "The Magnificent Hoster". We have several documents stored in our system:

1998-2008_history.txt: A plain text file documenting the company’s early years.
2009-2015_history.docx: A Word document detailing events from 2009 to 2015.
2016-today_history.pdf: A PDF covering recent company history.
awards.txt: A list of awards received and the corresponding years.
milestones.txt: A concise summary of the company’s history.

The goal is to read these documents, store them in a document collection, and use a Large Language Model to answer the query: What is the history of The Magnificent Hoster?

The answer provided by the LLM to this question could be:

Here are the historical milestones of The Magnificent Hoster:

* 1998: Founded by Maximilian "Max" Thompson as a small web hosting company.
* 2002: Launched HostMaster control panel for simplified hosting management.
* 2005: Introduced HostPro premium hosting solution for businesses.
* 2010: Launched CloudHost cloud-based hosting platform for scalability.
* 2016: Completed company rebranding effort for modernization and growth.

Notice that the output is generated using content from multiple files:

The key milestones originate from milestones.txt.
Additional details, such as the rebranding efforts, are taken from 2016-today_history.pdf.

Get started with semantic file search

To follow this tutorial, ensure you have:

Python 3.8 or higher installed on your machine,
The IONOS_API_TOKEN environment variable set with your authentication token.

Download the Python code and install the packages from requirements.txt to see it in action.

311KB

ai-model-hub-semantic-file-search.zip

Step 1: Extract Text from PDF and Word Files

Document collections support plain text only, so content from PDFs and Word documents must first be extracted. Python provides modules to handle this conversion.

Extract text from PDFs (documents.py):

with fitz.open(file_path) as doc:
    text = ''
    num_pages = len(doc)
    for page_num in range(num_pages):
        if max_pages is not None and page_num >= max_pages:
            break
        page = doc[page_num]
        text += page.get_text()

Extract content from Word documents (documents.py):

doc = Document(file_path)
text = ''
for para in doc.paragraphs:
    text += para.text + '\n'

Make sure to have the file_path variable pointing to your documents. After execution, the text variable will contain the extracted content in plain text format.

Step 2: Creating an Empty Document Collection

We assume your documents are already stored in your file system. To make files semantically searchable, first create a document collection (collection.py):

endpoint = "https://inference.de-txl.ionos.com/collections"
body = {
    "type": "collection",
    "properties": {
        "name": collection_name,
        "description": collection_description,
        "chunking": {
            "enabled": True,
            "strategy": {
                "config": {
                    "chunk_size": CHUNK_SIZE,
                    "chunk_overlap": CHUNK_OVERLAP
                }
            }
        },
        "embedding": {
            "model": EMBEDDING_MODEL
        },
        "engine": {
            "db_type": DATA_BACKEND
        }
    }
}
response = requests.post(endpoint, headers=HEADERS, json=body)

If the creation of the document collection is successful, retrieve the Collection ID from the response:

collection_id = response.json()["id"]

This Collection ID is required for document storage and querying.

Step 3: Add Files to the Collection

Using the plain text you extracted in Step 1 and the document collection created in Step 2, upload your documents into the document collection (collection.py):

endpoint = f"https://inference.de-txl.ionos.com/collections/{collection_id}/documents"
body = {
    "type": "collection",
    "items": [
        {
            "type": "document",
            "properties": {
                "name": file_name,
                "contentType": "text/plain",
                "content": base64.b64encode(text.encode("utf-8")).decode("utf-8")
            }
        }
    ]
}
response = requests.put(endpoint, headers=HEADERS, json=body)

The content of the document has to be base64 encoded. This is because the vector collection only stores encoded content.

Step 4: Retrieve Relevant Documents

endpoint = f"https://inference.de-txl.ionos.com/collections/{collection_id}/query"
body = {"query": query_string, "limit": num_documents }
relevant_documents = requests.post(endpoint, json=body, headers=HEADERS)

relevant_docs = [
    {
        'file_name': entry['document']['properties']['name'],
        'content': base64.b64decode(entry['document']['properties']['content']).decode()
    } for entry in relevant_documents.json()['properties']['matches']
]

Step 5: Query the Collection

Once relevant documents are retrieved, use a Large Language Model to generate an answer to the initial question (collection.py):

endpoint = "https://openai.inference.de-txl.ionos.com/v1/chat/completions"
print(f"The most relevant content is in files: {[entry['file_name'] for entry in relevant_docs]}")
prompt = [
    {"role": "system", "content": """
        Please use the information specified as context to answer the question.
        Formulate you answer in one sentence and be an honest AI. Answer in a list
        of five bullet points each starting with a year and the milestone in about 20 words.
        """},
    {"role": "system", "content": "; ".join([entry['content'] for entry in relevant_docs])},
    {"role": "user", "content": query_string}
]
body = {
    "model": model_name,
    "messages": prompt,
}
response = requests.post(endpoint, json=body, headers=HEADERS)

The first print command displays the names of the files identified as most relevant for the query_string by the document collection.

The prompt consists of three components:

Instruction: This entry guides the Large Language Model on how to generate the response. Two key instructions are:
1. The answer must be based on the provided context (from the second entry).
2. The response should be formatted as five bullet points, each starting with a year followed by a milestone, with approximately 20 words per point.
Context: A concatenation of the extracted content from the document collection, which serves as the factual basis for the answer.
Query: The query_string, which contains the actual question to be answered.

The answer displayed at the beginning of this article was generated using this approach.

Step 6: Try It Yourself!

Now, it is up to you to use the code! Follow these steps to execute the pipeline on your machine. Download the source code, install dependencies, and run:

pip install -r requirements.txt
python src/main.py --query_string "What is the history of The Magnificent Hoster?" --input_path "input"

After running the main script, you might see output like this:

Collection ID: e4be2d36-4db5-41c0-9897-3610742704ca
Processing: 1998-2008_history.txt
Document '1998-2008_history.txt' added to collection.
Processing: awards.txt
Document 'awards.txt' added to collection.
Processing: milestones.txt
Document 'milestones.txt' added to collection.
Processing: 2009-2015_history.docx
Document '2009-2015_history.docx' added to collection.
Processing: 2016-today_history.pdf
Document '2016-today_history.pdf' added to collection.
The most relevant content is in files: ['milestones.txt', '1998-2008_history.txt', '2016-today_history.pdf']
Query Result:
Here are the historical milestones of The Magnificent Hoster:
* 1998: Founded by Maximilian "Max" Thompson as a small web hosting company.
* 2002: Launched HostMaster control panel for simplified hosting management.
* 2005: Introduced HostPro premium hosting solution for businesses.
* 2010: Launched CloudHost cloud-based hosting platform for scalability.
* 2016: Completed company rebranding effort for modernization and growth.
Deleted collection: e4be2d36-4db5-41c0-9897-3610742704ca

Breakdown of the output:

Collection ID: Identifies the document collection created in Step 1.
Document Processing: Shows the progress of adding documents to the collection in Step 3.
Relevant Files: Lists the most relevant documents extracted in Step 4.
Final Output: Displays the response generated in Step 5.
Cleanup: The last line confirms the document collection has been deleted to avoid long-term storage costs.

Explore further

To deepen your understanding of the solution, experiment with modifications to analyze their impact. Here are some key areas to explore:

Modify the Query String: Adjust the query_string to observe how results change. Since the document collection only contains files related to The Magnificent Hoster, queries like “the history of our company” will return the same results.
Refine the Prompt (collections.py): Modify the output instructions to control the response format. Removing bullet point instructions, for example, will likely result in a full-text response. Adjusting the number of bullet points allows you to control the level of detail.
Expand the Input Data: Adding more documents to the input folder influences the retrieved results. If you include files about multiple companies, a query for “our company’s history” may no longer return relevant information.
Experiment with the Embedding Model: When creating the document collection, try different embedding models to analyze their effect on retrieving relevant documents.

In this tutorial, you learned how to build a system for searching and retrieving information from your files using the IONOS AI Model Hub API. Specifically, you:

Created a document collection
Extracted text from PDF and Word files
Stored the extracted text in the document collection
Implemented a querying mechanism to retrieve relevant information from the document collection
Generated answers using a Large Language Model

Want to take it a step further? Explore how AI-generated images can enrich text-based content in our next tutorial - Enriching Text with Generated Images!