# Intelligent Document Search with AI

Finding relevant documents in large collections is a common challenge. Traditional keyword-based searches often fail to capture the meaning behind queries, making it difficult to retrieve relevant results when different wording is used. AI-powered semantic search addresses this issue by understanding the meaning of text rather than relying solely on keyword matches. Various applications can benefit from semantic similarity search, including:

* **Knowledge Management**: Enhances the ability to find relevant articles in knowledge bases.
* **Customer Support**: Improves internal and external product documentation retrieval based on customer inquiries.
* **Market Research**: Facilitates identification of competitor analysis reports with similar themes.
* **Project Management**: Helps locate relevant project documentation across different organizations using a descriptive query.
* **Compliance and Auditing**: Simplifies the search for documents related to regulatory requirements and audit procedures.

## Overview

In this guide, we demonstrate how to automate semantic file search using two core components of the AI Model Hub:

* A [<mark style="color:blue;">Document Collection</mark>](https://docs.ionos.com/sections-test/guides/ai/ai-model-hub/how-tos/document-collections) to store all files and make them easily retrievable.
* A [<mark style="color:blue;">Large Language Model</mark>](https://docs.ionos.com/sections-test/guides/ai/ai-model-hub/how-tos/text-generation) that generates responses based on content retrieved from the vector database.

By the end of this guide, you will have a functional semantic search system that allows you to find related documents based on meaning rather than exact keywords.

## Example Scenario

Let's assume we work for a fictional hosting company, **"The Magnificent Hoster"**. We have several documents stored in our system:

* **1998-2008\_history.txt**: A plain text file documenting the company’s early years.
* **2009-2015\_history.docx**: A Word document detailing events from 2009 to 2015.
* **2016-today\_history.pdf**: A PDF covering recent company history.
* **awards.txt**: A list of awards received and the corresponding years.
* **milestones.txt**: A concise summary of the company’s history.

The goal is to read these documents, store them in a document collection, and use a Large Language Model to answer the query: *What is the history of The Magnificent Hoster?*

The answer provided by the LLM to this question could be:

```bash
Here are the historical milestones of The Magnificent Hoster:

* 1998: Founded by Maximilian "Max" Thompson as a small web hosting company.
* 2002: Launched HostMaster control panel for simplified hosting management.
* 2005: Introduced HostPro premium hosting solution for businesses.
* 2010: Launched CloudHost cloud-based hosting platform for scalability.
* 2016: Completed company rebranding effort for modernization and growth.

```

Notice that the output is generated using content from multiple files:

* The key milestones originate from **milestones.txt**.
* Additional details, such as the rebranding efforts, are taken from **2016-today\_history.pdf**.

## Get Started with Semantic File Search

To follow this guide, ensure you have:

* Python 3.8 or higher installed on your machine,
* The **IONOS\_API\_TOKEN** environment variable set with your [<mark style="color:blue;">authentication token</mark>](https://docs.ionos.com/sections-test/guides/ai/ai-model-hub/how-tos/access-management).

{% tabs %}
{% tab title="Python Code" %}
Download the Python code and install the packages from requirements.txt to see it in action.

{% file src="<https://1737632334-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MifAzdGvKLDTtvJP8sm%2Fuploads%2Fgit-blob-f643d1b0198fe6854bdab1f3e7ecb412300fd111%2Fai-model-hub-semantic-file-search.zip?alt=media&token=af98f7ef-a4e3-4333-8f03-efe8885bcb77>" %}
{% endtab %}
{% endtabs %}

We will use **Llama 3.1 8B** as the Large Language Model and **ChromaDB** as the document collection data base engine in our example. This setup is effective for most use cases, but experimenting with different models and configurations can help optimize results.

### Step 1: Extract Text from PDF and Word Files

Document collections support plain text only, so content from PDFs and Word documents must first be extracted. Python provides modules to handle this conversion.

Extract text from PDFs `(documents.py)`:

```python
with fitz.open(file_path) as doc:
    text = ''
    num_pages = len(doc)
    for page_num in range(num_pages):
        if max_pages is not None and page_num >= max_pages:
            break
        page = doc[page_num]
        text += page.get_text()
```

Extract content from Word documents `(documents.py)`:

```python
doc = Document(file_path)
text = ''
for para in doc.paragraphs:
    text += para.text + '\n'
```

Make sure to have the **file\_path** variable pointing to your documents. After execution, the **text** variable will contain the extracted content in plain text format.

### Step 2: Creating an Empty Document Collection

We assume your documents are already stored in your file system. To make files semantically searchable, first create a document collection `(collection.py)`:

```python
endpoint = "https://inference.de-txl.ionos.com/collections"
body = {
    "type": "collection",
    "properties": {
        "name": collection_name,
        "description": collection_description,
        "chunking": {
            "enabled": True,
            "strategy": {
                "config": {
                    "chunk_size": CHUNK_SIZE,
                    "chunk_overlap": CHUNK_OVERLAP
                }
            }
        },
        "embedding": {
            "model": EMBEDDING_MODEL
        },
        "engine": {
            "db_type": DATA_BACKEND
        }
    }
}
response = requests.post(endpoint, headers=HEADERS, json=body)
```

While **collection\_name** and **collection\_description** help you to identify the collection in the list of collections you generated, they do not impact the quality of the results. By adapting the parameters **CHUNK\_SIZE**, **CHUNK\_OVERLAP**, **EMBEDDING\_MODEL** and **DATA\_BACKEND**, you can influence the results. To learn more, see the guide on [<mark style="color:blue;">Document Collections</mark>](https://docs.ionos.com/sections-test/guides/ai/ai-model-hub/how-tos/document-collections). [<mark style="color:blue;">document collections</mark>](https://docs.ionos.com/sections-test/guides/ai/ai-model-hub/how-tos/document-collections).

If the creation of the document collection is successful, retrieve the **Collection ID** from the response:

```python
collection_id = response.json()["id"]
```

This **Collection ID** is required for document storage and querying.

### Step 3: Add Files to the Collection

Using the plain text you extracted in Step 1 and the document collection created in Step 2, upload your documents into the document collection `(collection.py)`:

```python
endpoint = f"https://inference.de-txl.ionos.com/collections/{collection_id}/documents"
body = {
    "type": "collection",
    "items": [
        {
            "type": "document",
            "properties": {
                "name": file_name,
                "contentType": "text/plain",
                "content": base64.b64encode(text.encode("utf-8")).decode("utf-8")
            }
        }
    ]
}
response = requests.put(endpoint, headers=HEADERS, json=body)
```

The **content** of the document must be **base64** encoded. This is because the vector collection only stores encoded content.

### Step 4: Retrieve Relevant Documents

After all documents have been added to the document collection, the next step is a query for relevant documents. In our Python example, the query is defined in the variable **query\_string**, e.g., "What is the history of The Magnificent Hoster?". To query the system and receive the most relevant documents `(collection.py)`:

```python
endpoint = f"https://inference.de-txl.ionos.com/collections/{collection_id}/query"
body = {"query": query_string, "limit": num_documents }
relevant_documents = requests.post(endpoint, json=body, headers=HEADERS)

relevant_docs = [
    {
        'file_name': entry['document']['properties']['name'],
        'content': base64.b64decode(entry['document']['properties']['content']).decode()
    } for entry in relevant_documents.json()['properties']['matches']
]
```

The variable **num\_documents** specifies how many documents are returned. When selecting this value, you balance providing the Large Language Model with enough relevant context to generate an accurate response while avoiding overly long or costly prompts.

When extracting the most relevant content from the query result, we retrieve both the **name** of the document for user reference and its **content**. Since the **content** is **base64** encoded, we must decode it into human-readable form before further processing.

### Step 5: Query the Collection

Once relevant documents are retrieved, use a Large Language Model to generate an answer to the initial question `(collection.py)`:

```python
endpoint = "https://openai.inference.de-txl.ionos.com/v1/chat/completions"
print(f"The most relevant content is in files: {[entry['file_name'] for entry in relevant_docs]}")
prompt = [
    {"role": "system", "content": """
        Please use the information specified as context to answer the question.
        Formulate you answer in one sentence and be an honest AI. Answer in a list
        of five bullet points each starting with a year and the milestone in about 20 words.
        """},
    {"role": "system", "content": "; ".join([entry['content'] for entry in relevant_docs])},
    {"role": "user", "content": query_string}
]
body = {
    "model": model_name,
    "messages": prompt,
}
response = requests.post(endpoint, json=body, headers=HEADERS)
```

The first **print** command displays the names of the files identified as most relevant for the query\_string by the document collection.

The prompt consists of three components:

* **Instruction**: This entry guides the Large Language Model on how to generate the response. Two key instructions are:
  1. The answer must be based on the provided context (from the second entry).
  2. The response should be formatted as five bullet points, each starting with a year followed by a milestone, with approximately 20 words per point.
* **Context**: A concatenation of the extracted content from the document collection, which serves as the factual basis for the answer.
* **Query**: The query\_string, which contains the actual question to be answered.

In this approach, the same **query\_string** is used both for retrieving relevant documents from the collection and for generating the final answer. However, we keep the query separate from the formatting instructions. This separation is crucial — if the query and formatting instructions were combined, the document retrieval process might return results that are semantically similar to the instructions rather than focusing solely on relevant content.

The answer displayed at the beginning of this article was generated using this approach.

### Step 6: Try It Yourself!

Now, it is up to you to use the code! Follow these steps to execute the pipeline on your machine. Download the source code, install dependencies, and run:

```bash
pip install -r requirements.txt
python src/main.py --query_string "What is the history of The Magnificent Hoster?" --input_path "input"
```

In this command, **input\_path** is a folder in the root directory containing all documents that will serve as the knowledge base for querying. The **query\_string** specifies the question to be answered.

After running the main script, you might see output like this:

```bash
Collection ID: e4be2d36-4db5-41c0-9897-3610742704ca
Processing: 1998-2008_history.txt
Document '1998-2008_history.txt' added to collection.
Processing: awards.txt
Document 'awards.txt' added to collection.
Processing: milestones.txt
Document 'milestones.txt' added to collection.
Processing: 2009-2015_history.docx
Document '2009-2015_history.docx' added to collection.
Processing: 2016-today_history.pdf
Document '2016-today_history.pdf' added to collection.
The most relevant content is in files: ['milestones.txt', '1998-2008_history.txt', '2016-today_history.pdf']
Query Result:
Here are the historical milestones of The Magnificent Hoster:
* 1998: Founded by Maximilian "Max" Thompson as a small web hosting company.
* 2002: Launched HostMaster control panel for simplified hosting management.
* 2005: Introduced HostPro premium hosting solution for businesses.
* 2010: Launched CloudHost cloud-based hosting platform for scalability.
* 2016: Completed company rebranding effort for modernization and growth.
Deleted collection: e4be2d36-4db5-41c0-9897-3610742704ca
```

**Breakdown of the output:**

* **Collection ID**: Identifies the document collection created in Step 1.
* **Document Processing**: Shows the progress of adding documents to the collection in Step 3.
* **Relevant Files**: Lists the most relevant documents extracted in Step 4.
* **Final Output**: Displays the response generated in Step 5.
* **Cleanup**: The last line confirms the document collection has been deleted to avoid long-term storage costs.

### Explore further

To deepen your understanding of the solution, experiment with modifications to analyze their impact. Here are some key areas to explore:

* **Modify the Query String**: Adjust the **query\_string** to observe how results change. Since the document collection only contains files related to The Magnificent Hoster, queries like “the history of our company” will return the same results.
* **Refine the Prompt (collections.py)**: Modify the output instructions to control the response format. Removing bullet point instructions, for example, will likely result in a full-text response. Adjusting the number of bullet points allows you to control the level of detail.
* **Expand the Input Data**: Adding more documents to the input folder influences the retrieved results. If you include files about multiple companies, a query for “our company’s history” may no longer return relevant information.
* **Experiment with the Embedding Model**: When creating the document collection, try different embedding models to analyze their effect on retrieving relevant documents.

***

In this guide, you learned how to build a system for searching and retrieving information from your files using the **IONOS AI Model Hub API**. Specifically, you:

* Created a **document collection**
* Extracted text from **PDF** and **Word** files
* Stored the extracted text in the **document collection**
* Implemented a **querying mechanism** to retrieve relevant information from the **document collection**
* Generated **answers** using a **Large Language Model**

This approach enables you to unlock valuable insights hidden within your documents, even if they are spread across different systems. You can expand on this guide to develop more advanced applications tailored to your needs. Experiment with different models, parameters, and query strategies to optimize performance, and enhance the user experience.

Want to take it a step further? Explore how AI-generated images can enrich text-based content in our guide - [<mark style="color:blue;">Enriching Text with Generated Images</mark>](https://docs.ionos.com/sections-test/guides/ai/ai-model-hub/how-tos/enrich-generated-images)!