# Data Annotation

{% hint style="info" %}
**Fair-use limits:** To ensure a reliable experience for all users, the AI Model Studio is subject to the following fair-use limits:

* **Parallel training:** Usage is limited to one active training process per user.
* **Inference capacity:** Limited to a maximum of three parallel queries per user.
* **Synthetic data generation:** Generation is limited to a maximum of 100 units per operation, with one parallel generation process allowed.

To increase the limits, contact [<mark style="color:blue;">IONOS Cloud Support</mark>](https://docs.ionos.com/cloud/support/).
{% endhint %}

The quality of your fine-tuned model begins and ends with the quality of your training data. This guide outlines the core principles of data annotation, provides task-specific specifications, and recommends powerful, free, open-source tools to help you create a world-class dataset for our fine-tuning service.

## Building the ground truth

When fine-tuning a model using the AI Model Studio for a specific task, whether classifying customer emails or detecting product anomalies, the model needs a precise, reliable "answer key". This answer key is your **annotated dataset**, often referred to as the **Ground Truth**.

{% hint style="info" %}
**Ground Truth Defined:** This is the ideal, human-verified output for every input in your training data. It is the gold standard that your fine-tuned model will strive to replicate.
{% endhint %}

AI Model Studio is a powerful engine designed to learn complex patterns from this data. However, it cannot correct human errors or resolve ambiguity. While some cases can be realized using synthetic data, **high-quality annotation is the single most critical factor** determining your model’s accuracy and effectiveness in real-world scenarios and key for successfully implementing complex use cases.

## Core principles

High-quality annotation is a structured process that relies on a clear rulebook. Before you begin labeling, you must define an **Annotation Schema**, a set of exhaustive, unambiguous guidelines.

| **Principle**                          | **Why It Matters for Your Business Case**                                                                                                                                                                             | **How to Ensure It**                                                                                                                                    |
| -------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Consistency**                        | If the same invoice or the same type of customer email receives different labels from different people, the model will fail to automate your back-office processes reliably.                                          | Train all annotators (even a single data scientist) using the same documented schema. Use the same labels for the same outcomes every time.             |
| **Clarity & Specificity**              | Ambiguous rules lead to ambiguous data. A lack of rules for edge cases, for example, a "mixed-sentiment" email, results in a model that cannot handle real-world complexity.                                          | Your schema must define every label, every boundary, and provide examples of *what to do* when something is unclear.                                    |
| **The "Assistant" Role is the Answer** | The final `assistant` message in your JSON is the model’s target behavior. It must be the **exact, desired output**, whether it’s a full conversation reply, a simple classification tag, or a structured data block. | The final labeled data must be immediately usable by your business process without further cleanup.                                                     |
| **Inter-Annotator Agreement (IAA)**    | IAA is the benchmark for your rulebook's quality. If your experts consistently disagree on labels, your guidelines are flawed, and your data quality is low.                                                          | Measure the agreement between 2-3 annotators on a small sample. If agreement is below $80%$, refine your guidelines *before* labeling the full dataset. |

## Annotation specifications by use case

All data must be formatted as **JSON Lines**, containing an array of `messages`, each with a `role` and `content`.

### Text Automation Cases

The `system` context defines the AI's job; the `user` content provides the input data; and the `assistant` content is the correct, desired output.

| **Use Case**              | **Annotation Focus**                                                                                                            | **Example** **Data**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Dialogue / Chatbot**    | Full, natural conversation flow; setting persona in the `system` message.                                                       | `{"messages":[{"role":"system","content":"You are a helpful customer support assistant."},{"role":"user","content":"I need help with my order #12345. It hasn't arrived yet."},{"role":"assistant","content":"I'd be happy to help you track order #12345. Let me check the shipping status for you."}]}`                                                                                                                                                                                                                                                   |
| **Email Classification**  | The model's final response *is* the classification. This is often a simple string.                                              | `{"messages":[{"role":"system","content":"You are an expert email classification system. Your response must be only one of the following: 'SUPPORT_REQUEST', 'SALES_INQUIRY', or 'GENERAL_FEEDBACK'."},{"role":"user","content":"Hi, can you tell me more about the pricing for your pro plan? Thanks!"},{"role":"assistant","content":"Classification: SALES_INQUIRY"}]}`                                                                                                                                                                                  |
| **Structured Extraction** | The model is trained to extract key-value pairs, for example, from an invoice or text body, and output a predictable structure. | `{"messages":[{"role":"system","content":"You are a data extraction bot. Extract the invoice ID, total amount with currency, and due date from the provided text. Respond only with a valid JSON object."},{"role":"user","content":"Invoice: Thank you for your business. Invoice #4567 is attached. The total due is 129.99 EUR and must be paid by November 15, 2025."},{"role":"assistant","content":"{ 'invoice_id': '4567', 'amount': '129.99 EUR', 'due_date': '2025-11-15' }"}]}`                                                                   |
| **Summarization**         | The assistant role contains the human-written, high-quality summary of the user's input text.                                   | `{"messages":[{"role":"system","content":"You are a text summarization tool. Your task is to provide a concise, neutral summary of the customer's feedback."},{"role":"user","content":"The product quality was excellent—sturdy and well-made. However, the shipping took three weeks longer than promised, which was very frustrating. I would recommend the product, but warn others about the slow delivery."},{"role":"assistant","content":"The customer experienced excellent product quality but was dissatisfied with the long shipping time."}]}` |

### Multimodal (Image and Text)

For any task involving an image, such as quality control, anomaly detection, or visual question answering, the image must be integrated as a `Base64` encoded string. Additionally, the input data type must be defined for each message.

```json
{"messages": [{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant converting images to CAD code."}]}, {"role": "user", "content": [{"type": "text", "text": "Generate the CADQuery code needed to create the CAD for the provided image. Just the code, no other words."}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}]}, {"role": "assistant", "content": [{"type": "text", "text": "code"}]}]}
```

Therefore, typical use cases that require multi-modal fine-tuning look like this. Keep in mind that complex use cases may require labeling using bound boxes as used in some examples.

| **Use Case**                           | **Annotation** **Focus**                                                                                                           | **Example** **Data**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **CAD Code Generation**                | Including a CAD image and providing the desired output CADquery code.                                                              | `{"messages": [{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant converting images to CAD code."}]}, {"role": "user", "content": [{"type": "text", "text": "Generate the CADQuery code needed to create the CAD for the provided image. Just the code, no other words."}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}]}, {"role": "assistant", "content": [{"type": "text", "text": "import cadquery as cq\n# Generating a workplane for sketch 0\nwp_sketch0 = cq.Workplane(cq.Plane(cq.Vector(-0.421875, -0.75, 0.0), cq.Vector(1.0, 0.0, 0.0),..."}]}]}`                                                                                                       |
| **Quality Control & Defect Detection** | Identifying and localizing specific manufacturing defects (e.g., scratches, cracks) on a part image. Requires a bounding box list. | `{"messages": [{"role": "system", "content": [{"type": "text", "text": "You are an AI assistant specialized in computer vision for manufacturing quality control, tasked with identifying and localizing defects."}]}, {"role": "user", "content": [{"type": "text", "text": "Analyze the high-resolution image of the metal component and output the JSON list of defects detected."}, {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJ..."}}]}, {"role": "assistant", "content": [{"type": "text", "text": "[{\"box_2d\": [150, 200, 250, 300], \"label\": \"scratch\"}, {\"box_2d\": [400, 50, 550, 150], \"label\": \"dent\"}, {\"box_2d\": [80, 800, 150, 950], \"label\": \"paint_chip\"}]"}]}]}` |
| **Predictive Maintenance**             | Analyzing a machine's thermal image to classify its state (e.g., Normal, Overheating) and identifying the critical hot spot.       | `{"messages": [{"role": "system", "content": [{"type": "text", "text": "You are an AI assistant for predictive maintenance, specialized in classifying industrial machine status from thermal images."}]}, {"role": "user", "content": [{"type": "text", "text": "Analyze the provided thermal image of the electric motor and output the overall condition status."}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}]}, {"role": "assistant", "content": [{"type": "text", "text": "STATUS: Warning (Bearing Overheating)\nMax_Temp_C: 98.5\nLocation: Rear Bearing Housing\nSeverity: High"}]}]}`                                                                                                     |

## Input format

Our fine-tuning service requires data in the **JSON Lines (.jsonl)** format, where each line is a self-contained, valid JSON object.

```json
{"messages":[{"role":"system","content":"You are a helpful customer support assistant."},{"role":"user","content":"I need help with my order #12345. It hasn't arrived yet."},{"role":"assistant","content":"I'd be happy to help you track order #12345. Let me check the shipping status for you."}]}
{"messages":[{"role":"system","content":"You are an AI trained to classify customer feedback."},{"role":"user","content":"The product quality is excellent but the shipping took too long."},{"role":"assistant","content":"Classification: MIXED\nPositive aspects: Product quality\nNegative aspects: Shipping time"}]}
{"messages":[{"role":"system","content":"You are a quality control AI. Classify the image as DEFECT or PASS."},{"role":"user","content":"Inspect the following item: <image:product_001.jpg>"},{"role":"assistant","content":"Classification: PASS"}]}
```

{% tabs %}
{% tab title="Multi-modal Example" %}
A example file for a multi-modal case based on [<mark style="color:blue;">GenCAD Code Dataset</mark>](http://huggingface.co/datasets/CADCODER/GenCAD-Code) can be downloaded here.

{% file src="<https://1737632334-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MifAzdGvKLDTtvJP8sm%2Fuploads%2Fgit-blob-26b4187ffc6162315f057f1a431cf33afd9ed3a5%2Fexample_data_genCAD_50samples.jsonl?alt=media>" %}
{% endtab %}
{% endtabs %}

## Annotation tools

We recommend the following community-driven, self-hostable tools to manage your annotation projects and export structured data.

| **Tool**         | **Focus Areas**                                             | **Why We Recommend It**                                                                                                                      | **Local/Self-Hostable?**              |
| ---------------- | ----------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------- |
| **Label Studio** | **Text (NER, Classification, Dialogue), Images**            | The most versatile platform supporting nearly all our use cases, including complex multi-modal projects. Excellent JSON export capabilities. | **Yes** (via Docker or `pip install`) |
| **Doccano**      | **Text Only (Classification, NER, Seq2Seq)**                | Simple, fast, and optimized specifically for text annotation. Ideal if your project is purely focused on back-office text automation.        | **Yes** (via Docker or `pip install`) |
| **CVAT**         | **Image & Video (Detection, Segmentation, Classification)** | Industry-standard for computer vision. Ideal for labeling your industrial quality control and predictive maintenance image datasets.         | **Yes** (via Docker)                  |

{% hint style="warning" %}
**Attention:** These tools may export your data as a standard `JSON` or `CSV` file. In this case, you will need to run a small, custom script to convert specific output format into the exact `role`/`content` **JSON Lines** structure required by our fine-tuning service. This script acts as the final quality check and conversion step in your data pipeline.
{% endhint %}
