Data Annotation

The quality of your fine-tuned model begins and ends with the quality of your training data. This guide outlines the core principles of data annotation, provides task-specific specifications, and recommends powerful, free, open-source tools to help you create a world-class dataset for our fine-tuning service.

Building the ground truth

When fine-tuning a model using the AI Model Studio for a specific task, whether classifying customer emails or detecting product anomalies, the model needs a precise, reliable "answer key". This answer key is your annotated dataset, often referred to as the Ground Truth.

Ground Truth Defined: This is the ideal, human-verified output for every input in your training data. It is the gold standard that your fine-tuned model will strive to replicate.

AI Model Studio is a powerful engine designed to learn complex patterns from this data. However, it cannot correct human errors or resolve ambiguity. While some cases can be realized using synthetic data, high-quality annotation is the single most critical factor determining your model’s accuracy and effectiveness in real-world scenarios and key for successfully implementing complex use cases.

Core principles

High-quality annotation is a structured process that relies on a clear rulebook. Before you begin labeling, you must define an Annotation Schema, a set of exhaustive, unambiguous guidelines.

Principle

Why It Matters for Your Business Case

How to Ensure It

Consistency

If the same invoice or the same type of customer email receives different labels from different people, the model will fail to automate your back-office processes reliably.

Train all annotators (even a single data scientist) using the same documented schema. Use the same labels for the same outcomes every time.

Clarity & Specificity

Ambiguous rules lead to ambiguous data. A lack of rules for edge cases, for example, a "mixed-sentiment" email, results in a model that cannot handle real-world complexity.

Your schema must define every label, every boundary, and provide examples of what to do when something is unclear.

The "Assistant" Role is the Answer

The final assistant message in your JSON is the model’s target behavior. It must be the exact, desired output, whether it’s a full conversation reply, a simple classification tag, or a structured data block.

The final labeled data must be immediately usable by your business process without further cleanup.

Inter-Annotator Agreement (IAA)

IAA is the benchmark for your rulebook's quality. If your experts consistently disagree on labels, your guidelines are flawed, and your data quality is low.

Measure the agreement between 2-3 annotators on a small sample. If agreement is below $80%$, refine your guidelines before labeling the full dataset.

Annotation specifications by use case

All data must be formatted as JSON Lines, containing an array of messages, each with a role and content.

Text Automation Cases

The system context defines the AI's job; the user content provides the input data; and the assistant content is the correct, desired output.

Use Case

Annotation Focus

Example Data

Dialogue / Chatbot

Full, natural conversation flow; setting persona in the system message.

{"messages":[{"role":"system","content":"You are a helpful customer support assistant."},{"role":"user","content":"I need help with my order #12345. It hasn't arrived yet."},{"role":"assistant","content":"I'd be happy to help you track order #12345. Let me check the shipping status for you."}]}

Email Classification

The model's final response is the classification. This is often a simple string.

{"messages":[{"role":"system","content":"You are an expert email classification system. Your response must be only one of the following: 'SUPPORT_REQUEST', 'SALES_INQUIRY', or 'GENERAL_FEEDBACK'."},{"role":"user","content":"Hi, can you tell me more about the pricing for your pro plan? Thanks!"},{"role":"assistant","content":"Classification: SALES_INQUIRY"}]}

Structured Extraction

The model is trained to extract key-value pairs, for example, from an invoice or text body, and output a predictable structure.

{"messages":[{"role":"system","content":"You are a data extraction bot. Extract the invoice ID, total amount with currency, and due date from the provided text. Respond only with a valid JSON object."},{"role":"user","content":"Invoice: Thank you for your business. Invoice #4567 is attached. The total due is 129.99 EUR and must be paid by November 15, 2025."},{"role":"assistant","content":"{ 'invoice_id': '4567', 'amount': '129.99 EUR', 'due_date': '2025-11-15' }"}]}

Summarization

The assistant role contains the human-written, high-quality summary of the user's input text.

{"messages":[{"role":"system","content":"You are a text summarization tool. Your task is to provide a concise, neutral summary of the customer's feedback."},{"role":"user","content":"The product quality was excellent—sturdy and well-made. However, the shipping took three weeks longer than promised, which was very frustrating. I would recommend the product, but warn others about the slow delivery."},{"role":"assistant","content":"The customer experienced excellent product quality but was dissatisfied with the long shipping time."}]}

Multimodal (Image and Text)

For any task involving an image, such as quality control, anomaly detection, or visual question answering, the image must be integrated as a Base64 encoded string. Additionally, the input data type must be defined for each message.

{"messages": [{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant converting images to CAD code."}]}, {"role": "user", "content": [{"type": "text", "text": "Generate the CADQuery code needed to create the CAD for the provided image. Just the code, no other words."}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}]}, {"role": "assistant", "content": [{"type": "text", "text": "code"}]}]}

Therefore, typical use cases that require multi-modal fine-tuning look like this. Keep in mind that complex use cases may require labeling using bound boxes as used in some examples.

Use Case

Annotation Focus

Example Data

CAD Code Generation

Including a CAD image and providing the desired output CADquery code.

{"messages": [{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant converting images to CAD code."}]}, {"role": "user", "content": [{"type": "text", "text": "Generate the CADQuery code needed to create the CAD for the provided image. Just the code, no other words."}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}]}, {"role": "assistant", "content": [{"type": "text", "text": "import cadquery as cq\n# Generating a workplane for sketch 0\nwp_sketch0 = cq.Workplane(cq.Plane(cq.Vector(-0.421875, -0.75, 0.0), cq.Vector(1.0, 0.0, 0.0),..."}]}]}

Quality Control & Defect Detection

Identifying and localizing specific manufacturing defects (e.g., scratches, cracks) on a part image. Requires a bounding box list.

{"messages": [{"role": "system", "content": [{"type": "text", "text": "You are an AI assistant specialized in computer vision for manufacturing quality control, tasked with identifying and localizing defects."}]}, {"role": "user", "content": [{"type": "text", "text": "Analyze the high-resolution image of the metal component and output the JSON list of defects detected."}, {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJ..."}}]}, {"role": "assistant", "content": [{"type": "text", "text": "[{\"box_2d\": [150, 200, 250, 300], \"label\": \"scratch\"}, {\"box_2d\": [400, 50, 550, 150], \"label\": \"dent\"}, {\"box_2d\": [80, 800, 150, 950], \"label\": \"paint_chip\"}]"}]}]}

Predictive Maintenance

Analyzing a machine's thermal image to classify its state (e.g., Normal, Overheating) and identifying the critical hot spot.

{"messages": [{"role": "system", "content": [{"type": "text", "text": "You are an AI assistant for predictive maintenance, specialized in classifying industrial machine status from thermal images."}]}, {"role": "user", "content": [{"type": "text", "text": "Analyze the provided thermal image of the electric motor and output the overall condition status."}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}]}, {"role": "assistant", "content": [{"type": "text", "text": "STATUS: Warning (Bearing Overheating)\nMax_Temp_C: 98.5\nLocation: Rear Bearing Housing\nSeverity: High"}]}]}

Input format

Our fine-tuning service requires data in the JSON Lines (.jsonl) format, where each line is a self-contained, valid JSON object.

{"messages":[{"role":"system","content":"You are a helpful customer support assistant."},{"role":"user","content":"I need help with my order #12345. It hasn't arrived yet."},{"role":"assistant","content":"I'd be happy to help you track order #12345. Let me check the shipping status for you."}]}
{"messages":[{"role":"system","content":"You are an AI trained to classify customer feedback."},{"role":"user","content":"The product quality is excellent but the shipping took too long."},{"role":"assistant","content":"Classification: MIXED\nPositive aspects: Product quality\nNegative aspects: Shipping time"}]}
{"messages":[{"role":"system","content":"You are a quality control AI. Classify the image as DEFECT or PASS."},{"role":"user","content":"Inspect the following item: <image:product_001.jpg>"},{"role":"assistant","content":"Classification: PASS"}]}

A example file for a multi-modal case based on GenCAD Code Dataset can be downloaded here.

Annotation tools

We recommend the following community-driven, self-hostable tools to manage your annotation projects and export structured data.

Tool

Focus Areas

Why We Recommend It

Local/Self-Hostable?

Label Studio

Text (NER, Classification, Dialogue), Images

The most versatile platform supporting nearly all our use cases, including complex multi-modal projects. Excellent JSON export capabilities.

Yes (via Docker or pip install)

Doccano

Text Only (Classification, NER, Seq2Seq)

Simple, fast, and optimized specifically for text annotation. Ideal if your project is purely focused on back-office text automation.

Yes (via Docker or pip install)

CVAT

Image & Video (Detection, Segmentation, Classification)

Industry-standard for computer vision. Ideal for labeling your industrial quality control and predictive maintenance image datasets.

Yes (via Docker)

Last updated

Was this helpful?