Datasets

Datasets are the foundation for training AI models. The statistical variance in the provided data is learned and represented by the resulting model. The same process is applied when fine-tuning a model. Datasets are therefore the base for everything that can be done in the AI Model Studio. The datasets consist of entries that contain both example input values as well as the expected target output.

Create a Dataset in the AI Model Studio

In general, there are three different methods for creating a dataset in the AI Model studio:

Upload a complete dataset

When uploading a complete dataset, only the datapoints provided are used for later training a model. Therefore developers have to make sure, the datasets contain enough examples to grab the variance of the use case they will be used for.

Full datasets need to be provided in JSONL format, meaning a JSON file that only consists of one single line of content which has proved itself as the dominant data format used for LLM fine-tunings.

{"messages":[{"role":"system","content":"You are a helpful customer support assistant."},{"role":"user","content":"I need help with my order #12345. It hasn't arrived yet."},{"role":"assistant","content":"I'd be happy to help you track order #12345. Let me check the shipping status for you. According to our records, your order was dispatched on Monday and is currently in transit. The expected delivery date is tomorrow by 5 PM. You should receive a tracking update via email shortly."}]}
{"messages":[{"role":"user","content":"How do I reset my password?"},{"role":"assistant","content":"To reset your password, please follow these steps:\n1. Go to the login page\n2. Click on 'Forgot Password'\n3. Enter your email address\n4. Check your email for a reset link\n5. Follow the link to create a new password\nThe link will expire in 24 hours for security reasons."}]}
{"messages":[{"role":"system","content":"You are an AI trained to classify customer feedback."},{"role":"user","content":"The product quality is excellent but the shipping took too long."},{"role":"assistant","content":"Classification: MIXED\nPositive aspects: Product quality\nNegative aspects: Shipping time\nOverall sentiment: Neutral to positive"}]}

Generate a completely synthetic dataset

The opposite of having complete datasets ready to train your AI models is having no available data at all. For this case, the AI Model Studio offers the possibility to create complete datasets from scratch using large AI models. To achieve this, the IONOS AI Model Hub is chosen to ensure full AI and data sovereignty.

Best Practice: Although no example data is required to create a synthetic dataset, the requirements and target usage of the data should be provided as detailed as possible to ensure results that improve existing language models and tailor them to specific needs. To achieve this, use synthesized data and then describe the use case, for example, the input data that will realistically be used later and the context, which includes what is expected by the AI model and how the output is supposed to look.

Limitations: Due to compute limitations, the free beta of the AI Model Studio will only allow the creation of up to 250 training examples. Downloads of the resulting datasets are not possible as of now.

Augment a Small Example Dataset

Additionally, to uploading full datasets and synthesizing data from scratch, small datasets can be augmented by creating new data points through usage of big AI models.

Best Practice: While creating new synthetic data points carries the risk of overfitting your AI model, generally speaking it’s a great way to use big AI models in an efficient manner and distill the knowledge into smaller fine-tuned ones. Make sure to still provide context and use case information like you would with a complete data synthesis. For multi-modal and/or complex in-production cases, we recommend creating a dataset using data annotation.

To augment data, initiate the data synthesis setup, as if creating a fully synthetic dataset. Then upload your real-world data to serve as the example reference set.

Privacy guidelines

GDPR compliance: AI Model Hub is designed to comply with the General Data Protection Regulation (GDPR), ensuring that we respect and uphold user rights regarding personal data. For more information, see Data Processing Agreement.

Data security

Our infrastructure is hosted in our German data centers, designed to meet stringent security standards.

Last updated

Was this helpful?