Format requirements of dataset when fine tuning another model

Brainmaniac · April 6, 2022, 10:02am

Hello I am new to this but love it

My team mate and I are about to try and fine-tune a text-generation model to be more “domain specific” AKA. speak the lingo of a certain profession.

We are from Sweden so we thought we’d use a base model that is able to produce Swedish-language and then train it additionally with texts from the the profession group that we aim for.
As a base we will try: birgermoell/swedish-gpt · Hugging Face

We are using AWS sagemaker. We think we might have successfully set the training up there BUT, among the 1 millon questions we have we wonder about the structure of the dataset. Is there any specific format/structure we need to have on the dataset CSV? Eg. specific headers etc.
Like is the dataset structure bound to have the same structure as the initial dataset that trained the model?

lhoestq · April 7, 2022, 9:38am

Hi ! If you plan to train for causal language modeling with a script similar to run_clm.py then a simple text file is enough

Topic		Replies	Views
Data format for fine-tune base model Intermediate	2	30	March 10, 2025
What is the text dataset format for fintune LLM? Beginners	2	2736	June 8, 2023
DialoGPT fine-tuning dataset format Models	3	722	April 27, 2021
How to determine the data format when creating a custom dataset for a given task? 🤗Transformers	0	173	April 18, 2023
Knowing the format of Dataset of pretrained facebook-mms-tts model 🤗Datasets	0	45	July 17, 2024

Format requirements of dataset when fine tuning another model

Related topics