Format requirements of dataset when fine tuning another model

Hello :wave: I am new to this but love it :heart_eyes:

My team mate and I are about to try and fine-tune a text-generation model to be more “domain specific” AKA. speak the lingo of a certain profession.

We are from Sweden so we thought we’d use a base model that is able to produce Swedish-language and then train it additionally with texts from the the profession group that we aim for.
As a base we will try: birgermoell/swedish-gpt · Hugging Face

We are using AWS sagemaker. We think we might have successfully set the training up there BUT, among the 1 millon questions we have we wonder about the structure of the dataset. Is there any specific format/structure we need to have on the dataset CSV? Eg. specific headers etc.
Like is the dataset structure bound to have the same structure as the initial dataset that trained the model?

Hi ! If you plan to train for causal language modeling with a script similar to run_clm.py then a simple text file is enough