Hello I am new to this but love it
My team mate and I are about to try and fine-tune a text-generation model to be more ādomain specificā AKA. speak the lingo of a certain profession.
We are from Sweden so we thought weād use a base model that is able to produce Swedish-language and then train it additionally with texts from the the profession group that we aim for.
As a base we will try: birgermoell/swedish-gpt Ā· Hugging Face
We are using AWS sagemaker. We think we might have successfully set the training up there BUT, among the 1 millon questions we have we wonder about the structure of the dataset. Is there any specific format/structure we need to have on the dataset CSV? Eg. specific headers etc.
Like is the dataset structure bound to have the same structure as the initial dataset that trained the model?