What file type should my training data be?

TheaMT · February 20, 2023, 11:07am

Hi,

I want to fine tune the opus nt ar-en model using my own dataset, but I’m not sure what type of files my training data should be in? In the huggingface Marian tutorial (MarianMT) they just pass in lists of sentences, but I also read somewhere that I’m supposed to preprocess the data with Sentencepiece first, since the Marian tokenizer takes spm files. Or is sentencepiece “built in” into the Marian tokenizer? As for now, my data is a csv file. I’m a beginner, so all help is much appreciated.

Topic		Replies	Views
Enhance a MarianMT pretrained model from HuggingFace with more training data Beginners	4	2707	May 29, 2021
Construct a Marian tokenizer. Based on huggingface tokenizers 🤗Tokenizers	0	204	May 7, 2024
Tuto on how to train a translation from scratch in a pythonic way? Beginners	2	618	October 23, 2023
Issues with save_pretrained (MarianMT) Beginners	1	655	April 11, 2023
How to train Marian Machine Translation Models	1	1030	June 23, 2022

What file type should my training data be?

Related topics