What file type should my training data be?


I want to fine tune the opus nt ar-en model using my own dataset, but I’m not sure what type of files my training data should be in? In the huggingface Marian tutorial (MarianMT) they just pass in lists of sentences, but I also read somewhere that I’m supposed to preprocess the data with Sentencepiece first, since the Marian tokenizer takes spm files. Or is sentencepiece “built in” into the Marian tokenizer? As for now, my data is a csv file. I’m a beginner, so all help is much appreciated.