What file type should my training data be?

Hi,

I want to fine tune the opus nt ar-en model using my own dataset, but Iā€™m not sure what type of files my training data should be in? In the huggingface Marian tutorial (MarianMT) they just pass in lists of sentences, but I also read somewhere that Iā€™m supposed to preprocess the data with Sentencepiece first, since the Marian tokenizer takes spm files. Or is sentencepiece ā€œbuilt inā€ into the Marian tokenizer? As for now, my data is a csv file. Iā€™m a beginner, so all help is much appreciated.