Hi,
I want to fine tune the opus nt ar-en model using my own dataset, but Iām not sure what type of files my training data should be in? In the huggingface Marian tutorial (MarianMT) they just pass in lists of sentences, but I also read somewhere that Iām supposed to preprocess the data with Sentencepiece first, since the Marian tokenizer takes spm files. Or is sentencepiece ābuilt inā into the Marian tokenizer? As for now, my data is a csv file. Iām a beginner, so all help is much appreciated.