Tuto on how to train a translation from scratch in a pythonic way?

Hi HugginFace Community !!

I’m working on a translation model that involves a language pair that is not available in the hub model yet.
So training it from scratch is the only way to go. But I’ve not seen any tutorial related to the training of huggingface models from scratch. The MarianMT class is the one that I want to use but I’m still discovering the huggingface library and I don’t know how to go with it. For example, the MarianTokenizer (the tokenizer of the MarianMT model) requires a SentencePiece model stored in a .spm file but what I’ve seen through the internet that SentencePiece models are only stored in .model files and I’m unfamiliar with the .spm format. So I’m stuck. Is there a way to train from scratch a translation model?

1 Like

Same question! I want to train a translation model from scratch to reproduce the results in

Hi, thanks for this question
Did you manage to solve this problem?