Fine-tune a translation or train from sratch?

Hello here. I hope this question wasn’t asked here before ask I first looked around.
I’m still studying HF Transformers. I tried to fine-tune a machine translation model « Helsinki-NLP/orpus-mt-en-mul » on a local dialect.
I used en-mul because the target language doesn’t appear on the list of languages this model checkpoint was trained on, like any other in the hub. So I can’t use the pretrained tokenizer as it doesn’t know my text. So please I wanted to know if my only bet is training a BART or MBART from scratch as that’s what I feel like after to research.

Welcome rickySaka! This is a bit outside of my expertise, but I took a shot at researching it too and I think I’ve landed on the same conclusion you have. From this thread and this thread, my understanding is that you’ll want to train from scratch. Specifically:

In case you have a plenty of new words (e.g. technical terms) or even a different language, it might makes sense to start from scratch

1 Like

Thanks @NimaBoscarino. I will have a look at them.

1 Like

Is there any model translating to a target language that is similar to your target language. Cross-lingual transfer could help leverage some knowledge from a model that translates from english to italian to e.g. english to spanish, even if the two languages are different.

This really depends on the amount of data you have, and you might likely want to go with scratch training as @NimaBoscarino suggested, but wanted to bring that option as well. @lewtun et. al. recent book has a whole chapter about Multilingual NER and the concepts from it can be used as well. The code is open sourced, so you might want to check the " Cross-Lingual Transfer" and “When Does Zero-Shot Transfer Make Sense?” sections of its notebook or take a look at the book.

1 Like

Thank you for your suggestions. I will have to check which language that can be close to my target language. I have around 5000 translation pair from this dataset, that’s why training from scratch will be my last solution as the data is really small. I observed interesting results though with a multilingual model(eng-mul) with a validation blue score of 21 after 20 epochs using the model’s tokenizer. Still working on it.
Thanks a lot.

Hi @rickySaka , I am hoping to benefit from cross-lingual transfer as I want to fine-tune the opus-mt-am-sv (Amharic-Swedish) pretrained model with a 650K sentence pair Amharic-English dataset. If it works out I will post here.

Great. Feel free.
I discovered that sacreblue score I announced (20, 21) decreased considerably to 18 2 days ago with the same dataset, checkpoint(eng-mul) and tokenizer. I don’t know why though