Fine-tune a translation or train from sratch?

rickySaka · April 1, 2022, 1:27pm

Hello here. I hope this question wasn’t asked here before ask I first looked around.
I’m still studying HF Transformers. I tried to fine-tune a machine translation model « Helsinki-NLP/orpus-mt-en-mul » on a local dialect.
I used en-mul because the target language doesn’t appear on the list of languages this model checkpoint was trained on, like any other in the hub. So I can’t use the pretrained tokenizer as it doesn’t know my text. So please I wanted to know if my only bet is training a BART or MBART from scratch as that’s what I feel like after to research.

NimaBoscarino · April 1, 2022, 11:41pm

Welcome rickySaka! This is a bit outside of my expertise, but I took a shot at researching it too and I think I’ve landed on the same conclusion you have. From this thread and this thread, my understanding is that you’ll want to train from scratch. Specifically:

In case you have a plenty of new words (e.g. technical terms) or even a different language, it might makes sense to start from scratch

rickySaka · April 2, 2022, 12:03am

Thanks @NimaBoscarino. I will have a look at them.

osanseviero · April 5, 2022, 9:33am

Is there any model translating to a target language that is similar to your target language. Cross-lingual transfer could help leverage some knowledge from a model that translates from english to italian to e.g. english to spanish, even if the two languages are different.

This really depends on the amount of data you have, and you might likely want to go with scratch training as @NimaBoscarino suggested, but wanted to bring that option as well. @lewtun et. al. recent book has a whole chapter about Multilingual NER and the concepts from it can be used as well. The code is open sourced, so you might want to check the " Cross-Lingual Transfer" and “When Does Zero-Shot Transfer Make Sense?” sections of its notebook or take a look at the book.

rickySaka · April 7, 2022, 3:14pm

Thank you for your suggestions. I will have to check which language that can be close to my target language. I have around 5000 translation pair from this dataset, that’s why training from scratch will be my last solution as the data is really small. I observed interesting results though with a multilingual model(eng-mul) with a validation blue score of 21 after 20 epochs using the model’s tokenizer. Still working on it.
Thanks a lot.

Terence · April 13, 2022, 10:27am

Hi @rickySaka , I am hoping to benefit from cross-lingual transfer as I want to fine-tune the opus-mt-am-sv (Amharic-Swedish) pretrained model with a 650K sentence pair Amharic-English dataset. If it works out I will post here.

rickySaka · April 15, 2022, 10:08am

Great. Feel free.
I discovered that sacreblue score I announced (20, 21) decreased considerably to 18 2 days ago with the same dataset, checkpoint(eng-mul) and tokenizer. I don’t know why though

Topic		Replies	Views
Finetune a pretrained huggingface translation model on a new language pair Models	1	1037	January 12, 2024
Translation architectures fine-tunable on a new language Models	0	407	October 3, 2023
Fine-tune a translation model on monolingual data Intermediate	1	434	June 16, 2022
Fine-tuning of multilingual (translation) models Models	1	1509	August 17, 2023
Zero-shot finetuning a model for translation Beginners	0	41	November 7, 2024

Fine-tune a translation or train from sratch?

Related topics