Hi team,
I want to finetune a large BERT-based MT model (e.g. NLLB-200-1.3B
) on new words that are out of model’s vocabulary.
I prepared a correctly working finetuning script, and now I need a clear theoretical foundation in the issue of dataset preparation.
The pretrained model has 250k tokens in the vocab, and according to the paper on NLLB, the number of sentences the model is pretrained on is 21B.
I constructed a pivot dataset for finetuning this model in an existing language pair. It consists of 30 entries in total, with 10 of them being separate new words, and the remaining 20 entries being example sentences utilizing these words (two examples per each word).
At tests of my finetuned translator, it became clear that the model tries to choose the most contextually appropriate word according to the data is was trained on in the aggregate - both pretrained and finetuned.
As a result, there occurs some confusion in the meaning, such as (compare actual vs expected translation):
the painter used the passport - the painter used a passe-partout
you need a special passport for this photo - this photograph needs a special passe-partout
he was waving at me - he was threatening me
he likes to scratch a bottle of beer - he likes to drink a bottle of beer
they like to mess with their friends - they like to drink alcohol with friends
if you’re going to throw away clothes… - if you spoil your clothes…
this spine bugger is always breaking his toys - this harmful child is always breaking his toys
this movie is too scary, not for children - this movie is too vulgar, not for children
Intuitively, it seems that in order to finetune a translator on a new word, a great number of sentences containing this word are needed. De-facto, I need either drag the model’s weights towards this word at finetuning or somehow setup the finetuning, such as add more epochs, allow a little overfitting during finetuning.
Are there any recommendations about how to construct my finetuning dataset of new words and phrases with them? What size should it be in order to achieve a translation quality comparable with the one a pretrained (non-finetuned) model capable of demonstrating on its “native” test datasets?
Any literature?