M2m-100 finetuning

I’ve been experimenting with the finetuning of m2m100, hoping that fine-tuning one lang pair (eg; en-fr) would result in an improvement for that lang \ pair and would not affect the rest of the model, however, following this huggingface guide for m2m100 in which it said to set tokenizer lang for multi-lingual models.

However upon training (tested upon an already fine-tuned for en - fr model located here- NDugar/m2m100_418M-fr · Hugging Face )

This model also suffered the same issue as mine, in which translating another lang pair, such as spanish to russian would result in the implementation of french grammar and words into the resulting translation, showing that they’ve seemed to blend together.
image

I am a complete newcomer to NLP and AI in general so apologies if this is a dumb post. I have observed, however, in my compute_metrics function , when called with .evaluate() before even training with new data, the resulting predictions from the data passed to the trainer are not in the target language, but in a mix of seemingly random languages
[IMAGE2 BELOW POST]

[IMAGE3 BELOW POST]

Essentially what I’m asking, is it possible to train m2m-100 for only one language pair, and preserving the weights of all other lang-pairs / languages not involved, as it seems even training causes the model to train with incorrect languages or something else is affecting it.

2 Likes

[IMAGE3] - https://media.discordapp.net/attachments/954340654069202966/957116541617446933/unknown.png?width=965&height=488

[IMAGE2] - https://media.discordapp.net/attachments/954340654069202966/957116757041094686/unknown.png?width=895&height=488
The predicitons var in question

I have the same problem with this code: Google Colab.
We must have done the same thing…

Hi, any progress with this?

+1 facing the same issue here