M2M100 12B performs worse that 1.2B


I evaluated the out-of-the-box performance of different M2M100 versions on some custom datasets. I observed that facebook/m2m100-12B-last-ckpt and facebook/m2m100-12B-avg-5-ckpt perform much worse than facebook/m2m100_1.2B.

Do you know why this happens? Are the weights of the m2m100 12B model not yet finalized?

Thank you!

Hi, I have the same experience with M2M 12B, 1.2B and the 400M versions. In my opinion I think 12B truly outperforms in rich-resourced language pairs such as DE-EN and FR-EN. However in other lower-resourced languages, 12B’s performance is not significantly different from 1.2B. From my own experience, I think 1.2B actually translates the best from and to Malay.

Thank you for your answer @kinetical ,

I evaluated the models on the English to German FLORES dataset.(GitHub - facebookresearch/flores: Facebook Low Resource (FLoRes) MT Benchmark).

This is how the models perform:

  • facebook/m2m100_1.2B: 35.39 BLEU
  • facebook/m2m100-12B-avg-5-ckpt: 12.44 BLEU

So, the problem exists for rich-resourced language pairs as well. M2M100 12B performance is much lower than 1.2B.

Thank you!

Hi @evroschris98 Thank you for your shared result.

Recently I have another opportunity to compare 12B and 1.2B, and I found out that the model capacity is the key difference between the two. I realized that for single pair of languages, 1.2B almost outperformed in every comparison against 12B. However when I modified the code to finetune for several pairs and directions (a group of geographically neighboring languages ), oh my god 12B shines. There were just more than enough capacity to actually “memorize all these languages”.

Hi, @kinetical.

I’m interested to know if fine-tuning m2m affected the quality of other translation directions in your case?

I’m fine-tuning on one lang pair and that pair works well, but it breaks all the other directions.

And could you share your fine-tuning script, maybe I’m doing something wrong?