Fine-tuning MT5 on XNLI

Hi there,

I am trying to fine-tune a MT5-base model to test it over the Spanish portion of the XNLI dataset.
My training dataset is the NLI dataset machine translated to Spanish by a MarianMT model, so the quality isn’t the best but I have still managed to get good results while training it with other models shuch as xlm-roberta.
Also, given the size of the NLI dataset I am only training with a 10% of it (with same proportion of labels), which is still 40.000 examples.

The problem I have is that it gets to a point where the loss is stucked and always predicts the same class, so I am looking for some hints about how to make training effective by changing parameters or to see if someone also had the same problems as me.

I have tried with both AdamW and Adafactor and with learning rates ranging from 0.001 to 1e-5 and I always get the same results.

Any help will be appreciated. Thank you very much!


I had the same issue with mT5 (both small and base) on BoolQ dataset (~9.5k train samples) and found out something that may be useful to you.

No matter what settings I used, how long I trained, and whether I oversampled the minority class on training set, all predictions on validation set were the same. Interestingly, this only occurred when using boolean QA data with mT5. Other tasks such as SQuAD, or switching to T5, worked just fine.

So, I looked into the differences of the pre-training stage between T5 and mT5. One thing to note is that no supervised pre-training is used in mT5. Since mT5 works fine on SQuAD, I trained for one epoch on SQuAD before proceeding to train on BoolQ using the settings described in the mT5 paper. This resolved the issue for me and now the accuracy improves as expected.

In short: Train on some other tasks first.

I hope it helps you too!