Training loss no drop for MT5ForSequenceClassification


I am using mT5 for sequence classification for the first time. Basically I am doing cross-lingual NLI using XNLI. I am using the official MT5ForSequenceClassification method, which attaches a FC network to the decoder’s output. The same code worked well for mBART and T5 (english, french and german only). But after I switched to mT5 (i literally just changed the model name string), I noticed that the training loss never dropped. I have tried various learning rates from 1e-3 to 1e-6. No luck.

Has anyone had a similar experience? Any suggestions ?