XLMR-large not converging on Paws-X paraphrase dataset but mbert does

Hey everyone,

I tried training Mbert and Xlmr-large on the paws-x English paraphrase detection dataset and looks like Xlmr-large is not converging, while Mbert does. I’ve tried tweaking the hyperparameters for the Xlmr-large but that doesn’t seem to help as well.

Attaching train stats for both models below(Note: I’m evaluating every 100 steps)

I’ve changed the hf run_glue colab example to reproduce this behavior here. For hyperparameters, I’m following the hyperparameters used in xtreme paper where they report better results for xlmr-large compared to mbert on paws-x

Would appreciate it if someone took a quick look and have any suggestions.

Thanks

I revisited this with the latest hf and tried it with fp16 and it seems to work now.

Also had a similar issue with roberta-large models on xnli and paws. Tried with fp16 and fp32 and every time, one of them worked.