XLMR-large not converging on Paws-X paraphrase dataset but mbert does

Hey everyone,

I tried training Mbert and Xlmr-large on the paws-x English paraphrase detection dataset and looks like Xlmr-large is not converging, while Mbert does. I’ve tried tweaking the hyperparameters for the Xlmr-large but that doesn’t seem to help as well.

Attaching train stats for both models below(Note: I’m evaluating every 100 steps)

I’ve changed the hf run_glue colab example to reproduce this behavior here. For hyperparameters, I’m following the hyperparameters used in xtreme paper where they report better results for xlmr-large compared to mbert on paws-x

Would appreciate it if someone took a quick look and have any suggestions.