I tried training Mbert and Xlmr-large on the paws-x English paraphrase detection dataset and looks like Xlmr-large is not converging, while Mbert does. I’ve tried tweaking the hyperparameters for the Xlmr-large but that doesn’t seem to help as well.
Attaching train stats for both models below(Note: I’m evaluating every 100 steps)
I’ve changed the hf run_glue colab example to reproduce this behavior here. For hyperparameters, I’m following the hyperparameters used in xtreme paper where they report better results for xlmr-large compared to mbert on paws-x
Would appreciate it if someone took a quick look and have any suggestions.