Cannot fine-tune RobertaForQA on SQuAD-like dataset?

Hi all,

I have been trying to fine-tune RoBERTa (pretrained on Polish) for on machine-translated squad-v2 dataset, but with no luck. Training loss seems very unstable and fluctuating. The resulting model I get after fine-tuning seems to handle impossible questions pretty good, but has hard time with regular questions, and overall it gets ~45 EM and ~53 F1.
Has anyone has had similiar issue and did manage to fix it???
Thanks.