I have been trying to train a BERT model on the TyDi QA dataset (only arabic questions) using the HF example script.
While everything runs well and the model trains well, I noticed that when I tried to train the same model .on the same data. with the same hyperparameters but using the run_squad.py script from the google/bert repo i get much higher results (+10 exact match)
I linked a colab with both codes ready to run https://colab.research.google.com/drive/1AHz4mpDBSea92MJVb-GhudGhFvO1VgJs?usp=sharing .
I checked the evaluation scripts (from datasets and the official squad script) and they both output the same scores. Hence the problem must be from either the preprocessing or the model(which i doubt).