`run_qa.py` achieves much lower performance than the original BERT run_squad.py

I have been trying to train a BERT model on the TyDi QA dataset (only arabic questions) using the HF example script.

While everything runs well and the model trains well, I noticed that when I tried to train the same model .on the same data. with the same hyperparameters but using the run_squad.py script from the google/bert repo i get much higher results (+10 exact match)

I linked a colab with both codes ready to run https://colab.research.google.com/drive/1AHz4mpDBSea92MJVb-GhudGhFvO1VgJs?usp=sharing .

I checked the evaluation scripts (from datasets and the official squad script) and they both output the same scores. Hence the problem must be from either the preprocessing or the model(which i doubt).

The script has been tested on the squad and squad v2 datasets and gives the same results (roughly) as the previous one. The reason might be linked to the fast tokenizers offset mappings not really working with Arabic, maybe? This would be the thing I would check first.

Maybe you could try to replicate the steps in this notebook on your dataset to check all the preprocessing looks correct?

I checked if the offsets correctly map back to the original tokens and found out that the start_position and end_position are sometimes being wrong by 1 or 2 tokens (around 6000 examples out of 14K were wrong)

This might be caused by the offset mappings being wrong by few characters.

I then selected a failed example and checked if the offsets map back to the original tokens nad everything looked correct.

For now, I solved the issue by applying some preprocessing and cleaning to the data which solved the issue, and reduced error count down to ~600. I tried training and got good results +10 EM score.
I think it might be due to some weird characters that are removed in the preprocessing step.

The remaining errors are caused by the answer being part of a word like: if the answer was “japan” but the context had “Japanese” . In the BERT squad preprocessing they apply a span improve function https://github.com/aub-mind/arabert/blob/master/arabert/run_squad.py#L559

This has been discussed briefly here https://github.com/huggingface/transformers/issues?q=_improve_answer_span

Here is the link for the colab that I experimented on https://colab.research.google.com/drive/1UWwQkf_xBzkoLzPfB2GRs8OPR1Wtvz1Q#scrollTo=FU-YMtp_t0ec