How to reproduce the performance of bert-large-uncased-whole-word-masking-finetuned-squad?

I have been trying to reproduce the results of the model bert-large-uncased-whole-word-masking-finetuned-squad · Hugging Face. The model page records a result of f1 = 93.15, exact_match = 86.91. But I am getting “f1”: 43.75 and “exact”: 39.02. I have been scratching my head for a few days now to figure out why there is such a big difference in performance. I am attaching the colab notebook here: Google Colaboratory

What am I missing? Any help is highly appreciated.