Deepset/bert-base-cased-squad2 F1/EM scores

This model has much higher F1/EM scores than what is in its card when evaluated on the validation squad2 data. Any ideas why that is?