As per twmkn9/distilroberta-base-squad2 · Hugging Face, the exact and f1 scores achieved while eval are given on total for 6078 instances. The official dataset for SQUAD2 has 11873 instances (refer to official website).
I searched for this subset for 6078 instances on google and got this - squad/data at master · elgeish/squad · GitHub - this has 6078 instances.
But even with this subset, the exact and f1 scores I’m getting are around 58 and 62 only using the same run_squad.py script – while the reported numbers are 70 and 74.
Need help if the model provided is properly trained or not and need clarity on the dev dataset being used.