I am doing finetuning of BERT-base on SQuADv1 and found that if I take the validation data (used during training for evaluation) from the training set, the F1 is consistently lower than the F1 on the test set (denoted ‘validation’ in the dataset). If I take the validation data from the test set (denoted ‘validation’ in SQuADv1) I get higher or the same F1, which is what I would expect.
The only way I could explain this is that there is a distribution shift between the training and validation set of SQuADv1.
Is that the case?
All the best,