Different evaluation results during and after training: Wav2Vec2 finetuning

Hello,

With the notebook Fine-Tune Wav2Vec2 for English ASR with :hugs: Transformers, I notice different results for evaluation during training with compute_metrics and after training on the same test dataset. Please find the details below.

  • To save time, I select part of the dataset with

timit[‘train’] = timit[‘train’].select(range(1000))
timit[‘test’] = timit[‘test’].select(range(500))

Loading best model from ./checkpoint-2500 (score: 0.605413556098938).
This score is valid_loss, and the corresponding WER is 0.439.

  • With the loaded model, I run the map() function to evaluate. Then I get the results

Test WER: 0.386

I expect to get the almost same results. If you have any ideas, please leave a comment.