With the notebook Fine-Tune Wav2Vec2 for English ASR with Transformers, I notice different results for evaluation during training with compute_metrics and after training on the same test dataset. Please find the details below.
- To save time, I select part of the dataset with
timit[‘train’] = timit[‘train’].select(range(1000))
timit[‘test’] = timit[‘test’].select(range(500))
Without changing, I get the following training results.
With load_best_model_at_end=True in TrainingArguments, the best model is loaded as:
Loading best model from ./checkpoint-2500 (score: 0.605413556098938).
This score is valid_loss, and the corresponding WER is 0.439.
- With the loaded model, I run the map() function to evaluate. Then I get the results
Test WER: 0.386
I expect to get the almost same results. If you have any ideas, please leave a comment.