Inconsistent evaluation result (WER) when finetuning wav2vev2 pretrained model

Hi guys. I am trying to finetune the pre-trained wav2vev2 model (facebook/wav2vec2-large-lv60) for the ASR task. I followed this article (wrote by @patrickvonplaten) to train and evaluate the performance. When running trainer.train(), the reported evaluation result is about 18% (you can refer to here). However, when I finished training and re-evaluate the same evaluation set (same procedure as the guiding article), the WER became 40%. Therefore, I am a little bit confused by the inconsistent WER result and would like to know if anyone has similar experience. Any help will be appreciated!

By the way, in my training process I only used limited training data (the first 300 samples of timit’s original training set) and a different evaluation set (the 301st ~ 350th samples of timit’s original training data). The compute_metrics passed to the trainer and the final evaluation process are the same as the guiding article, but let me re-post it here for your convenience:

    def compute_metrics(pred):
        pred_logits = pred.predictions
        pred_ids = np.argmax(pred_logits, axis=-1)

        pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id
        pred_str = processor.batch_decode(pred_ids)
        # we do not want to group tokens when computing the metrics
        label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

        wer = wer_metric.compute(predictions=pred_str, references=label_str)

        return {"wer": wer}

    def map_to_result(batch):
        with torch.no_grad():
            input_values = torch.tensor(batch["input_values"], device="cuda").unsqueeze(0)
            logits = model(input_values).logits

        pred_ids = torch.argmax(logits, dim=-1)
        batch["pred_str"] = processor.batch_decode(pred_ids)
        batch["text"] = processor.batch_decode(batch["labels"], group_tokens=False)
        return batch