Evaluation results (metric) during training is different from the evaluation results at the end

Hi, I鈥檓 trying to train the T5-small model on the summarization task using the seq2seq trainer.
(I鈥檓 currently referencing the example from transformers/examples/pytorch/summarization at v4.15.0 路 huggingface/transformers 路 GitHub)

While doing so, I found out that the final model performance on the validation set (best model loaded) is different from the evaluation results during training.

I set the argument --load_best_model_at_end as True, so from my understanding, this makes the trainer call the best checkpoint for the final evaluation.

However, the final performance (Rouge1, 2, L) is much higher than the evaluation results reported during training (every 500 steps).

What might be the reason?

the training script is:

	python ../run_summarization.py \
	    --model_name_or_path t5_small \
	    --do_train \
        --do_eval \
        --do_predict \
        --dataset_name ccdv/cnn_dailymail \
        --dataset_config "3.0.0" \
        --source_prefix "summarize: " \
        --output_dir ${OUTPUT_DIR} \
        --per_device_train_batch_size 32 \
        --per_device_eval_batch_size 64 \
        --gradient_accumulation_steps 2 \
        --max_steps ${MAX_STEPS} \
        --predict_with_generate \
        --evaluation_strategy "steps" \
        --logging_strategy "steps" \
        --save_strategy "steps" \
        --eval_steps 500 \
        --logging_steps 500 \
        --save_steps 500 \
        --load_best_model_at_end \
        --metric_for_best_model "eval_rougeLsum" \
        --save_total_limit 1 \
        --num_beams 1 \
        --dropout_rate 0.1 \
        --preprocessing_num_workers 8 \
        --learning_rate 5e-5 \
1 Like

I also run into the same problem. It doesn鈥檛 matter which checkpoint I take, the eval results while training are different than the eval results when calling trainer.evaluate() (exactly same dataset, it is in in the same run where do_train and do_eval are both enabled).
Interestingly, it only happens with T5 (and not with pegasus, for example).