Evaluation results (metric) during training is different from the evaluation results at the end

jspark93 · March 6, 2022, 5:24am

Hi, I’m trying to train the T5-small model on the summarization task using the seq2seq trainer.
(I’m currently referencing the example from transformers/examples/pytorch/summarization at v4.15.0 · huggingface/transformers · GitHub)

While doing so, I found out that the final model performance on the validation set (best model loaded) is different from the evaluation results during training.

I set the argument --load_best_model_at_end as True, so from my understanding, this makes the trainer call the best checkpoint for the final evaluation.

However, the final performance (Rouge1, 2, L) is much higher than the evaluation results reported during training (every 500 steps).

What might be the reason?

the training script is:

train_summarization_cnn:
	python ../run_summarization.py \
	    --model_name_or_path t5_small \
	    --do_train \
        --do_eval \
        --do_predict \
        --dataset_name ccdv/cnn_dailymail \
        --dataset_config "3.0.0" \
        --source_prefix "summarize: " \
        --output_dir ${OUTPUT_DIR} \
        --per_device_train_batch_size 32 \
        --per_device_eval_batch_size 64 \
        --gradient_accumulation_steps 2 \
        --max_steps ${MAX_STEPS} \
        --predict_with_generate \
        --evaluation_strategy "steps" \
        --logging_strategy "steps" \
        --save_strategy "steps" \
        --eval_steps 500 \
        --logging_steps 500 \
        --save_steps 500 \
        --load_best_model_at_end \
        --metric_for_best_model "eval_rougeLsum" \
        --save_total_limit 1 \
        --num_beams 1 \
        --dropout_rate 0.1 \
        --preprocessing_num_workers 8 \
        --learning_rate 5e-5 \
        --overwrite_output_dir

Eran · May 31, 2022, 6:23pm

I also run into the same problem. It doesn’t matter which checkpoint I take, the eval results while training are different than the eval results when calling trainer.evaluate() (exactly same dataset, it is in in the same run where do_train and do_eval are both enabled).
Interestingly, it only happens with T5 (and not with pegasus, for example).

per · July 13, 2022, 10:02am

hi @Eran were you able to figure out what is causing the problem?
I am facing the same issue…

Thanks!

Elron · July 13, 2022, 10:41am

It seems like at least in @jspark93 case this behavior is intentional. You can choose number of beams to use for the evaluation during training and evaluation post training.
Number of beams for evaluation during training is set with --generation_num_beams and num of beams for evaluation post training is set with --num_beams. If you want the same behavior in both its better just to set --generation_num_beams. Indeed very confusing and not well documented behavior.

The line in the run_summerization.py code that causing this behavior:

num_beams = data_args.num_beams if data_args.num_beams is not None else training_args.generation_num_beams

Additionally, it looks like --max_length and --generation_max_length can also cause such discrepancy.

Also worth mentioning that when only --max_length is set, at training time their is no max length restriction and the trainer will might use the model config if set. This can result in difference between models as @Eran experienced.

Weiheng · September 26, 2022, 9:13am

Hi, I met the same issue when using EncoderDecoderModel (load from pretrained BERT) with Seq2SeqTrainer.

It seems that the differences derived from model.predict() vs. model.generate().
I guess your best result (final performance after training) is from model.generate(), while the evaluation results during training is from model.predict().

Topic		Replies	Views
Unexpected behavior of load_best_model_at_end in Trainer (or am I doing it wrong?) 🤗Transformers	2	58	March 25, 2025
Run_summarization.py t5 model output inconsistent results Models	0	235	September 22, 2023
Run_summarization.py Rouge in eval cf. in final eval, predict Beginners	0	672	September 8, 2021
Metric while training and after one are different 🤗Transformers	0	240	November 23, 2022
T5 outperforms BART when fine-tuned for summarization task Intermediate	3	4013	August 8, 2022

Evaluation results (metric) during training is different from the evaluation results at the end

Related topics