Evaluation results (metric) during training is different from the evaluation results at the end

Hi, I鈥檓 trying to train the T5-small model on the summarization task using the seq2seq trainer.
(I鈥檓 currently referencing the example from transformers/examples/pytorch/summarization at v4.15.0 路 huggingface/transformers 路 GitHub)

While doing so, I found out that the final model performance on the validation set (best model loaded) is different from the evaluation results during training.

I set the argument --load_best_model_at_end as True, so from my understanding, this makes the trainer call the best checkpoint for the final evaluation.

However, the final performance (Rouge1, 2, L) is much higher than the evaluation results reported during training (every 500 steps).

What might be the reason?

the training script is:

	python ../run_summarization.py \
	    --model_name_or_path t5_small \
	    --do_train \
        --do_eval \
        --do_predict \
        --dataset_name ccdv/cnn_dailymail \
        --dataset_config "3.0.0" \
        --source_prefix "summarize: " \
        --output_dir ${OUTPUT_DIR} \
        --per_device_train_batch_size 32 \
        --per_device_eval_batch_size 64 \
        --gradient_accumulation_steps 2 \
        --max_steps ${MAX_STEPS} \
        --predict_with_generate \
        --evaluation_strategy "steps" \
        --logging_strategy "steps" \
        --save_strategy "steps" \
        --eval_steps 500 \
        --logging_steps 500 \
        --save_steps 500 \
        --load_best_model_at_end \
        --metric_for_best_model "eval_rougeLsum" \
        --save_total_limit 1 \
        --num_beams 1 \
        --dropout_rate 0.1 \
        --preprocessing_num_workers 8 \
        --learning_rate 5e-5 \
1 Like

I also run into the same problem. It doesn鈥檛 matter which checkpoint I take, the eval results while training are different than the eval results when calling trainer.evaluate() (exactly same dataset, it is in in the same run where do_train and do_eval are both enabled).
Interestingly, it only happens with T5 (and not with pegasus, for example).

hi @Eran were you able to figure out what is causing the problem?
I am facing the same issue鈥


It seems like at least in @jspark93 case this behavior is intentional. You can choose number of beams to use for the evaluation during training and evaluation post training.
Number of beams for evaluation during training is set with --generation_num_beams and num of beams for evaluation post training is set with --num_beams. If you want the same behavior in both its better just to set --generation_num_beams. Indeed very confusing and not well documented behavior.

The line in the run_summerization.py code that causing this behavior:

num_beams = data_args.num_beams if data_args.num_beams is not None else training_args.generation_num_beams

Additionally, it looks like --max_length and --generation_max_length can also cause such discrepancy.

Also worth mentioning that when only --max_length is set, at training time their is no max length restriction and the trainer will might use the model config if set. This can result in difference between models as @Eran experienced.

Hi, I met the same issue when using EncoderDecoderModel (load from pretrained BERT) with Seq2SeqTrainer.

It seems that the differences derived from model.predict() vs. model.generate().
I guess your best result (final performance after training) is from model.generate(), while the evaluation results during training is from model.predict().