Hi, I’m trying to train the T5-small model on the summarization task using the seq2seq trainer.
(I’m currently referencing the example from transformers/examples/pytorch/summarization at v4.15.0 · huggingface/transformers · GitHub)
While doing so, I found out that the final model performance on the validation set (best model loaded) is different from the evaluation results during training.
I set the argument --load_best_model_at_end
as True, so from my understanding, this makes the trainer call the best checkpoint for the final evaluation.
However, the final performance (Rouge1, 2, L) is much higher than the evaluation results reported during training (every 500 steps).
What might be the reason?
the training script is:
train_summarization_cnn:
python ../run_summarization.py \
--model_name_or_path t5_small \
--do_train \
--do_eval \
--do_predict \
--dataset_name ccdv/cnn_dailymail \
--dataset_config "3.0.0" \
--source_prefix "summarize: " \
--output_dir ${OUTPUT_DIR} \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 64 \
--gradient_accumulation_steps 2 \
--max_steps ${MAX_STEPS} \
--predict_with_generate \
--evaluation_strategy "steps" \
--logging_strategy "steps" \
--save_strategy "steps" \
--eval_steps 500 \
--logging_steps 500 \
--save_steps 500 \
--load_best_model_at_end \
--metric_for_best_model "eval_rougeLsum" \
--save_total_limit 1 \
--num_beams 1 \
--dropout_rate 0.1 \
--preprocessing_num_workers 8 \
--learning_rate 5e-5 \
--overwrite_output_dir