Rouge in eval cf. in final eval, predict


I am using the HuggingFace example to train T5 for summarisation (and some other tasks too). I noticed that the Rouge score numbers for evaluation inside training are a lot lower than in evaluation and prediction scores at the end of training. The scores suddenly jump up.

To try to remove anything specific to my model or data, and to keep things simple, I tried running a very small example with the following parameterisation:

–model_name_or_path t5-small --do_train True --num_train_epochs 1.0 --max_train_samples 1000 --do_eval True --evaluation_strategy steps --eval_steps 100 --max_eval_samples 100 --do_predict True --predict_with_generate True --max_predict_samples 100 --dataset_name cnn_dailymail --dataset_config “3.0.0” --source_prefix "summarize: " --output_dir “d:/BrianS/models/sum_test” --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --save_steps 100 --logging_strategy steps --logging_steps 50

*** At the first evaluation I get the following results: ***

{‘eval_loss’: 2.1008858680725098, ‘eval_rouge1’: 23.2084, ‘eval_rouge2’: 7.5598, ‘eval_rougeL’: 18.7416, ‘eval_rougeLsum’: 20.9243, ‘eval_gen_len’: 19.0, ‘eval_runtime’: 5.5309, ‘eval_samples_per_second’: 18.08, ‘eval_steps_per_second’: 4.52, ‘epoch’: 0.4}

*** At the second evaluation I get: ***

{‘eval_loss’: 2.093972682952881, ‘eval_rouge1’: 23.4257, ‘eval_rouge2’: 8.0721, ‘eval_rougeL’: 19.0282, ‘eval_rougeLsum’: 21.3982, ‘eval_gen_len’: 19.0, ‘eval_runtime’: 5.4218, ‘eval_samples_per_second’: 18.444, ‘eval_steps_per_second’: 4.611, ‘epoch’: 0.8}

*** But once training is complete I get for evaluation: ***

epoch = 1.0
eval_gen_len = 60.13
eval_loss = 2.0929
eval_rouge1 = 30.7996
eval_rouge2 = 11.3268
eval_rougeL = 23.027
eval_rougeLsum = 28.309
eval_runtime = 0:00:16.26
eval_samples = 100
eval_samples_per_second = 6.148
eval_steps_per_second = 1.537

*** and for prediction: ***

predict_gen_len = 60.37
predict_loss = 2.0777
predict_rouge1 = 29.7574
predict_rouge2 = 9.4481
predict_rougeL = 21.0967
predict_rougeLsum = 26.475
predict_runtime = 0:00:16.84
predict_samples = 100
predict_samples_per_second = 5.937
predict_steps_per_second = 1.484

So I noticed that _gen_len seemed very different between the two sets of results, plus the eval run at the end was taking so much longer than the evals during training. So I debugged inside the compute_metrics function for the 3 evaluation runs (the two inside training, the one after training) and found:

*** The target summarisation for the first sample is: ***

‘Accident happens in Santa Ynez, California, near where Crosby lives. The jogger suffered multiple fractures; his injuries are not believed to be life-threatening.’

*** prediction for first eval (0.4 epoch as above) ***

‘David Crosby hit a jogger with his car in Santa Ynez,’

*** prediction for second eval (0.8 epoch as above) ***

‘David Crosby was driving at approximately 50 mph when he struck the jogger’

*** prediction for eval after training completed ***

‘David Crosby hit a jogger with his car in Santa Ynez, California. The jogger suffered multiple fractures and was airlifted to a hospital. Crosby is known for weaving multilayered harmonies over sweet melodies.’

So given the lengh of the prediction in the evaluation after trianing, it is of similar length to the target and it is perhaps unsurprising the scores are better. It looks like the output generated is being kept small to keep compute resources down for fast eval during training, then openned up to the full summarization at the end? Have I missed a parameter that controls this behaviour, so that you could force full generation for eval during training, not just at the end? Be good to understand this, at the moment the Rouge scores I have go up like a hockey stick at the end.