Difference in results using trainer.evaluate() and pipeline inference

Hi everyone!

I’m fine-tuning flan-t5-base for text summarization and use ROUGE metric from evaluate.

I track model’s performace during training on validation set and then, after the training is finished, load the trained model to perform testing on the evaluation set using

trainer.evaluate(eval_dataset=dataset["test"])

The results I get are the following:

{'eval_rouge1': 50.2228, 'eval_rouge2': 47.0773, 'eval_rougeL': 50.1846, 'eval_rougeLsum': 50.1748

The next step I make is loading the model to the summarization pipeline to generate summaries for the evaluation set in order to perform error analysis:

summarizer = pipeline(
        task='summarization', 
        model=checkpoint,
        device=0)

for out in summarizer(test_dataset['test']['report']):
        predictions.append(out['summary_text'])

test_dataset = test_dataset['test'].to_pandas()
test_dataset['predictions'] = predictions 
test_dataset.to_csv('summarization_results.tsv', sep='\t')

When I open this file and compute the same metrics, I see that the results are much better than the ones shown during the testing on the exact same set:

{'rouge1': 89.1666, 'rouge2': 84.5434, 'rougeL': 89.1476, 'rougeLsum': 89.113}

What could cause this difference? If I’m going to report the performace of my model, which result should I mention? Thank you in advance for your help.

I’m having the exact same problem, but in classification! Any advice from the moderators would be greatly helpful!