Hi everyone!
I’m fine-tuning flan-t5-base
for text summarization and use ROUGE metric from evaluate
.
I track model’s performace during training on validation set and then, after the training is finished, load the trained model to perform testing on the evaluation set using
trainer.evaluate(eval_dataset=dataset["test"])
The results I get are the following:
{'eval_rouge1': 50.2228, 'eval_rouge2': 47.0773, 'eval_rougeL': 50.1846, 'eval_rougeLsum': 50.1748
The next step I make is loading the model to the summarization pipeline to generate summaries for the evaluation set in order to perform error analysis:
summarizer = pipeline(
task='summarization',
model=checkpoint,
device=0)
for out in summarizer(test_dataset['test']['report']):
predictions.append(out['summary_text'])
test_dataset = test_dataset['test'].to_pandas()
test_dataset['predictions'] = predictions
test_dataset.to_csv('summarization_results.tsv', sep='\t')
When I open this file and compute the same metrics, I see that the results are much better than the ones shown during the testing on the exact same set:
{'rouge1': 89.1666, 'rouge2': 84.5434, 'rougeL': 89.1476, 'rougeLsum': 89.113}
What could cause this difference? If I’m going to report the performace of my model, which result should I mention? Thank you in advance for your help.