How much data is needed for fine tuning a model for summarization?

Hi All,

I want to fine tune a summarization model on a custom dataset. Are there any guidelines around how much data I would need, will data from a different domain help, etc.?

I am trying to summarize conversations. In most cases, these conversations will involve just two people. I finetuned google/flan-t5-base and facebook/bart-large-cnn on about 1000 examples, results are good but not as good as GPT-3.5.

  1. Do I need to gather and train on more data? If I don’t have access to data for my use case, can I use data from any other domain as long as they are conversations? Say, from podcasts?

  2. For how long do I train the model for? Are there any best practices around choosing number of epochs, etc.?

I am looking to improve the performance of my model and can really use some help! I have looked online but can’t find a clear answer. I understand that in a lot of cases, you need to experiment what works for you but there are so many possibilities and I am looking for a starting point, as a beginner in this field.

Thank you for your help!