Finetuning in multiple sequential training sessions rather than at once

Hi, community!

My goal is to fine-tune an LLM suited to summarization on a dataset of technical texts. This will give a summarizer fit for academic summaries.

I have been attempting to fine-tune facebook/bart-base on the arxiv-summarization dataset (both on HF) on a Kaggle kernel with a P100 GPU.

I have noted that training for more than 4-5 epochs at a time is not possible due to the kernel timing out, and the fine-tuning does not render great improvements either.

A strategy I came up with is to train for 3 epochs at a time, and to repeat this process 5 times. Each time, we start with the model fine-tuned till the previous session and hence achieve an effective training of 15 epochs.

I have some queries regarding this approach which would benefit from my peers’ thoughts:

  1. Is this procedure sensible/viable?
  2. Should the hyperparameters stay the same across all sessions?
  3. What about the learning rate? Should each session start with the usual starting value, or should we account for the decay across sessions?
  4. Should each session use the same training data (about 50k samples)? Or should each session have different and mutually-exclusive training data (say 5 sessions of 25k samples each) to maximize diversity of learnings?
  5. Should the validation data be the same for each session? If so, we can accurately track any improvement in the ROUGE scores across sessions.

I realize I have asked several questions, but thoughts on any would be appreciated.

Thank you!