Finetuning in multiple sequential training sessions rather than at once

Hi, community!

My goal is to fine-tune an LLM suited to summarization on a dataset of technical texts. This will give a summarizer fit for academic summaries.

I have been attempting to fine-tune facebook/bart-base on the arxiv-summarization dataset (both on HF) on a Kaggle kernel with a P100 GPU.

I have noted that training for more than 4-5 epochs at a time is not possible due to the kernel timing out, and the fine-tuning does not render great improvements either.

A strategy I came up with is to train for 3 epochs at a time, and to repeat this process 5 times. Each time, we start with the model fine-tuned till the previous session and hence achieve an effective training of 15 epochs.

I have some queries regarding this approach which would benefit from my peers’ thoughts:

  1. Is this procedure sensible/viable?
  2. Should the hyperparameters stay the same across all sessions?
  3. What about the learning rate? Should each session start with the usual starting value, or should we account for the decay across sessions?
  4. Should each session use the same training data (about 50k samples)? Or should each session have different and mutually-exclusive training data (say 5 sessions of 25k samples each) to maximize diversity of learnings?
  5. Should the validation data be the same for each session? If so, we can accurately track any improvement in the ROUGE scores across sessions.

I realize I have asked several questions, but thoughts on any would be appreciated.

Thank you!

2 Likes

@adityashukzy were you ever able to successfully do this? I am having an issue with training a model that I have trained and saved successfully once before. On the second training attempt, I am getting this error:

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

Any help would be very appreciated.