My goal is to fine-tune an LLM suited to summarization on a dataset of technical texts. This will give a summarizer fit for academic summaries.
I have been attempting to fine-tune
facebook/bart-base on the
arxiv-summarization dataset (both on HF) on a Kaggle kernel with a P100 GPU.
I have noted that training for more than 4-5 epochs at a time is not possible due to the kernel timing out, and the fine-tuning does not render great improvements either.
A strategy I came up with is to train for 3 epochs at a time, and to repeat this process 5 times. Each time, we start with the model fine-tuned till the previous session and hence achieve an effective training of 15 epochs.
I have some queries regarding this approach which would benefit from my peers’ thoughts:
- Is this procedure sensible/viable?
- Should the hyperparameters stay the same across all sessions?
- What about the learning rate? Should each session start with the usual starting value, or should we account for the decay across sessions?
- Should each session use the same training data (about 50k samples)? Or should each session have different and mutually-exclusive training data (say 5 sessions of 25k samples each) to maximize diversity of learnings?
- Should the validation data be the same for each session? If so, we can accurately track any improvement in the ROUGE scores across sessions.
I realize I have asked several questions, but thoughts on any would be appreciated.