How to continue BERT training

lucadini · March 4, 2022, 11:01am

Hi everyone,

I’m trying to pre-train a BERT model in a cluster of GPUs. For the fairness of the scheduling, my process is stopped every 24h so I can’t give BERT all the dataset at once because if it stops, the resuming might be very slow (and also for memory occupation).

I decided to split my dataset into N chunks, each one requiring less than 24h to process. My idea was to set 1 epoch of training, train with the first chunk, then continue pretraining for 1 epoch on the second, until the Nth chunk, and then repeat for a total of 3 epochs.
I’m using the script in “transformers/examples/pytorch/language_modeling/run_mlm.py”, the problem is, after processing the first chunk:

If I simply tell the trainer to continue pretraining and I give it the last checkpoint and the second chunk to process, it simply does nothing. Probably because it finished the training I set with the first chunk.
If I load the last checkpoint as a model and start training with the second chunk of the dataset, the learning rate is set again to the initial value, not the last value of the previous checkpoint. And I don’t know if there are other problems like this.

What can I do? I also have another question: if I split the dataset in this way, imagining I can finish the training with my method, the learning rate decay will be different than if I gave all the dataset at once?

Thanks in advance,
Irene

anwarika · March 4, 2022, 4:36pm

You will have to adjust your learning rate every-time you load your model and continue finetuning versus starting from the checkpoint. The checkpoint you load will only use the batch dataset it started with thats why it didn’t work. Have you tried any other techniques to speed up training?

Topic		Replies	Views
Original Bert Pretraining Intermediate	0	546	January 10, 2022
What is Transformers doing? Why it's so slow? 🤗Transformers	0	998	June 16, 2023
Continuing Pre Training from Model Checkpoint Models	12	41980	January 13, 2025
How is the dataset loaded? Beginners	1	361	January 19, 2022
More complex training setups 🤗Transformers	4	1017	October 18, 2020

How to continue BERT training

Related topics