How to continue BERT training

Hi everyone,

I’m trying to pre-train a BERT model in a cluster of GPUs. For the fairness of the scheduling, my process is stopped every 24h so I can’t give BERT all the dataset at once because if it stops, the resuming might be very slow (and also for memory occupation).

I decided to split my dataset into N chunks, each one requiring less than 24h to process. My idea was to set 1 epoch of training, train with the first chunk, then continue pretraining for 1 epoch on the second, until the Nth chunk, and then repeat for a total of 3 epochs.
I’m using the script in “transformers/examples/pytorch/language_modeling/run_mlm.py”, the problem is, after processing the first chunk:

  • If I simply tell the trainer to continue pretraining and I give it the last checkpoint and the second chunk to process, it simply does nothing. Probably because it finished the training I set with the first chunk.
  • If I load the last checkpoint as a model and start training with the second chunk of the dataset, the learning rate is set again to the initial value, not the last value of the previous checkpoint. And I don’t know if there are other problems like this.

What can I do? I also have another question: if I split the dataset in this way, imagining I can finish the training with my method, the learning rate decay will be different than if I gave all the dataset at once?

Thanks in advance,
Irene

You will have to adjust your learning rate every-time you load your model and continue finetuning versus starting from the checkpoint. The checkpoint you load will only use the batch dataset it started with thats why it didn’t work. Have you tried any other techniques to speed up training?

1 Like