Hi everyone,
I’m trying to pre-train a BERT model in a cluster of GPUs. For the fairness of the scheduling, my process is stopped every 24h so I can’t give BERT all the dataset at once because if it stops, the resuming might be very slow (and also for memory occupation).
I decided to split my dataset into N chunks, each one requiring less than 24h to process. My idea was to set 1 epoch of training, train with the first chunk, then continue pretraining for 1 epoch on the second, until the Nth chunk, and then repeat for a total of 3 epochs.
I’m using the script in “transformers/examples/pytorch/language_modeling/run_mlm.py”, the problem is, after processing the first chunk:
- If I simply tell the trainer to continue pretraining and I give it the last checkpoint and the second chunk to process, it simply does nothing. Probably because it finished the training I set with the first chunk.
- If I load the last checkpoint as a model and start training with the second chunk of the dataset, the learning rate is set again to the initial value, not the last value of the previous checkpoint. And I don’t know if there are other problems like this.
What can I do? I also have another question: if I split the dataset in this way, imagining I can finish the training with my method, the learning rate decay will be different than if I gave all the dataset at once?
Thanks in advance,
Irene