I am working on BERT pre-training, and I wonder out loud if using multiple Trainer instances is the best way to go. As you may know, BERT pre-training consists of two-phases. These training phases are really distinct. Besides using different sequence lengths for each phase, they have different learning rates, batch sizes, etc.
It seems like a natural fit to have two sequential phases with two Trainer instance. Each Trainer instance takes different input training parameters, loaders, collators and so on. I would pass the model between Trainer instances. Once the first phase is done, the second is initialized with a new Trainer and the model is passed from the first Trainer instance to the second.
For the most part, this works ok. Do you envision a possibility of doing it any other way?
What do you mean with “pre-taining consists of two phases”? I always thought that the pre-training tasks of MLM and NSP were jointly performed.
Yes, yes, they are, but for 90% of the steps, they use 128 sequence lengths and for the remaining 10% 512. The reasoning is that only 10% is enough to learn positional embeddings, and using 512 for the whole training is a waste. Now because of this change, one has to lower the batch size, adjust the learning rate etc. It is a two-phase pre-training process.
Ah yes, I see it in the paper now. Never really thought about that. It seems that they did this sequentially, so first training on 128 seq_length on 90% of the data, and only then train further for 512. So I would probably do it like that, too. First train the 128-part, and then continue training with a new trainer.
This sounds like a completely fine way of doing it.