More complex training setups

vblagoje · October 17, 2020, 5:49pm

Hello everyone,

I am working on BERT pre-training, and I wonder out loud if using multiple Trainer instances is the best way to go. As you may know, BERT pre-training consists of two-phases. These training phases are really distinct. Besides using different sequence lengths for each phase, they have different learning rates, batch sizes, etc.

It seems like a natural fit to have two sequential phases with two Trainer instance. Each Trainer instance takes different input training parameters, loaders, collators and so on. I would pass the model between Trainer instances. Once the first phase is done, the second is initialized with a new Trainer and the model is passed from the first Trainer instance to the second.

For the most part, this works ok. Do you envision a possibility of doing it any other way?

BramVanroy · October 17, 2020, 9:11pm

What do you mean with “pre-taining consists of two phases”? I always thought that the pre-training tasks of MLM and NSP were jointly performed.

vblagoje · October 17, 2020, 10:05pm

Yes, yes, they are, but for 90% of the steps, they use 128 sequence lengths and for the remaining 10% 512. The reasoning is that only 10% is enough to learn positional embeddings, and using 512 for the whole training is a waste. Now because of this change, one has to lower the batch size, adjust the learning rate etc. It is a two-phase pre-training process.

BramVanroy · October 18, 2020, 8:58am

Ah yes, I see it in the paper now. Never really thought about that. It seems that they did this sequentially, so first training on 128 seq_length on 90% of the data, and only then train further for 512. So I would probably do it like that, too. First train the 128-part, and then continue training with a new trainer.

julien-c · October 18, 2020, 11:07am

This sounds like a completely fine way of doing it.

Topic		Replies	Views
Pre-training BERT Models	1	381	May 21, 2024
Pre-Train BERT (from scratch) Research	43	18992	June 27, 2022
Continual pre-training from an initial checkpoint with MLM and NSP Models	4	4285	September 8, 2021
Train from scratch vs further pretraining/fine tuning with MLM and NSP Research	1	1547	August 28, 2023
Continual pre-training vs. Fine-tuning a language model with MLM 🤗Transformers	5	8688	November 30, 2021

More complex training setups

Related topics