Pre-trained Model Enlargement

meliksahturker · August 28, 2023, 10:36pm

I have a pre-trained Encoder-Decoder MLM (BART) and have a bit of spare budget. I decided to enlarge and further pre-train it by doubling the number of both encoder and decoder layers. Therefore d_model, d_ff, d_kv, and so on will not change. Hence, the weights of the base model will be compatible.

I plan to use existing embedding, encoder and decoder weights to initialize the bigger model.
There are 3 scenarios to do that:

I can use the weights of the base model to set the weights of the new, bigger model for the first N Encoder and Decoder layers and initailize the rest from scratch.
I can use the weights of the base model to set the weights of the new, bigger model for odd layers, such as 1, 3, 5 and so on, and initialize the even numbered layers from scratch such as 2, 4, 6 and so on.
Instead of initializing the weights from scratch in the 2nd scenario, I can use the same weights that I used to set the weights of odd layers on the even layers as well.

Since this further pre-training will be a one-shot procedure with a limited budget, I will not have the opportunity to test different scenarios.
Hence, which scenario makes more sense, what do you think potential upsides and downsides are?

meliksahturker · August 31, 2023, 10:07am

I have decided to go with the 2nd scenario, and it worked pretty well. Cross-entropy loss after the 1st epoch (5000 steps per epoch) is 1.32, compared to +7 in training from scratch.

Topic		Replies	Views
Loading weights of BART model into a different architecture Models	0	389	December 29, 2021
How to load pretrained model with custom model layers Beginners	2	1096	September 12, 2023
How to reset a layer? Beginners	2	3825	November 30, 2021
BART from finetuned BERT Intermediate	2	472	September 9, 2021
Is it possible to reuse weights from a model with different dimensions? 🤗Transformers	0	654	January 18, 2022

Pre-trained Model Enlargement

Related topics