I have a pre-trained Encoder-Decoder MLM (BART) and have a bit of spare budget. I decided to enlarge and further pre-train it by doubling the number of both encoder and decoder layers. Therefore d_model, d_ff, d_kv, and so on will not change. Hence, the weights of the base model will be compatible.
I plan to use existing embedding, encoder and decoder weights to initialize the bigger model.
There are 3 scenarios to do that:
- I can use the weights of the base model to set the weights of the new, bigger model for the first N Encoder and Decoder layers and initailize the rest from scratch.
- I can use the weights of the base model to set the weights of the new, bigger model for odd layers, such as 1, 3, 5 and so on, and initialize the even numbered layers from scratch such as 2, 4, 6 and so on.
- Instead of initializing the weights from scratch in the 2nd scenario, I can use the same weights that I used to set the weights of odd layers on the even layers as well.
Since this further pre-training will be a one-shot procedure with a limited budget, I will not have the opportunity to test different scenarios.
Hence, which scenario makes more sense, what do you think potential upsides and downsides are?