Pre-trained Model Enlargement

I have a pre-trained Encoder-Decoder MLM (BART) and have a bit of spare budget. I decided to enlarge and further pre-train it by doubling the number of both encoder and decoder layers. Therefore d_model, d_ff, d_kv, and so on will not change. Hence, the weights of the base model will be compatible.

I plan to use existing embedding, encoder and decoder weights to initialize the bigger model.
There are 3 scenarios to do that:

  1. I can use the weights of the base model to set the weights of the new, bigger model for the first N Encoder and Decoder layers and initailize the rest from scratch.
  2. I can use the weights of the base model to set the weights of the new, bigger model for odd layers, such as 1, 3, 5 and so on, and initialize the even numbered layers from scratch such as 2, 4, 6 and so on.
  3. Instead of initializing the weights from scratch in the 2nd scenario, I can use the same weights that I used to set the weights of odd layers on the even layers as well.

Since this further pre-training will be a one-shot procedure with a limited budget, I will not have the opportunity to test different scenarios.
Hence, which scenario makes more sense, what do you think potential upsides and downsides are?

I have decided to go with the 2nd scenario, and it worked pretty well. Cross-entropy loss after the 1st epoch (5000 steps per epoch) is 1.32, compared to +7 in training from scratch.