BERT model size (transformer block number)


I have some general questions regarding BERT and distillation.

  1. I want to compare the performance of BERT with different model size (transformer block number). Is it necessary to do distillation? If I just train a BERT with 6 Layers without distillation, does the performance look bad?

  2. Do I have to do pre-train from scratch every time I change the layer number of BERT? Is it possible to just remove some layers in an existing pre-trained model and finetune on tasks?

  3. Why BERT has 12 blocks? Not 11 or 13 etc. ? I couldn’t find any explanation.


Hi, have you seen this

It describes and provides several smaller Bert models, including evaluations of their performance.

By the way, I am not an expert.

  1. I think the results would be OK, but the 6-layer Bert would be slower to train and use than the 6-layer DistilBert

  2. I expect it would be possible to just remove some layers and then finetune. After all, full training starts from randomly initialized weights, so I don’t suppose your cut-down model would be actually worse than that. On the other hand, I don’t know if it would be better. Certainly, if you do try it, you would want to cut off the last layers, not the first ones. If you look at deep Convolutional networks for image recognition, the first layers detect simple patterns, and the later layers build those simple patterns into more complicated ones.

  3. I don’t know for sure why Devlin et al chose 12 or 24 layers, but I assume they tried lots of different configurations and 12 or 24 were the best compromises. They might also have wanted to create models that were roughly as complicated (expensive to run) as some of the previous state-of-the-art models, so that they could compare like for like in their evaluations. It is also likely that even numbers are more efficient because of the way the hardware (GPU or TPU) is configured. Notice that all the newly released small Bert models have even numbers of Layers.


Thank you for your detailed explanation! I have some other question about changing size of each layer. I hope you can help me.
My purpose is to reduce layer number and expand layer size to maintain the same model size.

  1. When talking about changing layer size, does that mean change hidden_size? For example [768,768] —> [1024,1024].

  2. I tried expanding the tensor size using view(), expand().etc during training. But they result in a dis-matching size with pre-trained model error. So I assume I have to do pre-train again if I change the layer size? Reducing layer number does not have this kind of problem.


Hello again,

I am fairly sure you are right, and you would have to train a model from scratch if you want to alter the layer size.

I believe you could increase the width of the model by using more attention heads in each block, or by changing the hidden size, or both. For example, bert-large is 24-layer, 1024-hidden, 16-heads per block, 340M parameters. (bert-base is 12 heads per block) .

I think the hidden size corresponds to the number of real numbers used to represent each token, so I think you would need to train a new embedding layer if you changed the hidden-layer size.


OK. I will try to do that. Thank you very much!!