Obtaining BERT-base from BERT-large

prajjwal1 · September 29, 2020, 3:31am

So I want to extract (prune) BERT-large such that I get BERT-base fairly. Initially I performed random pruning (near to 110M param count) on BERT-large but it didn’t seem to work well. L1 pruning seemed to work (nearly 131M param), but it doesn’t seem fair. Pre-training seems like a big hurdle given that there are some ambiguities on how to go about it. Please let me know if you’ve any suggestions on getting BERT-base fairly from BERT-large.

rgwatwormhill · September 29, 2020, 3:22pm

Have you tried Distilling it?

https://medium.com/huggingface/distilbert-8cf3380435b5 .

Why would you expect pruning to work?

(Why do you want to extract bert-base from bert-large?)

prajjwal1 · September 30, 2020, 4:32am

Distillation is very different thing. What I want is to modify BERT-large such that it has the near same param count as BERT-base and the weight distribution matches that of BERT-base.

rgwatwormhill · October 2, 2020, 10:53am

What do you mean by “fairly”? Clearly, in order for a pruned bert-large to be effective, you need to prune those heads that are least useful. There isn’t really anything “fair” about that.

What do you mean by “the weight distribution matches that of bert-base”? I shouldn’t think that to be possible. To start with, I’m pretty sure you will need to keep at least one head per layer, so that the data can flow through the model, and bert-large has 24 layers to bert-base’s 12. Which weights are you hoping to match? Furthermore, there’s no reason to suppose that the way the weights develop in bert-large will be similar to the way the weights develop in bert-base.

Are you investigating this purely for the interest of it, or because you want to use the result?

Topic		Replies	Views
BERT model size (transformer block number) Beginners	4	3573	August 21, 2020
How to apply pruning on a BERT model? Beginners	5	3375	October 21, 2020
TinyReformer/TinyLongformer details Models	3	433	November 6, 2020
Tips for PreTraining BERT from scratch 🤗Transformers	19	9883	December 10, 2020
The model I'm using for QA info extraction is too heavy Beginners	0	254	April 19, 2022

Obtaining BERT-base from BERT-large

Related topics