So I want to extract (prune) BERT-large such that I get BERT-base fairly. Initially I performed random pruning (near to 110M param count) on BERT-large but it didn’t seem to work well. L1 pruning seemed to work (nearly 131M param), but it doesn’t seem fair. Pre-training seems like a big hurdle given that there are some ambiguities on how to go about it. Please let me know if you’ve any suggestions on getting BERT-base fairly from BERT-large.
Have you tried Distilling it?
Why would you expect pruning to work?
(Why do you want to extract bert-base from bert-large?)
Distillation is very different thing. What I want is to modify BERT-large such that it has the near same param count as BERT-base and the weight distribution matches that of BERT-base.
What do you mean by “fairly”? Clearly, in order for a pruned bert-large to be effective, you need to prune those heads that are least useful. There isn’t really anything “fair” about that.
What do you mean by “the weight distribution matches that of bert-base”? I shouldn’t think that to be possible. To start with, I’m pretty sure you will need to keep at least one head per layer, so that the data can flow through the model, and bert-large has 24 layers to bert-base’s 12. Which weights are you hoping to match? Furthermore, there’s no reason to suppose that the way the weights develop in bert-large will be similar to the way the weights develop in bert-base.
Are you investigating this purely for the interest of it, or because you want to use the result?