Why are embedding / pooler layers excluded from pruning comparisons?

lewtun · February 10, 2021, 12:27pm

In your Saving PruneBERT notebook I noticed that you only save the encoder and head when comparing the effects of pruning / quantisation. For example, here you save the original dense model as follows:


# Saving the original (encoder + classifier) in the standard torch.save format

dense_st = {name: param for name, param in model.state_dict().items() 
                            if "embedding" not in name and "pooler" not in name}
torch.save(dense_st, 'dbg/dense_squad.pt',)
dense_mb_size = os.path.getsize("dbg/dense_squad.pt")

My question is: why are the embedding and pooled layers excluded from the size comparison between the BERT-base model and its pruned / quantised counterpart?

Naively, I would have thought that if I care about the amount of storage my model requires, then I would include all layers in the size calculation.

Thanks!

VictorSanh · February 10, 2021, 10:41pm

Hey!
The QA model actually only needs the qa-head, the pooler is just decorative (it’s not even trained). Start and end of spans are predicted directly from the sequence of hidden state. This explains why I am not saving the pooler.
As for the embedding, I’m just fine-pruning the encoder, and the embedding modules stay fixed at their pre-trained values. So I am mostly interested in comparing the compression ratio of the encoder (since the rest is fixed).
Hope that makes sense.

lewtun · February 11, 2021, 8:44am

Thanks for the answer @VictorSanh - this makes perfect sense!

lewtun · February 13, 2021, 9:38pm

Hi @VictorSanh, I have a follow up question about the Saving PruneBERT notebook.

As far as I can tell, you rely on weight quantization in order to be able to use the CSR format on integer-valued weights - is this correct?

My question is whether it is possible to show the memory compression benefits of fine-pruning without quantizing the model first?

What I’d like to do is quantify the memory reduction of BERT-base vs your PruneBERT model, so that one can clearly see that X% comes from pruning, Y% from quantization and so on.

Thanks!

VictorSanh · February 14, 2021, 2:53pm

The notebook you are playing with is only applying the weight quantization. It is taking as input the fine-pruned (pruned during fine-tuning) model, so to see the impact of the pruning (compression), simply count the number of non-zero values (in the encoder). That should give you the compression rate of pruning!
Victor

lewtun · February 15, 2021, 10:05am

Thanks for the clarification!

Counting the number of non-zero values is a good idea to get the compression rate, but what I’d usually do to quantify the size on disk (e.g. in MB) is save the encoder’s state_dict and get the size as follows:

    state_dict = {name: param for name, param in model.state_dict().items() if "embedding" not in name and "pooler" not in name}
    tmp_path = Path("model.pt")
    torch.save(state_dict, tmp_path)
    # Calculate size in megabytes
    size_mb = Path(tmp_path).stat().st_size / (1024 * 1024)

Now, my understanding is that if I load a fine-pruned model as follows

model = BertForQuestionAnswering.from_pretrained("huggingface/prunebert-base-uncased-6-finepruned-w-distil-squad")

then the model is dense, so I don’t see any compression gains on disk when I save the state_dict - is this correct?

If yes, then do you know if there’s a way to save the state_dict of a fine-pruned model to disk in a way that reflects the compression gains from a sparse encoder?

Thanks!

VictorSanh · February 16, 2021, 4:38pm

Ooooh yeah sorry for the confusion.
As far as I know (I think I tried), you can use the torch.sparse tensors representations which will decompose a sparse tensor into its CSR format (location of non-zero values + these non-zero values). It should give you a MB compression gain.
The reason why I encoded the CSR format “by hand” is that sparse quantized tensors don’t exist yet in PyTorch so I had to do the quantization and the CSR format on top.

lewtun · February 16, 2021, 9:10pm

Thanks for the tip about torch.sparse: from the docs it seems to use the COO format which should also work well

And thanks for clarifying the reason for encoding the CSR format by hand - when I find a solution to the torch > 1.5 issue, I’ll expand the text accordingly!

Topic		Replies	Views
Saving standard BertModel english and BertModel multilingual have drastically different sizes? 🤗Transformers	2	275	August 28, 2020
How to save models after discarding some layers? Beginners	0	512	February 15, 2022
How to save a multi layered model? Models	0	79	June 5, 2024
Do I need to worry about this bert.dense.pooler training warning for my usecase? Models	0	810	March 25, 2022
Should pruning shrink model?; adjusting sparsity didn't change inference time 🤗Optimum	2	770	February 29, 2024

Why are embedding / pooler layers excluded from pruning comparisons?

Related topics