Pegasus Model Weights Compression/Pruning

I have trained pegasus for condition generation on a custom dataset. And have been successfully using it as well. My project aim is text summarization deployed on an heroku app. My model weights are over 2.5GB and heroku can’t support the same.

I trained pegasus on in legacy dir in transformers. I was thinking if pruning would help in reducing the model capacity significantly but still keeping the accuracy and performance intact.

I couldn’t try keras pruning due to the complexity of the trainer present in and I do not have any idea on how that works.

Any way you guys could help would be much appreciated.

hey @SteveMama, have you tried quantizing your model first? this might be a “quick win” to reduce the size of your model by 2-3x.

if that’s not good enough, then my suggestion would be to try movement pruning which is designed for fine-tuning language models and is implemented in the nn_pruning library: GitHub - huggingface/nn_pruning: Prune a model while finetuning or training.

(i’m not sure if pegasus is supported in the library, but you can post and issue to find out!)

Thanks for getting back @lewtun
the model is stored as a PyTorch model. so, I’ve been going through the docs of Pytorch, correct me if Im wrong but, is it possible to prune or reduce model size post-training and train again?

i think the answer strongly depends on the type of pruning method (structured vs unstructured), but it is possible in principle to iteratively prune a neural network - this was essentially the strategy in a now-famous paper by han et al:

now, when it comes to transformers, my current understanding is that you can either prune the attention heads post-training (see e.g. here), but if you want to keep training you’ll probably need to use something like movement pruning that i linked to above.

my personal experience is that pruning is still quite tricky to get right and requires some care to preserve the performance of the original model. i would suggest trying quantization or optimizations with frameworks like ONNX Runtime before devoting a lot of time into pruning

@lewtun Thanks a ton. Quantization did help me reduce the model by 3x. however, I can’t load it as a PegasusForConditionalGeneration model since the weights and layers no.s do not match and aren’t initialized. Is there a way, I can load the quantised model and avoid the problem.

@lewtun essentially, when I am training the load the quantized model I saved in my drive. This warning occurs. and post inference I find the summary generated in the form of numbers and not readable text.
How can this be fixed

hmm this is a bit odd :grimacing:

how did you do the quantization? i just remembered that someone already asked about quantizing pegasus here so maybe you can check whether you can dynamically quantize the model in a similar to how i described it there and then try generating some outputs with the same model in memory (i.e. don’t save and reload)

if that works, then my guess is that from_pretrained doesn’t support loading quantized models (i can have a look) and you might need to do the loading in native pytorch

@lewtun I chose dynamic quantization approach. And I dont think from_pretrained support loading quantised models.
I’d be happy if you can take a look. Can we hop on a google meet call and you can help me out.

hey @SteveMama, i had a closer look at from_pretrained and indeed it does not support loading quantized models because quantization changes the model’s state_dict (for example by introducing scale and zero_point parameters).

however, i think there is a work around that involves the following steps:

1. quantize your model

import torch
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

# load fine-tuned model and tokenizer
model_ckpt = "google/pegasus-cnn_dailymail"
model = PegasusForConditionalGeneration.from_pretrained(model_ckpt)
# quantize model
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8

2. save the quantized model’s state_dict and config to disk

# save config
# save state dict
quantized_state_dict = quantized_model.state_dict(), "")

3. in your heroku app, create a dummy model using the saved config

from transformers import AutoConfig
# load config and dummy model
config = AutoConfig.from_pretrained("pegasus-quantized-config")
dummy_model = PegasusForConditionalGeneration(config)

4. quantize dummy model and load state dict

reconstructed_quantized_model = torch.quantization.quantize_dynamic(
    dummy_model, {torch.nn.Linear}, dtype=torch.qint8

from here you should be able to run reconstructed_quantized_model.generate and produce coherent outputs - let me know if you cannot.

an alternative is to use torchscript for the serialization as done in this pytorch tutorial (link), but i am not very familiar with torchscript and it seems somewhat complicated because you need to trace out the forward pass with some dummy inputs (seems to work for BERT but less sure about seq2seq models like pegasus)

hope that helps!

hey @lewtun thanks for getting back.
Near dummy_model, I see that you have linked the config file as model.config, which essentially is the original/pretrained model. so, Will be possible to load the quantized model without referencing the original one.

The reason I’m asking this is, incase I want to put my model in google cloud and deploy as a service, I can’t afford to productionalize a 1.3GB model. I thought, post quantisation, I can directly upload the quantised model of half the size.

Is there a way to do this?

Pranav K

oops that was a typo! i should have written that we load the saved config:

from transformers import AutoConfig
# load config and dummy model
config = AutoConfig.from_pretrained("pegasus-quantized-config")
dummy_model = PegasusForConditionalGeneration(config)

if i am not mistaken this should allow you to load just the quantized model, but since this is pseudocode you should do some checks to see if the dummy_model is really light on disk / RAM (as I expect it should be) :slight_smile:

This works. Also, If I have to load Pegasus tokenizer, I dont think the quantized model can be used. I essentially have to load the pre-trained model.

Is there a solution for this?

if i am not mistaken, you could load up the default pegasus tokenizer with the google/pegasus-cnn_dailymail checkpoint (which should be light) and then load the quantized model as i described above.

if that still is not sufficient memory-wise (or simply doesn’t work), you might want to check out torchscript for the serialisation, e.g. here’s a nice guide An empirical approach to speedup your BERT inference with ONNX/Torchscript | by Maxence Alluin | Towards Data Science

Dynamic Quantization description in Torch says that it is used for situations where the model execution time is dominated by loading weights from memory rather than computing the matrix multiplications. For live applications, the model is already loaded so the bottleneck is matrix multiplications for inference at runtime. Am I right or missing out on some details here?