Pegasus Model Weights Compression/Pruning

I have trained pegasus for condition generation on a custom dataset. And have been successfully using it as well. My project aim is text summarization deployed on an heroku app. My model weights are over 2.5GB and heroku can’t support the same.

I trained pegasus on seq2seq.py in legacy dir in transformers. I was thinking if pruning would help in reducing the model capacity significantly but still keeping the accuracy and performance intact.

I couldn’t try keras pruning due to the complexity of the trainer present in seq2seq.py and I do not have any idea on how that works.

Any way you guys could help would be much appreciated.
Thanks

hey @SteveMama, have you tried quantizing your model first? this might be a “quick win” to reduce the size of your model by 2-3x.

if that’s not good enough, then my suggestion would be to try movement pruning which is designed for fine-tuning language models and is implemented in the nn_pruning library: GitHub - huggingface/nn_pruning: Prune a model while finetuning or training.

(i’m not sure if pegasus is supported in the library, but you can post and issue to find out!)

1 Like

Thanks for getting back @lewtun
Correction.
the model is stored as a PyTorch model. so, I’ve been going through the docs of Pytorch, correct me if Im wrong but, is it possible to prune or reduce model size post-training and train again?

i think the answer strongly depends on the type of pruning method (structured vs unstructured), but it is possible in principle to iteratively prune a neural network - this was essentially the strategy in a now-famous paper by han et al:

now, when it comes to transformers, my current understanding is that you can either prune the attention heads post-training (see e.g. here), but if you want to keep training you’ll probably need to use something like movement pruning that i linked to above.

my personal experience is that pruning is still quite tricky to get right and requires some care to preserve the performance of the original model. i would suggest trying quantization or optimizations with frameworks like ONNX Runtime before devoting a lot of time into pruning

@lewtun Thanks a ton. Quantization did help me reduce the model by 3x. however, I can’t load it as a PegasusForConditionalGeneration model since the weights and layers no.s do not match and aren’t initialized. Is there a way, I can load the quantised model and avoid the problem.

@lewtun essentially, when I am training the load the quantized model I saved in my drive. This warning occurs. and post inference I find the summary generated in the form of numbers and not readable text.
How can this be fixed


hmm this is a bit odd :grimacing:

how did you do the quantization? i just remembered that someone already asked about quantizing pegasus here so maybe you can check whether you can dynamically quantize the model in a similar to how i described it there and then try generating some outputs with the same model in memory (i.e. don’t save and reload)

if that works, then my guess is that from_pretrained doesn’t support loading quantized models (i can have a look) and you might need to do the loading in native pytorch

@lewtun I chose dynamic quantization approach. And I dont think from_pretrained support loading quantised models.
I’d be happy if you can take a look. Can we hop on a google meet call and you can help me out.
Hopefully.
Thanks

hey @SteveMama, i had a closer look at from_pretrained and indeed it does not support loading quantized models because quantization changes the model’s state_dict (for example by introducing scale and zero_point parameters).

however, i think there is a work around that involves the following steps:

1. quantize your model

import torch
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

# load fine-tuned model and tokenizer
model_ckpt = "google/pegasus-cnn_dailymail"
model = PegasusForConditionalGeneration.from_pretrained(model_ckpt)
# quantize model
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

2. save the quantized model’s state_dict and config to disk

# save config
quantized_model.config.save_pretrained("pegasus-quantized-config")
# save state dict
quantized_state_dict = quantized_model.state_dict()
torch.save(quantized_state_dict, "pegasus-quantized.pt")

3. in your heroku app, create a dummy model using the saved config

from transformers import AutoConfig
# load config and dummy model
config = AutoConfig.from_pretrained("pegasus-quantized-config")
dummy_model = PegasusForConditionalGeneration(config)

4. quantize dummy model and load state dict

reconstructed_quantized_model = torch.quantization.quantize_dynamic(
    dummy_model, {torch.nn.Linear}, dtype=torch.qint8
)
reconstructed_quantized_model.load_state_dict(quantized_state_dict)

from here you should be able to run reconstructed_quantized_model.generate and produce coherent outputs - let me know if you cannot.

an alternative is to use torchscript for the serialization as done in this pytorch tutorial (link), but i am not very familiar with torchscript and it seems somewhat complicated because you need to trace out the forward pass with some dummy inputs (seems to work for BERT but less sure about seq2seq models like pegasus)

hope that helps!

2 Likes

hey @lewtun thanks for getting back.
Near dummy_model, I see that you have linked the config file as model.config, which essentially is the original/pretrained model. so, Will be possible to load the quantized model without referencing the original one.

The reason I’m asking this is, incase I want to put my model in google cloud and deploy as a service, I can’t afford to productionalize a 1.3GB model. I thought, post quantisation, I can directly upload the quantised model of half the size.

Is there a way to do this?

Regards,
Pranav K

oops that was a typo! i should have written that we load the saved config:

from transformers import AutoConfig
# load config and dummy model
config = AutoConfig.from_pretrained("pegasus-quantized-config")
dummy_model = PegasusForConditionalGeneration(config)

if i am not mistaken this should allow you to load just the quantized model, but since this is pseudocode you should do some checks to see if the dummy_model is really light on disk / RAM (as I expect it should be) :slight_smile:

This works. Also, If I have to load Pegasus tokenizer, I dont think the quantized model can be used. I essentially have to load the pre-trained model.

Is there a solution for this?

if i am not mistaken, you could load up the default pegasus tokenizer with the google/pegasus-cnn_dailymail checkpoint (which should be light) and then load the quantized model as i described above.

if that still is not sufficient memory-wise (or simply doesn’t work), you might want to check out torchscript for the serialisation, e.g. here’s a nice guide An empirical approach to speedup your BERT inference with ONNX/Torchscript | by Maxence Alluin | Towards Data Science

Dynamic Quantization description in Torch says that it is used for situations where the model execution time is dominated by loading weights from memory rather than computing the matrix multiplications. For live applications, the model is already loaded so the bottleneck is matrix multiplications for inference at runtime. Am I right or missing out on some details here?

Thanks @lewtun, for me, the quantized model takes more time to run than the original version. Please run this code, do you also get higher time?


import torch
from transformers import pipeline, T5ForConditionalGeneration, AutoConfig, T5Tokenizer
import torch.quantization

tokenizer = T5Tokenizer.from_pretrained('t5-small')
if torch.cuda.is_available():
	device = "cuda:0"
else:
	device = "cpu"

base_model = T5ForConditionalGeneration.from_pretrained("t5-small")
param_count = sum(p.numel() for p in base_model.parameters())
memory = (param_count * 4) / (1024 * 1024)
print(f'memory in MB: {memory}')

base_model.save_pretrained("tmp-t5-small")

quantized_model = torch.quantization.quantize_dynamic(model=base_model,
                                                      qconfig_spec={torch.nn.Linear},
                                                      dtype=torch.qint8)

# This does NOT work:
#quantized_model.save_pretrained("tmp-t5-small-quantized")

quantized_model.config.save_pretrained("tmp-t5-small-quantized-config")  # save config
quantized_state_dict = quantized_model.state_dict()
torch.save(quantized_state_dict, "tmp-t5-small-quantized-state-dict.pt")

print('Load quantized model')
quantized_config = AutoConfig.from_pretrained("tmp-t5-small-quantized-config")
dummy_model = T5ForConditionalGeneration(quantized_config)

reconstructed_quantized_model = torch.quantization.quantize_dynamic(
    dummy_model, {torch.nn.Linear}, dtype=torch.qint8
)
reconstructed_quantized_model.load_state_dict(torch.load("tmp-t5-small-quantized-state-dict.pt"))

def eval(model, tokenizer, sentence):
	import time
	s = time.time()
	model.eval()
	test_ids = tokenizer(sentence, return_tensors="pt").to(device).input_ids
	beam_output = model.generate(test_ids)
	print(f"eval sentence: [{str(tokenizer.decode(beam_output[0], skip_special_tokens=True))}], took {(time.time()-s)}")

prompt = "summarize: From the very beginning, Regan was seen as having series potential. After the television film scored highly in the ratings, work began on the development of the series proper. Ian Kennedy Martin's idea was for the series to be mainly studio-based, with more dialogue and less action, but producer Ted Childs disagreed, and in consequence Ian Kennedy Martin parted company with the project. Childs produced it on 16mm film, a format that allowed for a much smaller film unit than videotape at that time. This made it possible to shoot almost entirely on location which helped give the series a startling degree of realism and to use film editing techniques which enabled him to give the show a heavy bias toward action sequences. The television play and the subsequent series were commissioned by Thames Television and produced by its film division Euston Films. It was originally broadcast on ITV between 2 January 1975 and 28 December 1978 at 21:00–22:00 on weekdays (usually Mondays), with repeated screenings at the same time until the early 1980s. The writers were given strict guidelines to follow: \"Each show will have an overall screen time (minus titles) of 48 minutes 40 seconds. Each film will open with a teaser of up to 3 minutes, which will be followed by the opening titles. The story will be played across three acts, each being no more than 19 minutes and no less than 8 minutes in length. Regan will appear in every episode, Carter in approximately 10 out of 13 episodes. In addition to these main characters, scripts should be based around three major speaking parts, with up to ten minor speaking parts."
print('Quantized model generate()')
eval(reconstructed_quantized_model, tokenizer, prompt)
eval(reconstructed_quantized_model, tokenizer, prompt)
print('Base model generate()')
eval(base_model, tokenizer, prompt)
eval(base_model, tokenizer, prompt)