Pegasus Inference for production usecase


We have finetuned distill-pegasus-cnn-16-4 summarization model on our own data and results look good. However, when we want to deploy it for a real-time production use case - it is taking huge time on ml.c5.xlarge CPU (around 13seconds per document in a sequence). We tried a g4dn.xlarge GPU for inference and it is taking around 1.7seconds for one document in a sequence. Inference on a GPU is a costly affair. Could anyone suggest me if anyone have any ideas to make faster with less cost.

I have tried “onnx runtime”, TorchScript but both of them dont have support for pegasus model. Do we have any timelines for supporting pegasus. I tried num_beams as well - above numbers are after using them.

Inference code:
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
src_text = [
“”" PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""

model_name = “sshleifer/distill-pegasus-cnn-16-4”
torch_device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
inputs = tokenizer.batch_encode_plus(src_text, truncation=True, padding=“longest”, return_tensors=“pt”).to(torch_device)
translated = model.generate(inputs[“input_ids”], num_beams=2)
output = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in translated][0]

Thanks in advance for your help. Apologies if I am missing something obvious.



Could someone help in this regard?

Thanks in advance,

HI Karthik,
Forgive me if you have looked into this, but have you tried model quantization?

If ONNX is not supported, then your only other bet is to decrease the size of the model or change some parameters like you’ve been trying.

Hi Karthik, one option would be to serve it with our hosted inference API for example with our startup plan here: We would take care of the optimizations for you. If you have any question, feel free to write an email to


Thanks for your inputs. I will check them.

@karthikgali Something is confusing about this post for me. You wrote

We have finetuned distill-pegasus-cnn-16-4

But I can see in your code that you are loading

model_name = “sshleifer/distill-pegasus-cnn-16-4”

If you fine-tuned it, wouldn’t you load your fine-tuned model, and not the model that has already been tuned by sshleifer?

Put another way, see here:

All the checkpoints are fine-tuned for summarization, besides pegasus-large, whence the other checkpoints are fine-tuned

My understanding from this is that, if we, as the newbie user, have some data we want to use with Pegasus, we should do this:

  1. Start with pegasus-large:
  2. Fine tune it on our own data
  3. Use the pytorch_model.bin output from this fine tuning process to run inference on our own data.

Note – I am not suggesting you are doing anything wrong. I’m just working with pegasus also, and reading your message makes me think that I’m maybe getting something wrong.

Also – I am finding, when fine-tuning Pegasus, using pegasus-large, that the RAM requirements for even just a batch size of 1 are so extreme, that a Nvidia card with 16GB of memory is required… just to run the batch size of 1! So at this point I am thinking that maybe my training will run better on the CPU, using a machine with a huge amount of ram… like 512GB of ram… as this seems to allow a much bigger batch size, like up to 64 or 128.

@karthikgali Did you end up with a decent solution for this issue? Did you try the solution suggested on StackOverflow below?

Here are some leads: