Pegasus Inference for production usecase

karthikgali · December 3, 2020, 5:02pm

Hi,

We have finetuned distill-pegasus-cnn-16-4 summarization model on our own data and results look good. However, when we want to deploy it for a real-time production use case - it is taking huge time on ml.c5.xlarge CPU (around 13seconds per document in a sequence). We tried a g4dn.xlarge GPU for inference and it is taking around 1.7seconds for one document in a sequence. Inference on a GPU is a costly affair. Could anyone suggest me if anyone have any ideas to make faster with less cost.

I have tried “onnx runtime”, TorchScript but both of them dont have support for pegasus model. Do we have any timelines for supporting pegasus. I tried num_beams as well - above numbers are after using them.

Inference code:
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
src_text = [
“”" PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]

model_name = “sshleifer/distill-pegasus-cnn-16-4”
torch_device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
inputs = tokenizer.batch_encode_plus(src_text, truncation=True, padding=“longest”, return_tensors=“pt”).to(torch_device)
translated = model.generate(inputs[“input_ids”], num_beams=2)
output = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in translated][0]
print(output)

Thanks in advance for your help. Apologies if I am missing something obvious.

Regards,
Karthik

karthikgali · December 16, 2020, 12:09pm

Hi,

Could someone help in this regard?

Thanks in advance,
Karthik

FL33TW00D · December 16, 2020, 8:06pm

HI Karthik,
Forgive me if you have looked into this, but have you tried model quantization?

If ONNX is not supported, then your only other bet is to decrease the size of the model or change some parameters like you’ve been trying.

clem · December 17, 2020, 3:28am

Hi Karthik, one option would be to serve it with our hosted inference API for example with our startup plan here: https://huggingface.co/pricing. We would take care of the optimizations for you. If you have any question, feel free to write an email to api-enterprise@huggingface.co.

karthikgali · December 20, 2020, 7:46am

Thanks for your inputs. I will check them.

the-pale-king · December 20, 2020, 2:42pm

@karthikgali Something is confusing about this post for me. You wrote

We have finetuned distill-pegasus-cnn-16-4

But I can see in your code that you are loading

model_name = “sshleifer/distill-pegasus-cnn-16-4”

If you fine-tuned it, wouldn’t you load your fine-tuned model, and not the model that has already been tuned by sshleifer?

Put another way, see here: Pegasus

All the checkpoints are fine-tuned for summarization, besides pegasus-large, whence the other checkpoints are fine-tuned

My understanding from this is that, if we, as the newbie user, have some data we want to use with Pegasus, we should do this:

Start with pegasus-large: google/pegasus-large · Hugging Face
Fine tune it on our own data
Use the pytorch_model.bin output from this fine tuning process to run inference on our own data.

Note – I am not suggesting you are doing anything wrong. I’m just working with pegasus also, and reading your message makes me think that I’m maybe getting something wrong.

Also – I am finding, when fine-tuning Pegasus, using pegasus-large, that the RAM requirements for even just a batch size of 1 are so extreme, that a Nvidia card with 16GB of memory is required… just to run the batch size of 1! So at this point I am thinking that maybe my training will run better on the CPU, using a machine with a huge amount of ram… like 512GB of ram… as this seems to allow a much bigger batch size, like up to 64 or 128.

the-pale-king · February 26, 2021, 1:28am

@karthikgali Did you end up with a decent solution for this issue? Did you try the solution suggested on StackOverflow below?

Here are some leads:

github.com/huggingface/transformers

When would pegasus be able to be exported in ONNX format?

opened 10:37PM - 01 Nov 20 UTC

closed 03:03PM - 24 Apr 21 UTC

phosfuldev

It seems like it's not available now, I got this error: `Error while converting… the model: Unrecognized configuration class <class 'transformers.configuration_pegasus.PegasusConfig'> for this kind of AutoModel: AutoModel. Model type should be one of RetriBertConfig, T5Config, DistilBertConfig, AlbertConfig, CamembertConfig, XLMRobertaConfig, BartConfig, LongformerConfig, RobertaConfig, LayoutLMConfig, SqueezeBertConfig, BertConfig, OpenAIGPTConfig, GPT2Config, MobileBertConfig, TransfoXLConfig, XLNetConfig, FlaubertConfig, FSMTConfig, XLMConfig, CTRLConfig, ElectraConfig, ReformerConfig, FunnelConfig, LxmertConfig, BertGenerationConfig, DebertaConfig, DPRConfig, XLMProphetNetConfig, ProphetNetConfig.` Which is fair since pegasus is a new addition. Is it something the team plans to do soon? Or can someone point me some resources on if there are other ways to export a pre-trained model from huggingface? I'm pretty new to the machine learning thing :p Thanks all!

thanks

Topic		Replies	Views
Fast CPU Inference On Pegasus-Large Finetuned Model -- Currently Impossible? Beginners	4	2530	March 1, 2021
Fine-Tuning Pegasus - Model Not Training? Models	4	1737	March 14, 2021
Improve the performance of model prediction of transformers model 🤗Transformers	3	2616	November 24, 2021
Google/pegasus-xsum for summerization is very slow Beginners	2	207	February 26, 2024
T5 inference performance Models	5	1564	March 8, 2022

Pegasus Inference for production usecase

Related topics