We have finetuned distill-pegasus-cnn-16-4 summarization model on our own data and results look good. However, when we want to deploy it for a real-time production use case - it is taking huge time on ml.c5.xlarge CPU (around 13seconds per document in a sequence). We tried a g4dn.xlarge GPU for inference and it is taking around 1.7seconds for one document in a sequence. Inference on a GPU is a costly affair. Could anyone suggest me if anyone have any ideas to make faster with less cost.
I have tried “onnx runtime”, TorchScript but both of them dont have support for pegasus model. Do we have any timelines for supporting pegasus. I tried num_beams as well - above numbers are after using them.
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
src_text = [
“”" PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
model_name = “sshleifer/distill-pegasus-cnn-16-4”
torch_device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
inputs = tokenizer.batch_encode_plus(src_text, truncation=True, padding=“longest”, return_tensors=“pt”).to(torch_device)
translated = model.generate(inputs[“input_ids”], num_beams=2)
output = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in translated]
Thanks in advance for your help. Apologies if I am missing something obvious.
Use the pytorch_model.bin output from this fine tuning process to run inference on our own data.
Note – I am not suggesting you are doing anything wrong. I’m just working with pegasus also, and reading your message makes me think that I’m maybe getting something wrong.
Also – I am finding, when fine-tuning Pegasus, using pegasus-large, that the RAM requirements for even just a batch size of 1 are so extreme, that a Nvidia card with 16GB of memory is required… just to run the batch size of 1! So at this point I am thinking that maybe my training will run better on the CPU, using a machine with a huge amount of ram… like 512GB of ram… as this seems to allow a much bigger batch size, like up to 64 or 128.