Fast CPU Inference On Pegasus-Large Finetuned Model -- Currently Impossible?

lewtun · February 28, 2021, 7:50pm

Hi @the-pale-king, looking at the link inside the SO link it seems that you need to split the Pegasus model into separate encoder / decoder blocks and then apply the graph optimizations from ONNX (their example is for T5, so presumably can be adapted to Pegasus without too much work).

What model did you use for distillation? The choice of student will indeed determine the maximum sequence length you can work with, but with 2,000 tokens I’m not sure what you can use that would be faster than Pegasus

Have you tried dynamic quantization? In PyTorch you can do this with one line of code as follows:

import torch
from torch.quantization import quantize_dynamic

model_ckpt = ...
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = (AutoModelForSequenceClassification
         .from_pretrained(model_ckpt).to("cpu"))

model_quantized = quantize_dynamic(model, {nn.Linear}, dtype=torch.qint8)

which can give you a 2-3x reduction in latency (depends on the hardware, model architecture etc). I’ve never tried it for a seq2seq model, but don’t see why it shouldn’t work “out of the box”

Topic		Replies	Views
Pegasus Inference for production usecase Beginners	6	1564	February 26, 2021
Boost inference speed of T5 models up to 5X & reduce the model size by 3X 🤗Transformers	2	5601	June 8, 2023
T5 inference performance Models	5	1563	March 8, 2022
Fine-Tuning Pegasus - Model Not Training? Models	4	1737	March 14, 2021
Pegasus finetuning, should we always start with pegasus-large? Beginners	5	1671	May 3, 2024

Fast CPU Inference On Pegasus-Large Finetuned Model -- Currently Impossible?

Related topics