Fast CPU Inference On Pegasus-Large Finetuned Model -- Currently Impossible?

Hi @the-pale-king, looking at the link inside the SO link it seems that you need to split the Pegasus model into separate encoder / decoder blocks and then apply the graph optimizations from ONNX (their example is for T5, so presumably can be adapted to Pegasus without too much work).

What model did you use for distillation? The choice of student will indeed determine the maximum sequence length you can work with, but with 2,000 tokens I’m not sure what you can use that would be faster than Pegasus :grimacing:

Have you tried dynamic quantization? In PyTorch you can do this with one line of code as follows:

import torch
from torch.quantization import quantize_dynamic

model_ckpt = ...
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = (AutoModelForSequenceClassification
         .from_pretrained(model_ckpt).to("cpu"))

model_quantized = quantize_dynamic(model, {nn.Linear}, dtype=torch.qint8)

which can give you a 2-3x reduction in latency (depends on the hardware, model architecture etc). I’ve never tried it for a seq2seq model, but don’t see why it shouldn’t work “out of the box” :smiley: