Fast CPU Inference On Pegasus-Large Finetuned Model -- Currently Impossible?

I have a Pegasus model, fine-tuned from Pegasus Large, which works great, but CPU inference with a input about 2000 characters in length takes between 5 and 12 seconds.

My understanding is that I need either a smaller model, or a “quantized” version of my current model, to speed up CPU inference.

I have two leads, which I want to put into this post clearly:

  1. Export to ONNX. As far as I understand, either this is not possible currently for Pegasus, or nobody has publicly documented a successful export. The closest thing I can find is this: Pegasus ONNX format? · Issue #10042 · huggingface/transformers · GitHub – leading to this StackOverflow post: python - how to convert HuggingFace's Seq2seq models to onnx format - Stack Overflow

  2. Re-run my fine-tuning on one of the “distilled” or “student” models, shown here: Hugging Face – On a mission to solve NLP, one commit at a time.
    … I tried this, and found that these models won’t accept my input text, because it is too long. I don’t understand the specifics, but conceptually it make sense to me that a ‘distilled’ or ‘student’ model might be made smaller by reducing the number of tokens it could accept.

1 Like

Hi @the-pale-king, looking at the link inside the SO link it seems that you need to split the Pegasus model into separate encoder / decoder blocks and then apply the graph optimizations from ONNX (their example is for T5, so presumably can be adapted to Pegasus without too much work).

What model did you use for distillation? The choice of student will indeed determine the maximum sequence length you can work with, but with 2,000 tokens I’m not sure what you can use that would be faster than Pegasus :grimacing:

Have you tried dynamic quantization? In PyTorch you can do this with one line of code as follows:

import torch
from torch.quantization import quantize_dynamic

model_ckpt = ...
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = (AutoModelForSequenceClassification
         .from_pretrained(model_ckpt).to("cpu"))

model_quantized = quantize_dynamic(model, {nn.Linear}, dtype=torch.qint8)

which can give you a 2-3x reduction in latency (depends on the hardware, model architecture etc). I’ve never tried it for a seq2seq model, but don’t see why it shouldn’t work “out of the box” :smiley:

Thanks, that was super-helpful. Using quantize_dynamic sped up my inference by about 2x.

Also, this is likely because I am a n00b a this, but I had previous not benchmarked the effect of reducing the number of beams.

I just ran some benchmarks with various values for num_beams

ranslated = model.generate(
        inputs['input_ids'], num_beams=num_beams, repetition_penalty=2.0)

… And found that inference time was much faster, the fewer beams that I used. I also ran another test to compare the resulting output for various beam values, and found that for my model, anything beyond 2 beams is a waste… I would guess this might vary widely based on model, but in my case, by beam 2 it seems the model has, 95% of the time, reached a solution – put another way: The text generated at beam 3 in my case is almost always exactly the same as the text generated by beam 2.

1 Like

What I want to try next is running the inference on a computer with more CPUs. I am running this currently on a i9-10980XE CPU @ 3.00GHz , which seems to have 18 cores: Intel® Core™ i9-10980XE Extreme Edition Processor (24.75M Cache, 3.00 GHz) Product Specifications … And I can see in glances (like top) that as the inference is running, I see 1800% cpu usage. I am going to try it next on a 36 core cpu and benchmark the difference.

1 Like

Just tried some more benchmarks on various processors – the speed definitely does not scale linearly with number of CPUs – I found actually that one of my machines that 12x CPUs outperformed one with 20x CPUs, possibly because it is a couple years newer, or has higher clock speed. And all the inference times are pretty close – within 250ms of eachother, even when comparing the 12x CPU machine with the 36x CPU. I wonder if for some reason the speed is more related to clock GHZ instead of number of CPUs.

1 Like