Fast CPU Inference On Pegasus-Large Finetuned Model -- Currently Impossible?

the-pale-king · February 28, 2021, 6:21pm

I have a Pegasus model, fine-tuned from Pegasus Large, which works great, but CPU inference with a input about 2000 characters in length takes between 5 and 12 seconds.

My understanding is that I need either a smaller model, or a “quantized” version of my current model, to speed up CPU inference.

I have two leads, which I want to put into this post clearly:

Export to ONNX. As far as I understand, either this is not possible currently for Pegasus, or nobody has publicly documented a successful export. The closest thing I can find is this: Pegasus ONNX format? · Issue #10042 · huggingface/transformers · GitHub – leading to this StackOverflow post: python - how to convert HuggingFace's Seq2seq models to onnx format - Stack Overflow
Re-run my fine-tuning on one of the “distilled” or “student” models, shown here: Hugging Face – On a mission to solve NLP, one commit at a time.
… I tried this, and found that these models won’t accept my input text, because it is too long. I don’t understand the specifics, but conceptually it make sense to me that a ‘distilled’ or ‘student’ model might be made smaller by reducing the number of tokens it could accept.

lewtun · February 28, 2021, 7:50pm

Hi @the-pale-king, looking at the link inside the SO link it seems that you need to split the Pegasus model into separate encoder / decoder blocks and then apply the graph optimizations from ONNX (their example is for T5, so presumably can be adapted to Pegasus without too much work).

What model did you use for distillation? The choice of student will indeed determine the maximum sequence length you can work with, but with 2,000 tokens I’m not sure what you can use that would be faster than Pegasus

Have you tried dynamic quantization? In PyTorch you can do this with one line of code as follows:

import torch
from torch.quantization import quantize_dynamic

model_ckpt = ...
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = (AutoModelForSequenceClassification
         .from_pretrained(model_ckpt).to("cpu"))

model_quantized = quantize_dynamic(model, {nn.Linear}, dtype=torch.qint8)

which can give you a 2-3x reduction in latency (depends on the hardware, model architecture etc). I’ve never tried it for a seq2seq model, but don’t see why it shouldn’t work “out of the box”

the-pale-king · March 1, 2021, 1:08am

Thanks, that was super-helpful. Using quantize_dynamic sped up my inference by about 2x.

Also, this is likely because I am a n00b a this, but I had previous not benchmarked the effect of reducing the number of beams.

I just ran some benchmarks with various values for num_beams

ranslated = model.generate(
        inputs['input_ids'], num_beams=num_beams, repetition_penalty=2.0)

… And found that inference time was much faster, the fewer beams that I used. I also ran another test to compare the resulting output for various beam values, and found that for my model, anything beyond 2 beams is a waste… I would guess this might vary widely based on model, but in my case, by beam 2 it seems the model has, 95% of the time, reached a solution – put another way: The text generated at beam 3 in my case is almost always exactly the same as the text generated by beam 2.

the-pale-king · March 1, 2021, 1:12am

What I want to try next is running the inference on a computer with more CPUs. I am running this currently on a i9-10980XE CPU @ 3.00GHz , which seems to have 18 cores: Intel® Core™ i9-10980XE Extreme Edition Processor (24.75M Cache, 3.00 GHz) Product Specifications … And I can see in glances (like top) that as the inference is running, I see 1800% cpu usage. I am going to try it next on a 36 core cpu and benchmark the difference.

the-pale-king · March 1, 2021, 2:27am

Just tried some more benchmarks on various processors – the speed definitely does not scale linearly with number of CPUs – I found actually that one of my machines that 12x CPUs outperformed one with 20x CPUs, possibly because it is a couple years newer, or has higher clock speed. And all the inference times are pretty close – within 250ms of eachother, even when comparing the 12x CPU machine with the 36x CPU. I wonder if for some reason the speed is more related to clock GHZ instead of number of CPUs.

Topic		Replies	Views
Pegasus Inference for production usecase Beginners	6	1564	February 26, 2021
Improve the performance of model prediction of transformers model 🤗Transformers	3	2621	November 24, 2021
Pegasus Model Weights Compression/Pruning Models	14	4253	February 15, 2023
T5 inference performance Models	5	1564	March 8, 2022
Pegasus Questions 🤗Transformers	29	3944	July 5, 2021

Fast CPU Inference On Pegasus-Large Finetuned Model -- Currently Impossible?

Related topics