I have a Pegasus model, fine-tuned from Pegasus Large, which works great, but CPU inference with a input about 2000 characters in length takes between 5 and 12 seconds.
My understanding is that I need either a smaller model, or a “quantized” version of my current model, to speed up CPU inference.
I have two leads, which I want to put into this post clearly:
Export to ONNX. As far as I understand, either this is not possible currently for Pegasus, or nobody has publicly documented a successful export. The closest thing I can find is this: Pegasus ONNX format? · Issue #10042 · huggingface/transformers · GitHub – leading to this StackOverflow post: python - how to convert HuggingFace's Seq2seq models to onnx format - Stack Overflow
Re-run my fine-tuning on one of the “distilled” or “student” models, shown here: Hugging Face – On a mission to solve NLP, one commit at a time.
… I tried this, and found that these models won’t accept my input text, because it is too long. I don’t understand the specifics, but conceptually it make sense to me that a ‘distilled’ or ‘student’ model might be made smaller by reducing the number of tokens it could accept.