Improve the performance of model prediction of transformers model

Hi I am new to transformers. I am using the some models of it for many tasks. One is the summarization using google pegasus-xum model, the performance is good on GPU but when I try to do it on CPU it takes around 16-18 seconds. I also started using the parrot-paraphrase library which uses the T5 model in the backend, it also performs the same in GPU but on CPU it is taking time around 5-8 seconds to process the result. Due to some limitation of GPU on my server I have to optimize it for CPU, to take down the response time to 2-4 seconds max.

Here are the link of models I am using:
Pegasus-XSUM: google/pegasus-xsum · Hugging Face
Parrot=Paraphrase: prithivida/parrot_paraphraser_on_T5 · Hugging Face

Code for pegasus model:

from transformers import PegasusTokenizer, PegasusForConditionalGeneration
from trained_model import ModelFactory
import os

project_root = os.path.dirname(os.path.dirname(__file__))

path = os.path.join(project_root, 'models/')

class Summarization:
    def __init__(self):
        self.mod = ModelFactory("summary")
        self.tokenizer = PegasusTokenizer.from_pretrained(path)

    def generate_abstractive_summary(self, text):
        """ To generate summary of text using pegasus model"""
        model = self.mod.get_preferred_model("summarization")
        inputs = self.tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)
        summary_ids = model.generate(inputs['input_ids'])
        summary = [self.tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
        return summary[0]

Is there any way to improve the performance, any small improvement also appreciated…

I have not used those two specific models before, however without a small reproducible example of your code is impossible to provide personalised feedback. You didn’t mention what you’re doing or using the model for, but I’ll assume you’re doing inference (not training or fine tuning) based on the run times you reported.
Ultimately doing inference on a CPU will be much slower than on a GPU, there is absolutely no way around it (although the time difference if you’re doing training in a CPU vs GPU is even bigger). Whether there is still some margin for CPU optimisation in your specific case, is something you would need to share the code for the community to be able to comment on.

The only thing that comes to my mind right now (without knowing anything at all about your code), is to only initialise the model once, and then do inference on the same instance, so avoid calling something like

trained_model = AutoModel.from_pretrained(...)

at every iteration inside a for loop, as that would be very detrimental, but without seeing the code I can’t say much else.

Hi I have added the code for summarization, as you said, I loaded model once when the class in instantiated and than use that object whenever I need the result form the model… but still the response time is 14-15 seconds. Is there any way to improve by multiprocessing or threading or any other optimization on CPU??

One can export the model to ONNX, apply quantization, etc.

This thread can help: Fast CPU Inference On Pegasus-Large Finetuned Model -- Currently Impossible? - #4 by the-pale-king