Optimize response time of model output

devpy079 · December 23, 2021, 5:06am

I am using Parrot Paraphrase that uses transformers model, but the problem is that the result generation is too slow for it without GPU. I have limitation of GPU and can only process the result on CPU. I saw some optimization solutions but they are all suggesting to train the model in some desired form or optmization but they model is already trained by someone else so I can not do much on that. Recently I improved the performance of one of the model by pytorch quantize implementation. eg:

model = (AutoModelForSequenceClassification
         .from_pretrained(model_ckpt).to("cpu"))

model_quantized = quantize_dynamic(model, {nn.Linear}, dtype=torch.qint8)

Can I use it with the parrot library, the code it uses for the model loading is:

  self.tokenizer = AutoTokenizer.from_pretrained(model_tag)
   self.model     = AutoModelForSeq2SeqLM.from_pretrained(model_tag)

Topic		Replies	Views
Improve the performance of model prediction of transformers model 🤗Transformers	3	2613	November 24, 2021
How to improve model latency using quantization Beginners	0	317	December 27, 2021
Speed up the prediction in transformers models 🤗Transformers	0	662	November 23, 2021
[Help] GPU with query answering 🤗Transformers	0	328	November 25, 2020
Boosting the speed of a translation model Helsinki-NLP/opus-mt-en-ar 🤗Transformers	0	734	October 2, 2023

Optimize response time of model output

Related topics