I am using Parrot Paraphrase that uses transformers model, but the problem is that the result generation is too slow for it without GPU. I have limitation of GPU and can only process the result on CPU. I saw some optimization solutions but they are all suggesting to train the model in some desired form or optmization but they model is already trained by someone else so I can not do much on that. Recently I improved the performance of one of the model by pytorch quantize implementation. eg:
model = (AutoModelForSequenceClassification
.from_pretrained(model_ckpt).to("cpu"))
model_quantized = quantize_dynamic(model, {nn.Linear}, dtype=torch.qint8)
Can I use it with the parrot library, the code it uses for the model loading is:
self.tokenizer = AutoTokenizer.from_pretrained(model_tag)
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_tag)