Hello, I would like to improve the inference time of a finetuned T5-base (translation task). I am currently using the .bin file (from_pretrained) and a GPU. I have tried several approaches, such as ONNX and TensorRT. Having a max_length=1024, these approaches perform worse (and sometimes by a lot).…

T5 inference performance

YannAgora March 8, 2022, 1:31pm 2

Hey ,

Did you try quantization ?

There is an example for pegasus model here. I tried and it performed pretty well for summarization with an inference time decrease by 2x or 3x

Topic		Replies	Views
Boost inference speed of T5 models up to 5X & reduce the model size by 3X 🤗Transformers	2	5651	June 8, 2023
Pegasus Inference for production usecase Beginners	6	1575	February 26, 2021
Optimum & T5 for inference 🤗Optimum	18	5859	February 8, 2023
Fast CPU Inference On Pegasus-Large Finetuned Model -- Currently Impossible? Beginners	4	2547	March 1, 2021
Model quantization Models	5	2642	February 15, 2023