T5 inference performance

I would like to improve the inference time of a finetuned T5-base (translation task). I am currently using the .bin file (from_pretrained) and a GPU. I have tried several approaches, such as ONNX and TensorRT. Having a max_length=1024, these approaches perform worse (and sometimes by a lot). Are there any techniques that could help? Thank you.

Hey :wave:,

Did you try quantization ?

There is an example for pegasus model here. I tried and it performed pretty well for summarization with an inference time decrease by 2x or 3x

Thanks @YannAgora. Can run with 2x or 3x on GPU?

I haven’t tried on a GPU instance but I don’t see why it wouldn’t work.

@YannAgora I got this error using the code, adding reconstructed_quantized_model.to("cuda") for GPU inference:

NotImplementedError: Could not run 'quantized::linear_dynamic' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'quantized::linear_dynamic' is only available for these backends: [CPU, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].

I saw that GPU is not supported here.

Oh ok I didn’t know that :sweat_smile: