I would like to improve the inference time of a finetuned T5-base (translation task). I am currently using the
.bin file (
from_pretrained) and a GPU. I have tried several approaches, such as ONNX and TensorRT. Having a
max_length=1024, these approaches perform worse (and sometimes by a lot). Are there any techniques that could help? Thank you.
Did you try quantization ?
There is an example for pegasus model here. I tried and it performed pretty well for summarization with an inference time decrease by 2x or 3x
Thanks @YannAgora. Can run with 2x or 3x on GPU?
I haven’t tried on a GPU instance but I don’t see why it wouldn’t work.
@YannAgora I got this error using the code, adding
reconstructed_quantized_model.to("cuda") for GPU inference:
NotImplementedError: Could not run 'quantized::linear_dynamic' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'quantized::linear_dynamic' is only available for these backends: [CPU, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].
I saw that GPU is not supported here.
Oh ok I didn’t know that