If we tried to run translation service on facebook mbart many to many on cpu it take 9 secs to translate, how do we reduce the inference time further…
Hi @Vimal0703, one idea could be to try quantizing the model’s weights to a lower precision datatype. See e.g. step 2 in this guide: Dynamic Quantization — PyTorch Tutorials 1.7.1 documentation
This usually gives you a 2-3x reduction in latency and model size
Thank you it worked out, now we are able to quantize the model where the time takes around 1-2 seconds, but is there any way to decrease the time further…
One more option to improve the speed is to use
onnx_runtime, but at this moment we don’t have any tool/script which will let you import export
I’ve written a script for exporting T5 to
onnx, something similar can be used for
MBart as well
Another option is, we have also ported the
M2M100 model, which is SOTA many-to-many translation model. The m2m100_418M smaller than
MBart50, can give more speed-up. Here’s an example of how to use
M2M100, M2M100 — transformers 4.2.0 documentation
Are you able to share this script or settings for quantizing this model? I would greatly apreciate it as I’m also going through a similar undertaking!
@valhalla Not to be greedy but was any one able to provide a script for this? Otherise I would like to work on this.