How to Improve inference time of facebook/mbart many to many model?

If we tried to run translation service on facebook mbart many to many on cpu it take 9 secs to translate, how do we reduce the inference time further…

Hi @Vimal0703, one idea could be to try quantizing the model’s weights to a lower precision datatype. See e.g. step 2 in this guide: Dynamic Quantization — PyTorch Tutorials 1.7.1 documentation

This usually gives you a 2-3x reduction in latency and model size :slight_smile:

Thank you it worked out, now we are able to quantize the model where the time takes around 1-2 seconds, but is there any way to decrease the time further…

1 Like

Hi @Vimal0703

One more option to improve the speed is to use onnx_runtime, but at this moment we don’t have any tool/script which will let you import export MBart to onnx.

I’ve written a script for exporting T5 to onnx, something similar can be used for MBart as well

Another option is, we have also ported the M2M100 model, which is SOTA many-to-many translation model. The m2m100_418M smaller than MBart50, can give more speed-up. Here’s an example of how to use M2M100, M2M100 — transformers 4.2.0 documentation

1 Like

Are you able to share this script or settings for quantizing this model? I would greatly apreciate it as I’m also going through a similar undertaking!

@valhalla Not to be greedy but was any one able to provide a script for this? Otherise I would like to work on this.