If we tried to run translation service on facebook mbart many to many on cpu it take 9 secs to translate, how do we reduce the inference time further…
Hi @Vimal0703, one idea could be to try quantizing the model’s weights to a lower precision datatype. See e.g. step 2 in this guide: Dynamic Quantization — PyTorch Tutorials 1.7.1 documentation
This usually gives you a 2-3x reduction in latency and model size
Thank you it worked out, now we are able to quantize the model where the time takes around 1-2 seconds, but is there any way to decrease the time further…
Hi @Vimal0703
One more option to improve the speed is to use onnx_runtime
, but at this moment we don’t have any tool/script which will let you import export MBart
to onnx
.
I’ve written a script for exporting T5 to onnx
, something similar can be used for MBart
as well
Another option is, we have also ported the M2M100
model, which is SOTA many-to-many translation model. The m2m100_418M smaller than MBart50
, can give more speed-up. Here’s an example of how to use M2M100
, M2M100
Are you able to share this script or settings for quantizing this model? I would greatly apreciate it as I’m also going through a similar undertaking!
@valhalla Not to be greedy but was any one able to provide a script for this? Otherise I would like to work on this.
@Vimal0703 @addressoic can you please share the script, I have similar task…