How to perform fast batch inference for NLLB Model translation?

Batch Inference of NLLB Models with different source languages.

Need help in inferencing NLLB models for batch inference where the source language can change. Pipeline inference is slow even on GPU. 10k examples of various languages (simple example inference) - 6 hours, batch inference of 4.5k examples of one language took 1.6 hrs.

Single GPU - Tesla T4

You can have a look at CTranslate2 which is a Python library for fast inference. It can convert NLLB models from Transformers. See an example here.

Disclaimer: I’m the author of CTranslate2.

@guillaumekln Thanks for the reply. Awesome repo, I looked around and found it useful in understanding translation steps. But can you share an example for translation of 50/100 examples in one go and I can time it as well. I changed your given example in the repo for a batching perspective but couldn’t figure out that problem.

Maybe this other tutorial is more helpful: