Is Facebook NLLB too slow?

So, I needed to have an English Translated version of some texts. For this, first, I used google trans, and this worked fine. It was able to complete about 7-8 translations per second. Then, I tried using Facebook 600M distilled NLLB. But, this took about 10 seconds for a translation. Now, I ran both the code on Google Colab. Is this the expected thing? Or is something wrong?

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")

translator_nllb = pipeline('translation', model=model, tokenizer=tokenizer, src_lang="ory_Orya", tgt_lang='eng_Latn', max_length = 400)

translated_text_nllb = translator_nllb(text)[0]['translation_text']

I’m seeing issues with how slow this is as well. Is this expected for a 600M parameter model? I get much faster responses from other autoregressive models of similar weights. Can someone from HF chime in?

I have the same issue. I wonder if this could be solved somehow.

Hi,

You’re not using a GPU, which will indeed make it pretty slow. Try adding the device argument when instantiating the pipeline:

from transformers import pipeline
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

pipe = pipeline(model="facebook/nllb-200-distilled-600M", src_lang="eng_Latn", tgt_lang='nld_Latn', max_length = 400, device=device)

translated_text = pipe("hello world")[0]['translation_text']
print(translated_text)

In my case, I do have a GPU and pass it just like in your example, but it’s still too slow.

Then I recommend to use CTranslate2, which is a C++ port of these models. Inference with NLLB is shown here.

thanks, it 5 times faster.
dont forget to set the device

translator = ctranslate2.Translator("nllb-200-distilled-600M", device='cuda')
tokenizer = AutoTokenizer.from_pretrained('facebook/nllb-200-distilled-600M', src_lang=src_lang, device='cuda')