Speeding up the inference for marian MT

Inference for machine translation task using a pretrained model is very slow .Is there a way to speed up the inference using the marian-mt running the tokenizer and model on Nvidia gpu integrated with a flask service .

What is the best mechanism(string or complete paragraph or beam search to the Marian MT pretrained model)

please find the below code which i am using

model_name = ‘Helsinki-NLP/opus-mt-ROMANCE-en’
torch_device = ‘cuda’ if torch.cuda.is_available() else 'cpu’
tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes)
model = MarianMTModel.from_pretrained(model_name) .to(torch_device)
translated = model.generate(**tokenizer.prepare_translation_batch(src_text).to(‘cuda’))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

Hi,
Were you able to speed up the inference?

Hi,

You can consider using the CTranslate2 library which can convert and run MarianMT models efficiently (up to 6x faster than Transformers on a NVIDIA Tesla T4). See a usage example here.

Disclaimer: I’m the author of CTranslate2.

2 Likes

Hi @guillaumekln,
I see that in your example for Bart, you pass a tuple. Is there an option on the size? or do we have an option for the number of sentences/tokens that could be passed as a parameter?