Speeding up the inference for marian MT

Inference for machine translation task using a pretrained model is very slow .Is there a way to speed up the inference using the marian-mt running the tokenizer and model on Nvidia gpu integrated with a flask service .

What is the best mechanism(string or complete paragraph or beam search to the Marian MT pretrained model)

please find the below code which i am using

model_name = ‘Helsinki-NLP/opus-mt-ROMANCE-en’
torch_device = ‘cuda’ if torch.cuda.is_available() else 'cpu’
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name) .to(torch_device)
translated = model.generate(**tokenizer.prepare_translation_batch(src_text).to(‘cuda’))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

Were you able to speed up the inference?


You can consider using the CTranslate2 library which can convert and run MarianMT models efficiently (up to 6x faster than Transformers on a NVIDIA Tesla T4). See a usage example here.

Disclaimer: I’m the author of CTranslate2.


Hi @guillaumekln,
I see that in your example for Bart, you pass a tuple. Is there an option on the size? or do we have an option for the number of sentences/tokens that could be passed as a parameter?