Inference for machine translation task using a pretrained model is very slow .Is there a way to speed up the inference using the marian-mt running the tokenizer and model on Nvidia gpu integrated with a flask service .
What is the best mechanism(string or complete paragraph or beam search to the Marian MT pretrained model)
please find the below code which i am using
model_name = ‘Helsinki-NLP/opus-mt-ROMANCE-en’
torch_device = ‘cuda’ if torch.cuda.is_available() else 'cpu’
tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes)
model = MarianMTModel.from_pretrained(model_name) .to(torch_device)
translated = model.generate(**tokenizer.prepare_translation_batch(src_text).to(‘cuda’))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
Hi,
Were you able to speed up the inference?
Hi,
You can consider using the CTranslate2 library which can convert and run MarianMT models efficiently (up to 6x faster than Transformers on a NVIDIA Tesla T4). See a usage example here.
Disclaimer: I’m the author of CTranslate2.
2 Likes
Hi @guillaumekln,
I see that in your example for Bart, you pass a tuple. Is there an option on the size? or do we have an option for the number of sentences/tokens that could be passed as a parameter?
Hello.
I tried to convert a fine-tuned MarianMT Transformers model by CTranslate2 and, while the conversion works fine, the output is remarkably different from Transformers Pipeline. I wonder why, since in both cases all information is within the model folder and no additional parameter is supplied to both.
Any hint? Perhaps default MarianMT Transformers configuration parameters not explicitly listed in config.json file?
Thank you.