Hello, I want to use Mbart for translation tasks, but the translation is too slow, in fact it takes 1min and more for 2000 characters when I would need a translation in a few seconds. Is there any way to increase the speed of this model? threading, sharding etc…
my current code:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
import time
article = "Hello world."
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
time1 = time.time()
tokenizer.src_lang = "en_XX"
encoded_hi = tokenizer(article, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
print(time.time()-time1)
for 2 words, the translation takes 2.5 seconds. I can’t use a gpu.