I am currently working on the MBART50 many-to-one model for translation. The model takes a really long time to generate the translation. Is this normal? How can we optimize it?
I tried in CPU and GPU but both remain slow :
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-one-mmt")
Inference time in seconds for
model.generate(**input, max_length=max_length) where
input is a tokenized string with 1024 tokens :
|max_length||8 CPUs||1 GPU|
It takes this long just for one string… Doing it in batch does not make it faster . Any idea what’s wrong or how to optimize?
Thank you !