MBART50 .generate() is very slow


I am currently working on the MBART50 many-to-one model for translation. The model takes a really long time to generate the translation. Is this normal? How can we optimize it?

I tried in CPU and GPU but both remain slow :

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-one-mmt")

Inference time in seconds for model.generate(**input, max_length=max_length) where input is a tokenized string with 1024 tokens :

max_length 8 CPUs 1 GPU
200 ~38s ~4s
512 ~105s ~11s
750 ~160s ~16s
1024 ~237s ~22s

It takes this long just for one string… Doing it in batch does not make it faster :confused:. Any idea what’s wrong or how to optimize?

Thank you !