Hello,
I am currently working on the MBART50 many-to-one model for translation. The model takes a really long time to generate the translation. Is this normal? How can we optimize it?
I tried in CPU and GPU but both remain slow :
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-one-mmt")
Inference time in seconds for model.generate(**input, max_length=max_length)
where input
is a tokenized string with 1024 tokens :
max_length | 8 CPUs | 1 GPU |
---|---|---|
200 | ~38s | ~4s |
512 | ~105s | ~11s |
750 | ~160s | ~16s |
1024 | ~237s | ~22s |
It takes this long just for one string… Doing it in batch does not make it faster . Any idea what’s wrong or how to optimize?
Thank you !