mBART for translation truncating result

I am using mBART (specifically “mrm8488/mbart-large-finetuned-opus-es-en-translation”) for translation and the model seems to be truncating the output. Below is the code and the result. Has anyone used this model successfully? Can you see an error in my code? Any suggestions on how I might get a better translation with this model?

<< Original Text: "Esta investigación presenta un análisis del gasto público federal asignado a los hogares en condición de marginación con enfoque asistencial. Se sustenta en que éste debe orientarse a facilitar la inversión y el impulso a los procesos de trabajo productivo, generadores de crecimiento y empleo. Para esto se presenta una propuesta de evaluación cuantitativa basada en el modelo de contabilidad social que formula el Sistema de Cuentas Nacionales, en su revisión de 1993 y actualizada con la misma perspectiva en 2008. Los resultados se analizan con el modelo de multiplicador keynesiano.">>

model_name1 = "mrm8488/mbart-large-finetuned-opus-es-en-translation"
tokenizer1 = AutoTokenizer.from_pretrained(model_name1)
model1 = AutoModelForSeq2SeqLM.from_pretrained(model_name1)

input_ids1 = tokenizer1(text, return_tensors=“pt”).input_ids
outputs1 = model1.generate(input_ids1, num_return_sequences=4, num_beams=6, do_sample=True, early_stopping=True)
print(tokenizer1.decode(outputs1[0]))

<s> This Research presents a three-year review of the federal public expenditure model, the same as the nationally-allotment model,</s>

Hi @Buckeyes2019, looking at the docs for MBart, it seems that you need to prepare the data in a special format and define decoder_start_token_id

I’m not sure whether that will solve your truncation issue (I wonder if you can set max_length in model.generate to a larger value?) and looking at your example it seems like the model is “summarising” the input text instead of translating it which seems odd to me …

max_length cannot be set to a number greater than 32 which is the batch len

hi @infinitejoy

As @lewtun said, MBart needs the data in a special format. You should pass the src_lang when initializing the tokenizer or set the tokenizer.src_lang attribute, so the tokenizer will add the correct language token to encoded text. Also MBart expects the target language id as the decoder_start_token_id , so you need to pass that argument to generate. Here’s a simple code snippet

tokenizer = MBartTokenizer(..., src_lang="es_XX")
model.generate(..., decoder_start_token_id=tokenizer.get_lang_code_to_id["en_XX"]

also I’m not sure what you mean by

max_length cannot be set to a number greater than 32 which is the batch len

you can set max_length to any value you want.

also there are other models on the hub that can do es to en translation like mbart50 or m2m100

2 Likes