mBART for translation truncating result

Buckeyes2019 · March 15, 2021, 7:45pm

I am using mBART (specifically “mrm8488/mbart-large-finetuned-opus-es-en-translation”) for translation and the model seems to be truncating the output. Below is the code and the result. Has anyone used this model successfully? Can you see an error in my code? Any suggestions on how I might get a better translation with this model?

<< Original Text: "Esta investigación presenta un análisis del gasto público federal asignado a los hogares en condición de marginación con enfoque asistencial. Se sustenta en que éste debe orientarse a facilitar la inversión y el impulso a los procesos de trabajo productivo, generadores de crecimiento y empleo. Para esto se presenta una propuesta de evaluación cuantitativa basada en el modelo de contabilidad social que formula el Sistema de Cuentas Nacionales, en su revisión de 1993 y actualizada con la misma perspectiva en 2008. Los resultados se analizan con el modelo de multiplicador keynesiano.">>

model_name1 = "mrm8488/mbart-large-finetuned-opus-es-en-translation"
tokenizer1 = AutoTokenizer.from_pretrained(model_name1)
model1 = AutoModelForSeq2SeqLM.from_pretrained(model_name1)

input_ids1 = tokenizer1(text, return_tensors=“pt”).input_ids
outputs1 = model1.generate(input_ids1, num_return_sequences=4, num_beams=6, do_sample=True, early_stopping=True)
print(tokenizer1.decode(outputs1[0]))

<s> This Research presents a three-year review of the federal public expenditure model, the same as the nationally-allotment model,</s>

lewtun · March 15, 2021, 10:31pm

Hi @Buckeyes2019, looking at the docs for MBart, it seems that you need to prepare the data in a special format and define decoder_start_token_id

I’m not sure whether that will solve your truncation issue (I wonder if you can set max_length in model.generate to a larger value?) and looking at your example it seems like the model is “summarising” the input text instead of translating it which seems odd to me …

infinitejoy · March 16, 2021, 5:18am

max_length cannot be set to a number greater than 32 which is the batch len

valhalla · March 16, 2021, 10:21am

hi @infinitejoy

As @lewtun said, MBart needs the data in a special format. You should pass the src_lang when initializing the tokenizer or set the tokenizer.src_lang attribute, so the tokenizer will add the correct language token to encoded text. Also MBart expects the target language id as the decoder_start_token_id , so you need to pass that argument to generate. Here’s a simple code snippet

tokenizer = MBartTokenizer(..., src_lang="es_XX")
model.generate(..., decoder_start_token_id=tokenizer.get_lang_code_to_id["en_XX"]

also I’m not sure what you mean by

max_length cannot be set to a number greater than 32 which is the batch len

you can set max_length to any value you want.

also there are other models on the hub that can do es to en translation like mbart50 or m2m100

Topic		Replies	Views
Translation takes too long - from fine-tuned mbart-large-50 model Beginners	0	405	September 7, 2021
Help with finetuning mBART on an unseen language Models	19	2054	October 30, 2020
Mbart finetuning Models	0	675	July 29, 2021
Pruning a model embedding matrix for memory efficiency Intermediate	7	3439	July 27, 2022
Fine-tuning for translation with facebook mbart-large-50 🤗Transformers	1	1728	March 16, 2024

mBART for translation truncating result

Related topics