Help with finetuning mBART on an unseen language

Hi everyone,
I wanted to know how would we finetune mBART on a summarization task on a different language than that of English. Also, how can we finetune mBART on a translation task where one of the languages is not present in the language code list that mBART has been trained on.
Appreciate any help!! Thank you.

1 Like

Hi @LaibaMehnaz

DISCLAIMER: I haven’t tried this myself , and as Sam found in his experiments mBART doesn’t always give good results.

for mBart source seq ends with src lang id and tgt seq starts with the tgt seq id. so for summ you can pass the same lang id for both source and tgt lang and then finetune it the same way you finetune any other seq2seq model.

For translation, if the lang is not present, you can try without using any lang id in the sequences.

Hi @valhalla,
I did try using mBart without any lang id, but it gives me this error:

self.cur_lang_code = self.lang_code_to_id[src_lang]
KeyError: ‘’

Also, when I am using the same language code on both the sides, the generations are in a totally different script.

For this, you’ll need tokenize input and output seq without using prepare_seq2seq_batch method or override prepare_seq2seq_batch and modify it to not use lang id

I modified the tokenizer to not use the lang id as you suggested, but still the same problem. ROUGE is 0.0, as the generations are in another script.

Also, is it possible because I am using tiny-mbart and not mbart-large-cc25. I was trying out tiny-mbart due to memory constraints.

hi @LaibaMehnaz
tiny-mbart is just meant for testing, it’s randomly initialised model.

You could try to create a smaller student using make_student.py script.

Oh, thanks a lot. I will proceed this way and let you know. Thanks again:)

Also, how many encoder and decoder layers would you suggest?

hard to say, depends on problem. But you could start with same number of encoder layers and 6 decoder layers distillbart-12-6 performs really well on summarization

Alright, thank you so much.

Hi, I am also interested in topic and I am trying to add mBart functionality to another library, but I have encountered strange error: https://huggingface.co/transformers/model_doc/mbart.html states that prepare_seq2seq_batch should give me dict with this keys: [input_ids, attention_mask, decoder_input_ids, decoder_attention_mask], but actually it gives me [input_ids, attention_mask, labels]. I am a bit confused :slight_smile:
Is it a bug or me doing something wrong?

Hi @Zhylkaaa

The doc is incorrect. The prepare_seq2seq_batch returns [input_ids, attention_mask, labels], and it’s not a bug

cc. @sshleifer

1 Like

Hi @valhalla
thanks for your respond, but how I am supposed to create decoder inputs? because there is difference in lang_id position
should I use something like:
[lang_id] + prepare_seq2seq_batch(decoder_input)['input_ids'][:-1] + padding if required?
Or should I just modify prepare_seq2seq_batch throwing away lang_id for summarisation task? (I am not sure about this modification because my intuition tells me that lang_id is something like language conditioned [CLS] token, or my intuition is wrong again😁?)
Thanks!

I have read that I can tag @sshleifer for summarisation and BART problems/questions. Sorry if I am wrong.

You can keep lang id for summarisation, you could pass the same lang id as src_lang and tgt_lang to prepare_seq2seq_batch method

Thank you @valhalla,
but what about decoder_input_ids? because I doesn’t receive this value after I use prepare_seq2seq_batch

finetune.py and finetune_trainer.py will make the right right deocder_input_ids, you won’t need to pass them :slight_smile:

thanks, actually I’ve been digging through source code and found that forward method actually generates decoder_input_ids from labels through shift_tokens_right. Thank you for help, and sorry for being annoying, should have checked source code first :slight_smile:

1 Like