Help with finetuning mBART on an unseen language

LaibaMehnaz · October 13, 2020, 3:25pm

Hi everyone,
I wanted to know how would we finetune mBART on a summarization task on a different language than that of English. Also, how can we finetune mBART on a translation task where one of the languages is not present in the language code list that mBART has been trained on.
Appreciate any help!! Thank you.

valhalla · October 13, 2020, 3:57pm

Hi @LaibaMehnaz

DISCLAIMER: I haven’t tried this myself , and as Sam found in his experiments mBART doesn’t always give good results.

for mBart source seq ends with src lang id and tgt seq starts with the tgt seq id. so for summ you can pass the same lang id for both source and tgt lang and then finetune it the same way you finetune any other seq2seq model.

For translation, if the lang is not present, you can try without using any lang id in the sequences.

LaibaMehnaz · October 14, 2020, 2:01pm

Hi @valhalla,
I did try using mBart without any lang id, but it gives me this error:

self.cur_lang_code = self.lang_code_to_id[src_lang]
KeyError: ‘’

LaibaMehnaz · October 14, 2020, 4:01pm

Also, when I am using the same language code on both the sides, the generations are in a totally different script.

valhalla · October 16, 2020, 5:19pm

For this, you’ll need tokenize input and output seq without using prepare_seq2seq_batch method or override prepare_seq2seq_batch and modify it to not use lang id

LaibaMehnaz · October 19, 2020, 1:02pm

I modified the tokenizer to not use the lang id as you suggested, but still the same problem. ROUGE is 0.0, as the generations are in another script.

LaibaMehnaz · October 20, 2020, 6:16am

Also, is it possible because I am using tiny-mbart and not mbart-large-cc25. I was trying out tiny-mbart due to memory constraints.

valhalla · October 20, 2020, 6:45am

hi @LaibaMehnaz
tiny-mbart is just meant for testing, it’s randomly initialised model.

You could try to create a smaller student using make_student.py script.

LaibaMehnaz · October 20, 2020, 7:55am

Oh, thanks a lot. I will proceed this way and let you know. Thanks again:)

LaibaMehnaz · October 20, 2020, 8:15am

Also, how many encoder and decoder layers would you suggest?

valhalla · October 20, 2020, 11:38am

hard to say, depends on problem. But you could start with same number of encoder layers and 6 decoder layers distillbart-12-6 performs really well on summarization

LaibaMehnaz · October 20, 2020, 11:54am

Alright, thank you so much.

Zhylkaaa · October 23, 2020, 12:15am

Hi, I am also interested in topic and I am trying to add mBart functionality to another library, but I have encountered strange error: https://huggingface.co/transformers/model_doc/mbart.html states that prepare_seq2seq_batch should give me dict with this keys: [input_ids, attention_mask, decoder_input_ids, decoder_attention_mask], but actually it gives me [input_ids, attention_mask, labels]. I am a bit confused
Is it a bug or me doing something wrong?

valhalla · October 24, 2020, 7:28am

Hi @Zhylkaaa

The doc is incorrect. The prepare_seq2seq_batch returns [input_ids, attention_mask, labels], and it’s not a bug

cc. @sshleifer

Zhylkaaa · October 24, 2020, 7:33pm

Hi @valhalla
thanks for your respond, but how I am supposed to create decoder inputs? because there is difference in lang_id position
should I use something like:
[lang_id] + prepare_seq2seq_batch(decoder_input)['input_ids'][:-1] + padding if required?
Or should I just modify prepare_seq2seq_batch throwing away lang_id for summarisation task? (I am not sure about this modification because my intuition tells me that lang_id is something like language conditioned [CLS] token, or my intuition is wrong again😁?)
Thanks!

Zhylkaaa · October 27, 2020, 3:42am

I have read that I can tag @sshleifer for summarisation and BART problems/questions. Sorry if I am wrong.

valhalla · October 29, 2020, 5:27pm

You can keep lang id for summarisation, you could pass the same lang id as src_lang and tgt_lang to prepare_seq2seq_batch method

Zhylkaaa · October 29, 2020, 9:05pm

Thank you @valhalla,
but what about decoder_input_ids? because I doesn’t receive this value after I use prepare_seq2seq_batch

valhalla · October 30, 2020, 5:45am

finetune.py and finetune_trainer.py will make the right right deocder_input_ids, you won’t need to pass them

Zhylkaaa · October 30, 2020, 8:41pm

thanks, actually I’ve been digging through source code and found that forward method actually generates decoder_input_ids from labels through shift_tokens_right. Thank you for help, and sorry for being annoying, should have checked source code first

Topic		Replies	Views
How to train an MBart model from scratch for a new language pair? Beginners	0	481	February 16, 2021
How to prepare data for mBART50 multilingual (many-to-many) fine-tuning? Models	1	32	June 17, 2025
MBart Zero Shot Transfer Learning Beginners	0	350	June 4, 2021
[Beginner] fine-tune Bart with custom dataset in other language? Beginners	2	3235	January 22, 2021
Weird behavior with mBART-50 and Spanish Models	0	301	July 30, 2021

Help with finetuning mBART on an unseen language

Related topics