How to train an MBart model from scratch for a new language pair?

Vineetha · February 16, 2021, 3:35pm

I want to train an MBART model from scratch, for a new language pair, unsupervised translation. I have monolingual data from both languages. Specifically, how do I prepare the data for the same?

Currently I start with a code as follows

tokenizer = MBartTokenizer.from_pretrained(’./tokenizer_de_hsb.model’) //My own tokenizer trained with google sentencepiece
batch = tokenizer.prepare_seq2seq_batch(src_texts=src_txts, src_lang=“en_XX”,
tgt_texts=tgt_txts, tgt_lang=“ro_RO”,
return_tensors=“pt”) //The src and tgt language codes are dummy here.
config = MBartConfig()
model = MBartModel(config)
model(input_ids=batch[‘input_ids’], decoder_input_ids=batch[‘labels’]) # forward pass
model.save_pretrained(’./trained_model’)

Following are the doubts I have.

For pre-training mbart, what should input_ids and decoder_input_id in the forward pass be? Is there a function that generates the input with the masked tokens?
Is the approach to combine src and tgt language data and train once on the combined data?
Is there a sample code for this?

Topic		Replies	Views
How to train mBart or any multilingual model for translation task Beginners	0	254	January 4, 2023
Pre-Training MBART/MBART50 from Scratch in HuggingFace Models	1	567	January 7, 2024
Help with finetuning mBART on an unseen language Models	19	2053	October 30, 2020
How to prepare data for mBART50 multilingual (many-to-many) fine-tuning? Models	1	19	June 17, 2025
How to train a translation model from scratch Beginners	9	12553	March 1, 2022

How to train an MBart model from scratch for a new language pair?

Related topics