How to train an MBart model from scratch for a new language pair?

I want to train an MBART model from scratch, for a new language pair, unsupervised translation. I have monolingual data from both languages. Specifically, how do I prepare the data for the same?

Currently I start with a code as follows

tokenizer = MBartTokenizer.from_pretrained(’./tokenizer_de_hsb.model’) //My own tokenizer trained with google sentencepiece
batch = tokenizer.prepare_seq2seq_batch(src_texts=src_txts, src_lang=“en_XX”,
tgt_texts=tgt_txts, tgt_lang=“ro_RO”,
return_tensors=“pt”) //The src and tgt language codes are dummy here.
config = MBartConfig()
model = MBartModel(config)
model(input_ids=batch[‘input_ids’], decoder_input_ids=batch[‘labels’]) # forward pass

Following are the doubts I have.

  • For pre-training mbart, what should input_ids and decoder_input_id in the forward pass be? Is there a function that generates the input with the masked tokens?
  • Is the approach to combine src and tgt language data and train once on the combined data?
  • Is there a sample code for this?