I want to train an MBART model from scratch, for a new language pair, unsupervised translation. I have monolingual data from both languages. Specifically, how do I prepare the data for the same?
Currently I start with a code as follows
tokenizer = MBartTokenizer.from_pretrained(’./tokenizer_de_hsb.model’) //My own tokenizer trained with google sentencepiece
batch = tokenizer.prepare_seq2seq_batch(src_texts=src_txts, src_lang=“en_XX”,
return_tensors=“pt”) //The src and tgt language codes are dummy here.
config = MBartConfig()
model = MBartModel(config)
model(input_ids=batch[‘input_ids’], decoder_input_ids=batch[‘labels’]) # forward pass
Following are the doubts I have.
- For pre-training mbart, what should input_ids and decoder_input_id in the forward pass be? Is there a function that generates the input with the masked tokens?
- Is the approach to combine src and tgt language data and train once on the combined data?
- Is there a sample code for this?