Pre-Training MBART/MBART50 from Scratch in HuggingFace

I have my own custom dataset that I need to pre-train a MBART/MBART50 type architecture from scratch. I have already tried the tutorial (given here ), but it difficult to extend the same concept to train to MBART/MBART50. The doubts that I have :

  1. I need to have the custom trained tokenizer. While the tokenizer used in MBART/MBART50 architecture is BPE(multi-lingual), following the guide here for training BPE tokenizer does not work for me(while training the following error is produced : MBart.from_pretrained() got an additional argument 'labels')

  2. How do we train with multilingual dataset, or pass it to LineByLineDataset ? I couldn’t find any reference on how to do it?

  3. Since my task further is to use adapters with the model, training with hugging face is kinda given. I can’t use other libraries which do not have built in support for adapters(not full support)

Any help/reference would be really really really helpful. I can clarify more if this conversation is continued !!

Hey, I’m working on something similar. Can you provide details on how you went about it? because for mbart the encoder inputs needs to be masked but then the decoder inputs are not masked. I’m not sure how I can achieve that using huggingface trainer