I have my own custom dataset that I need to pre-train a MBART/MBART50 type architecture from scratch. I have already tried the tutorial (given here ), but it difficult to extend the same concept to train to MBART/MBART50. The doubts that I have :
-
I need to have the custom trained tokenizer. While the tokenizer used in MBART/MBART50 architecture is BPE(multi-lingual), following the guide here for training BPE tokenizer does not work for me(while training the following error is produced :
MBart.from_pretrained() got an additional argument 'labels'
) -
How do we train with multilingual dataset, or pass it to
LineByLineDataset
? I couldn’t find any reference on how to do it? -
Since my task further is to use adapters with the model, training with hugging face is kinda given. I can’t use other libraries which do not have built in support for adapters(not full support)
Any help/reference would be really really really helpful. I can clarify more if this conversation is continued !!