Pre-Training MBART/MBART50 from Scratch in HuggingFace

yash-srivastava19 · June 15, 2023, 11:59am

I have my own custom dataset that I need to pre-train a MBART/MBART50 type architecture from scratch. I have already tried the tutorial (given here ), but it difficult to extend the same concept to train to MBART/MBART50. The doubts that I have :

I need to have the custom trained tokenizer. While the tokenizer used in MBART/MBART50 architecture is BPE(multi-lingual), following the guide here for training BPE tokenizer does not work for me(while training the following error is produced : MBart.from_pretrained() got an additional argument 'labels')
How do we train with multilingual dataset, or pass it to LineByLineDataset ? I couldn’t find any reference on how to do it?
Since my task further is to use adapters with the model, training with hugging face is kinda given. I can’t use other libraries which do not have built in support for adapters(not full support)

Any help/reference would be really really really helpful. I can clarify more if this conversation is continued !!

jes3275 · January 7, 2024, 9:52am

Hey, I’m working on something similar. Can you provide details on how you went about it? because for mbart the encoder inputs needs to be masked but then the decoder inputs are not masked. I’m not sure how I can achieve that using huggingface trainer

Topic		Replies	Views
How to train an MBart model from scratch for a new language pair? Beginners	0	481	February 16, 2021
Fine-tuning for translation with facebook mbart-large-50 🤗Transformers	1	1732	March 16, 2024
Adding a special language token to MBART 🤗Tokenizers	0	582	November 12, 2022
How to finetune MBART on an single language? Models	0	395	December 17, 2022
How to prepare data for mBART50 multilingual (many-to-many) fine-tuning? Models	1	33	June 17, 2025

Pre-Training MBART/MBART50 from Scratch in HuggingFace

Related topics