How to prepare data for mBART50 multilingual (many-to-many) fine-tuning?

Hi all,

I’m trying to fine-tune facebook/mbart-large-50-many-to-many-mmt for multilingual translation across several language pairs, such as en→ro, fr→en, de→fr, etc.

I’ve checked the documentation here: transformers/examples/pytorch/translation/README.md at main · huggingface/transformers · GitHub
I’ve reviewed a previous GitHub issue here: Issues finetuning MBART 50 many to many · Issue #10835 · huggingface/transformers · GitHub

There, it was mentioned that the example script wouldn’t support multilingual fine-tuning directly, and users should prepare data with appropriate src_lang / tgt_lang tags and manage forced_bos_token_id dynamically.

But I can’t find a working example or notebook that shows:

  • How to structure a dataset with mixed language pairs
  • How to tokenize dynamically with different src_lang / tgt_lang
  • How to set forced_bos_token_id per sample when using run_translation.py or Seq2SeqTrainer

It would be really helpful to have guidance or an example script for this — even just a dataset schema and preprocessing function would be great.

Thanks in advance!

1 Like

https://stackoverflow.com/questions/76191862/how-can-i-fine-tune-mbart-50-for-machine-translation-in-the-transformers-python
I found a sample, but it seems that the only way to switch between multiple tokenizers is to use them simultaneously…

There seem to be several methods, such as overriding __getitem__ or passing pre-tokenized data, but…