How to prepare data for mBART50 multilingual (many-to-many) fine-tuning?

odegiber · June 17, 2025, 1:07pm

Hi all,

I’m trying to fine-tune facebook/mbart-large-50-many-to-many-mmt for multilingual translation across several language pairs, such as en→ro, fr→en, de→fr, etc.

I’ve checked the documentation here: transformers/examples/pytorch/translation/README.md at main · huggingface/transformers · GitHub
I’ve reviewed a previous GitHub issue here: Issues finetuning MBART 50 many to many · Issue #10835 · huggingface/transformers · GitHub

There, it was mentioned that the example script wouldn’t support multilingual fine-tuning directly, and users should prepare data with appropriate src_lang / tgt_lang tags and manage forced_bos_token_id dynamically.

But I can’t find a working example or notebook that shows:

How to structure a dataset with mixed language pairs
How to tokenize dynamically with different src_lang / tgt_lang
How to set forced_bos_token_id per sample when using run_translation.py or Seq2SeqTrainer

It would be really helpful to have guidance or an example script for this — even just a dataset schema and preprocessing function would be great.

Thanks in advance!

John6666 · June 17, 2025, 4:04pm

https://stackoverflow.com/questions/76191862/how-can-i-fine-tune-mbart-50-for-machine-translation-in-the-transformers-python
I found a sample, but it seems that the only way to switch between multiple tokenizers is to use them simultaneously…

There seem to be several methods, such as overriding __getitem__ or passing pre-tokenized data, but…

github.com/huggingface/transformers

Allow training from multiple languages for multilingual seq2seq models (varying forced_bos_token_id)

opened 11:56AM - 03 Feb 22 UTC

closed 03:02PM - 13 Mar 22 UTC

nfortescue

# 🚀 Feature request Allow mBART and M2M100 to be easily fine-tuned with multi…ple target languages in the fine-tuning data set, probably by allowing forced_bos_token_id to be provided in the training dataset. ## Motivation A number of multilingual models already exist in huggingface transformers (m2m100, mBART, mt5). These can do translation to and from multiple languages. However, at the moment, m2m100 and mBART can only easily be fine-tuned using training data with a single target language. This is because mBART and M2M100 required a "forced beginning of sequence token" [`forced_bos_token_id`](https://huggingface.co/docs/transformers/main_classes/model#transformers.generation_utils.GenerationMixin.generate.forced_bos_token_id) to be set indicating the target language. This is set on the model. Because of this, there is no obvious way to have different target language outputs to be used while training. This has been asked about by multiple people independently in the discussion forums with no response yet. I've found two at this point (and I would have been a third): - https://discuss.huggingface.co/t/m2m-model-finetuning-on-multiple-language-pairs/13203 - https://discuss.huggingface.co/t/how-to-force-bos-token-id-for-each-example-individually-in-mbart/8712 ## Your contribution I don't feel confident enough in my python or transformers expertise to contribute a pull request. However, to me it feels like this code should live in the trainer, rather than in the model. So for my own individual project I believe I have made a workaround to this by: ([Sample Colab notebook for the code below](https://colab.research.google.com/drive/11Wml-dOasQTuUYtk7dwKuU6B7bnQhwAq?usp=sharing)) - adding `forced_bos_token_id` as a column in my dataset, with one entry for each training example - subclassing Seq2SeqTrainer and making the following changes: - Overriding `prediction_step(...)` with a copy and paste of the original code, and adding code to read forced_bos_token_id from inputs, and add it as an argument in `generated_tokens = self.model.generate(**generation_inputs, **gen_kwargs, forced_bos_token_id = forced_bos_token_id)` - Overriding `_remove_unused_columns()` to be a no-op, so the forced_bos_token_id doesn't get removed (as it isn't in the model signature, as it isn't a parameter for the model) - Overriding `compute_loss()` to have an `inputs.pop("forced_bos_token_id")` to prevent this unexpected input breaking the forward step This seems to run, and I hope it is working, but I could have easily made a simple error. And all the copy and pasting of code makes this very fragile, which is why it would be nicer to have it in the transformers library.

Topic		Replies	Views
Fine-tuning for translation with facebook mbart-large-50 🤗Transformers	1	1729	March 16, 2024
Facebook mbart multilingual translation Beginners	0	499	February 1, 2023
Help with finetuning mBART on an unseen language Models	19	2054	October 30, 2020
Mbart finetuning Models	0	676	July 29, 2021
Question about Multilingual Tokenizers expected behaviours Beginners	0	326	July 13, 2022

How to prepare data for mBART50 multilingual (many-to-many) fine-tuning?

Related topics