Hi all,
I’m trying to fine-tune facebook/mbart-large-50-many-to-many-mmt
for multilingual translation across several language pairs, such as en→ro
, fr→en
, de→fr
, etc.
I’ve checked the documentation here: transformers/examples/pytorch/translation/README.md at main · huggingface/transformers · GitHub
I’ve reviewed a previous GitHub issue here: Issues finetuning MBART 50 many to many · Issue #10835 · huggingface/transformers · GitHub
There, it was mentioned that the example script wouldn’t support multilingual fine-tuning directly, and users should prepare data with appropriate src_lang
/ tgt_lang
tags and manage forced_bos_token_id
dynamically.
But I can’t find a working example or notebook that shows:
How to structure a dataset with mixed language pairs
How to tokenize dynamically with different src_lang
/ tgt_lang
How to set forced_bos_token_id
per sample when using run_translation.py
or Seq2SeqTrainer
It would be really helpful to have guidance or an example script for this — even just a dataset schema and preprocessing function would be great.
Thanks in advance!
1 Like
https://stackoverflow.com/questions/76191862/how-can-i-fine-tune-mbart-50-for-machine-translation-in-the-transformers-python
I found a sample, but it seems that the only way to switch between multiple tokenizers is to use them simultaneously…
There seem to be several methods, such as overriding __getitem__
or passing pre-tokenized data, but…
I know this a late response, but I have been doing multilingual finetuning without issue (mostly). As long as the first token of the target sentence is the correct output lang_id, the teacher forcing will include it after the first decoding step. The model does not have to correctly predict the output lang_id, it only has to correctly predict the output sequence given the correct lang_id.
As far as multiple languages in one batch goes, I’ll update here if my code works. I made a custom torch da…
opened 11:56AM - 03 Feb 22 UTC
closed 03:02PM - 13 Mar 22 UTC
# 🚀 Feature request
Allow mBART and M2M100 to be easily fine-tuned with multi… ple target languages in the fine-tuning data set, probably by allowing forced_bos_token_id to be provided in the training dataset.
## Motivation
A number of multilingual models already exist in huggingface transformers (m2m100, mBART, mt5). These can do translation to and from multiple languages. However, at the moment, m2m100 and mBART can only easily be fine-tuned using training data with a single target language.
This is because mBART and M2M100 required a "forced beginning of sequence token" [`forced_bos_token_id`](https://huggingface.co/docs/transformers/main_classes/model#transformers.generation_utils.GenerationMixin.generate.forced_bos_token_id) to be set indicating the target language. This is set on the model. Because of this, there is no obvious way to have different target language outputs to be used while training.
This has been asked about by multiple people independently in the discussion forums with no response yet. I've found two at this point (and I would have been a third):
- https://discuss.huggingface.co/t/m2m-model-finetuning-on-multiple-language-pairs/13203
- https://discuss.huggingface.co/t/how-to-force-bos-token-id-for-each-example-individually-in-mbart/8712
## Your contribution
I don't feel confident enough in my python or transformers expertise to contribute a pull request. However, to me it feels like this code should live in the trainer, rather than in the model. So for my own individual project I believe I have made a workaround to this by: ([Sample Colab notebook for the code below](https://colab.research.google.com/drive/11Wml-dOasQTuUYtk7dwKuU6B7bnQhwAq?usp=sharing))
- adding `forced_bos_token_id` as a column in my dataset, with one entry for each training example
- subclassing Seq2SeqTrainer and making the following changes:
- Overriding `prediction_step(...)` with a copy and paste of the original code, and adding code to read forced_bos_token_id from inputs, and add it as an argument in `generated_tokens = self.model.generate(**generation_inputs, **gen_kwargs, forced_bos_token_id = forced_bos_token_id)`
- Overriding `_remove_unused_columns()` to be a no-op, so the forced_bos_token_id doesn't get removed (as it isn't in the model signature, as it isn't a parameter for the model)
- Overriding `compute_loss()` to have an `inputs.pop("forced_bos_token_id")` to prevent this unexpected input breaking the forward step
This seems to run, and I hope it is working, but I could have easily made a simple error. And all the copy and pasting of code makes this very fragile, which is why it would be nicer to have it in the transformers library.
Hey everyone! So, I’m a 21-year-old AI enthusiast (also, currently unemployed, but that’s just more time for projects like this, right?). I…
Reading time: 17 min read