Translation - MBART, translation with identical source and target language, for text normalization

desothier2 · July 14, 2021, 1:06pm

Hi,

This is rather a general question about translation and I am aware that I don’t follow exactly your guidelines, so I am sorry for that.
(We could run the examples mentioned in your readme, great tool!)

We try to conceive normalization for Dutch, as a ‘translation’ task.

So, is it for instance possible to use source + target language, defined as the same language, for instance

--source_lang nl_XX \
--target_lang nl_XX \

{“translation”: {“nl_XX”: “liefst geen energie vandaag . waar is m’n oplaadstation ?”, “nl_XX”: “liefst geen energie vandaag . waar is mijn oplaadstation ?”}}

or

--source_lang source\
--target_lang target \

{“translation”: {“source”: “liefst geen energie vandaag . waar is m’n oplaadstation ?”, “target”: “liefst geen energie vandaag . waar is mijn oplaadstation ?”}}

Environment info

–model_name_or_path facebook/mbart-large-50-many-to-many-mmt

Thanks for your answer!

sgugger · July 14, 2021, 3:11pm

You should change the data processing in the script as the source_lang and target_lang are also used to set the langauges on multilingual tokenizers (if you’re using mBart for instance, this approach wouldn’t work).

desothier2 · July 14, 2021, 7:03pm

Hi,

thanks for your answer.
So as I understand it, using MBart with identical source and target language is not feasible with the current architecture?

Thanks

sgugger · July 14, 2021, 7:08pm

It is, but you need to change a bit the way the data is processed in the script since you can’t have a dictionary with the same key twice.

Topic		Replies	Views
Mbart finetuning Models	0	676	July 29, 2021
Help with finetuning mBART on an unseen language Models	19	2054	October 30, 2020
Question about Multilingual Tokenizers expected behaviours Beginners	0	326	July 13, 2022
How to prepare data for mBART50 multilingual (many-to-many) fine-tuning? Models	1	19	June 17, 2025
Facebook mbart multilingual translation Beginners	0	499	February 1, 2023

Translation - MBART, translation with identical source and target language, for text normalization

Environment info

Related topics