Translation - MBART, translation with identical source and target language, for text normalization

Hi,

This is rather a general question about translation and I am aware that I don’t follow exactly your guidelines, so I am sorry for that.
(We could run the examples mentioned in your readme, great tool!)

We try to conceive normalization for Dutch, as a ‘translation’ task.

So, is it for instance possible to use source + target language, defined as the same language, for instance

--source_lang nl_XX \
--target_lang nl_XX \

{“translation”: {“nl_XX”: “liefst geen energie vandaag . waar is m’n oplaadstation ?”, “nl_XX”: “liefst geen energie vandaag . waar is mijn oplaadstation ?”}}

or

--source_lang source\
--target_lang target \

{“translation”: {“source”: “liefst geen energie vandaag . waar is m’n oplaadstation ?”, “target”: “liefst geen energie vandaag . waar is mijn oplaadstation ?”}}

Environment info

–model_name_or_path facebook/mbart-large-50-many-to-many-mmt

Thanks for your answer!

You should change the data processing in the script as the source_lang and target_lang are also used to set the langauges on multilingual tokenizers (if you’re using mBart for instance, this approach wouldn’t work).

Hi,

thanks for your answer.
So as I understand it, using MBart with identical source and target language is not feasible with the current architecture?

Thanks

It is, but you need to change a bit the way the data is processed in the script since you can’t have a dictionary with the same key twice.