[mBART] What is the correct data format for fine-tuning on a new language? (e.g., Toki Pona)

I’m facing a fundamental challenge while trying to fine-tune facebook/mbart-large-50-many-to-many-mmt to add support for a new language, Toki Pona (tok). I’ve seen several conflicting discussions and examples, and I’m hoping to find the definitive “best practice”.

My core confusion is about the data format and the placement of language ID tokens .

I’ve seen valhalla’s advice in the forums, which suggests that for mBART, the format should be:

  • Source: Sentence text… [SRC_LANG_ID]

  • Target: [TGT_LANG_ID] Sentence text…

However, most examples for the Trainer framework (like the official run_translation.py script) seem to use a preprocess_function where language IDs are NOT part of the text, but are set via tokenizer.src_lang and tokenizer.tgt_lang before tokenization.

This leads to my central question:

When fine-tuning mBART on a new, unsupported language (e.g., tok_XX) within the transformers Trainer framework, what is the correct and officially recommended way to format the training data?

Specifically:

  1. Should the language IDs (en_XX, zh_CN, tok_XX…) be physically present in the input/target strings of my dataset?

  2. If so, where should they be placed? At the beginning of the source string (like en_XX I feel good.)? Or at the end (like I feel good. en_XX)?

  3. If not, and I should rely on tokenizer.src_lang and tokenizer.tgt_lang, how can I make this work in a multi-directional (X ↔ tok) training setup where the source and target languages change for every sample within a batch? The official run_translation.py seems designed only for a single language pair.

I have tried several approaches based on these conflicting ideas, including adding tok_XX as a special token. My experiments have resulted in bizarre “language switching” failures (e.g., tok → zh producing German), which strongly suggests I am fundamentally misunderstanding how mBART’s language control mechanism is supposed to work in a fine-tuning context for a new language.

Could anyone who has successfully navigated this, or any of the core developers, shed some light on the canonical data formatting required to make this work?

Thank you so much for your time and expertise.

1 Like

Adding a new language to the model seems complicated… but I found a way that might work.

2 Likes

Thanks a lot for your long but very, very detailed solution! This will help a lot :laughing::laughing::laughing:

1 Like