NLLB tokenizer multiple target/source languages within a training batch

From the docs, one way to load the nllb tokenizer is:
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-1.3B", use_auth_token=False, src_lang="eng_Latn", tgt_lang="hin_Deva")

But, this way, tgt_lang which gets assigned is hin_Deva for all labels. Is there a way to set the tgt_lang code when calling the tokenizer function, so that one can have multiple languages within a batch while training. e.g.

# loading tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-1.3B", use_auth_token=False, src_lang="eng_Latn")
model_inputs = tokenizer(inputs, text_target=targets, target_langs=target_langs, max_length=max_length, truncation=True)

# where target_langs = ['hin_Deva', 'tur_Latn', ....]

Hi anyone has a solution to this?

To my understanding, one way to prepare the training data for finetuning if we feed the model the same sentence pair twice but flipped so that the model learns to translate in both direction.

Eg (LHS is the Input & RHS is the Target):

{"eng_Latn": "Hi", "zho_Hans": "你好"}
{"zho_Hans": "你好", "eng_Latn": "Hi"}

But I can’t seem to find a way to do this with the Trainer API.

Not sure, but found a workaround. Created a preprocess function that appends the tokenizer.lang_code_to_id[ex["tgt_lang"]] value as part of tokenization process.

def preprocess_function_helper(tokenizer, context_length=3):
    def preprocess_function(examples, max_length=128):
        ......
        model_inputs = tokenizer(
            inputs, text_target=targets, max_length=max_length, truncation=True
        )

        # Update first token id in each model_inputs['labels'] with lang_tokens
        for i in range(batch_size):
            model_inputs['labels'][i][0] = lang_tokens[i]
        .....

   return preprocess_function

dataset = dataset.map(
        preprocess_function_helper(tokenizer=tokenizer, context_length=context_length),
        batched=True,
        remove_columns=["translation"],
        load_from_cache_file=False,
    )

which gets passed into the models. Hope that helps.

Hey thanks for sharing!
I think that helps a lot to ensure that the correct lang_id is attached to the correct input sequence. However, when it comes to the training, how will the model know when to switch the output token id (In your case, between eng_Latn & hin_Deva, or in my example, between eng_Latn & zho_Hans)

The reason why I ask this is because I know during inference after the model has been trained, we need to specify the forced_bos_token_id to generate output.
Thus I am not sure if there is a need/way to specify the forced_bos_token_id while training.

Edit: I just realised that multi-lingual finetuning/training is not supported and is unlikely to be supported after reading a related issue posted on huggingface’s Github