NLLB tokenizer multiple target/source languages within a training batch

From the docs, one way to load the nllb tokenizer is:
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-1.3B", use_auth_token=False, src_lang="eng_Latn", tgt_lang="hin_Deva")

But, this way, tgt_lang which gets assigned is hin_Deva for all labels. Is there a way to set the tgt_lang code when calling the tokenizer function, so that one can have multiple languages within a batch while training. e.g.

# loading tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-1.3B", use_auth_token=False, src_lang="eng_Latn")
model_inputs = tokenizer(inputs, text_target=targets, target_langs=target_langs, max_length=max_length, truncation=True)

# where target_langs = ['hin_Deva', 'tur_Latn', ....]

Hi anyone has a solution to this?

To my understanding, one way to prepare the training data for finetuning if we feed the model the same sentence pair twice but flipped so that the model learns to translate in both direction.

Eg (LHS is the Input & RHS is the Target):

{"eng_Latn": "Hi", "zho_Hans": "你好"}
{"zho_Hans": "你好", "eng_Latn": "Hi"}

But I can’t seem to find a way to do this with the Trainer API.

Not sure, but found a workaround. Created a preprocess function that appends the tokenizer.lang_code_to_id[ex["tgt_lang"]] value as part of tokenization process.

def preprocess_function_helper(tokenizer, context_length=3):
    def preprocess_function(examples, max_length=128):
        ......
        model_inputs = tokenizer(
            inputs, text_target=targets, max_length=max_length, truncation=True
        )

        # Update first token id in each model_inputs['labels'] with lang_tokens
        for i in range(batch_size):
            model_inputs['labels'][i][0] = lang_tokens[i]
        .....

   return preprocess_function

dataset = dataset.map(
        preprocess_function_helper(tokenizer=tokenizer, context_length=context_length),
        batched=True,
        remove_columns=["translation"],
        load_from_cache_file=False,
    )

which gets passed into the models. Hope that helps.

Hey thanks for sharing!
I think that helps a lot to ensure that the correct lang_id is attached to the correct input sequence. However, when it comes to the training, how will the model know when to switch the output token id (In your case, between eng_Latn & hin_Deva, or in my example, between eng_Latn & zho_Hans)

The reason why I ask this is because I know during inference after the model has been trained, we need to specify the forced_bos_token_id to generate output.
Thus I am not sure if there is a need/way to specify the forced_bos_token_id while training.

Edit: I just realised that multi-lingual finetuning/training is not supported and is unlikely to be supported after reading a related issue posted on huggingface’s Github

I know this a late response, but I have been doing multilingual finetuning without issue (mostly). As long as the first token of the target sentence is the correct output lang_id, the teacher forcing will include it after the first decoding step. The model does not have to correctly predict the output lang_id, it only has to correctly predict the output sequence given the correct lang_id.

As far as multiple languages in one batch goes, I’ll update here if my code works. I made a custom torch dataset which prepends the lang_ids and tokenizes the sequences in the __getitem__ function. This way the sequences will be properly tokenized before the dataloader batches them. I am still working on my training loop code for this functionality, so I am unsure of what might go wrong.

2 Likes

I was just reminded of this and I realized I never followed up to say that it worked. I wasn’t able to get it to converge while training on more than one language at a time, but that may have been due to other factors. The model allowed it with no issue, so it could definitely work for batch inference.

1 Like