NLLB tokenizer multiple target/source languages within a training batch

nitindub · September 25, 2023, 1:48pm

From the docs, one way to load the nllb tokenizer is:
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-1.3B", use_auth_token=False, src_lang="eng_Latn", tgt_lang="hin_Deva")

But, this way, tgt_lang which gets assigned is hin_Deva for all labels. Is there a way to set the tgt_lang code when calling the tokenizer function, so that one can have multiple languages within a batch while training. e.g.

# loading tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-1.3B", use_auth_token=False, src_lang="eng_Latn")

model_inputs = tokenizer(inputs, text_target=targets, target_langs=target_langs, max_length=max_length, truncation=True)

# where target_langs = ['hin_Deva', 'tur_Latn', ....]

KhaiKit · October 20, 2023, 9:55am

Hi anyone has a solution to this?

To my understanding, one way to prepare the training data for finetuning if we feed the model the same sentence pair twice but flipped so that the model learns to translate in both direction.

Eg (LHS is the Input & RHS is the Target):

{"eng_Latn": "Hi", "zho_Hans": "你好"}
{"zho_Hans": "你好", "eng_Latn": "Hi"}

But I can’t seem to find a way to do this with the Trainer API.

nitindub · November 20, 2023, 3:49pm

Not sure, but found a workaround. Created a preprocess function that appends the tokenizer.lang_code_to_id[ex["tgt_lang"]] value as part of tokenization process.

def preprocess_function_helper(tokenizer, context_length=3):
    def preprocess_function(examples, max_length=128):
        ......
        model_inputs = tokenizer(
            inputs, text_target=targets, max_length=max_length, truncation=True
        )

        # Update first token id in each model_inputs['labels'] with lang_tokens
        for i in range(batch_size):
            model_inputs['labels'][i][0] = lang_tokens[i]
        .....

   return preprocess_function

dataset = dataset.map(
        preprocess_function_helper(tokenizer=tokenizer, context_length=context_length),
        batched=True,
        remove_columns=["translation"],
        load_from_cache_file=False,
    )

which gets passed into the models. Hope that helps.

KhaiKit · November 20, 2023, 4:26pm

Hey thanks for sharing!
I think that helps a lot to ensure that the correct lang_id is attached to the correct input sequence. However, when it comes to the training, how will the model know when to switch the output token id (In your case, between eng_Latn & hin_Deva, or in my example, between eng_Latn & zho_Hans)

The reason why I ask this is because I know during inference after the model has been trained, we need to specify the forced_bos_token_id to generate output.
Thus I am not sure if there is a need/way to specify the forced_bos_token_id while training.

Edit: I just realised that multi-lingual finetuning/training is not supported and is unlikely to be supported after reading a related issue posted on huggingface’s Github

natekrasner · July 17, 2024, 11:31pm

I know this a late response, but I have been doing multilingual finetuning without issue (mostly). As long as the first token of the target sentence is the correct output lang_id, the teacher forcing will include it after the first decoding step. The model does not have to correctly predict the output lang_id, it only has to correctly predict the output sequence given the correct lang_id.

As far as multiple languages in one batch goes, I’ll update here if my code works. I made a custom torch dataset which prepends the lang_ids and tokenizes the sequences in the __getitem__ function. This way the sequences will be properly tokenized before the dataloader batches them. I am still working on my training loop code for this functionality, so I am unsure of what might go wrong.

natekrasner · January 10, 2025, 1:31pm

I was just reminded of this and I realized I never followed up to say that it worked. I wasn’t able to get it to converge while training on more than one language at a time, but that may have been due to other factors. The model allowed it with no issue, so it could definitely work for batch inference.

Topic		Replies	Views
Select Source and Target Langauge in multi-language translation models 🤗Transformers	1	370	August 14, 2024
How to train mBart or any multilingual model for translation task Beginners	0	254	January 4, 2023
Get Problem with Doubled tokens in NLLB Tokenizer After load new vocab! 🤗Tokenizers	0	271	November 21, 2023
Fine-tuning NLLB model Models	1	2673	July 20, 2023
Help defining tokenizer 🤗Tokenizers	0	282	April 28, 2023

NLLB tokenizer multiple target/source languages within a training batch

Related topics