From the docs, one way to load the nllb tokenizer is: tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-1.3B", use_auth_token=False, src_lang="eng_Latn", tgt_lang="hin_Deva")
But, this way, tgt_lang which gets assigned is hin_Deva for all labels. Is there a way to set the tgt_lang code when calling the tokenizer function, so that one can have multiple languages within a batch while training. e.g.
To my understanding, one way to prepare the training data for finetuning if we feed the model the same sentence pair twice but flipped so that the model learns to translate in both direction.
Not sure, but found a workaround. Created a preprocess function that appends the tokenizer.lang_code_to_id[ex["tgt_lang"]] value as part of tokenization process.
def preprocess_function_helper(tokenizer, context_length=3):
def preprocess_function(examples, max_length=128):
......
model_inputs = tokenizer(
inputs, text_target=targets, max_length=max_length, truncation=True
)
# Update first token id in each model_inputs['labels'] with lang_tokens
for i in range(batch_size):
model_inputs['labels'][i][0] = lang_tokens[i]
.....
return preprocess_function
dataset = dataset.map(
preprocess_function_helper(tokenizer=tokenizer, context_length=context_length),
batched=True,
remove_columns=["translation"],
load_from_cache_file=False,
)
which gets passed into the models. Hope that helps.
Hey thanks for sharing!
I think that helps a lot to ensure that the correct lang_id is attached to the correct input sequence. However, when it comes to the training, how will the model know when to switch the output token id (In your case, between eng_Latn & hin_Deva, or in my example, between eng_Latn & zho_Hans)
The reason why I ask this is because I know during inference after the model has been trained, we need to specify the forced_bos_token_id to generate output.
Thus I am not sure if there is a need/way to specify the forced_bos_token_id while training.
Edit: I just realised that multi-lingual finetuning/training is not supported and is unlikely to be supported after reading a related issue posted on huggingface’s Github