Fine-tuning an NLLB model for a new language

CmdCody · April 3, 2024, 9:59pm

Hi, I’m new to huggingface, transformers and NLP in general. I found this article how to fine tune an NLLB model for a new language, which I followed and actually got some decent results.
However, the post used transformers version 4.33 where adding a new language token to the NLLB tokenizer was a bit hacky. Since then this PR was implemented which (I think) allows adding a new language to the tokenizer simply by doing this:

tokenizer = NllbTokenizer.from_pretrained('facebook/nllb-200-distilled-600M', 
        additional_special_tokens = FAIRSEQ_LANGUAGE_CODES + [new_language_code])

This worked fine so far and I have successfully fine-tuned the model again with transformers 4.38 on a parallel dataset of Northern Frisian and German sentences. Everything worked just as with the previous version, except now translation to Northern Frisian only works from German. When trying to translate e.g. an English sentence to Northern Frisian it just gets translated to German instead. In the old version translating English to Frisian worked perfectly fine.

I also noticed that the <mask> token isn’t the last one in the tokenizer, even though the code in the NllbTokenizer really looks like it should be. In the old version, part of adding the new language tag was to also move the mask token into the last spot again.

So the question is, am I missing something? Do I need to do more to the tokenizer (or the model) to correctly add the new language tag? Or is there something wrong with the NllbTokenizer?

cointegrated · June 18, 2024, 7:23pm

Do I need to do more to the tokenizer (or the model) to correctly add the new language tag? Or is there something wrong with the NllbTokenizer?

In my post, I used transformers==4.33, and I mentioned that my code for updating the tokenizer works only with this version. With version 4.34 there was an update that made my code no longer valid, and with version 4.38 came another update that totally changed how special tokens were handled in the NllbTokenizer. So its code now is different from what my Medium post relied upon.

If you want to follow the recipe from this post exactly, I strongly recommend sticking to transformers==4.33.
With the version 4.38 (and hopefully, multiple versions into the future), there is no longer need to call fix_tokenizer after each initialization. However, the encoder and decoder for special tokens are now much more stubborn, and I haven’t found any good way to shift the token ids in them but to fully delete and recreate them with this code.

CmdCody · June 18, 2024, 10:11pm

Thanks for the reply! I realize that your original code is tailored to transformers==4.33 but I was hoping to update to 4.38 and later versions at some point.

Tbh, I’m still not sure what exactly went wrong when I tried adding a new language token to the tokenizer in 4.38 since at some point it looked very much like I got everything in the tokenizer right, including moving the mask token to last position. But I guess that there are some more implementation details that I missed.

In any case, thanks a lot for the new code, I will try creating the updated tokenizer that way some time soon.

Tom9358 · December 30, 2024, 11:58am

Hi Cody, did you ever manage to get this to work?

CmdCody · December 31, 2024, 12:50pm

Unfortunately no. I tried to delete and recreate the token IDs as suggested, which initially led to a different ordering of the language token IDs, but even after fixing it I basically ended up with the same situation as before. I also dug around in the tokenizer a bit, but I don’t really understand the root cause of the issue.

jcuenod · January 7, 2025, 7:44am

Have you tried tokenizer.add_special_tokens({"additional_special_tokens": [lang_code]}, replace_additional_special_tokens=False)?

CmdCody · January 12, 2025, 2:25pm

I’m not sure what the difference is to loading the tokenizer with additional_special_tokens , but that does seem to work, thanks a lot!

system · January 13, 2025, 2:26am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to add new language to NLLB tokenizer in Huggingface? 🤗Transformers	2	1913	September 30, 2023
Finetune a pretrained huggingface translation model on a new language pair Models	1	1031	January 12, 2024
Select Source and Target Langauge in multi-language translation models 🤗Transformers	1	371	August 14, 2024
LM finetuning on domain specific unlabelled data Beginners	6	4663	April 21, 2021
How to "further pretrain" a tokenizer (do I need to do so?) 🤗Tokenizers	5	4384	February 20, 2022

Fine-tuning an NLLB model for a new language

Related topics