Fine-tuning an NLLB model for a new language

Hi, I’m new to huggingface, transformers and NLP in general. I found this article how to fine tune an NLLB model for a new language, which I followed and actually got some decent results.
However, the post used transformers version 4.33 where adding a new language token to the NLLB tokenizer was a bit hacky. Since then this PR was implemented which (I think) allows adding a new language to the tokenizer simply by doing this:

tokenizer = NllbTokenizer.from_pretrained('facebook/nllb-200-distilled-600M', 
        additional_special_tokens = FAIRSEQ_LANGUAGE_CODES + [new_language_code])

This worked fine so far and I have successfully fine-tuned the model again with transformers 4.38 on a parallel dataset of Northern Frisian and German sentences. Everything worked just as with the previous version, except now translation to Northern Frisian only works from German. When trying to translate e.g. an English sentence to Northern Frisian it just gets translated to German instead. In the old version translating English to Frisian worked perfectly fine.

I also noticed that the <mask> token isn’t the last one in the tokenizer, even though the code in the NllbTokenizer really looks like it should be. In the old version, part of adding the new language tag was to also move the mask token into the last spot again.

So the question is, am I missing something? Do I need to do more to the tokenizer (or the model) to correctly add the new language tag? Or is there something wrong with the NllbTokenizer?

2 Likes

Do I need to do more to the tokenizer (or the model) to correctly add the new language tag? Or is there something wrong with the NllbTokenizer?

In my post, I used transformers==4.33, and I mentioned that my code for updating the tokenizer works only with this version. With version 4.34 there was an update that made my code no longer valid, and with version 4.38 came another update that totally changed how special tokens were handled in the NllbTokenizer. So its code now is different from what my Medium post relied upon.

If you want to follow the recipe from this post exactly, I strongly recommend sticking to transformers==4.33.
With the version 4.38 (and hopefully, multiple versions into the future), there is no longer need to call fix_tokenizer after each initialization. However, the encoder and decoder for special tokens are now much more stubborn, and I haven’t found any good way to shift the token ids in them but to fully delete and recreate them with this code.

1 Like

Thanks for the reply! I realize that your original code is tailored to transformers==4.33 but I was hoping to update to 4.38 and later versions at some point.

Tbh, I’m still not sure what exactly went wrong when I tried adding a new language token to the tokenizer in 4.38 since at some point it looked very much like I got everything in the tokenizer right, including moving the mask token to last position. But I guess that there are some more implementation details that I missed.

In any case, thanks a lot for the new code, I will try creating the updated tokenizer that way some time soon.