NLLB tokenizer multiple target/source languages within a training batch

natekrasner · July 17, 2024, 11:31pm

I know this a late response, but I have been doing multilingual finetuning without issue (mostly). As long as the first token of the target sentence is the correct output lang_id, the teacher forcing will include it after the first decoding step. The model does not have to correctly predict the output lang_id, it only has to correctly predict the output sequence given the correct lang_id.

As far as multiple languages in one batch goes, I’ll update here if my code works. I made a custom torch dataset which prepends the lang_ids and tokenizes the sequences in the __getitem__ function. This way the sequences will be properly tokenized before the dataloader batches them. I am still working on my training loop code for this functionality, so I am unsure of what might go wrong.

Topic		Replies	Views
Select Source and Target Langauge in multi-language translation models 🤗Transformers	1	372	August 14, 2024
How to train mBart or any multilingual model for translation task Beginners	0	254	January 4, 2023
Get Problem with Doubled tokens in NLLB Tokenizer After load new vocab! 🤗Tokenizers	0	271	November 21, 2023
Fine-tuning NLLB model Models	1	2676	July 20, 2023
Help defining tokenizer 🤗Tokenizers	0	282	April 28, 2023

NLLB tokenizer multiple target/source languages within a training batch

Related topics