NLLB tokenizer multiple target/source languages within a training batch

KhaiKit · October 20, 2023, 9:55am

Hi anyone has a solution to this?

To my understanding, one way to prepare the training data for finetuning if we feed the model the same sentence pair twice but flipped so that the model learns to translate in both direction.

Eg (LHS is the Input & RHS is the Target):

{"eng_Latn": "Hi", "zho_Hans": "你好"}
{"zho_Hans": "你好", "eng_Latn": "Hi"}

But I can’t seem to find a way to do this with the Trainer API.

Topic		Replies	Views
Select Source and Target Langauge in multi-language translation models 🤗Transformers	1	379	August 14, 2024
How to prepare data for mBART50 multilingual (many-to-many) fine-tuning? Models	1	45	June 17, 2025
How to train mBart or any multilingual model for translation task Beginners	0	255	January 4, 2023
Fine tuning a T5 model for translation - How do I apply my trained tokenizer to the target sentences? 🤗Tokenizers	0	47	July 20, 2024
Finetune a pretrained huggingface translation model on a new language pair Models	1	1040	January 12, 2024

NLLB tokenizer multiple target/source languages within a training batch

Related topics