given I have parallel corpora in 3 languages A, B, and C. I want to train/fine-tune translation model for N-to-N translation task (translating between all possible pairs of languages)
When looking on guides on Huggingface or over the net, all I can find is one-to-one fine-tune guides; i.e., the guide showcase how to fine-tune for a given fixed source and fixed target language.
when looking at the documentation of M2M100
it says that each language is prefixed with language token.
I understand that during training this can beneficial for N-to-N training
Since I want N-to-N training to run in a way that a single batch can contain any combination of languages, which means every forward pass of the data I can have examples translating A-> C, C-> A, and B-> C for example, and the next batch could very different.
During training prefixing the text with the language token would work and I could perform N-to-N training.
But my main issue is when doing evaluation using an eval dataset.
Ideally, when training the model we use Seq2SeqTrainer
and one of the options is predict_with_generate which should ideally set to true since this gives better metric results during evaluation and closely mimics how the model will run in inference.
But predict_with_generate always starts with the first token as decoder_start_token_id which can be some start_token or end_token doesn’t really matter, and then the trainers calls model.generate
For example let’s assume that my eval batch is size 1 and contains B->A text
the model will feed B_lang_token B end_token to the encoder part
and will start prompting generating from the decoder with initial token start_token
given that in this eval example I want B to generate sentence in language A
how can I enforce that assumption
cause in theory my model during eval doesn’t know how which language I indent to translate to, so it can both do B → A as desired or it can suddenly do B → C.
Now I know you can provide GenerationConfig to the model training args and set forced_bos_token_id in a way that we’ll guarantee that the model will not only start generating from start_token but rather from start_token A_lang_token which would work well in our 1 example batch
but fixating forced_bos_token_id will not carry between batches, since the 2nd eval batch could actually be A → C and the model will still try to prompt from start_token A_lang_token and basically try to do A → A translation.
or in the case of batch size 2 for eval we’ll have B → A and B → C, so prompting the model to generate starting from start_token A_lang_token for both examples is also wrong.
I would also add that adding decoder_input_ids to the training batch wouldn’t work since in prediction_step the trainer drops these before generating.
And if we provide only partial decoder_input_ids it’ll cause an issue when calculating the loss during the evaluation when it reaches outputs = model(**inputs)
since decoder_input_ids shape doesn’t match the labels shape
I tried looking all over the web, read the original papers of mBART, M2M100, NLLB, and more but none explains how to do N-to-N properly
prompting big LLMs for help always results in very ugly and patched code that implements many methods from the HF trainer, and collators to patch this dynamic situation by rewriting the entire given method and adding 3 more lines of code and changing 2 existing lines of code.
I wanted to know if there’s a more clean, simpler way that allows me to use the existing huggingface components without subclassing and rewriting the entire training logic just to add one simple functionality
Thank you