I know this a late response, but I have been doing multilingual finetuning without issue (mostly). As long as the first token of the target sentence is the correct output lang_id, the teacher forcing will include it after the first decoding step. The model does not have to correctly predict the output lang_id, it only has to correctly predict the output sequence given the correct lang_id.
As far as multiple languages in one batch goes, I’ll update here if my code works. I made a custom torch dataset which prepends the lang_ids and tokenizes the sequences in the __getitem__ function. This way the sequences will be properly tokenized before the dataloader batches them. I am still working on my training loop code for this functionality, so I am unsure of what might go wrong.