How to train the Translation Language Modeling (TLM) with transformers/examples/language-modeling/run_mlm.py?

Hello, I have a question to ask for your help. I want to train the Translation Language Modeling (TLM) in XLM (Paper: Cross-lingual Language Model Pretraining). The translation language modeling (TLM) is very similar to the Masked Language Modeling (MLM), which only shows the difference in the form of input data. If I want to use the run_mlm.py file to achieve the effect of training the translation language modeling (TLM), can I just modify the composition of training data without modifying the source code of the transformers/examples/language-modeling/run_mlm.py file? Is this feasible? :grimacing: :yum:

For example, for the masked language modeling (MLM), one row of my training data is a language, as shown below:

( Row 1 ) polonium 's isotopes tend to decay with alpha or beta decay ( en ) .
( Row 2 ) 231 and penetrated the armour of the Panzer IV behind it ( en ) .
( Row 3 ) die Isotope von Polonium neigen dazu , mit dem Alpha- oder Beta-Zerfall zu zerfallen ( de ) .
( Row 4 ) 231 und durchbrach die Rüstung des Panzers IV hinter ihm ( de ) .
…

For the translation language modeling (TLM), my training data is a combination of two parallel corpora (It is to splice the above data in pairs. The separator is [/s] [/s].), as shown below:

( Row 1 ) polonium 's isotopes tend to decay with alpha or beta decay ( en ) . [/s] [/s] die Isotope von Polonium neigen dazu , mit dem Alpha- oder Beta-Zerfall zu zerfallen ( de ) .
( Row 2 ) 231 and penetrated the armour of the Panzer IV behind it ( en ) . [/s] [/s] 231 und durchbrach die Rüstung des Panzers IV hinter ihm ( de ) .
…

If I only modify the training data into a combination of two parallel corpora before executing the transformers/examples/language-modeling/run_mlm.py file, can I achieve the effect of training the translation language modeling (TLM)?

Looking forward to your help, thank you very much! :wink:

I have same confuse. Could tell me if you solve this issue?

I think it’s almost the same . The only gap should be the random mask by random mask. If the mask is applied on the special tokens some effects would occur. But i think this could be a positive effect as the model could distinguish the differences betwwen different language.