Hello, I have a question to ask for your help. I want to train the Translation Language Modeling (TLM) in XLM (Paper: Cross-lingual Language Model Pretraining). The translation language modeling (TLM) is very similar to the Masked Language Modeling (MLM), which only shows the difference in the form of input data. If I want to use the run_mlm.py file to achieve the effect of training the translation language modeling (TLM), can I just modify the composition of training data without modifying the source code of the transformers/examples/language-modeling/run_mlm.py file? Is this feasible?
For example, for the masked language modeling (MLM), one row of my training data is a language, as shown below:
( Row 1 ) polonium 's isotopes tend to decay with alpha or beta decay ( en ) .
( Row 2 ) 231 and penetrated the armour of the Panzer IV behind it ( en ) .
( Row 3 ) die Isotope von Polonium neigen dazu , mit dem Alpha- oder Beta-Zerfall zu zerfallen ( de ) .
( Row 4 ) 231 und durchbrach die Rüstung des Panzers IV hinter ihm ( de ) .
…
For the translation language modeling (TLM), my training data is a combination of two parallel corpora (It is to splice the above data in pairs. The separator is [/s] [/s].), as shown below:
( Row 1 ) polonium 's isotopes tend to decay with alpha or beta decay ( en ) . [/s] [/s] die Isotope von Polonium neigen dazu , mit dem Alpha- oder Beta-Zerfall zu zerfallen ( de ) .
( Row 2 ) 231 and penetrated the armour of the Panzer IV behind it ( en ) . [/s] [/s] 231 und durchbrach die Rüstung des Panzers IV hinter ihm ( de ) .
…
If I only modify the training data into a combination of two parallel corpora before executing the transformers/examples/language-modeling/run_mlm.py file, can I achieve the effect of training the translation language modeling (TLM)?
Looking forward to your help, thank you very much!