How big are differences between transformer implementations

Hello, does it make sense to train and compare different HF-Transformer Implementations like BERT and RoBERTa with the same training routine? Because the implementation seems to be very similar and the main difference are the different training routines in the original paper.

E.g. when I train ElectraForMaskedLM. How does it differ from training BertForMaskedLM as they have the same tokenizer, and the change in pre-training objectives is not part of the class ElectraForMaskedLM (correct me if I am wrong)?

Even though both the discriminator and generator may be loaded into this model, the generator is the only model of the two to have been trained for the masked language modeling task.