How big are differences between transformer implementations

harnisph · April 26, 2022, 9:38am

Hello, does it make sense to train and compare different HF-Transformer Implementations like BERT and RoBERTa with the same training routine? Because the implementation seems to be very similar and the main difference are the different training routines in the original paper.

E.g. when I train ElectraForMaskedLM. How does it differ from training BertForMaskedLM as they have the same tokenizer, and the change in pre-training objectives is not part of the class ElectraForMaskedLM (correct me if I am wrong)?

Even though both the discriminator and generator may be loaded into this model, the generator is the only model of the two to have been trained for the masked language modeling task.

Topic		Replies	Views
Will masking more tokens speed up training and use less memory in HuggingFace's Bert or Roberta? 🤗Transformers	1	290	December 3, 2022
Difference between roberta and bert for pretraining Models	0	559	July 15, 2023
How could different model be trained same way? Beginners	0	207	July 26, 2021
SpanBERT, ELECTRA, MARGE from scratch? Beginners	5	1379	July 22, 2023
BERT performs worse than other implementations? 🤗Transformers	0	779	July 24, 2020

How big are differences between transformer implementations

Related topics