Difference between roberta and bert for pretraining

M98M · July 15, 2023, 10:38pm

I wanted to pre-train a Bert on my own dataset so following this how-to-train blog post I came upon Roberta.
After reading up the differences I dont find any real differences in which model should i chose.
MLM collator already masks dynamically, byte level BPE vs Wordpiece shouldn’t make much impact, batch size and number of epochs can easily be adjusted and the model is essentially identical.
So when we can just chose BertConfig and ignore the NSP task, why should I chose RobertaConfig in the script? am I missing something?

Topic		Replies	Views
Tokenizer vs Model 🤗Tokenizers	0	251	June 24, 2024
Questions about the connection between tokenizer and the model Beginners	0	308	September 19, 2023
Train from scratch vs further pretraining/fine tuning with MLM and NSP Research	1	1545	August 28, 2023
Does all masking during training take place in data_collator.py? 🤗Transformers	0	118	November 11, 2023
Pre-Train BERT (from scratch) Research	43	18992	June 27, 2022

Difference between roberta and bert for pretraining

Related topics