Difference between roberta and bert for pretraining

I wanted to pre-train a Bert on my own dataset so following this how-to-train blog post I came upon Roberta.
After reading up the differences I dont find any real differences in which model should i chose.
MLM collator already masks dynamically, byte level BPE vs Wordpiece shouldn’t make much impact, batch size and number of epochs can easily be adjusted and the model is essentially identical.
So when we can just chose BertConfig and ignore the NSP task, why should I chose RobertaConfig in the script? am I missing something?