Framework for Continual Pretraining

I am conducting continual pretraining / domain adaptive pretraining on bert-base-uncased using domain data in order to increase the accuracy of the model on domain specific tasks downstream. Currently, I am not seeing any improvements in those tasks after using a domain adapted model, even after fine-tuning. I want to confirm that I’m using Transformers correctly. Specifically:

  1. Should I use BertForMaskedLM or BertForPreTraining as my starting model?
  2. Should I train a tokenizer from scratch on the domain data or use a preexisting one?
  3. Is having ~1M text records (each around a paragraph in length) roughly enough for continual pretraining?