Pretraining Models from Scratch vs Further Training


I want to pretrain an Arabic BERT model on domain-specific data to make it suitable for a specific domain problem, which is the classification of citizen reviews about government services into relevant government sectors. After extensive research, I found that domain-specific models outperform the general ones. So, my plan is to pretrain the model on freely available Arabic newspaper articles that specifically tackle the same sectors as the government ones, including education, healthcare, etc. I know these articles are not considered too specific to the target domain, but they are the only suitable data available. I plan to pretrain the model on around 20K articles only since I am limited with time and computational resources. Also, the target dataset contains about 2K citizen reviews provided in Modern Standard Arabic.

So, I have several questions concerning this project:

  1. Would it be beneficial to pretrain the Arabic BERT model from scratch using this small dataset of 20K samples? or would it be too small to tackle my problem?

  2. Would it be better to apply further pretraining for Arabic BERT model, which means starting with the model initial knowledge (weights) and then further pretraining it on the 20K samples? I am afraid this will lead to model forgetting for the previously learnt knowledge. Also, the combination of general and specific knowledge might affect the model performance on the target dataset of citizen reviews.

  3. Whichever method I choose from above, should I pretrain the model on unlabeled data (unsupervised learning)? or is it better to train it on labeled data to be useful for text classification?

  4. After pretraining the model, should I apply feature extraction or fine-tuning on the target dataset of citizen reviews?