Domain Specific Pretraining using BERT models vs other smaller architecture models

I have around 4K target data about Arabic citizen reviews towards government services and I want to apply transfer learning to enhance the performance of target task, which is classifying the reviews to the relevant government sectors, education, healthcare, etc. I plan to use a source data of 30-40K Arabic newspaper articles that specifically tackle the same sectors as the government ones, including education, healthcare, etc. I know these articles are not considered too specific to the target domain, but they are the only suitable data available.

My plan is to compare fine tuning an Arabic BERT model on the target data to training a model from scratch on the source data and then applying fine tuning on the target data.

So my question is should I apply Task Adaptive Pretraining (TAPT), more specifically, further pretraining the Arabic BERT model on task-specific data (which is the 30-40K newspaper articles), then fine tune on the target data, to compare the performance against fine tuning the Arabic BERT model directly on yhe target data without further training?

Or is it better to pretrain another smaller model architecture like (LSTM or may be lite BERT) from scratch on the source data to account for its small size. Then, fine tune the model on the target data of citizen reviews?

Also, is it fine to compare direct fine tuning with pretraining from scratch using different model architectures? Or should I use the same architecture to be able to perform a valid experiment?