Suitable Data for Task Adaptive Pretraining (TAPT)

I want to pretrain an Arabic BERT model on domain-specific data to make it suitable for a specific domain problem, which is the classification of citizen reviews about government services into relevant government sectors. My plan is to pretrain the model on freely available Arabic newspaper articles that specifically tackle the same sectors as the government ones, including education, healthcare, etc. I know these articles are not considered too specific to the target domain, but they are the only suitable data available.

I want to apply Task Adaptive Pretraining (TAPT), more specifically, to pretrain on task-specific data. So, I am a bit confused, should I apply TAPT by further pretraining on the newspaper articles? or should I consider more task specific data drawn from the same distribution of the target data (getting more citizen reviews about government services for pretraining).

I am confused about what exactly is meant by task-specific data? Also, If I pretrain the Arabic BERT model from scratch on the newspaper articles, can I refer to it Domain Speific Pretraining (DSPT)? or Should the DSPT be applied with data that is domain specific but can serve multiple tasks?

P.S. the newspaper articles data is about 40-50K articles only. Also, the target dataset contains about 2-4K citizen reviews provided in Modern Standard Arabic.

1 Like