Continue pre-training of Greek BERT with domain specific dataset [clarified]


I want to further pre-train Greek BERT of the library on a domain specific dataset that is not annotated in MLM task. The downstream task of BERT is Name Entity Recognition where I have a second annotated dataset to subsequently fine tune the model. I found that the library provides scripts for that also. In the example RoBERTa is further trained on wikitext-2-raw-v1. As I saw here, the dataset is formatted as:

“text”: “” The gold dollar or gold one @-@ dollar …

although, I downloaded the dataset from the link provided in that site and saw that the texts in the dataset are one after the other separated by titles within = .

My question is, what format should the dataset that I will provide to the script for further pre-training of BERT have, and how should they provided as train and dev?
If there is any source, it would be very helpful.

p.s. BERT was pre-trained in two tasks, MLM and NSP. Since my downstream task is Sequence Labeling, I thought that I should continue the pre-training with just the MLM task.