Continue pre-training of Greek BERT with domain specific dataset [clarified]

hf4nlp · February 26, 2021, 1:52pm

Hello,

I want to further pre-train Greek BERT of the library on a domain specific dataset that is not annotated in MLM task. The downstream task of BERT is Name Entity Recognition where I have a second annotated dataset to subsequently fine tune the model. I found that the library provides scripts for that also. In the example RoBERTa is further trained on wikitext-2-raw-v1. As I saw here, the dataset is formatted as:

{
“text”: “” The gold dollar or gold one @-@ dollar …
}

although, I downloaded the dataset from the link provided in that site and saw that the texts in the dataset are one after the other separated by titles within = .

My question is, what format should the dataset that I will provide to the script for further pre-training of BERT have, and how should they provided as train and dev?
If there is any source, it would be very helpful.

p.s. BERT was pre-trained in two tasks, MLM and NSP. Since my downstream task is Sequence Labeling, I thought that I should continue the pre-training with just the MLM task.

Topic		Replies	Views
Continue pre-training of Greek BERT with domain specific dataset Beginners	7	3029	August 6, 2021
Continue pre-training BERT Intermediate	5	2467	November 13, 2023
Continue pre-training Greek BERT with domain specific dataset 🤗Transformers	10	4658	January 4, 2023
BERT pre-training run_mlm_flax.py questions Beginners	0	254	November 3, 2021
How to train BERT from scratch on a new domain for both MLM and NSP? Models	2	2295	February 6, 2021

Continue pre-training of Greek BERT with domain specific dataset [clarified]

Related topics