Continue pre-training of Greek BERT with domain specific dataset

Hello,

I want to further pre-train Greek BERT of the library on a domain specific dataset in MLM task to improve results. The downstream task of BERT will be sequence classification. I found that the library also provides scripts for that. In the example RoBERTa is further trained on wikitext-2-raw-v1. As I saw here, the dataset is formatted as:

{
“text”: “” The gold dollar or gold one @-@ dollar …
}

although, I downloaded the dataset from the link provided in that site and saw that the texts in the dataset are one after the other separated by titles within =.

My question is, what format should the dataset that I will further pre-train BERT have and how should they provided as train and dev?
If there is any source, it would be very helpful.

p.s. BERT was pre-trained in two tasks, MLM and NSP. Since my downstream task is Sequence Labeling, I thought that I should continue the pre-training with just the MLM task.

You should fine-tune the model on whatever task you want to perform. If you’d like to use it for sequence classification, then that’s what you should train it on i.e. exchange the head for a Sequence-Classification one.

This should be of help: BERT — transformers 4.3.0 documentation

This may help you understand how to format your input: BERT Fine-Tuning Tutorial with PyTorch · Chris McCormick (the tokenizer does most of the heavy lifting for you)

@neuralpat thank you for your answer, but I am afraid that my post was not clear.

I have an annotated dataset to fine tune BERT for Sequence Labeling (Name Entity Recognition), and a much larger dataset that is not annotated but it is from the same domain. I want to continue the pre-training of BERT on the dataset of the associated domain, to see whether it will help BERT perform better in NER.

Maybe, I have to rewrite my OP.

You’re right, I did missunderstand your OP.

On the page I linked you can see how to run BERT with any supported head. There is a head for Pretraining, which you can use to continue to pre-train BERT BERT — transformers 4.3.0 documentation

If my understanding is correct, this should enable you to do exactly what you want (train both MLM and NSP).

Thank you @neuralpat , but is it really that simple? I mean does the MLM head implement the BERT pre-training algorithm, masks the 15% of the input tokens etc., and continue the pre-training of the model?

I believe that I have to use the script run_mlm.py to continue with the pre-training.

Also, since my downstream task will be Name Entity Recognition, should I also continue the pre-training of BERT on the NSP task or just on the MLM?

Yes, I believe so. I could be wrong, I’m kinda new to this myself, but I don’t see anything indicating otherwise. It being “too easy” is generally a feeling I get when using huggingface. It feels like cheating :wink:

You can test all the things you’re thinking about. Just MLM, MLM and NSP, just fine-tuning…What works best is usually only found out empirically.

1 Like

You are quite right. I will post again either way, since my OP was confusing in the first place.

Thanks.

As mentioned by @neuralpat in this comment, it’s up to you to continue pretraining on MLM, NSP or both. In fact, some approaches like RoBERTa questioned the necessity of NSP objective during pretraining.