I want to further pre-train Greek BERT of the library on a domain specific dataset in MLM task to improve results. The downstream task of BERT will be sequence classification. I found that the library also provides scripts for that. In the example RoBERTa is further trained on wikitext-2-raw-v1. As I saw here, the dataset is formatted as:
“text”: “” The gold dollar or gold one @-@ dollar …
although, I downloaded the dataset from the link provided in that site and saw that the texts in the dataset are one after the other separated by titles within =.
My question is, what format should the dataset that I will further pre-train BERT have and how should they provided as train and dev?
If there is any source, it would be very helpful.
p.s. BERT was pre-trained in two tasks, MLM and NSP. Since my downstream task is Sequence Labeling, I thought that I should continue the pre-training with just the MLM task.
You should fine-tune the model on whatever task you want to perform. If you’d like to use it for sequence classification, then that’s what you should train it on i.e. exchange the head for a Sequence-Classification one.
@neuralpat thank you for your answer, but I am afraid that my post was not clear.
I have an annotated dataset to fine tune BERT for Sequence Labeling (Name Entity Recognition), and a much larger dataset that is not annotated but it is from the same domain. I want to continue the pre-training of BERT on the dataset of the associated domain, to see whether it will help BERT perform better in NER.
Thank you @neuralpat , but is it really that simple? I mean does the MLM head implement the BERT pre-training algorithm, masks the 15% of the input tokens etc., and continue the pre-training of the model?
I believe that I have to use the script run_mlm.py to continue with the pre-training.
Also, since my downstream task will be Name Entity Recognition, should I also continue the pre-training of BERT on the NSP task or just on the MLM?
Yes, I believe so. I could be wrong, I’m kinda new to this myself, but I don’t see anything indicating otherwise. It being “too easy” is generally a feeling I get when using huggingface. It feels like cheating
You can test all the things you’re thinking about. Just MLM, MLM and NSP, just fine-tuning…What works best is usually only found out empirically.
As mentioned by @neuralpat in this comment, it’s up to you to continue pretraining on MLM, NSP or both. In fact, some approaches like RoBERTa questioned the necessity of NSP objective during pretraining.