Continue pre-training of Greek BERT with domain specific dataset

hf4nlp · February 25, 2021, 12:52pm

Hello,

I want to further pre-train Greek BERT of the library on a domain specific dataset in MLM task to improve results. The downstream task of BERT will be sequence classification. I found that the library also provides scripts for that. In the example RoBERTa is further trained on wikitext-2-raw-v1. As I saw here, the dataset is formatted as:

{
“text”: “” The gold dollar or gold one @-@ dollar …
}

although, I downloaded the dataset from the link provided in that site and saw that the texts in the dataset are one after the other separated by titles within =.

My question is, what format should the dataset that I will further pre-train BERT have and how should they provided as train and dev?
If there is any source, it would be very helpful.

p.s. BERT was pre-trained in two tasks, MLM and NSP. Since my downstream task is Sequence Labeling, I thought that I should continue the pre-training with just the MLM task.

neuralpat · February 26, 2021, 6:09am

You should fine-tune the model on whatever task you want to perform. If you’d like to use it for sequence classification, then that’s what you should train it on i.e. exchange the head for a Sequence-Classification one.

This should be of help: BERT — transformers 4.3.0 documentation

This may help you understand how to format your input: BERT Fine-Tuning Tutorial with PyTorch · Chris McCormick (the tokenizer does most of the heavy lifting for you)

hf4nlp · February 26, 2021, 7:19am

@neuralpat thank you for your answer, but I am afraid that my post was not clear.

I have an annotated dataset to fine tune BERT for Sequence Labeling (Name Entity Recognition), and a much larger dataset that is not annotated but it is from the same domain. I want to continue the pre-training of BERT on the dataset of the associated domain, to see whether it will help BERT perform better in NER.

Maybe, I have to rewrite my OP.

neuralpat · February 26, 2021, 12:02pm

You’re right, I did missunderstand your OP.

On the page I linked you can see how to run BERT with any supported head. There is a head for Pretraining, which you can use to continue to pre-train BERT BERT — transformers 4.3.0 documentation

If my understanding is correct, this should enable you to do exactly what you want (train both MLM and NSP).

hf4nlp · February 26, 2021, 12:35pm

Thank you @neuralpat , but is it really that simple? I mean does the MLM head implement the BERT pre-training algorithm, masks the 15% of the input tokens etc., and continue the pre-training of the model?

I believe that I have to use the script run_mlm.py to continue with the pre-training.

Also, since my downstream task will be Name Entity Recognition, should I also continue the pre-training of BERT on the NSP task or just on the MLM?

neuralpat · February 26, 2021, 1:20pm

Yes, I believe so. I could be wrong, I’m kinda new to this myself, but I don’t see anything indicating otherwise. It being “too easy” is generally a feeling I get when using huggingface. It feels like cheating

You can test all the things you’re thinking about. Just MLM, MLM and NSP, just fine-tuning…What works best is usually only found out empirically.

hf4nlp · February 26, 2021, 1:35pm

You are quite right. I will post again either way, since my OP was confusing in the first place.

Thanks.

kaankork · August 6, 2021, 1:44pm

As mentioned by @neuralpat in this comment, it’s up to you to continue pretraining on MLM, NSP or both. In fact, some approaches like RoBERTa questioned the necessity of NSP objective during pretraining.

Topic		Replies	Views
Continue pre-training of Greek BERT with domain specific dataset [clarified] Beginners	0	495	February 26, 2021
Continue pre-training BERT Intermediate	5	2463	November 13, 2023
Continue pre-training Greek BERT with domain specific dataset 🤗Transformers	10	4656	January 4, 2023
Fine-tuning BERT Model on domain specific language and for classification 🤗Transformers	7	8426	November 14, 2024
Pre-training & fine-tuning BERT on specific domain with custom dataset Beginners	4	4266	August 10, 2021

Continue pre-training of Greek BERT with domain specific dataset

Related topics