Continue pre-training Greek BERT with domain specific dataset

Hello,

I want to further pre-train Greek BERT in a domain specific dataset and the library provides scripts for this. There is also a BERT model, BertForPreTraining, which has a head for masked language modeling and a head for next sentence prediction.

Can this model be used for continuing pre-training as well?
If it can should I use the script or the model?

Hi,

Yes the script is only for masked language modeling (MLM), so you would have to modify this script if you want to also perform next sentence prediction.

But what you could do is the following:

  1. First use the run_mlm.py script to continue pre-training Greek BERT on your domain specific dataset for masked language modeling.
  2. Define a BertForPreTraining model (which includes both the masked language modeling head as well as a sequence classification head), load in the weights of the model that you trained in step 1, and then train on the next sentence prediction task.
1 Like

@nielsr thank you for your reply.

So if I get it right, you suggest to prefer the script for the MLM task. My downstream task is NER, and I have a second (smaller) annotated dataset to subsequently fine-tune the model. Since my downstream task is NER, I don’t think that I need to also pre-train the model on the NSP task.

p.s. Could you explain briefly why prefer the script over the BertForPreTraining model for MLM?

If you only want to perform MLM, then you don’t need to use BertForPreTraining, you only need BertForMaskedLM. The script is very easy to use as you only need to specify your text files and it runs!

In your case, this will look something like:

python run_mlm.py \
    --model_name_or_path nlpaueb/bert-base-greek-uncased-v1 \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-mlm

Of course, a script is a bit like a blackbox in the sense that you don’t know the details about how training is happening exactly, but it’s much faster than writing a script or notebook yourself.

You are right @nielsr, I tested the script and it works.

I saw that the script uses as you suggested, AutoModelForMaskedLM and I assume that it implements the algorithm that Devlin used to train BERT. The script takes also parameters, which makes is less black box.

You have been very helpful.

2 Likes

Hey hf4nlp, I’m currently stuck with a problem similar to yours. I want to further pre-train a BERT model with domain-specific data (cooking-domain), and then fine-tune it to do a specific downstream task. Is there a chance that you post a link to your Github repository (if there is one)?

1 Like

Hello @PaschiSt, sorry for my late reply.

Unfortunately, I don’t have a repo for that, and I didn’t manage to collect the data I planed for continue the pre-training.

But I remember that, as I mentioned in the previous comment, the script run successfully. So, if you follow the detailed description of the scripts’ page, and you give the input data to the script accordingly (if I remember a text per line), you won’t have a problem with the pre-training.

Hope that this helped somehow.