Continue pre-training Greek BERT with domain specific dataset


I want to further pre-train Greek BERT in a domain specific dataset and the library provides scripts for this. There is also a BERT model, BertForPreTraining, which has a head for masked language modeling and a head for next sentence prediction.

Can this model be used for continuing pre-training as well?
If it can should I use the script or the model?


Yes the script is only for masked language modeling (MLM), so you would have to modify this script if you want to also perform next sentence prediction.

But what you could do is the following:

  1. First use the script to continue pre-training Greek BERT on your domain specific dataset for masked language modeling.
  2. Define a BertForPreTraining model (which includes both the masked language modeling head as well as a sequence classification head), load in the weights of the model that you trained in step 1, and then train on the next sentence prediction task.

@nielsr thank you for your reply.

So if I get it right, you suggest to prefer the script for the MLM task. My downstream task is NER, and I have a second (smaller) annotated dataset to subsequently fine-tune the model. Since my downstream task is NER, I don’t think that I need to also pre-train the model on the NSP task.

p.s. Could you explain briefly why prefer the script over the BertForPreTraining model for MLM?

If you only want to perform MLM, then you don’t need to use BertForPreTraining, you only need BertForMaskedLM. The script is very easy to use as you only need to specify your text files and it runs!

In your case, this will look something like:

python \
    --model_name_or_path nlpaueb/bert-base-greek-uncased-v1 \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-mlm

Of course, a script is a bit like a blackbox in the sense that you don’t know the details about how training is happening exactly, but it’s much faster than writing a script or notebook yourself.

You are right @nielsr, I tested the script and it works.

I saw that the script uses as you suggested, AutoModelForMaskedLM and I assume that it implements the algorithm that Devlin used to train BERT. The script takes also parameters, which makes is less black box.

You have been very helpful.


Hey hf4nlp, I’m currently stuck with a problem similar to yours. I want to further pre-train a BERT model with domain-specific data (cooking-domain), and then fine-tune it to do a specific downstream task. Is there a chance that you post a link to your Github repository (if there is one)?

1 Like

Hello @PaschiSt, sorry for my late reply.

Unfortunately, I don’t have a repo for that, and I didn’t manage to collect the data I planed for continue the pre-training.

But I remember that, as I mentioned in the previous comment, the script run successfully. So, if you follow the detailed description of the scripts’ page, and you give the input data to the script accordingly (if I remember a text per line), you won’t have a problem with the pre-training.

Hope that this helped somehow.

Hi @nielsr , @hf4nlp, This link doesn’t exist now - scripts

Is this the new link now - transformers/ at main · huggingface/transformers · GitHub?

I appreciate your response, Thanks in advance.

Yes that’s correct!

Thanks for the reply @nielsr
I am actually looking to find embedding of a sentence, and before directly using Bert model, I want to fine tune it for my specific domain and then get the embeddings (pooler_output from AutoModel)
This was my plan

  1. Finetune Masked Language model for the specific domain
  2. Load this fine tuned model into AutoModel, and then get the embeddings

However, the embeddings which I got from the second step doesn’t seem to be correct
(I did a simple check - for a particular sentence, computed the cosine similarity of standalone step2 and the similarity of step2 preceeded by step1, this similarity was low, which I think it shouldn’t be, as I just did a fine tuning on a small dataset (500 data points)

Code -

from transformers import AutoModelForMaskedLM

from transformers import AutoTokenizer

import datasets
from datasets import load_dataset
dataset=load_dataset("csv", data_files='.fintech_inputs_n500.csv', split=datasets.Split.TRAIN) x:bert_tokenizer(x['text'],  padding='max_length'), batched=True)
tok_oup=tok_oup.remove_columns('Unnamed: 0')
tok_oup.set_format("torch", columns=[ "input_ids", 'token_type_ids' ,'attention_mask'])

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_tokenizer, mlm=True, mlm_probability=0.15

from transformers import Trainer, TrainingArguments
training_args = TrainingArguments("test-trainer3")

trainer = Trainer(


#step 2
from transformers import AutoModel

#step2 , if no step1
from transformers import AutoModel

inputs=[ "no you say that if i make a late payment there is no late fee"]
inputs=bert_tokenizer(inputs, padding='max_length', return_tensors='pt')

from torch.nn import CosineSimilarity
cos = CosineSimilarity(dim=0, eps=1e-6)
cos(auto_pooler[0], Masked_auto_pooler[0])

Can you suggest improvement in this approach or any other approach for my task?

Does this script pre-train BERT from scratch or perform continual pre-training given the model name as “bert-base-uncased”? I am little confused on that. Thank you!