Continue pre-training Greek BERT with domain specific dataset

hf4nlp · March 12, 2021, 9:35am

Hello,

I want to further pre-train Greek BERT in a domain specific dataset and the library provides scripts for this. There is also a BERT model, BertForPreTraining, which has a head for masked language modeling and a head for next sentence prediction.

Can this model be used for continuing pre-training as well?
If it can should I use the script or the model?

nielsr · March 12, 2021, 12:52pm

Hi,

Yes the script is only for masked language modeling (MLM), so you would have to modify this script if you want to also perform next sentence prediction.

But what you could do is the following:

First use the run_mlm.py script to continue pre-training Greek BERT on your domain specific dataset for masked language modeling.
Define a BertForPreTraining model (which includes both the masked language modeling head as well as a sequence classification head), load in the weights of the model that you trained in step 1, and then train on the next sentence prediction task.

hf4nlp · March 12, 2021, 1:14pm

@nielsr thank you for your reply.

So if I get it right, you suggest to prefer the script for the MLM task. My downstream task is NER, and I have a second (smaller) annotated dataset to subsequently fine-tune the model. Since my downstream task is NER, I don’t think that I need to also pre-train the model on the NSP task.

p.s. Could you explain briefly why prefer the script over the BertForPreTraining model for MLM?

nielsr · March 13, 2021, 8:29am

If you only want to perform MLM, then you don’t need to use BertForPreTraining, you only need BertForMaskedLM. The script is very easy to use as you only need to specify your text files and it runs!

In your case, this will look something like:

python run_mlm.py \
    --model_name_or_path nlpaueb/bert-base-greek-uncased-v1 \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-mlm

Of course, a script is a bit like a blackbox in the sense that you don’t know the details about how training is happening exactly, but it’s much faster than writing a script or notebook yourself.

hf4nlp · March 16, 2021, 10:39am

You are right @nielsr, I tested the script and it works.

I saw that the script uses as you suggested, AutoModelForMaskedLM and I assume that it implements the algorithm that Devlin used to train BERT. The script takes also parameters, which makes is less black box.

You have been very helpful.

PaschiSt · October 28, 2021, 9:23am

Hey hf4nlp, I’m currently stuck with a problem similar to yours. I want to further pre-train a BERT model with domain-specific data (cooking-domain), and then fine-tune it to do a specific downstream task. Is there a chance that you post a link to your Github repository (if there is one)?

hf4nlp · November 3, 2021, 1:37pm

Hello @PaschiSt, sorry for my late reply.

Unfortunately, I don’t have a repo for that, and I didn’t manage to collect the data I planed for continue the pre-training.

But I remember that, as I mentioned in the previous comment, the script run successfully. So, if you follow the detailed description of the scripts’ page, and you give the input data to the script accordingly (if I remember a text per line), you won’t have a problem with the pre-training.

Hope that this helped somehow.

mkabra · December 18, 2022, 9:27pm

Hi @nielsr , @hf4nlp, This link doesn’t exist now - scripts

Is this the new link now - transformers/run_mlm.py at main · huggingface/transformers · GitHub?

I appreciate your response, Thanks in advance.

nielsr · December 19, 2022, 2:16pm

Yes that’s correct!

mkabra · December 21, 2022, 4:18am

Thanks for the reply @nielsr
I am actually looking to find embedding of a sentence, and before directly using Bert model, I want to fine tune it for my specific domain and then get the embeddings (pooler_output from AutoModel)
This was my plan

Finetune Masked Language model for the specific domain
Load this fine tuned model into AutoModel, and then get the embeddings

However, the embeddings which I got from the second step doesn’t seem to be correct
(I did a simple check - for a particular sentence, computed the cosine similarity of standalone step2 and the similarity of step2 preceeded by step1, this similarity was low, which I think it shouldn’t be, as I just did a fine tuning on a small dataset (500 data points)

Code -

#step1
from transformers import AutoModelForMaskedLM
bert_maskedML=AutoModelForMaskedLM.from_pretrained('bert-base-uncased')

from transformers import AutoTokenizer
bert_tokenizer=AutoTokenizer.from_pretrained("bert-base-uncased")

import datasets
from datasets import load_dataset
df=pd.read_csv('/kaggle/input/inputs-n500/Regression_inputs_n500.csv')
dataset=load_dataset("csv", data_files='.fintech_inputs_n500.csv', split=datasets.Split.TRAIN)

tok_oup=dataset.map(lambda x:bert_tokenizer(x['text'],  padding='max_length'), batched=True)
tok_oup=tok_oup.remove_columns('text')
tok_oup=tok_oup.remove_columns('Unnamed: 0')
tok_oup.set_format("torch", columns=[ "input_ids", 'token_type_ids' ,'attention_mask'])


from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_tokenizer, mlm=True, mlm_probability=0.15
)

from transformers import Trainer, TrainingArguments
training_args = TrainingArguments("test-trainer3")


trainer = Trainer(
    bert_maskedML,
    training_args,
    data_collator=data_collator,
    train_dataset=tok_oup
)


trainer.train()
trainer.save_model('./MaskedLM')

#step 2
from transformers import AutoModel
bertMasked_auto=AutoModel.from_pretrained('/kaggle/working/MaskedLM')


#step2 , if no step1
from transformers import AutoModel
bert_auto=AutoModel.from_pretrained('bert-base-uncased')


#comparison
inputs=[ "no you say that if i make a late payment there is no late fee"]
inputs=bert_tokenizer(inputs, padding='max_length', return_tensors='pt')


bert_masked_predctn=bertMasked_auto(**inputs)
bert_auto_predctn=bert_auto(**inputs)
from torch.nn import CosineSimilarity
cos = CosineSimilarity(dim=0, eps=1e-6)
auto_pooler=bert_auto_predctn['pooler_output']
Masked_auto_pooler=bert_masked_predctn['pooler_output']
cos(auto_pooler[0], Masked_auto_pooler[0])

Can you suggest improvement in this approach or any other approach for my task?

rishabhstha · January 4, 2023, 5:55am

Does this script pre-train BERT from scratch or perform continual pre-training given the model name as “bert-base-uncased”? I am little confused on that. Thank you!

Topic		Replies	Views
Continue pre-training of Greek BERT with domain specific dataset Beginners	7	3029	August 6, 2021
Pre-Train BERT from scratch 🤗Transformers	5	15475	May 30, 2023
Domain adaptation with MLM and NSP 🤗Transformers	3	1727	January 18, 2024
Framework for Continual Pretraining 🤗Transformers	0	1259	August 16, 2023
Continue pre-training of Greek BERT with domain specific dataset [clarified] Beginners	0	497	February 26, 2021

Continue pre-training Greek BERT with domain specific dataset

Related topics