Why fine-tuning BERT mlm on specific domain doesn't work? What am I doing wrong?

frap · November 18, 2021, 4:29pm

I’m new. I’m trying to fine-tuned a BERT MLM (bert-base-uncased) on a target domain. Unfortunately, results are not good.

Before fine-tuning, the pre-trained model fills the mask of a sentence with words in line of human expectations.

E.g. Wikipedia is a free online [MASK], created and edited by volunteers around the world.

The most probable prediction are encyclopedia (score: 0.650) and resource (score:0.087).

After fine-tuning, the prediction are completely wrong. Often stopwords are predicted as result.

E.g. Wikipedia is a free online [MASK], created and edited by volunteers around the world.

The most probable prediction are the (score: 0.052) and be (score:0.033).

I experimented with different epochs (from 1 to 10) and different datasets (from a few MB to a few GB) but I got the same issue. What am I doing wrong? I’m using the following code, I hope you can help me.

from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
config = AutoConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
model = AutoModelForMaskedLM.from_config(config)  # BertForMaskedLM.from_pretrained(path)

from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(tokenizer=tokenizer,
                                file_path="data/english/corpora.txt", 
                                block_size = 512)

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(output_dir="output/models/english",
                                 overwrite_output_dir=True,
                                 num_train_epochs=5,
                                 per_gpu_train_batch_size=8,
                                 save_steps = 22222222,
                                 save_total_limit=2)

trainer = Trainer(model=model, args=training_args, data_collator=data_collator, train_dataset=dataset)

trainer.train()
trainer.save_model("output/models/english")

from transformers import pipeline

# Initialize MLM pipeline
mlm = pipeline('fill-mask', model="output/models/english", tokenizer="output/models/english")

# Get mask token
mask = mlm.tokenizer.mask_token

# Get result for particular masked phrase
phrase = f'Wikipedia is a free online {mask}, created and edited by volunteers around the world'

result = mlm(phrase)

# Print result
print(result)

ehalit · November 19, 2021, 5:23am

What I believe happening here is, when you call
config = AutoConfig.from_pretrained('bert-base-uncased', output_hidden_states=True) model = AutoModelForMaskedLM.from_config(config)

the model is a randomly initialized BERT model having the same configuration with “bert-base-uncased” model. So, after fine-tuning, model predictions are not satisfactory because the initial model is not a pretrained one. I have checked it on a colab notebook like this:

If your intention is just to enable output_hidden_states functionality, you can try:
AutoModelForMaskedLM.from_pretrained("bert-base-uncased", output_hidden_states=True)

When I ran the line I recommended on colab, the download time was noticably greater which is a potential indicator that it is indeed the model parameters that are being loaded, not just the configuration file.

frap · November 22, 2021, 11:41am

Thank you for your answer. Unfortunately, this solution doesn’t solve my problem. However, I noticed that, after fine-tuning phase, the model seems to be quite stable (no improvment) for some high-frequency words, while it drastically worse for others.

Topic		Replies	Views
Fine-tuning BERT Model on domain specific language and for classification 🤗Transformers	7	8414	November 14, 2024
Train MLM on my own domain and fine tune on downstream classification task Intermediate	3	1013	April 16, 2024
Using MLM and NSP to fine-tune BERT for question answering Models	0	1169	October 11, 2022
RoBERTa MLM fine-tuning Beginners	1	1871	November 24, 2021
Fine tuning an unsupervised model - BERT Beginners	0	850	April 7, 2022

Why fine-tuning BERT mlm on specific domain doesn't work? What am I doing wrong?

Related topics