Fine-tune BERT for Masked Language Modeling

nes · September 28, 2020, 11:53am

Hello,

I have used a pre-trained BERT model using Hugging Transformers for a project. I would like to know how to “fine-tune” the BERT for Masked Language Modeling for a task like spelling correction. The links “https://github.com/huggingface/transformers/tree/master/examples/lm_finetuning” and “https://github.com/huggingface/transformers/blob/master/examples/lm_finetuning/pregenerate_training_data.py” are not found which seemed to be of great resource. As well as I would also like to know the dataset (like what kind of inputs and labels are to be given to the model) format that BERTForMaskedLM requires to be trained on. I would be grateful if anyone could help me in this regard.

Thanks,
Nes

tillfurger · January 5, 2021, 6:20pm

Interested in this too…

VP1 · January 22, 2021, 10:20am

it seems “lm_finetunin” script is not active.
there is this:

github.com

huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py

# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, CTRL, BERT, RoBERTa, XLNet).
GPT, GPT-2 and CTRL are fine-tuned using a causal language modeling (CLM) loss. BERT and RoBERTa are fine-tuned
using a masked language modeling (MLM) loss. XLNet is fine-tuned using a permutation language modeling (PLM) loss.
"""

This file has been truncated. show original

ningmoufubi · January 25, 2021, 7:37am

data_text like this:
traindata = {
‘text’: [
‘我们有一个愉快的星期天’,
‘我们明天去吃饭？’,
‘我们有英语课程’,
‘我们明天出去玩吧’,
‘哈哈哈，今天是星期二’
]}
when make MLM model train data，mask traindata as the model input, origin traindata as the label.
for example:
input=‘我们[MASK]天出去玩吧’, //mask position is random.
label=‘我们明天出去玩吧’

Topic		Replies	Views
"run_lm_finetuning.py" was replaced? Beginners	5	4644	June 1, 2021
Fine tune Masked Language Model on custom dataset Beginners	5	6060	August 20, 2020
Finetune molformer model Models	2	69	March 25, 2025
BertForMaskedLM model require fine-tuning? Beginners	0	643	August 7, 2022
Guidance on getting started with fine tuned uncensored model Beginners	2	985	March 8, 2025

Fine-tune BERT for Masked Language Modeling

Related topics