Fine tune Masked Language Model on custom dataset

smalltoken · August 18, 2020, 1:10pm

Hi, I am new in Bert, I want to fine tune a plain text corpus(only text, without any label or other information) to get hidden state layer for specific word/document embedding.

Thus I think the suitable model for my task is the masked language model, however I found this only example demo https://huggingface.co/blog/how-to-train, which is not universal and clear.

The original examples are here, but no mlm model available https://huggingface.co/transformers/master/custom_datasets.html Does anyone have clear example for this case?

To be specific, I don’t know how to pass true label in MLM task for loss and gradient update. Also while using the Trainer, where should I insert -mlm command?

Thank you in advance

valhalla · August 20, 2020, 3:25pm

Hi @smalltoken, what is the issue with https://huggingface.co/blog/how-to-train ?
This colab should help you. It walks you through,

How to to train tokenizer from scratch
Create RobertaModel using the config
use the DataCollatorForLanguageModeling, which handle the masking
and train using Trainer.

smalltoken · August 20, 2020, 3:46pm

Thanks for your kind reply.
I put the code and error message in this open issue on github.https://github.com/huggingface/transformers/issues/6616 My problem is mainly about ‘index out of range in self’ error, given that I alredy set a very low max_length value in tokenizer. I was also confused using native PyTorch to train before, in which I don’t know how to handle mask and pass labels to model. But if DataCollatorForLanguageModeling and Trainer works well, that doesn’t matter.

Could you help verify if there is any mistake in my piece of code? Thank you!

valhalla · August 20, 2020, 3:52pm

Sure, what is ./bert-large-cased in the code, is it a pre-trained bert or did you create it ? If yes, can you post the config ?

valhalla · August 20, 2020, 3:52pm

Also DataCollatorForLanguageModeling handles masking

smalltoken · August 20, 2020, 9:09pm

Because of proxyerror in my working company, I mannually downloaded the pre trained model from s3.amazonaws.com.(So maybe I ignored some items to be download?) And ./bert-large-cased is a folder which contains these following items :

bert-large-cased-tf_model.h5
config.json
pytorch_model.bin
vocab.txt

Here is what inside the config:

attention_probs_dropout_prob:0.1
directionality:“bidi”
hidden_act:“gelu”
hidden_dropout_prob:0.1
hidden_size:1024
initializer_range:0.02
intermediate_size:4096
max_position_embeddings:512
num_attention_heads:16
num_hidden_layers:24
pooler_fc_size:768
pooler_num_attention_heads:12
pooler_num_fc_layers:3
pooler_size_per_head:128
pooler_type:“first_token_transform”
type_vocab_size:2
vocab_size:28996

Topic		Replies	Views
Finetune molformer model Models	2	69	March 25, 2025
Train MLM on my own domain and fine tune on downstream classification task Intermediate	3	1016	April 16, 2024
Fine-tune BERT for Masked Language Modeling 🤗Transformers	3	3024	January 25, 2021
Fine-tuning BERT with deterministic masking instead of random masking Beginners	0	162	April 22, 2024
Fine-tuning BERT Model on domain specific language and for classification 🤗Transformers	7	8428	November 14, 2024

Fine tune Masked Language Model on custom dataset

Related topics