Fine tune Masked Language Model on custom dataset

Hi, I am new in Bert, I want to fine tune a plain text corpus(only text, without any label or other information) to get hidden state layer for specific word/document embedding.

Thus I think the suitable model for my task is the masked language model, however I found this only example demo https://huggingface.co/blog/how-to-train, which is not universal and clear.

The original examples are here, but no mlm model available https://huggingface.co/transformers/master/custom_datasets.html Does anyone have clear example for this case?

To be specific, I don’t know how to pass true label in MLM task for loss and gradient update. Also while using the Trainer, where should I insert -mlm command?

Thank you in advance

Hi @smalltoken, what is the issue with https://huggingface.co/blog/how-to-train ?
This colab should help you. It walks you through,

  1. How to to train tokenizer from scratch
  2. Create RobertaModel using the config
  3. use the DataCollatorForLanguageModeling, which handle the masking
  4. and train using Trainer.

Thanks for your kind reply.
I put the code and error message in this open issue on github.https://github.com/huggingface/transformers/issues/6616 My problem is mainly about ‘index out of range in self’ error, given that I alredy set a very low max_length value in tokenizer. I was also confused using native PyTorch to train before, in which I don’t know how to handle mask and pass labels to model. But if DataCollatorForLanguageModeling and Trainer works well, that doesn’t matter.

Could you help verify if there is any mistake in my piece of code? Thank you!

Sure, what is ./bert-large-cased in the code, is it a pre-trained bert or did you create it ? If yes, can you post the config ?

Also DataCollatorForLanguageModeling handles masking

Because of proxyerror in my working company, I mannually downloaded the pre trained model from s3.amazonaws.com.(So maybe I ignored some items to be download?) And ./bert-large-cased is a folder which contains these following items :

  1. bert-large-cased-tf_model.h5
  2. config.json
  3. pytorch_model.bin
  4. vocab.txt

Here is what inside the config:

  • attention_probs_dropout_prob:0.1
  • directionality:“bidi”
  • hidden_act:“gelu”
  • hidden_dropout_prob:0.1
  • hidden_size:1024
  • initializer_range:0.02
  • intermediate_size:4096
  • max_position_embeddings:512
  • num_attention_heads:16
  • num_hidden_layers:24
  • pooler_fc_size:768
  • pooler_num_attention_heads:12
  • pooler_num_fc_layers:3
  • pooler_size_per_head:128
  • pooler_type:“first_token_transform”
  • type_vocab_size:2
  • vocab_size:28996