How to add new tokens for existing masked language modelling?


I have followed this tutorial from GitHub on masked language modelling: notebooks/language_modeling.ipynb at master · huggingface/notebooks · GitHub

But, I am wondering, how do I modfiy this code below for the masked language modelling task, and where in my code do I place it?

In the tutorial, this line of code is used:

from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

This is the code I need to modify to satisfy MLM:

 Let's see how to increase the vocabulary of Bert model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.

First of all, I guess you want to use BertForMaskedLM instead of BertModel. The other parts should work AFAIK.

@BramVanroy Yes, I want to use BertForMaskedLM

I have modified the code, but I am getting this error:

NameError: name 'BertTokenizer' is not defined