Hi,
I have followed this tutorial from GitHub on masked language modelling: notebooks/language_modeling.ipynb at master · huggingface/notebooks · GitHub
But, I am wondering, how do I modfiy this code below for the masked language modelling task, and where in my code do I place it?
In the tutorial, this line of code is used:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
This is the code I need to modify to satisfy MLM:
Let's see how to increase the vocabulary of Bert model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer)) # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.