How to test masked language model after training it?

Hi,

I have followed and trained my masked language model using this tutorial: notebooks/language_modeling.ipynb at master · huggingface/notebooks · GitHub

Now, once the model as been saved using this code below:

trainer.save_model("my_model")

But, the notebook does not seem to include any code to allow me to test my model, so I am unsure how to do this.

I have saved my model, but I now want to mask a sentence using my model by doing something like this:

The [MASK] of France is Paris

Thanks!

1 Like

You can load it in a pipeline by using the folder where you saved it:

mask_filler = pipeline("fill-mask", model="my_model")
2 Likes

Thanks. What about the tokenizer? Does that need to be somewhere?

It should be saved in the same folder, which will be the case if you passed it to the trainer.

Okay, I tried that code you suggested, but I get his error:

OSError: Can't load tokenizer for '/content/my_model'. Make sure that:

- '/content/my_model' is a correct model identifier listed on 'https://huggingface.co/models'

- or '/content/my_model' is the correct path to a directory containing relevant tokenizer files

That means you did not save your tokenizer in that folder (which also means you did not pass it when creating your Trainer).

Oh… Well, this is my trainer:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["test"],
    data_collator=data_collator,
)

I thought the data_collator contains the tokenizer?

Like, here, for example:

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

hey @anon58275033 it’s true that the data collator uses a tokenizer to perform the collation, but you need to provide the tokenizer argument explicitly to the trainer:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

as described in the docs:

The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an interrupted training or reuse the fine-tuned model.

hth!

Hi @lewtun. Thanks very much for that - it works now. I will check out the docs for some further information.

Also, I am trying to add special tokens to my model using the code found here: Tokenizer — transformers 2.11.0 documentation

Here is the code I am using:

# Let's see how to increase the vocabulary of Bert model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

num_added_toks = tokenizer.add_tokens(['🥵', '👏'])
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer

The error:

NameError                                 Traceback (most recent call last)

<ipython-input-27-203dc3e7172a> in <module>()
      1 # Let's see how to increase the vocabulary of Bert model and tokenizer
----> 2 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
      3 model = BertModel.from_pretrained('bert-base-uncased')
      4 
      5 num_added_toks = tokenizer.add_tokens(['🥵', '👏'])

NameError: name 'BertTokenizer' is not defined

hey @anon58275033 what version of transformers are you using? does the problem go away if you upgrade to the latest version?