How to test masked language model after training it?

anon58275033 · June 21, 2021, 9:22am

Hi,

I have followed and trained my masked language model using this tutorial: notebooks/language_modeling.ipynb at master · huggingface/notebooks · GitHub

Now, once the model as been saved using this code below:

trainer.save_model("my_model")

But, the notebook does not seem to include any code to allow me to test my model, so I am unsure how to do this.

I have saved my model, but I now want to mask a sentence using my model by doing something like this:

The [MASK] of France is Paris

Thanks!

sgugger · June 21, 2021, 1:14pm

You can load it in a pipeline by using the folder where you saved it:

mask_filler = pipeline("fill-mask", model="my_model")

anon58275033 · June 21, 2021, 1:15pm

Thanks. What about the tokenizer? Does that need to be somewhere?

sgugger · June 21, 2021, 1:16pm

It should be saved in the same folder, which will be the case if you passed it to the trainer.

anon58275033 · June 21, 2021, 1:28pm

Okay, I tried that code you suggested, but I get his error:

OSError: Can't load tokenizer for '/content/my_model'. Make sure that:

- '/content/my_model' is a correct model identifier listed on 'https://huggingface.co/models'

- or '/content/my_model' is the correct path to a directory containing relevant tokenizer files

sgugger · June 21, 2021, 1:31pm

That means you did not save your tokenizer in that folder (which also means you did not pass it when creating your Trainer).

anon58275033 · June 21, 2021, 1:36pm

Oh… Well, this is my trainer:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["test"],
    data_collator=data_collator,
)

I thought the data_collator contains the tokenizer?

Like, here, for example:

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

lewtun · June 21, 2021, 1:55pm

hey @anon58275033 it’s true that the data collator uses a tokenizer to perform the collation, but you need to provide the tokenizer argument explicitly to the trainer:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

as described in the docs:

The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an interrupted training or reuse the fine-tuned model.

hth!

anon58275033 · June 21, 2021, 5:06pm

Hi @lewtun. Thanks very much for that - it works now. I will check out the docs for some further information.

Also, I am trying to add special tokens to my model using the code found here: Tokenizer — transformers 2.11.0 documentation

Here is the code I am using:

# Let's see how to increase the vocabulary of Bert model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

num_added_toks = tokenizer.add_tokens(['🥵', '👏'])
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer

The error:

NameError                                 Traceback (most recent call last)

<ipython-input-27-203dc3e7172a> in <module>()
      1 # Let's see how to increase the vocabulary of Bert model and tokenizer
----> 2 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
      3 model = BertModel.from_pretrained('bert-base-uncased')
      4 
      5 num_added_toks = tokenizer.add_tokens(['🥵', '👏'])

NameError: name 'BertTokenizer' is not defined

lewtun · June 22, 2021, 3:24pm

hey @anon58275033 what version of transformers are you using? does the problem go away if you upgrade to the latest version?

Topic		Replies	Views
Fine tune Masked Language Model on custom dataset Beginners	5	6110	August 20, 2020
How to save my tokenizer using save_pretrained? Beginners	5	29860	August 13, 2021
Masking specific token in each input sentence during Masked language modelling 🤗Transformers	0	1057	October 18, 2021
Using a dataset with already masked tokens Beginners	2	710	February 3, 2021
How to add new tokens for existing masked language modelling? Beginners	3	686	June 11, 2021

How to test masked language model after training it?

Related topics