How to return word replacements when returning masked word predictions?

youssefav · September 17, 2020, 8:44pm

Hey guys, I’m trying to use masked word prediction to measure how probable a word is in a certain context.

The problem is that transformers replaces words not in the vocabulary, however I need to know what words were replaced (not in a warning) to be able to say “This replaced word got a probability of X” and I can’t do that.

It seems very messy to go for the nearest fuzzy match or something.

For example:

from transformers import pipeline
nlp = pipeline("fill-mask", model="roberta-base")
res, rep= nlp(f"{nlp.tokenizer.mask_token} talk about the rules of the game first.", 
targets=[' Will', ' Wille', ' Wil', ' We\'ll'])

Gives me:

The specified target token ` Wille` does not exist in the model vocabulary. Replacing with `ĠWil`.
The specified target token ` We'll` does not exist in the model vocabulary. Replacing with `ĠWe`.

And a result of:

  [{'sequence': '<s> We talk about the rules of the game first.</s>', 'score': 9.70363453234313e-06, 'token': 166, 'token_str': 'ĠWe'}, 
    {'sequence': '<s> Will talk about the rules of the game first.</s>', 'score': 1.4700815142987267e-07, 'token': 2290, 'token_str': 'ĠWill'}, 
    {'sequence': '<s> Wil talk about the rules of the game first.</s>', 'score': 4.419403731859006e-11, 'token': 3884, 'token_str': 'ĠWil'}, 
    {'sequence': '<s> Wil talk about the rules of the game first.</s>', 'score': 4.419403731859006e-11, 'token': 3884, 'token_str': 'ĠWil'}]

However… what I really want is at least something like this coming out along with the output:
{"We'll":"ĠWe", "Wille":"ĠWil"}

So that later I can link that to the original results, and do something like:

results.get("We'll") and get: 'score': 1.4700815142987267e-07

Is there an existing feature for this? If not, is it possible for me to get the logs as output so that I can parse them to get this result? I have no idea how to store the logs in a variable and searching online reveals no answers so far.

Topic		Replies	Views
Unexpected result from transformer model prediction Beginners	0	288	November 21, 2021
I get the predicted token as ` े` . What am I doing wrong? 🤗Tokenizers	1	614	March 27, 2023
Retrieving whole words with fill-mask pipeline Beginners	1	401	November 19, 2021
MLM pipeline with saved/customized BertModel Beginners	10	1905	March 22, 2022
Code about DataCollatorForWholeWordMask in github 🤗Transformers	0	558	October 12, 2022

How to return word replacements when returning masked word predictions?

Related topics