Hey guys, I’m trying to use masked word prediction to measure how probable a word is in a certain context.
The problem is that transformers replaces words not in the vocabulary, however I need to know what words were replaced (not in a warning) to be able to say “This replaced word got a probability of X” and I can’t do that.
It seems very messy to go for the nearest fuzzy match or something.
For example:
from transformers import pipeline
nlp = pipeline("fill-mask", model="roberta-base")
res, rep= nlp(f"{nlp.tokenizer.mask_token} talk about the rules of the game first.",
targets=[' Will', ' Wille', ' Wil', ' We\'ll'])
Gives me:
The specified target token ` Wille` does not exist in the model vocabulary. Replacing with `ĠWil`.
The specified target token ` We'll` does not exist in the model vocabulary. Replacing with `ĠWe`.
And a result of:
[{'sequence': '<s> We talk about the rules of the game first.</s>', 'score': 9.70363453234313e-06, 'token': 166, 'token_str': 'ĠWe'},
{'sequence': '<s> Will talk about the rules of the game first.</s>', 'score': 1.4700815142987267e-07, 'token': 2290, 'token_str': 'ĠWill'},
{'sequence': '<s> Wil talk about the rules of the game first.</s>', 'score': 4.419403731859006e-11, 'token': 3884, 'token_str': 'ĠWil'},
{'sequence': '<s> Wil talk about the rules of the game first.</s>', 'score': 4.419403731859006e-11, 'token': 3884, 'token_str': 'ĠWil'}]
However… what I really want is at least something like this coming out along with the output:
{"We'll":"ĠWe", "Wille":"ĠWil"}
So that later I can link that to the original results, and do something like:
results.get("We'll")
and get: 'score': 1.4700815142987267e-07
Is there an existing feature for this? If not, is it possible for me to get the logs as output so that I can parse them to get this result? I have no idea how to store the logs in a variable and searching online reveals no answers so far.