Retrieving whole words with fill-mask pipeline


I’ve recently discovered the power of the fill-mask pipeline from Huggingface, and while playing with it, I discovered that it has issues handling non-vocabulary words.

For example, in the sentence, “The internal analysis indicates that the company has reached a [MASK] level.”, I would like to know which one of these words [‘good’, ‘virtuous’, ‘obedient’] is the most probable according to the bert-large-cased-whole-word-masking model.

The model refuses to give a score to the words virtuous and obedient because they do not exist in the vocabulary as such, therefore the scores are given to the first tokens that are recognized: v and o; which are not useful.

So the question remains, how could I get the prediction scores for the whole word instead of scores for individual subword tokens?

1 Like

I’m not sure but you would average the probability of these tokens together and then compare with each other