Retrieving whole words with fill-mask pipeline

fmorales · November 3, 2021, 1:40pm

Hi,

I’ve recently discovered the power of the fill-mask pipeline from Huggingface, and while playing with it, I discovered that it has issues handling non-vocabulary words.

For example, in the sentence, “The internal analysis indicates that the company has reached a [MASK] level.”, I would like to know which one of these words [‘good’, ‘virtuous’, ‘obedient’] is the most probable according to the bert-large-cased-whole-word-masking model.

The model refuses to give a score to the words virtuous and obedient because they do not exist in the vocabulary as such, therefore the scores are given to the first tokens that are recognized: v and o; which are not useful.

So the question remains, how could I get the prediction scores for the whole word instead of scores for individual subword tokens?

Felipehonorato · November 19, 2021, 7:05pm

I’m not sure but you would average the probability of these tokens together and then compare with each other

Topic		Replies	Views
About fill-mask pipeline with [mask] made up of multiple tokens 🤗Transformers	0	323	April 24, 2023
How to return word replacements when returning masked word predictions? Beginners	0	609	September 17, 2020
Fill mask with subwords 🤗Transformers	0	351	June 6, 2021
How does FillMaskPipeline work with Subword-Tokenization? 🤗Transformers	1	426	April 6, 2022
MLM pipeline with saved/customized BertModel Beginners	10	1905	March 22, 2022

Retrieving whole words with fill-mask pipeline

Related topics