How does FillMaskPipeline work with Subword-Tokenization?

harnisph · March 17, 2022, 6:43pm

I have a general understanding problem in regard to the FillMaskPipeline with different types of tokenization. Following question: How is it managed for a sentence like “The capital of France is [MASK]” to predict Paris, assuming Paris would be split into two tokens by the sub-word-tokenizer (e.g. for BERT)? Because the model is only trained to predict single tokens, so is there any beam-search mechanic or similar to deal with multiple tokens per [MASK] or am I missing something else?

I already tried to make sense from the huggingface implementation source code but I did not really found the lines where this happens.

Thank you in advance!

harnisph · April 6, 2022, 4:18pm

After reading more of the literature and comparing the FillMaskPipeline against my own trivial implementation, I found out that the common approach only allows for single tokens and this is a known limitation. Although there are possible strategies to enable multi-token-prediction, they introduce a new layer of complexity and design choices.

Topic		Replies	Views
About fill-mask pipeline with [mask] made up of multiple tokens 🤗Transformers	0	323	April 24, 2023
Having Multiple [MASK] tokens in a sentence Beginners	2	2489	April 8, 2021
Cannot load fill-mask pipline with BertWordPieceTokenizer Beginners	0	305	May 15, 2023
Mask More Than one Word: 🤗Transformers	7	3300	October 24, 2022
MLM pipeline with saved/customized BertModel Beginners	10	1905	March 22, 2022

How does FillMaskPipeline work with Subword-Tokenization?

Related topics