I have a general understanding problem in regard to the FillMaskPipeline with different types of tokenization. Following question: How is it managed for a sentence like “The capital of France is [MASK]” to predict Paris, assuming Paris would be split into two tokens by the sub-word-tokenizer (e.g. for BERT)? Because the model is only trained to predict single tokens, so is there any beam-search mechanic or similar to deal with multiple tokens per [MASK] or am I missing something else?
I already tried to make sense from the huggingface implementation source code but I did not really found the lines where this happens.
After reading more of the literature and comparing the FillMaskPipeline against my own trivial implementation, I found out that the common approach only allows for single tokens and this is a known limitation. Although there are possible strategies to enable multi-token-prediction, they introduce a new layer of complexity and design choices.