Best way to mask a multi-token word when using `.*ForMaskedLM` models

This is something of interest for me too!

This might be of help: [2009.07118] It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners