Why are tokens missing in my trained MLM model?

Hi,

I trained a tokenizer and a language model from scratch using a vocabulary size of 400,000.

Now, when I use my trained model for MLM and see the top_k=400000, some of the words in my vocab.txt are not there?

Like, for example, in my vocab.txt, the word “rain” is in there, but when I use my trained model for MLM, the word “rain” is not in the top 400,000 - why is this?

Has it got to do with the MLM probability of 0.15?

I am a bit confused.