Why are tokens missing in my trained MLM model?

anon58275033 · September 1, 2021, 5:42pm

Hi,

I trained a tokenizer and a language model from scratch using a vocabulary size of 400,000.

Now, when I use my trained model for MLM and see the top_k=400000, some of the words in my vocab.txt are not there?

Like, for example, in my vocab.txt, the word “rain” is in there, but when I use my trained model for MLM, the word “rain” is not in the top 400,000 - why is this?

Has it got to do with the MLM probability of 0.15?

I am a bit confused.

Topic		Replies	Views
[unused] tokens in predicting with MLM model Beginners	0	781	January 3, 2022
Why does my MLM model still not output emojis after adding them as special tokens? Beginners	0	422	June 29, 2021
Training BERT with new tokenizer and vocabulary 🤗Transformers	0	422	April 10, 2023
Best solution for train tokenizer and MLM from scratch 🤗Tokenizers	0	729	December 6, 2021
Pretraining T5 from scratch using MLM Models	1	394	December 6, 2024

Why are tokens missing in my trained MLM model?

Related topics