Bert LM pretraining: training loss goes to 0 at masking probability of 0.999

Hi, I am using the Trainer class to perform masked language modeling with a pretrained Bert checkpoint (fine-tune on own domain). I’m using the official run_language_modeling.py script from
https://github.com/huggingface/transformers/tree/master/examples/language-modeling. I merely changed it to use a different dataset implementation, as the default ones load the full data into memory. But __getitem__ works the same, returning a token-id converted sequence at a time, so this should not matter.
Even when setting the mlm probability to something ridiculous like 0.999, this is what I observe:


The loss becomes zero rather quickly (full dataset would be 40k training steps)

This is the command I’m using, working on google cloud.

python3 xla_spawn.py --num_cores=8 train_mlm.py \
--output_dir=real_runs/6 \
--model_type=bert \
--model_name_or_path=Rostlab/prot_bert \
--do_train \
--train_data_file=/home/preprocessed_allorgs_alllengths.txt \
--mlm \
--line_by_line \
--block_size 512 \
--max_steps 30000 \
--out_of_core  \
--logging_steps 20 \
--learning_rate 0.00001 \
--per_device_train_batch_size 20 \
--lazy \
--run_name high_mlm_prob \
--save_steps 2000 \
--warmup_steps 2666 \
--weight_decay 0.01 \
--mlm_probability=0.9999

(train_mlm.py is run_language_modeling.py with a custom dataset class, --lazy is an arg for that dataset. The rest is default, didn’t change anything about it.)

Am I missing something with regards to the setup? This just seems wrong in general, and not an issue with the dataset.

Maybe the model is almost perfect already (?)
What is mLm probability, and is it supposed to have a capital L?

Is your data very similar to the data originally used to pre-train BERT?

What value of loss do you get if you try to fine-tune using the data originally used to pre-train BERT?

Maybe there are a few differences, but BERT has encountered enough examples of each difference to learn the patterns within the first 150 steps.

Turned out it was a tokenization issue. The tokenizer for the model checkpoint i was using needed a do_lower_case=False flag, this was of course missing when just taking the provided script as it is. Didn’t realize that it only trained on [CLS,UNK,SEP] sequences because with TPUs I had no intuition for training speed.