Bert LM pretraining: training loss goes to 0 at masking probability of 0.999

fteufel · October 13, 2020, 5:36pm

Hi, I am using the Trainer class to perform masked language modeling with a pretrained Bert checkpoint (fine-tune on own domain). I’m using the official run_language_modeling.py script from
https://github.com/huggingface/transformers/tree/master/examples/language-modeling. I merely changed it to use a different dataset implementation, as the default ones load the full data into memory. But __getitem__ works the same, returning a token-id converted sequence at a time, so this should not matter.
Even when setting the mlm probability to something ridiculous like 0.999, this is what I observe:

The loss becomes zero rather quickly (full dataset would be 40k training steps)

This is the command I’m using, working on google cloud.

python3 xla_spawn.py --num_cores=8 train_mlm.py \
--output_dir=real_runs/6 \
--model_type=bert \
--model_name_or_path=Rostlab/prot_bert \
--do_train \
--train_data_file=/home/preprocessed_allorgs_alllengths.txt \
--mlm \
--line_by_line \
--block_size 512 \
--max_steps 30000 \
--out_of_core  \
--logging_steps 20 \
--learning_rate 0.00001 \
--per_device_train_batch_size 20 \
--lazy \
--run_name high_mlm_prob \
--save_steps 2000 \
--warmup_steps 2666 \
--weight_decay 0.01 \
--mlm_probability=0.9999

(train_mlm.py is run_language_modeling.py with a custom dataset class, --lazy is an arg for that dataset. The rest is default, didn’t change anything about it.)

Am I missing something with regards to the setup? This just seems wrong in general, and not an issue with the dataset.

rgwatwormhill · October 31, 2020, 4:52pm

Maybe the model is almost perfect already (?)
What is mLm probability, and is it supposed to have a capital L?

Is your data very similar to the data originally used to pre-train BERT?

What value of loss do you get if you try to fine-tune using the data originally used to pre-train BERT?

Maybe there are a few differences, but BERT has encountered enough examples of each difference to learn the patterns within the first 150 steps.

fteufel · October 31, 2020, 5:19pm

Turned out it was a tokenization issue. The tokenizer for the model checkpoint i was using needed a do_lower_case=False flag, this was of course missing when just taking the provided script as it is. Didn’t realize that it only trained on [CLS,UNK,SEP] sequences because with TPUs I had no intuition for training speed.

Topic		Replies	Views
I used a trainer to pretraining a BertForMaskedLM model, but the training loss always be zero 🤗Transformers	0	234	August 31, 2023
Getting the MLM accuracy for the BERT model I am training from scratch Beginners	7	5367	October 5, 2023
Fine tune Masked Language Model on custom dataset Beginners	5	6068	August 20, 2020
MLM train loss is very different after version update 🤗Transformers	1	438	August 29, 2021
Pre-training BERT Models	1	382	May 21, 2024

Bert LM pretraining: training loss goes to 0 at masking probability of 0.999

Related topics