The loss plateau of pratraining Bert using run_mlm.py

ezio98 · July 19, 2021, 11:45am

Hi, I am trying to pretrain a Bert model from scratch using BookCorpusOpen and Wikipedia by run_mlm.py. One strange thing I note is that the loss curves always present two decreases. As the following figure shows (the blue curve), the loss curve first drops quickly and reaches a plateau. After a while, it drops again and converges. (These two are not convergent, and I show them because they are in the same figure and easy to compare. I do train other curves to converge.) I thought the normal curve should drop smoothly. Actually, I have tested different hyper-parameter settings, such as the learning rate, batch size, warm-up, and dataset. They all have the loss plateua, long or short. Finally, I find pre-processing the dataset with the “line_by_line” parameter mitigates the plateau problem, as the green curve shows. Although it is still not smooth but matches more what I thought it should be. Does the “line_by_line” parameter influence so much? I wonder whether this is a common problem or I have a problem with my hyper-parameters?

The model includes 6 layers and the hyper-parameters I used are:

python src/run_mlm.py \
    --model_type bert \
    --tokenizer_name bert-base-uncased \
    --config_name model/bert_layer6_512/config.json \
    --do_train \
    --seed 42 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 16 \
    --learning_rate=1e-4\
    --num_train_epochs 25 \
    --warmup_steps 8000 \
    --line_by_line \ # this parameter effects significantly

lihaoxin2020 · February 14, 2022, 1:10am

I met the exact same problem. I thought it was because initial lr is too large, so I decrease it to 2.6e-4, but it still plateaus around 6. Did you find the reason for this?

ezio98 · February 14, 2022, 9:50am

I also tried some hyper-parameters and have no idea of the exact reason. But one thing I noticed is that the dataset influences the period of plateau. If you use small dataset like wikitext or processe a large dataset using 'line_by_line" (one choice in the script of run_mlm.py"), only a small plateau will occur🤔.

RomanCast · November 3, 2022, 8:57am

I also met the exact same problem, and also witnessed that using the line_by_line argument resolves it.

I inspected the outputs of the data collator, and both methods seem to work properly, thus the plateau is probably not the result of a bug. Maybe line_by_line enables shorter sentences to be seen by the model which helps the model learn faster, similar to https://aclanthology.org/2021.ranlp-1.112.pdf ?

mehanik · April 4, 2023, 9:36am

My hypothesis is that the plateau is corresponding to the level of loss that model can achieve only relying on words frequency statistic, without any relation to context. And more time is needed for model to start to use context.

Topic		Replies	Views
Bert LM pretraining: training loss goes to 0 at masking probability of 0.999 Beginners	2	2320	October 31, 2020
Loss behaviour for bert fine-tuning on QNLI Models	3	4435	October 15, 2021
MLM train loss is very different after version update 🤗Transformers	1	438	August 29, 2021
Plot Loss Curve with Trainer() Beginners	9	17914	November 24, 2021
Pre-training BERT Models	1	382	May 21, 2024

The loss plateau of pratraining Bert using run_mlm.py

Related topics