Hi, I am trying to pretrain a Bert model from scratch using BookCorpusOpen and Wikipedia by run_mlm.py. One strange thing I note is that the loss curves always present two decreases. As the following figure shows (the blue curve), the loss curve first drops quickly and reaches a plateau. After a while, it drops again and converges. (These two are not convergent, and I show them because they are in the same figure and easy to compare. I do train other curves to converge.) I thought the normal curve should drop smoothly. Actually, I have tested different hyper-parameter settings, such as the learning rate, batch size, warm-up, and dataset. They all have the loss plateua, long or short. Finally, I find pre-processing the dataset with the “line_by_line” parameter mitigates the plateau problem, as the green curve shows. Although it is still not smooth but matches more what I thought it should be. Does the “line_by_line” parameter influence so much? I wonder whether this is a common problem or I have a problem with my hyper-parameters?
The model includes 6 layers and the hyper-parameters I used are:
python src/run_mlm.py \
--model_type bert \
--tokenizer_name bert-base-uncased \
--config_name model/bert_layer6_512/config.json \
--do_train \
--seed 42 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 16 \
--learning_rate=1e-4\
--num_train_epochs 25 \
--warmup_steps 8000 \
--line_by_line \ # this parameter effects significantly