Validation loss lower than training loss when further pretraining BERT?

PaschiSt · January 29, 2022, 12:50pm

Hi, i am currently further pretraining my own BERT model for the cooking domain. I chose bert-base-uncased model as a starting point and use the run_mlm.py script to further pretrain the model and adapt it to the cooking data. The data consists of approx. 2 million recipe instructions from recipeNLG dataset, with 5% used as validation data.
I use the following arguments for training:

!python run_mlm.py \
--model_name_or_path=bert-base-uncased \
--output_dir=CookBERT/further_pretraining/model_output \
--do_train \
--do_eval \
--validation_split_percentage=5 \
--train_file=datasets/recipeNLG/recipeNLG_instructions.txt \
--per_device_train_batch_size=16 \
--per_device_eval_batch_size=16 \
--gradient_accumulation_steps=2 \
--learning_rate=2e-5 \
--num_train_epochs=3 \
--save_total_limit=10 \
--save_strategy=steps \
--save_steps=1000 \
--line_by_line \
--max_seq_length=256 \
--evaluation_strategy=steps \
--eval_steps=1000 \

Training process works fine but i am just curious if theres a good explanation on why the validation loss is lower than the train loss?
loss

Any ideas are welcome

AlexN · January 29, 2022, 8:56pm

Hi, I would say that it can be explained by the dropout layers present in the network.
Dropout are activated only on training time.
During the training at each step, dropout deactivates some weights for the forward pass. It helps avoiding overfitting but reduce the local performance for the current step.
Where as during evaluation, the dropout doesn’t deactivate any weight, so you have a complete neural network, that may perform better on a same model state.

Topic		Replies	Views
Bert LM pretraining: training loss goes to 0 at masking probability of 0.999 Beginners	2	2319	October 31, 2020
Training Loss Higher than Validation Loss Beginners	0	432	August 3, 2022
'No Log' for validation loss during training with Trainer Beginners	2	8195	May 15, 2024
Very poor model performance post-training 🤗Transformers	0	400	November 1, 2021
Pre-training BERT Models	1	381	May 21, 2024

Validation loss lower than training loss when further pretraining BERT?

Related topics