Validation loss lower than training loss when further pretraining BERT?

Hi, i am currently further pretraining my own BERT model for the cooking domain. I chose bert-base-uncased model as a starting point and use the run_mlm.py script to further pretrain the model and adapt it to the cooking data. The data consists of approx. 2 million recipe instructions from recipeNLG dataset, with 5% used as validation data.
I use the following arguments for training:

!python run_mlm.py \
--model_name_or_path=bert-base-uncased \
--output_dir=CookBERT/further_pretraining/model_output \
--do_train \
--do_eval \
--validation_split_percentage=5 \
--train_file=datasets/recipeNLG/recipeNLG_instructions.txt \
--per_device_train_batch_size=16 \
--per_device_eval_batch_size=16 \
--gradient_accumulation_steps=2 \
--learning_rate=2e-5 \
--num_train_epochs=3 \
--save_total_limit=10 \
--save_strategy=steps \
--save_steps=1000 \
--line_by_line \
--max_seq_length=256 \
--evaluation_strategy=steps \
--eval_steps=1000 \

Training process works fine but i am just curious if theres a good explanation on why the validation loss is lower than the train loss?
loss

Any ideas are welcome :smiley:

1 Like

Hi, I would say that it can be explained by the dropout layers present in the network.
Dropout are activated only on training time.
During the training at each step, dropout deactivates some weights for the forward pass. It helps avoiding overfitting but reduce the local performance for the current step.
Where as during evaluation, the dropout doesn’t deactivate any weight, so you have a complete neural network, that may perform better on a same model state.

3 Likes