Run_mlm.py: Why does eval_loss at the last epoch differ from the do_eval eval_loss?

I’m using the example command line supplied with run_mlm.py (v4.4.1), but I’ve enabled more frequent logging as well as evaluation at the end of each epoch:

python run_mlm.py --model_name_or_path roberta-base \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--output_dir test-wikitest \
--do_train \
--do_eval \
--evaluation_strategy epoch \
--logging_steps 10 

By default, this is configured to train over 3 epochs. When I run this to completion, I find that the eval_loss calculated at the end of the last epoch is reported as:

[INFO|trainer.py:1775] 2021-03-18 18:16:19,116 >> ***** Running Evaluation *****
[INFO|trainer.py:1776] 2021-03-18 18:16:19,116 >>   Num examples = 496
[INFO|trainer.py:1777] 2021-03-18 18:16:19,116 >>   Batch size = 8
{'eval_loss': 1.2713477611541748, 'eval_runtime': 27.0394, 'eval_samples_per_second': 18.344, 'epoch': 3.0}

Immediately following this, the final evaluation (because I specified --do_eval) is executed. But the eval_loss reported is different:

[INFO|trainer.py:1775] 2021-03-18 18:16:47,957 >> ***** Running Evaluation *****
[INFO|trainer.py:1776] 2021-03-18 18:16:47,957 >>   Num examples = 496
[INFO|trainer.py:1777] 2021-03-18 18:16:47,957 >>   Batch size = 8
[INFO|trainer_pt_utils.py:650] 2021-03-18 18:17:15,018 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:17:15,018 >>   epoch                     =     3.0
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:17:15,018 >>   eval_loss                 =  1.2644
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:17:15,018 >>   eval_mem_cpu_alloc_delta  =     0MB
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:17:15,018 >>   eval_mem_cpu_peaked_delta =     0MB
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:17:15,018 >>   eval_mem_gpu_alloc_delta  =     0MB
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:17:15,018 >>   eval_mem_gpu_peaked_delta =  1584MB
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:17:15,018 >>   eval_runtime              = 26.8687
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:17:15,018 >>   eval_samples              =     496
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:17:15,018 >>   eval_samples_per_second   =   18.46
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:17:15,019 >>   perplexity                =  3.5409

I was expecting the two eval_loss values to be the same, since no training happens between the two back-to-back evaluations and since the validation set is the same.

Further, when I invoke run_mlm.py with the trained model checkpoint and do only an evaluation (no training at all):

python run_mlm.py --model_name_or_path test-wikitest \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--output_dir test-wikitest-evalonly \
--do_eval 

I get still another value for eval_loss:

[INFO|trainer.py:1775] 2021-03-18 18:46:12,713 >> ***** Running Evaluation *****
[INFO|trainer.py:1776] 2021-03-18 18:46:12,713 >>   Num examples = 496
[INFO|trainer.py:1777] 2021-03-18 18:46:12,714 >>   Batch size = 8
[INFO|trainer_pt_utils.py:650] 2021-03-18 18:46:35,544 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:46:35,545 >>   eval_loss                 =   1.287
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:46:35,545 >>   eval_mem_cpu_alloc_delta  =     0MB
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:46:35,545 >>   eval_mem_cpu_peaked_delta =     0MB
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:46:35,545 >>   eval_mem_gpu_alloc_delta  =     0MB
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:46:35,545 >>   eval_mem_gpu_peaked_delta =  1584MB
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:46:35,545 >>   eval_runtime              = 22.6793
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:46:35,545 >>   eval_samples              =     496
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:46:35,545 >>   eval_samples_per_second   =   21.87
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:46:35,545 >>   init_mem_cpu_alloc_delta  =     0MB
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:46:35,545 >>   init_mem_cpu_peaked_delta =     0MB
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:46:35,545 >>   init_mem_gpu_alloc_delta  =   476MB
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:46:35,545 >>   init_mem_gpu_peaked_delta =     0MB
[INFO|trainer_pt_utils.py:655] 2021-03-18 18:46:35,545 >>   perplexity                =  3.6218

Can anyone tell me what the source of the non-determinism is?

Thanks in advance!

The masking is randomly done, that’s why you get those results.

D’oh! Yes, you are right of course. Thanks!