Why do the F1 and accuracy scores vary when I run the run_glue.py script from Hugging Face's Transformers library for the BERT-base model on the MNLI task, while using different numbers of GPUs?

Why do the F1 and accuracy scores vary when I run the run_glue.py script from Hugging Face’s Transformers library for the BERT-base model on the MNLI task, while using different numbers of GPUs?

from examples/pytorch/text-classification/README.md

MRPC F1/Accuracy 88.85/84.07

But if I run it on 4 V100 GPUs, I get results:

***** eval metrics *****
  epoch                   =        5.0
  eval_accuracy           =     0.7966
  eval_combined_score     =     0.8251
  eval_f1                 =     0.8536
  eval_loss               =     0.4435
  eval_runtime            = 0:00:01.42
  eval_samples            =        408
  eval_samples_per_second =    287.176
  eval_steps_per_second   =       9.15