Why do the F1 and accuracy scores vary when I run the run_glue.py script from Hugging Face’s Transformers library for the BERT-base model on the MNLI task, while using different numbers of GPUs?
from examples/pytorch/text-classification/README.md
MRPC F1/Accuracy 88.85/84.07
But if I run it on 4 V100 GPUs, I get results:
***** eval metrics *****
epoch = 5.0
eval_accuracy = 0.7966
eval_combined_score = 0.8251
eval_f1 = 0.8536
eval_loss = 0.4435
eval_runtime = 0:00:01.42
eval_samples = 408
eval_samples_per_second = 287.176
eval_steps_per_second = 9.15