Why do the F1 and accuracy scores vary when I run the run_glue.py script from Hugging Face's Transformers library for the BERT-base model on the MNLI task, while using different numbers of GPUs?

skytree · June 19, 2023, 5:18pm

Why do the F1 and accuracy scores vary when I run the run_glue.py script from Hugging Face’s Transformers library for the BERT-base model on the MNLI task, while using different numbers of GPUs?

from examples/pytorch/text-classification/README.md

MRPC F1/Accuracy 88.85/84.07

But if I run it on 4 V100 GPUs, I get results:

***** eval metrics *****
  epoch                   =        5.0
  eval_accuracy           =     0.7966
  eval_combined_score     =     0.8251
  eval_f1                 =     0.8536
  eval_loss               =     0.4435
  eval_runtime            = 0:00:01.42
  eval_samples            =        408
  eval_samples_per_second =    287.176
  eval_steps_per_second   =       9.15

Topic		Replies	Views
MRPC Reproducibility with transformers-4.1.0 Intermediate	1	362	December 20, 2020
MRPC Reproducibility with transformers-4.1.0 Research	0	500	December 19, 2020
Ensuring Consistency in Results: A Focus on Reproducibility BERT 🤗Transformers	2	87	October 3, 2024
Accuracy changes dramatically 🤗Transformers	0	562	November 23, 2020
Run_glue.py provides higher GLUE score on bert-base-uncased 🤗Transformers	0	261	April 6, 2023

Why do the F1 and accuracy scores vary when I run the run_glue.py script from Hugging Face's Transformers library for the BERT-base model on the MNLI task, while using different numbers of GPUs?

Related topics