Getting different results on different hardware

stas · August 15, 2020, 10:45pm

I have been observing that transformers quite often generates different qualitative outcomes on different hardware, despite using the same seed and the same configuration.

Shouldn’t it be the case that identical code with a fixed seed and identical configuration (batch_size, etc.) should produce identical results no matter the hardware?

For example here, @sshleifer gets a bleu score of 27.65 on his PR branch, whereas I get 27.84. The only difference is hardware.

Another example, we have been battling at finding hparams that will make CI happy with pl_glue_run.py test - I was getting acc/f1 of 1.0 on my hardware but CI was getting 0.5, despite multiple attempts to improve it. Two attempts were made (1, 2) but the test is currently still testing acc>0.25, so really we are just testing that the example runs.

I have seen a lot of others examples, these are just 2 recent ones I could easily point to.

Perhaps some of you have practical insights at how this can be improved.

Thank you.

sgugger · August 17, 2020, 1:02pm

Different GPUs will often have different results in base operations starting at the 6th digit after the comma (from my experience). That may be enough to explain the difference?
Then maybe the number of GPU used/total batch size?

stas · August 17, 2020, 6:37pm

Ah, that’s a great insight, @sgugger - thank you!

I know you mentioned, this is based on your experience, but perhaps you could recommend a good document to read about this?

It should be only on different GPUs, right? i.e. different CPUs won’t have this difference

Batch size is identical.

Indeed, different number of GPUs would make such an impact - but it should be easy to restrict to, say, one for comparison purposes.

The intention is to try and replicate the CI environment, with purchasing an identical GPU, so that the test results shown locally would be also the same on CI.

sgugger · August 17, 2020, 6:54pm

Note sure what setup of GPU the CI uses, maybe @sshleifer knows (for the slow tests only, normal tests don’t use any GPU). I know I got different results from the examples in the doc for instance (with the TensorFlow outputs that have lots of digits shown) but I don’t have any document in mind, it’s just from personal experience.

sshleifer · August 17, 2020, 10:24pm

github actions slow ci uses a v100 that we rent from aws. I even know how to ssh into it!

Topic		Replies	Views
[Help] GPU with query answering 🤗Transformers	0	331	November 25, 2020
MRPC Reproducibility with transformers-4.1.0 Intermediate	1	364	December 20, 2020
Why do the F1 and accuracy scores vary when I run the run_glue.py script from Hugging Face's Transformers library for the BERT-base model on the MNLI task, while using different numbers of GPUs? 🤗Transformers	0	150	June 19, 2023
Huge disparity between CPU and GPU memory usage? 🤗Transformers	0	412	February 22, 2022
Why there are other results with the same seed for Transformers? 🤗Transformers	0	342	December 18, 2022

Getting different results on different hardware

Related topics