Getting different results on different hardware

I have been observing that transformers quite often generates different qualitative outcomes on different hardware, despite using the same seed and the same configuration.

Shouldn’t it be the case that identical code with a fixed seed and identical configuration (batch_size, etc.) should produce identical results no matter the hardware?

For example here, @sshleifer gets a bleu score of 27.65 on his PR branch, whereas I get 27.84. The only difference is hardware.

Another example, we have been battling at finding hparams that will make CI happy with pl_glue_run.py test - I was getting acc/f1 of 1.0 on my hardware but CI was getting 0.5, despite multiple attempts to improve it. Two attempts were made (1, 2) but the test is currently still testing acc>0.25, so really we are just testing that the example runs.

I have seen a lot of others examples, these are just 2 recent ones I could easily point to.

Perhaps some of you have practical insights at how this can be improved.

Thank you.

1 Like

Different GPUs will often have different results in base operations starting at the 6th digit after the comma (from my experience). That may be enough to explain the difference?
Then maybe the number of GPU used/total batch size?

1 Like

Ah, that’s a great insight, @sgugger - thank you!

I know you mentioned, this is based on your experience, but perhaps you could recommend a good document to read about this?

It should be only on different GPUs, right? i.e. different CPUs won’t have this difference

Batch size is identical.

Indeed, different number of GPUs would make such an impact - but it should be easy to restrict to, say, one for comparison purposes.

The intention is to try and replicate the CI environment, with purchasing an identical GPU, so that the test results shown locally would be also the same on CI.

Note sure what setup of GPU the CI uses, maybe @sshleifer knows (for the slow tests only, normal tests don’t use any GPU). I know I got different results from the examples in the doc for instance (with the TensorFlow outputs that have lots of digits shown) but I don’t have any document in mind, it’s just from personal experience.

1 Like

github actions slow ci uses a v100 that we rent from aws. I even know how to ssh into it!

1 Like