Multi-GPU is slower than single GPU when running examples

I have a machine with 3 3090s and have been using accelerate with lm_eval to speed up inference and seeing sensible results.

I wrote some custom training scripts using accelerate but noticed about a 3x slowdown vs the single GPU case. While debugging, I decided to try the nlp_example.py and I’m getting significant slowdowns between the single GPU and multi-GPU case there too.

I’ve found this topic on a similar slowdown, but I am able to see performance gains with accelerate and lm_eval, so I doubt that it is a CUDA/pytorch version incompatibility…

For reference, when running nlp_example.py on one GPU I am getting 44 secs total for three epochs and for the multiGPU case I get 4 mins.

I’m happy to provide any other information (package versions and CUDA versions) if that is needed.

If anyone else runs into this issue: it turned out to be a BIOS level change that was needed in order to fix the communication overhead. Specifically changing the link speed on the PCI ports from Gen 1 to Gen 4 resulted in seeing speedups using multiple GPUs for fine-tuning!

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.