I have a machine with 3 3090s and have been using accelerate with lm_eval to speed up inference and seeing sensible results.
I wrote some custom training scripts using accelerate but noticed about a 3x slowdown vs the single GPU case. While debugging, I decided to try the nlp_example.py
and I’m getting significant slowdowns between the single GPU and multi-GPU case there too.
I’ve found this topic on a similar slowdown, but I am able to see performance gains with accelerate and lm_eval, so I doubt that it is a CUDA/pytorch version incompatibility…
For reference, when running nlp_example.py
on one GPU I am getting 44 secs total for three epochs and for the multiGPU case I get 4 mins.
I’m happy to provide any other information (package versions and CUDA versions) if that is needed.