Why is Trainer only using 1 (not 4) GPUs?

drscotthawley · June 2, 2022, 8:05pm

The Transformers Trainer is only using 1 out of 4 possible GPUs. Why is that?
Hi, I’ve set CUDA_VISIBLE_DEVICES=0,1,2,3 and torch.cuda.device_count() shows 4. But when I run my Trainer, nvtop shows that only GPU 0 is computing anything. I would expect all 4 GPU usage bars in the following screenshot to be all the way up, but devices 1-3 show 0% usage:

I even tried manually setting trainer.args._n_gpu = 4 (instead of leaving it alone) but it had no effect.

Someone will ask to see the full code for what I’m doing, which is understandable. It is a clone of this Kaggle notebook: Music Genre Classification with Wav2Vec2 | Kaggle

PS- In searching for answers, I notice people asking how to limit the number of GPUs to 1 but I"m trying to get it to do the opposite – use all GPUs! If the latter is supposed to be the default… ?

Do the HuggingFace models not automatically invoke DistributedDataParallel? If that omission is the issue, then…that would be understandable. I don’t see any usage of “parallel” or “accelerate” anywhere in the code. So perhaps I need to add that manually?

drscotthawley · June 2, 2022, 9:08pm

Tried taking a look at accelerate example notebook for SimpleNLP, but it crashes with SIGSEV for me:

Topic		Replies	Views
How to restrict training to one GPU if multiple are available, co 🤗Transformers	4	14371	November 1, 2023
Limit GPU cores for training 🤗Transformers	4	1543	September 14, 2023
How to restrict Trainer to use certain GPUs? Beginners	2	533	February 25, 2024
Trainer.train() hangs with multiple GPUs (but GPUs show activity) Beginners	4	861	October 31, 2024
Distribute training 🤗Transformers	0	314	November 16, 2022

Why is Trainer only using 1 (not 4) GPUs?

Related topics