The Transformers Trainer is only using 1 out of 4 possible GPUs. Why is that?
Hi, I’ve set CUDA_VISIBLE_DEVICES=0,1,2,3
and torch.cuda.device_count()
shows 4. But when I run my Trainer, nvtop shows that only GPU 0 is computing anything. I would expect all 4 GPU usage bars in the following screenshot to be all the way up, but devices 1-3 show 0% usage:
I even tried manually setting trainer.args._n_gpu = 4
(instead of leaving it alone) but it had no effect.
Someone will ask to see the full code for what I’m doing, which is understandable. It is a clone of this Kaggle notebook: Music Genre Classification with Wav2Vec2 | Kaggle
PS- In searching for answers, I notice people asking how to limit the number of GPUs to 1 but I"m trying to get it to do the opposite – use all GPUs! If the latter is supposed to be the default… ?
Do the HuggingFace models not automatically invoke DistributedDataParallel? If that omission is the issue, then…that would be understandable. I don’t see any usage of “parallel” or “accelerate” anywhere in the code. So perhaps I need to add that manually?