How to restrict training to one GPU if multiple are available, co

I have multiple GPUs available in my enviroment, but I am just trying to train on one GPU.

It looks like the default fault setting local_rank=-1 will turn off distributed training

However, I’m a bit confused on their latest version of the code

If local_rank =-1 , then I imagine that n_gpu would be one, but its being set to torch.cuda.device_count() . But then the device is being set to cuda:0
And if local_rank is anything else, n_gpu is being set to one. I was thinking may be the meaning of local_rank has changed, but looking at the main training code, it doesn’t look like it

3 Likes

You can use the CUDA_VISIBLE_DEVICES directive to indicate which GPUs should be visible to the command that you’ll use. For instance

# Only make GPUs #0 and #1 visible to the python script
CUDA_VISIBLE_DEVICES=0,1 python train.py <args>
# Only make GPU #3 visible to the script
CUDA_VISIBLE_DEVICES=3 python train.py <args>
6 Likes

Do you have any suggestions for the case when setting CUDA_VISIBLE_DEVICES is not an option?

UPD: This worked in my case trainer.args._n_gpu = 1, but it seems wrong to reassign a property, especially a _-prepended.

7 Likes

Same problem here. I upgrade my transformer package, and suddently the trainer started forking on multiple gpus, without permission, even on gpus that were occupied by other processes, and then got OOM.

This worked perfectly for me and was exactly what I was looking for. It needs to match according to the GPU ID specified in:

device = torch.device(“cuda:0”) # to use GPU ID 0 only

Setting CUDA_VISIBLE_DEVICES=0 did not work for me. It seems it gets lost trying to find & match the device ID >= 0 <= n_gpus and it suggests to report the bug to PyTorch.