I have multiple GPUs available in my enviroment, but I am just trying to train on one GPU.
It looks like the default fault setting local_rank=-1 will turn off distributed training
However, I’m a bit confused on their latest version of the code
If local_rank =-1 , then I imagine that n_gpu would be one, but its being set to torch.cuda.device_count() . But then the device is being set to cuda:0
And if local_rank is anything else, n_gpu is being set to one. I was thinking may be the meaning of local_rank has changed, but looking at the main training code, it doesn’t look like it
You can use the CUDA_VISIBLE_DEVICES directive to indicate which GPUs should be visible to the command that you’ll use. For instance
# Only make GPUs #0 and #1 visible to the python script
CUDA_VISIBLE_DEVICES=0,1 python train.py <args>
# Only make GPU #3 visible to the script
CUDA_VISIBLE_DEVICES=3 python train.py <args>