I have multiple GPUs available in my enviroment, but I am just trying to train on one GPU.
It looks like the default fault setting local_rank=-1 will turn off distributed training
However, I’m a bit confused on their latest version of the code
def _setup_devices(self) -> Tuple["torch.device", int]:
logger.info("PyTorch: setting up devices")
device = torch.device("cpu")
n_gpu = 0
device = xm.xla_device()
n_gpu = 0
elif self.local_rank == -1:
# if n_gpu is > 1 we'll use nn.DataParallel.
# If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0`
# Explicitly set CUDA to the first (index 0) CUDA device, otherwise `set_device` will
# trigger an error that a device index is missing. Index 0 takes into account the
# GPUs available in the environment, so `CUDA_VISIBLE_DEVICES=1,2` with `cuda:0`
# will use the first GPU in that env, i.e. GPU#1
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
# Here, we'll use torch.distributed.
If local_rank =-1 , then I imagine that n_gpu would be one, but its being set to torch.cuda.device_count() . But then the device is being set to
And if local_rank is anything else, n_gpu is being set to one. I was thinking may be the meaning of local_rank has changed, but looking at the main training code, it doesn’t look like it
def _get_train_sampler(self) -> Optional[torch.utils.data.sampler.Sampler]:
if isinstance(self.train_dataset, torch.utils.data.IterableDataset):
if self.args.local_rank == -1
def get_train_dataloader(self) -> DataLoader:
Returns the training :class:`~torch.utils.data.DataLoader`.
Will use no sampler if :obj:`self.train_dataset` is a :obj:`torch.utils.data.IterableDataset`, a random sampler
(adapted to distributed training if necessary) otherwise.
You can use the CUDA_VISIBLE_DEVICES directive to indicate which GPUs should be visible to the command that you’ll use. For instance
# Only make GPUs #0 and #1 visible to the python script
CUDA_VISIBLE_DEVICES=0,1 python train.py <args>
# Only make GPU #3 visible to the script
CUDA_VISIBLE_DEVICES=3 python train.py <args>
Do you have any suggestions for the case when setting CUDA_VISIBLE_DEVICES is not an option?
UPD: This worked in my case
trainer.args._n_gpu = 1, but it seems wrong to reassign a property, especially a