How to restrict training to one GPU if multiple are available, co

reSearch2vec · September 24, 2020, 9:26am

I have multiple GPUs available in my enviroment, but I am just trying to train on one GPU.

It looks like the default fault setting local_rank=-1 will turn off distributed training

However, I’m a bit confused on their latest version of the code

huggingface/transformers/blob/master/src/transformers/training_args.py#L343


@cached_property
@torch_required
def _setup_devices(self) -> Tuple["torch.device", int]:
    logger.info("PyTorch: setting up devices")
    if self.no_cuda:
        device = torch.device("cpu")
        n_gpu = 0
    elif is_torch_tpu_available():
        device = xm.xla_device()
        n_gpu = 0
    elif self.local_rank == -1:
        # if n_gpu is > 1 we'll use nn.DataParallel.
        # If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0`
        # Explicitly set CUDA to the first (index 0) CUDA device, otherwise `set_device` will
        # trigger an error that a device index is missing. Index 0 takes into account the
        # GPUs available in the environment, so `CUDA_VISIBLE_DEVICES=1,2` with `cuda:0`
        # will use the first GPU in that env, i.e. GPU#1
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        n_gpu = torch.cuda.device_count()
    else:
        # Here, we'll use torch.distributed.

If local_rank =-1 , then I imagine that n_gpu would be one, but its being set to torch.cuda.device_count() . But then the device is being set to cuda:0
And if local_rank is anything else, n_gpu is being set to one. I was thinking may be the meaning of local_rank has changed, but looking at the main training code, it doesn’t look like it

github.com

huggingface/transformers/blob/master/src/transformers/trainer.py#L331


    )
    dataset.set_format(type=dataset.format["type"], columns=columns)

def _get_train_sampler(self) -> Optional[torch.utils.data.sampler.Sampler]:
    if isinstance(self.train_dataset, torch.utils.data.IterableDataset):
        return None
    elif is_torch_tpu_available():
        return get_tpu_sampler(self.train_dataset)
    else:
        return (
            RandomSampler(self.train_dataset)
            if self.args.local_rank == -1
            else DistributedSampler(self.train_dataset)
        )

def get_train_dataloader(self) -> DataLoader:
    """
    Returns the training :class:`~torch.utils.data.DataLoader`.

    Will use no sampler if :obj:`self.train_dataset` is a :obj:`torch.utils.data.IterableDataset`, a random sampler
    (adapted to distributed training if necessary) otherwise.

BramVanroy · September 24, 2020, 9:51am

You can use the CUDA_VISIBLE_DEVICES directive to indicate which GPUs should be visible to the command that you’ll use. For instance

# Only make GPUs #0 and #1 visible to the python script
CUDA_VISIBLE_DEVICES=0,1 python train.py <args>
# Only make GPU #3 visible to the script
CUDA_VISIBLE_DEVICES=3 python train.py <args>

dropout05 · February 22, 2021, 4:09pm

Do you have any suggestions for the case when setting CUDA_VISIBLE_DEVICES is not an option?

UPD: This worked in my case trainer.args._n_gpu = 1, but it seems wrong to reassign a property, especially a _-prepended.

ndvb · July 29, 2023, 2:24pm

Same problem here. I upgrade my transformer package, and suddently the trainer started forking on multiple gpus, without permission, even on gpus that were occupied by other processes, and then got OOM.

vponcelo · November 1, 2023, 9:42pm

This worked perfectly for me and was exactly what I was looking for. It needs to match according to the GPU ID specified in:

device = torch.device(“cuda:0”) # to use GPU ID 0 only

Setting CUDA_VISIBLE_DEVICES=0 did not work for me. It seems it gets lost trying to find & match the device ID >= 0 <= n_gpus and it suggests to report the bug to PyTorch.

Topic		Replies	Views
TrainingArguments changing the GPU by iteslf 🤗Transformers	1	360	July 7, 2021
Why is Trainer only using 1 (not 4) GPUs? Beginners	1	1641	June 2, 2022
Limit GPU cores for training 🤗Transformers	4	1558	September 14, 2023
How to restrict Trainer to use certain GPUs? Beginners	2	545	February 25, 2024
Running a Trainer in DistributedDataParallel mode 🤗Transformers	1	1466	October 24, 2020

How to restrict training to one GPU if multiple are available, co

Related topics