Multi gpu training

kaoutar55 · February 25, 2021, 9:15pm

It seems that the hugging face implementation still uses nn.DataParallel for one node multi-gpu training.
In the pytorch documentation page, it clearly states that " It is recommended to use DistributedDataParallel instead of DataParallel to do multi-GPU training, even if there is only a single node. Could you please clarify if my understanding is correct? and if your training support DistributedDataParallel for one node with multiple GPUs.

sgugger · February 25, 2021, 11:51pm

Both are supported by the Hugging Face Trainer. You just have to use the pytorch launcher to use DistributedDataParallel, see an example here.

Lee1990 · April 24, 2022, 9:04am

how to find the example?

BramVanroy · April 24, 2022, 2:03pm

See here. E.g.

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of your training script)

Topic		Replies	Views
How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)? Intermediate	17	17944	September 6, 2023
Which method is use HF Trainer with multiple GPU? 🤗Transformers	4	1564	June 19, 2023
Running a Trainer in DistributedDataParallel mode 🤗Transformers	1	1449	October 24, 2020
How to run single-node, multi-GPU training with HF Trainer? 🤗Transformers	5	15229	October 16, 2024
Boilerplate for Trainer using torch.distributed Beginners	4	2050	January 11, 2022

Multi gpu training

Related topics