Trainer is not using multiple GPUs in the DP setup

vladyorsh · April 9, 2023, 3:51pm

I’m trying to launch a custom model training through the Trainer API in the single-node-multi-GPU setup. I use the subclasssed Trainer, which modifies the evaluation_loop() function.

My training script sees all the available GPUs through torch.cuda commands; however, I observe no speedup when launching the script as the ordinary python command. As I see from the code, the Trainer wraps the model into nn.DataParallel, so it should work, but it doesn’t.

I do see the speedup in the DPP scenario, when launching through the torchrun; however, the RAM usage scales up as well, and I cannot use all the GPUs because I need to load the data directly into the memory – fetching from files is too slow. Switching to other libraries such as accelerate is also undesirable due to the effort of re-coding the needed Trainer API features.

Is there anything that can help me to make the DP+Trainer setup into work?

UPD:
I suggest that with an increasing amount of GPUs the Trainer doesn’t split accumulation steps between individual GPUs, so the actual batch size is acc_steps * n_gpus * per_device_batch_size. However, I also suggest that the DPP setup must do this as well, which doesn’t explain the speedup.

Topic		Replies	Views
Model not copied to multiple GPUs when using DDP (using trainer) 🤗Accelerate	2	672	February 5, 2024
Trainer.train() hangs with multiple GPUs (but GPUs show activity) Beginners	4	887	October 31, 2024
Trainer API for Model Parallelism on Multiple GPUs 🤗Transformers	5	4217	September 10, 2024
What algorithm Trainer uses for multi GPU training (without torchrun) Beginners	1	918	January 19, 2023
Model's evaluation in DDP training is using only one GPU Beginners	1	1056	September 14, 2023

Trainer is not using multiple GPUs in the DP setup

Related topics