Trainer is not using multiple GPUs in the DP setup

I’m trying to launch a custom model training through the Trainer API in the single-node-multi-GPU setup. I use the subclasssed Trainer, which modifies the evaluation_loop() function.

My training script sees all the available GPUs through torch.cuda commands; however, I observe no speedup when launching the script as the ordinary python command. As I see from the code, the Trainer wraps the model into nn.DataParallel, so it should work, but it doesn’t.

I do see the speedup in the DPP scenario, when launching through the torchrun; however, the RAM usage scales up as well, and I cannot use all the GPUs because I need to load the data directly into the memory – fetching from files is too slow. Switching to other libraries such as accelerate is also undesirable due to the effort of re-coding the needed Trainer API features.

Is there anything that can help me to make the DP+Trainer setup into work?

UPD:
I suggest that with an increasing amount of GPUs the Trainer doesn’t split accumulation steps between individual GPUs, so the actual batch size is acc_steps * n_gpus * per_device_batch_size. However, I also suggest that the DPP setup must do this as well, which doesn’t explain the speedup.