I’m trying to launch a custom model training through the Trainer API in the single-node-multi-GPU setup. I use the subclasssed Trainer, which modifies the
My training script sees all the available GPUs through
torch.cuda commands; however, I observe no speedup when launching the script as the ordinary python command. As I see from the code, the Trainer wraps the model into
nn.DataParallel, so it should work, but it doesn’t.
I do see the speedup in the DPP scenario, when launching through the
torchrun; however, the RAM usage scales up as well, and I cannot use all the GPUs because I need to load the data directly into the memory – fetching from files is too slow. Switching to other libraries such as
accelerate is also undesirable due to the effort of re-coding the needed Trainer API features.
Is there anything that can help me to make the DP+Trainer setup into work?
I suggest that with an increasing amount of GPUs the Trainer doesn’t split accumulation steps between individual GPUs, so the actual batch size is
acc_steps * n_gpus * per_device_batch_size. However, I also suggest that the DPP setup must do this as well, which doesn’t explain the speedup.