I’m trying to launch a custom model training through the Trainer API in the single-node-multi-GPU setup. I use the subclasssed Trainer, which modifies the evaluation_loop()
function.
My training script sees all the available GPUs through torch.cuda
commands; however, I observe no speedup when launching the script as the ordinary python command. As I see from the code, the Trainer wraps the model into nn.DataParallel
, so it should work, but it doesn’t.
I do see the speedup in the DPP scenario, when launching through the torchrun
; however, the RAM usage scales up as well, and I cannot use all the GPUs because I need to load the data directly into the memory – fetching from files is too slow. Switching to other libraries such as accelerate
is also undesirable due to the effort of re-coding the needed Trainer API features.
Is there anything that can help me to make the DP+Trainer setup into work?
UPD:
I suggest that with an increasing amount of GPUs the Trainer doesn’t split accumulation steps between individual GPUs, so the actual batch size is acc_steps * n_gpus * per_device_batch_size
. However, I also suggest that the DPP setup must do this as well, which doesn’t explain the speedup.