Torch.distributed.launch question

Hello,
I’m trying a 4 GPU training, using this code:

python -m torch.distributed.launch --nproc_per_node = 4 run_translation.py --other_args

Using that code, it seems that with per_device_train_batch_size=3 I have the cuda Out of memory error, using per_device_train_batch_size=2 instead I don’t get the error.

But if I train on a single GPU, with per_device_train_batch_size=3, I have no memory error.

I have such a small batch size because each GPU has 15GB of RAM and I have inputs of len 1024.

I also tried these lines of code but nothing has changed:

torch.cuda.set_device(rank)
torch.cuda.empty_cache()

Does distributed training use more GPU memory?
I’m only using the code above, is there anything else I should use?

Thanks

Does distributed training use more GPU memory?

Yes, it uses slightly more memory on GPU 0 which contains buffers to synchronize gradients, so if you were very close to your memory, you might need to slightly lower the batch size (though it’s multiplied by 4 when running on 4 GPUs, so you are still training at bigger batch size).

Yes, 2*4. Thanks for the answer!

1 Like