Torch.distributed.launch question

GenV · October 19, 2022, 8:11am

Hello,
I’m trying a 4 GPU training, using this code:

python -m torch.distributed.launch --nproc_per_node = 4 run_translation.py --other_args

Using that code, it seems that with per_device_train_batch_size=3 I have the cuda Out of memory error, using per_device_train_batch_size=2 instead I don’t get the error.

But if I train on a single GPU, with per_device_train_batch_size=3, I have no memory error.

I have such a small batch size because each GPU has 15GB of RAM and I have inputs of len 1024.

I also tried these lines of code but nothing has changed:

torch.cuda.set_device(rank)
torch.cuda.empty_cache()

Does distributed training use more GPU memory?
I’m only using the code above, is there anything else I should use?

Thanks

sgugger · October 19, 2022, 1:44pm

Does distributed training use more GPU memory?

Yes, it uses slightly more memory on GPU 0 which contains buffers to synchronize gradients, so if you were very close to your memory, you might need to slightly lower the batch size (though it’s multiplied by 4 when running on 4 GPUs, so you are still training at bigger batch size).

GenV · October 19, 2022, 2:06pm

Yes, 2*4. Thanks for the answer!

Topic		Replies	Views
Is it normal of more memory use of DistributedDataParallel than single Beginners	2	820	June 22, 2021
DDP running out of memory but DP is successful for the same per_device_train_batch_size 🤗Accelerate	0	389	February 5, 2024
Clarifying multi-GPU memory usage Beginners	1	1406	November 5, 2020
Loading extra memory in GPU 0 using DDP Intermediate	0	386	June 18, 2023
Cannot use multiple GPUs Beginners	0	511	July 14, 2023

Torch.distributed.launch question

Related topics