Clarifying multi-GPU memory usage

deppen8 · November 5, 2020, 6:20pm

Am I reading this thread (Training using multiple GPUs) correctly? I interpret that to mean:

Training a model with batch size 16 on one GPU is equivalent to running a model with batch size 4 on 4 GPUs

Is that correct? And does it differ between DataParallel and DistributedDataParallel modes?

sgugger · November 5, 2020, 10:22pm

It is correct. The difference between DataParallel and DistributedDataParallel is (in your current example):

in DataParallel mode you have to set the batch size to 16 for your data loaders.
in DistributedDataParallel you have to set the batch size to 4 for your data loaders.

Topic		Replies	Views
Is it normal of more memory use of DistributedDataParallel than single Beginners	2	835	June 22, 2021
Multi gpu training 🤗Transformers	3	6065	April 24, 2022
Torch.distributed.launch question Beginners	2	4045	October 19, 2022
Running a Trainer in DistributedDataParallel mode 🤗Transformers	1	1476	October 24, 2020
Model training in Multi GPU 🤗Transformers	1	1844	March 17, 2021