Clarifying multi-GPU memory usage

Am I reading this thread (Training using multiple GPUs) correctly? I interpret that to mean:

Training a model with batch size 16 on one GPU is equivalent to running a model with batch size 4 on 4 GPUs

Is that correct? And does it differ between DataParallel and DistributedDataParallel modes?

It is correct. The difference between DataParallel and DistributedDataParallel is (in your current example):

  • in DataParallel mode you have to set the batch size to 16 for your data loaders.
  • in DistributedDataParallel you have to set the batch size to 4 for your data loaders.