Am I reading this thread (Training using multiple GPUs) correctly? I interpret that to mean:
Training a model with batch size 16 on one GPU is equivalent to running a model with batch size 4 on 4 GPUs
Is that correct? And does it differ between DataParallel and DistributedDataParallel modes?
sgugger
2
It is correct. The difference between DataParallel and DistributedDataParallel is (in your current example):
- in
DataParallel mode you have to set the batch size to 16 for your data loaders.
- in
DistributedDataParallel you have to set the batch size to 4 for your data loaders.