Am I reading this thread (Training using multiple GPUs) correctly? I interpret that to mean:
Training a model with batch size 16 on one GPU is equivalent to running a model with batch size 4 on 4 GPUs
Is that correct? And does it differ between DataParallel
and DistributedDataParallel
modes?
sgugger
2
It is correct. The difference between DataParallel
and DistributedDataParallel
is (in your current example):
- in
DataParallel
mode you have to set the batch size to 16 for your data loaders.
- in
DistributedDataParallel
you have to set the batch size to 4 for your data loaders.