Am I reading this thread (Training using multiple GPUs) correctly? I interpret that to mean:
Training a model with batch size 16 on one GPU is equivalent to running a model with batch size 4 on 4 GPUs
Is that correct? And does it differ between DataParallel
and DistributedDataParallel
modes?