Loading extra memory in GPU 0 using DDP

Hi All,
I’m trying to fine-tune Whisper model on a custom dataset using a Multi-GPU machine. Specifically, my machine has 4 v100 GPUs. When running with single GPU i.e. setting the following env variable CUDA_VISIBLE_DEVICES=0 with batch size of 16 the model trains as expected. However, when training with all 4 GPUs and running with 16 batch size per GPU I get OOM error. Even when reducing the batch size to 8 per GPU the OOM error pops up. Even when running with batch size equal to 4 per GPU, GPU 0 is loaded with ~14GB while the other 3 card use only 8GB. When running with batch size per GPU = 4 it’s even slower compare with running on a single GPU with batch size=16. I’m not sure what I’m doing wrong here, but I assume that if I was able to run batch size=16 on a single GPU I should be able to do the same across 4 cards i.e. batch size of 4*16 in total.

I will appreciate your help a lot.
Thanks.