Loading extra memory in GPU 0 using DDP

AvivSham · June 18, 2023, 2:33pm

Hi All,
I’m trying to fine-tune Whisper model on a custom dataset using a Multi-GPU machine. Specifically, my machine has 4 v100 GPUs. When running with single GPU i.e. setting the following env variable CUDA_VISIBLE_DEVICES=0 with batch size of 16 the model trains as expected. However, when training with all 4 GPUs and running with 16 batch size per GPU I get OOM error. Even when reducing the batch size to 8 per GPU the OOM error pops up. Even when running with batch size equal to 4 per GPU, GPU 0 is loaded with ~14GB while the other 3 card use only 8GB. When running with batch size per GPU = 4 it’s even slower compare with running on a single GPU with batch size=16. I’m not sure what I’m doing wrong here, but I assume that if I was able to run batch size=16 on a single GPU I should be able to do the same across 4 cards i.e. batch size of 4*16 in total.

I will appreciate your help a lot.
Thanks.

Topic		Replies	Views
Failed to increase batch size when using multi gpu 🤗Transformers	0	365	April 7, 2023
Hugging face accelerate and torch DDP crash with out-of-memory errors for a model runs fine on a single GPU 🤗Accelerate	3	4489	January 1, 2024
How to load large model with multiple GPU cards? Beginners	8	44085	October 25, 2023
Loading a large dataset occupies ~2GB on each GPU 🤗Datasets	0	102	April 24, 2024
Can't load huge model onto multiple GPU's Beginners	5	5240	June 15, 2023

Loading extra memory in GPU 0 using DDP

Related topics