More GPUs = lower performance?

moyix · December 29, 2020, 10:30pm

I’ve been prototyping my model training code on my local machine (2x RTX 3090 GPUs), and I’m now trying to migrate it over for a full training run on the university HPC cluster. What’s confusing me is that training on the cluster node (which has 4x RTX 8000s) is reporting completion times that are a lot longer than what I was seeing locally (same dataset and batch size).

On my local machine, one epoch is projected to take ~84 hours:
49/586086 [00:28<84:02:02, 1.94it/s]

On the HPC, it’s predicting 455 hours(!):
76/293043 [07:13<455:24:38, 5.60s/it]

(note the different units: it/s vs s/it)

I’ve checked with nvidia-smi and all four GPUs are at 100%. The dataset is being stored on a local disk in both cases. So I’m running out of ideas for what could be happening…

moyix · December 31, 2020, 6:56pm

I’ve looked into this more and I think it’s a performance bug related to excessive GPU-GPU communication: https://github.com/huggingface/transformers/issues/9371

Topic		Replies	Views
Multi-gpu training does not optimize as expected Beginners	1	450	February 26, 2024
Multi-GPU is slower than single GPU when running examples 🤗Accelerate	2	450	July 24, 2024
Multiple GPUs do not speed up the training 🤗Accelerate	1	3447	January 26, 2022
Single GPU is faster than multiple GPUs 🤗Accelerate	3	1921	January 31, 2024
Dataloader fetches slowly using accelerator for distributed training 🤗Accelerate	0	1204	October 29, 2021

More GPUs = lower performance?

Related topics