More GPUs = lower performance?

I’ve been prototyping my model training code on my local machine (2x RTX 3090 GPUs), and I’m now trying to migrate it over for a full training run on the university HPC cluster. What’s confusing me is that training on the cluster node (which has 4x RTX 8000s) is reporting completion times that are a lot longer than what I was seeing locally (same dataset and batch size).

On my local machine, one epoch is projected to take ~84 hours:
49/586086 [00:28<84:02:02, 1.94it/s]

On the HPC, it’s predicting 455 hours(!):
76/293043 [07:13<455:24:38, 5.60s/it]

(note the different units: it/s vs s/it)

I’ve checked with nvidia-smi and all four GPUs are at 100%. The dataset is being stored on a local disk in both cases. So I’m running out of ideas for what could be happening…

I’ve looked into this more and I think it’s a performance bug related to excessive GPU-GPU communication: https://github.com/huggingface/transformers/issues/9371

1 Like