All GPUs at 100% except GPU0 at 0%?

I’m using transformers using torchrun via DDP for training on a 7 node GPU. It’s a LLAMA model that I am doing LoRA fine-tuning on.

I have a GPU utilization chart that looks like this.

Basically after a few hours all GPUs go to 100% and one of them goes to 0%. All progress on training stops completely and eventually the process times out and crashes, and I need to resume from the last saved checkpoint. Has anyone else seen this error? I see ChildFailedError on the logs. Why would this been happening and how to resolve it? Sometimes it happens after an hour, and sometimes after 20 hours.

1 Like