All GPUs at 100% except GPU0 at 0%?

dhruvgrammarly · November 25, 2024, 5:09pm

I’m using transformers using torchrun via DDP for training on a 7 node GPU. It’s a LLAMA model that I am doing LoRA fine-tuning on.

I have a GPU utilization chart that looks like this.

Basically after a few hours all GPUs go to 100% and one of them goes to 0%. All progress on training stops completely and eventually the process times out and crashes, and I need to resume from the last saved checkpoint. Has anyone else seen this error? I see ChildFailedError on the logs. Why would this been happening and how to resolve it? Sometimes it happens after an hour, and sometimes after 20 hours.

Topic		Replies	Views
Trainer freezes after all steps are complete (multi-gpu setting) 🤗Transformers	4	1514	February 14, 2024
Run crash with all GPU's and success with less 🤗Transformers	0	414	December 12, 2022
Baffling performance issue on most NVidia GPUs with simple transformers + pytorch code Intermediate	5	4490	April 9, 2024
Mutli GPU freezes on Roberta Pretraining Beginners	6	2060	July 11, 2022
Why is Trainer only using 1 (not 4) GPUs? Beginners	1	1561	June 2, 2022

All GPUs at 100% except GPU0 at 0%?

Related topics