I’m using transformers using torchrun
via DDP for training on a 7 node GPU. It’s a LLAMA model that I am doing LoRA fine-tuning on.
I have a GPU utilization chart that looks like this.
Basically after a few hours all GPUs go to 100% and one of them goes to 0%. All progress on training stops completely and eventually the process times out and crashes, and I need to resume from the last saved checkpoint. Has anyone else seen this error? I see ChildFailedError on the logs. Why would this been happening and how to resolve it? Sometimes it happens after an hour, and sometimes after 20 hours.