torch.distributed.elastic.multiprocessing.errors.ChildFailedError

IdoAmit198 · March 26, 2023, 6:35am

Hey guys, I’m glad to announce I solved the issue on my side.
As can be seen I use multiple GPUs, which have sufficient memory for the use case.
HOWEVER! My issue was due to not enough CPU memory. That’s why my runs crashed and without any trace of the reason.
Once I allocated enough cpu (on my case I increased it from 32GB to 96+ GB).

If the CPU allocation is constant and you can not allocated more, I’m sure you can try solutions as compressed models, deepspeed optimization levels and more.

Good luck to future readers.

Topic		Replies	Views
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 🤗Accelerate	1	594	August 15, 2024
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 10561) of binary 🤗Accelerate	4	4823	January 24, 2024
Errors when training on multi node single gpu 🤗Transformers	1	1752	February 25, 2022
Multi-GPU Distributed Training using Accelerate on Windows 🤗Accelerate	0	1535	August 9, 2023
RAM memory issues while training with torch.distributed.launch 🤗Transformers	1	1020	October 19, 2022

torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Related topics