torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Hey guys, I’m glad to announce I solved the issue on my side.
As can be seen I use multiple GPUs, which have sufficient memory for the use case.
HOWEVER! My issue was due to not enough CPU memory. That’s why my runs crashed and without any trace of the reason.
Once I allocated enough cpu (on my case I increased it from 32GB to 96+ GB).

If the CPU allocation is constant and you can not allocated more, I’m sure you can try solutions as compressed models, deepspeed optimization levels and more.

Good luck to future readers.

6 Likes