torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Yeah, my issue got resolved after specifying the docker memory and properly specifying the device id’s with CUDA_VISIBLE_DEVICES=…

2 Likes