torch.distributed.elastic.multiprocessing.errors.ChildFailedError

AnustupOCR · July 26, 2023, 8:04am

Hi, I have been facing the same Issue, In my case, I am fine tuning TrOCR model to work on other languages, wherein i have swapped out the encoder and decoder.

The code runs perfectly fine on single GPU, but when i tried to train the model over multiple GPU’s with accelerate library, i am facing the exact same error.

The error occurs randomly, mostly after the validation end and next epoch starts. I am not being able to figure out the exact reason.

I am running the program in a docker and i checked the stats, memory(RAM) should not be an issue as 1.8TB is free in my case.

I am recieving the exact same error with the same exitcode : -9.

Please help me with any possible solutions.

Topic		Replies	Views
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 🤗Accelerate	1	603	August 15, 2024
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 10561) of binary 🤗Accelerate	4	4839	January 24, 2024
Errors when training on multi node single gpu 🤗Transformers	1	1756	February 25, 2022
Multi-GPU Distributed Training using Accelerate on Windows 🤗Accelerate	0	1537	August 9, 2023
RAM memory issues while training with torch.distributed.launch 🤗Transformers	1	1023	October 19, 2022

torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Related topics