torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Hi, I have been facing the same Issue, In my case, I am fine tuning TrOCR model to work on other languages, wherein i have swapped out the encoder and decoder.

The code runs perfectly fine on single GPU, but when i tried to train the model over multiple GPU’s with accelerate library, i am facing the exact same error.

The error occurs randomly, mostly after the validation end and next epoch starts. I am not being able to figure out the exact reason.

I am running the program in a docker and i checked the stats, memory(RAM) should not be an issue as 1.8TB is free in my case.

I am recieving the exact same error with the same exitcode : -9.

Please help me with any possible solutions.