Hi, I have been facing the same Issue, In my case, I am fine tuning TrOCR model to work on other languages, wherein i have swapped out the encoder and decoder.
The code runs perfectly fine on single GPU, but when i tried to train the model over multiple GPU’s with accelerate library, i am facing the exact same error.
The error occurs randomly, mostly after the validation end and next epoch starts. I am not being able to figure out the exact reason.
I am running the program in a docker and i checked the stats, memory(RAM) should not be an issue as 1.8TB is free in my case.
I am recieving the exact same error with the same exitcode : -9.
Please help me with any possible solutions.