Training fails on multiple gpu throwing cuda runtime errors

I am fine-tuning a GPT2LMHeadModel. When I run the code on a single GPU, it works, but when I run it on multiple GPUS, I get the following error. I have used multiple GPUS in my code base before, but I get this error after I made some changes to the data format.

File “/project/src/trainer.py”, line 71, in train
pre_trainer.train()
File “/project/envs/lib/python3.9/site-packages/transformers/trainer.py”, line 1409, in train
return inner_training_loop(
File “/project/envs/lib/python3.9/site-packages/transformers/trainer.py”, line 1651, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File “/project/envs/lib/python3.9/site-packages/transformers/trainer.py”, line 2363, in training_step

tr_loss_step = self.training_step(model, inputs)
File “/project/envs/lib/python3.9/site-packages/transformers/trainer.py”, line 2363, in training_step
loss.backward()
File “project/envs/lib/python3.9/site-packages/torch/_tensor.py”, line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/project/envs/lib/python3.9/site-packages/torch/autograd/init.py”, line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: unique_by_key: failed to synchronize: cudaErrorECCUncorrectable: uncorrectable ECC error encountered

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.