Training fails on multiple gpu throwing cuda runtime errors

adibm · September 30, 2022, 2:46pm

I am fine-tuning a GPT2LMHeadModel. When I run the code on a single GPU, it works, but when I run it on multiple GPUS, I get the following error. I have used multiple GPUS in my code base before, but I get this error after I made some changes to the data format.

File “/project/src/trainer.py”, line 71, in train
pre_trainer.train()
File “/project/envs/lib/python3.9/site-packages/transformers/trainer.py”, line 1409, in train
return inner_training_loop(
File “/project/envs/lib/python3.9/site-packages/transformers/trainer.py”, line 1651, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File “/project/envs/lib/python3.9/site-packages/transformers/trainer.py”, line 2363, in training_step

tr_loss_step = self.training_step(model, inputs)
File “/project/envs/lib/python3.9/site-packages/transformers/trainer.py”, line 2363, in training_step
loss.backward()
File “project/envs/lib/python3.9/site-packages/torch/_tensor.py”, line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/project/envs/lib/python3.9/site-packages/torch/autograd/init.py”, line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: unique_by_key: failed to synchronize: cudaErrorECCUncorrectable: uncorrectable ECC error encountered
│
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Topic		Replies	Views
CUDA error that only occurs on multiple gpus when doing batched training Beginners	0	764	June 25, 2024
Error when fine-tuning on multi-gpu 🤗Transformers	1	643	February 17, 2025
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! 🤗Transformers	28	114131	November 17, 2024
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! 🤗Transformers	2	185	March 25, 2025
Am I doing multiple GPU right? Intermediate	8	472	November 29, 2024

Training fails on multiple gpu throwing cuda runtime errors

Related topics