Hi ,
I am getting Expected float error on one of my models using transformers. This is coming at the line 199 Variable._execution_engine.run_backward( in torch/autograd package.
Below is the complete stack trace of the error.
File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2745, in Trainer.training_step(self, model, inputs)
2743 else:
2744 logger.info(f"loss.dtype={loss.dtype} , loss={loss}
.")
→ 2745 self.accelerator.backward(loss)
2747 return loss.detach() / self.args.gradient_accumulation_steps
File /opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py:1910, in Accelerator.backward(self, loss, **kwargs)
1908 print(f"acc: loss.dtype={loss.dtype} , loss={loss}
.“)
1909 logger.info(f"acc: loss.dtype={loss.dtype} , loss={loss}
.”)
→ 1910 loss.backward(**kwargs)
File /opt/conda/lib/python3.10/site-packages/torch/_tensor.py:489, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
479 return handle_torch_function(
480 Tensor.backward,
481 (self,),
(…)
486 inputs=inputs,
487 )
488 print(f"dtype={self.dtype} self={self} gradient={gradient}
.")
→ 489 torch.autograd.backward(
490 self, gradient, retain_graph, create_graph, inputs=inputs
491 )
File /opt/conda/lib/python3.10/site-packages/torch/autograd/init.py:199, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
197 print(f"tensors={tensors} , 0_dtype={tensors[0].dtype}
.“)
198 print(f"grad_tensors_={grad_tensors_} , 0_dtype={grad_tensors_[0].dtype}
.”)
→ 199 Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
200 tensors, grad_tensors, retain_graph, create_graph, inputs,
201 allow_unreachable=True, accumulate_grad=True)
I did some debugging and logged the values of tensor at each of the stack trace call.
loss.dtype=torch.float32 , loss=0.006293008103966713`. ( in trainer.py )
acc: loss.dtype=torch.float32 , loss=0.006293008103966713
. (in accelerator.py)
dtype=torch.float32 self=0.006293008103966713 gradient=None
. ( in torch/_tensor.py)
The 2 lines below at last stack trace in torch/autograd/init.py
tensors=(tensor(0.0063, device='cuda:0', grad_fn=<DivBackward0>),) , 0_dtype=torch.float32
.
grad_tensors_=(tensor(1., device='cuda:0'),) , 0_dtype=torch.float32
.
The line numbers above may be slightly off as I added print or logging statements before the function call.
As you can see the tensor value is always float but I am getting type error. I am puzzled by the error. What variable is it expecting to be float that is long?
Thanks for the help in advance.