Loss becomes nan

swtb · April 12, 2024, 8:54am

Hello,

I am trying to train XCLIP from scratch using my own dataset of ~1500 videos that are between 32 and 512 frames long. I aim to train a few models using frame sampling of 32, 64 and 128 and a patch size of 14.

I have a single A100 (though considering using up to 8, havent implemented any distributed training yet) and have set up FP16, Gradient Accumulation, and Gradient Checkpointing.

My model does train, though the losses stay steady at 0.26 and eventually they become “nan”.

Its clear that this model isnt learning much, how do I diagnose this? is it my data? parameters? config?

here are some errors I get when using the anomaly detection:

RuntimeError: Function 'SoftmaxBackward0' returned nan values in its 0th output.

RuntimeError: Function 'MeanBackward1' returned nan values in its 0th output.

I have seen a few topics posted with similar issues, and the reccomended solutions are either not possible in my case or they have not worked in my case. I have tried using gradient clipping it seems to just delay the issue. And I have tried without FP16 which again seems to just delay the onset of the problem

Topic		Replies	Views
`nan` training loss but eval loss does improve over time Research	5	4006	October 10, 2022
Wav2vec2 xlsr nan train loss Models	1	1007	June 14, 2021
Getting nan while fine tuning Blip 2 and weired output Intermediate	0	147	May 14, 2024
I'm getting "nan" value for loss, while following a tutorial from the documentatin 🤗Transformers	0	670	October 14, 2020
Text-to-image training loss becomes nan all of a sudden 🧨 Diffusers	7	3570	September 19, 2024

Loss becomes nan

Related topics