Loss becomes nan

Hello,

I am trying to train XCLIP from scratch using my own dataset of ~1500 videos that are between 32 and 512 frames long. I aim to train a few models using frame sampling of 32, 64 and 128 and a patch size of 14.

I have a single A100 (though considering using up to 8, havent implemented any distributed training yet) and have set up FP16, Gradient Accumulation, and Gradient Checkpointing.

My model does train, though the losses stay steady at 0.26 and eventually they become “nan”.

Its clear that this model isnt learning much, how do I diagnose this? is it my data? parameters? config?

here are some errors I get when using the anomaly detection:

RuntimeError: Function 'SoftmaxBackward0' returned nan values in its 0th output.
RuntimeError: Function 'MeanBackward1' returned nan values in its 0th output.

I have seen a few topics posted with similar issues, and the reccomended solutions are either not possible in my case or they have not worked in my case. I have tried using gradient clipping it seems to just delay the issue. And I have tried without FP16 which again seems to just delay the onset of the problem