12% into epoch training loss drops to 0.0

I noticed a similar problem but was running on much smaller model and dataset, where I was able to run for long enough after seeing 0.0 loss in a very short period of time.

In my experiments, this was because there are nan values in your model (and maybe some parts of the loss, weird for some reason). So hf handles it by first outputting a sequence of zeros, and after a while, they became nan loss. I should be able to see this if you are tracking your gradients.

Curious if you had made any progress on this?

See my issue here: TRL SFT super prone to nan when using data collator

1 Like