Training stops/crashes with no trace

I am running GPT1 and GPT2 models. Its pretty standard way of running. But I see that sometimes the training stops/crashes, but no stack trace of it. The process is killed. I checked system metrics as well. Both CPU and GPU util did not go beyond ~80%. I tried custom logging_dir in TrainingArguments, but it doesn’t log as well. I have also tried setting transformers.logging.set_verbosity_info().

I also log to wandb and the status says crashed. But, I can’t find the reason why.

Is there a way I can get a trace why it got crashed?

Have you found a solution? I am facing the same issue.