Training stops/crashes with no trace

I am running GPT1 and GPT2 models. Its pretty standard way of running. But I see that sometimes the training stops/crashes, but no stack trace of it. The process is killed. I checked system metrics as well. Both CPU and GPU util did not go beyond ~80%. I tried custom logging_dir in TrainingArguments, but it doesn’t log as well. I have also tried setting transformers.logging.set_verbosity_info().

I also log to wandb and the status says crashed. But, I can’t find the reason why.

Is there a way I can get a trace why it got crashed?

Have you found a solution? I am facing the same issue.

My training also stops without any errors or logs… Did you solved this problem?

Are you running on Linux? When Linux starts running low on RAM it will kill a process instead of having malloc() return NULL to indicate the condition. It’s called the OOM killer if you want to look into it more.

Anyone found a solution? I face the same problem while training Flan-T5-xl on my personal laptop. The training starts for me, but stops before the 1st iteration is done, without any errors or warnings.