I am running GPT1 and GPT2 models. Its pretty standard way of running. But I see that sometimes the training stops/crashes, but no stack trace of it. The process is killed. I checked system metrics as well. Both CPU and GPU util did not go beyond ~80%. I tried custom logging_dir in TrainingArguments, but it doesn’t log as well. I have also tried setting transformers.logging.set_verbosity_info().
I also log to wandb and the status says crashed. But, I can’t find the reason why.
Is there a way I can get a trace why it got crashed?
Are you running on Linux? When Linux starts running low on RAM it will kill a process instead of having malloc() return NULL to indicate the condition. It’s called the OOM killer if you want to look into it more.
Anyone found a solution? I face the same problem while training Flan-T5-xl on my personal laptop. The training starts for me, but stops before the 1st iteration is done, without any errors or warnings.