I am running GPT1 and GPT2 models. Its pretty standard way of running. But I see that sometimes the training stops/crashes, but no stack trace of it. The process is killed. I checked system metrics as well. Both CPU and GPU util did not go beyond ~80%. I tried custom
TrainingArguments, but it doesn’t log as well. I have also tried setting
I also log to wandb and the status says
crashed. But, I can’t find the reason why.
Is there a way I can get a trace why it got crashed?