I am running GPT1 and GPT2 models. Its pretty standard way of running. But I see that sometimes the training stops/crashes, but no stack trace of it. The process is killed. I checked system metrics as well. Both CPU and GPU util did not go beyond ~80%. I tried custom logging_dir
in TrainingArguments
, but it doesn’t log as well. I have also tried setting transformers.logging.set_verbosity_info()
.
I also log to wandb and the status says crashed
. But, I can’t find the reason why.
Is there a way I can get a trace why it got crashed?