Training stops/crashes with no trace

spiralarchitect · June 18, 2021, 7:08pm

I am running GPT1 and GPT2 models. Its pretty standard way of running. But I see that sometimes the training stops/crashes, but no stack trace of it. The process is killed. I checked system metrics as well. Both CPU and GPU util did not go beyond ~80%. I tried custom logging_dir in TrainingArguments, but it doesn’t log as well. I have also tried setting transformers.logging.set_verbosity_info().

I also log to wandb and the status says crashed. But, I can’t find the reason why.

Is there a way I can get a trace why it got crashed?

hfuser · July 28, 2022, 9:53am

Have you found a solution? I am facing the same issue.

kaane0202 · April 20, 2023, 12:57pm

My training also stops without any errors or logs… Did you solved this problem?

cardboardaardvark · April 21, 2023, 9:20am

Are you running on Linux? When Linux starts running low on RAM it will kill a process instead of having malloc() return NULL to indicate the condition. It’s called the OOM killer if you want to look into it more.

devvanshhh · November 15, 2023, 9:11am

Anyone found a solution? I face the same problem while training Flan-T5-xl on my personal laptop. The training starts for me, but stops before the 1st iteration is done, without any errors or warnings.

Topic		Replies	Views
All the training jobs end up getting stopped 🤗AutoTrain	6	2142	April 17, 2024
Trainer using Checkpoint makes TPU crash 🤗Transformers	4	588	October 15, 2021
Training Process Crashes without error message Beginners	0	133	July 1, 2024
Run crash with all GPU's and success with less 🤗Transformers	0	418	December 12, 2022
Training doesn't end properly but stops the machine with no error message 🤗Transformers	5	2304	January 15, 2024

Training stops/crashes with no trace

Related topics