Trainer using Checkpoint makes TPU crash

Kamel · October 15, 2021, 9:21am

Hi,

I am running a Bert from scratch on google cloud, and it is working.
But when I am doing the training using a saved Checkpoint with “model_name_or_path” it makes the TPU crash with SIGKILL error (memory issue I guess).

I don’t understand this behavior, since if I rerun it without the checkpoint it works with no problems.
Running with checkpoint works only if I use just 1 core (num_cores=1) which is not convenient for me (takes a much larger time).

Anyone have an idea to help me?

Thanks

sgugger · October 15, 2021, 12:15pm

You might need more RAM to be able to resume from a checkpoint. The core of the issue is that the optimizer state is loaded on each TPU before being transferred to the XLA device (it can’t be directly loaded on the XLA device sadly) but since you have 8 processes loading it, it’s loaded 8 times on CPU.

Kamel · October 15, 2021, 1:16pm

Thanks a lot @sgugger. I am not very used to these issues, so I am now on a 30GB memory vm instance, how can I know to how much memory should I upgrade it?

sgugger · October 15, 2021, 1:31pm

It depends on the model you are using. The size of the optimizer state is twice the size of the model if you’re using the default Adam, and this will need to be multiplied by 8.

Kamel · October 15, 2021, 6:33pm

Ok I see. My optimizer.pt is 1.43GB so I need at least 12GB more ram.
Thanks again

Topic		Replies	Views
TPU Memory problem when saving model checkpoint Beginners	0	553	April 7, 2022
Training of GPT hang during Checkpoint stage 🤗Transformers	0	134	January 23, 2024
Saving optimizer 🤗Accelerate	19	6657	May 18, 2023
Cannot resume trainer from checkpoint 🤗Transformers	2	1389	May 5, 2023
Out of memory when fine-tuning bert on tpu 🤗Transformers	0	605	December 2, 2021

Trainer using Checkpoint makes TPU crash

Related topics