I am running a Bert from scratch on google cloud, and it is working.
But when I am doing the training using a saved Checkpoint with “model_name_or_path” it makes the TPU crash with SIGKILL error (memory issue I guess).
I don’t understand this behavior, since if I rerun it without the checkpoint it works with no problems.
Running with checkpoint works only if I use just 1 core (num_cores=1) which is not convenient for me (takes a much larger time).
You might need more RAM to be able to resume from a checkpoint. The core of the issue is that the optimizer state is loaded on each TPU before being transferred to the XLA device (it can’t be directly loaded on the XLA device sadly) but since you have 8 processes loading it, it’s loaded 8 times on CPU.
Thanks a lot @sgugger. I am not very used to these issues, so I am now on a 30GB memory vm instance, how can I know to how much memory should I upgrade it?
It depends on the model you are using. The size of the optimizer state is twice the size of the model if you’re using the default Adam, and this will need to be multiplied by 8.