You might need more RAM to be able to resume from a checkpoint. The core of the issue is that the optimizer state is loaded on each TPU before being transferred to the XLA device (it can’t be directly loaded on the XLA device sadly) but since you have 8 processes loading it, it’s loaded 8 times on CPU.
1 Like