I was several days in on finetuning the medium model when my machine shut down due to a power failure. How do I restart from the last saved checkpoint? I was using the following command:
In order to restart a previous training run, you need to have both the saved model state and the state of the optimizer’s parameters. (These take up a surprisingly large amount of memory - about half the size of the model).
If you haven’t got the optimizer state-dict, then you can still load the saved model from the model checkpoint, but you will need to start a new training run . You will probably need to estimate how far along the first run was, and what Learning Rate it might have got up to.