Restarting gpt-2 finetuning after power failure

[I am assuming that gpt-2 saving works in the same way as BERT saving. I am not an expert.]

Hi @pgfeldman

did you save the optimizer state-dictionary?

In order to restart a previous training run, you need to have both the saved model state and the state of the optimizer’s parameters. (These take up a surprisingly large amount of memory - about half the size of the model).

If you haven’t got the optimizer state-dict, then you can still load the saved model from the model checkpoint, but you will need to start a new training run . You will probably need to estimate how far along the first run was, and what Learning Rate it might have got up to.

This thread might help: