Restarting gpt-2 finetuning after power failure

Hi All,

I was several days in on finetuning the medium model when my machine shut down due to a power failure. How do I restart from the last saved checkpoint? I was using the following command:

python --output_dir=output --model_type=gpt2 --model_name_or_path=D:\Development\models\gpt2-medium --do_train --train_data_file=./twitter_train.txt --do_eval --eval_data_file=./twitter_test.txt --per_device_train_batch_size=1 --overwrite_output_dir


[I am assuming that gpt-2 saving works in the same way as BERT saving. I am not an expert.]

Hi @pgfeldman

did you save the optimizer state-dictionary?

In order to restart a previous training run, you need to have both the saved model state and the state of the optimizer’s parameters. (These take up a surprisingly large amount of memory - about half the size of the model).

If you haven’t got the optimizer state-dict, then you can still load the saved model from the model checkpoint, but you will need to start a new training run . You will probably need to estimate how far along the first run was, and what Learning Rate it might have got up to.

This thread might help: