Restarting gpt-2 finetuning after power failure

pgfeldman · November 8, 2020, 1:49pm

Hi All,

I was several days in on finetuning the medium model when my machine shut down due to a power failure. How do I restart from the last saved checkpoint? I was using the following command:

python run_language_modeling.py --output_dir=output --model_type=gpt2 --model_name_or_path=D:\Development\models\gpt2-medium --do_train --train_data_file=./twitter_train.txt --do_eval --eval_data_file=./twitter_test.txt --per_device_train_batch_size=1 --overwrite_output_dir

Thanks!

rgwatwormhill · November 10, 2020, 11:49am

[I am assuming that gpt-2 saving works in the same way as BERT saving. I am not an expert.]

Hi @pgfeldman

did you save the optimizer state-dictionary?

In order to restart a previous training run, you need to have both the saved model state and the state of the optimizer’s parameters. (These take up a surprisingly large amount of memory - about half the size of the model).

If you haven’t got the optimizer state-dict, then you can still load the saved model from the model checkpoint, but you will need to start a new training run . You will probably need to estimate how far along the first run was, and what Learning Rate it might have got up to.

This thread might help:

Topic		Replies	Views
Loading finetuned model to generate text 🤗Transformers	12	3309	August 7, 2023
Language-modeling script "killed" when fine-tuning gpt2-medium Beginners	3	6893	May 19, 2023
How to train gpt-2 from scratch? (no fine-tuning) Beginners	17	19011	December 14, 2022
Finetuning GPT2 with user defined loss Beginners	56	16087	July 23, 2023
Training models for smaller epochs and then continue trianing 🤗Transformers	5	1318	January 16, 2021

Restarting gpt-2 finetuning after power failure

Related topics