Warning occured when trying to load checkpoint to continue training

When I tried to load the trainer from a specific checkpoint (which were generated during a previous training process)

trainer.train("checkpoint-100")

The model did continue to train from the given checkpoint, but also I encountered this warning:

UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler.

warnings.warn(SAVE_STATE_WARNING, UserWarning)

Inside the “checkpoint-100” directory, there are 5 files: config.json, optimizer.pt, pytorch_model.bin, scheduler.pt, training_args.bin

Update:
The model loss reset (to higher value) after loading the checkpoint with the warning

HI, I’m facing a similar issue. How did you solve it?

It appeared to be some bug in my code. Did manage to fix it but now I don’t remember how

Vào 09:23, T.5, 17 Th9, 2020 Alex via Hugging Face Forums <hellohellohello@discoursemail.com> đã viết:

I have this problem, too. I am running finetune_t5.sh (or finetune bart) as given in transformers/examples/seq2seq. The error is given even when running the example script as given:

From transformers/examples/seq2seq, run ./finetune_bart_tiny.sh. Observe the following output with warning about scheduler:

cnn_tiny.tgz 100%[================================================>] 22.59K --.-KB/s in 0.08s

2020-10-11 22:53:22 (299 KB/s) - ‘cnn_tiny.tgz’ saved [23131/23131]

x cnn_tiny/
x cnn_tiny/train.target
x cnn_tiny/train.source
x cnn_tiny/val.source
x cnn_tiny/val.target
x cnn_tiny/test.source
x cnn_tiny/test.target

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
Epoch 1: 100%|█████████████████████████████████████████████████████| 4/4 [00:48<00:00, 12.22s/it, loss=10.838, v_num=6]/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler.
warnings.warn(SAVE_STATE_WARNING, UserWarning)
Epoch 1: 100%|█████████████████████████████████████████████████████| 4/4 [00:59<00:00, 14.81s/it, loss=10.838, v_num=6]
(cryptic)

1 Like

Bug filed: https://github.com/huggingface/transformers/issues/7765