Warning occured when trying to load checkpoint to continue training

proxyht · August 12, 2020, 9:26am

When I tried to load the trainer from a specific checkpoint (which were generated during a previous training process)

trainer.train("checkpoint-100")

The model did continue to train from the given checkpoint, but also I encountered this warning:

UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler.

warnings.warn(SAVE_STATE_WARNING, UserWarning)

Inside the “checkpoint-100” directory, there are 5 files: config.json, optimizer.pt, pytorch_model.bin, scheduler.pt, training_args.bin

proxyht · August 12, 2020, 2:51pm

Update:
The model loss reset (to higher value) after loading the checkpoint with the warning

kintaro · September 17, 2020, 2:11am

HI, I’m facing a similar issue. How did you solve it?

proxyht · September 17, 2020, 2:23pm

It appeared to be some bug in my code. Did manage to fix it but now I don’t remember how

Vào 09:23, T.5, 17 Th9, 2020 Alex via Hugging Face Forums <hellohellohello@discoursemail.com> đã viết:

jsrozner · October 12, 2020, 6:08am

I have this problem, too. I am running finetune_t5.sh (or finetune bart) as given in transformers/examples/seq2seq. The error is given even when running the example script as given:

From transformers/examples/seq2seq, run ./finetune_bart_tiny.sh. Observe the following output with warning about scheduler:

cnn_tiny.tgz 100%[================================================>] 22.59K --.-KB/s in 0.08s

2020-10-11 22:53:22 (299 KB/s) - ‘cnn_tiny.tgz’ saved [23131/23131]

x cnn_tiny/
x cnn_tiny/train.target
x cnn_tiny/train.source
x cnn_tiny/val.source
x cnn_tiny/val.target
x cnn_tiny/test.source
x cnn_tiny/test.target

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
Epoch 1: 100%|█████████████████████████████████████████████████████| 4/4 [00:48<00:00, 12.22s/it, loss=10.838, v_num=6]/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler.
warnings.warn(SAVE_STATE_WARNING, UserWarning)
Epoch 1: 100%|█████████████████████████████████████████████████████| 4/4 [00:59<00:00, 14.81s/it, loss=10.838, v_num=6]
(cryptic)

jsrozner · October 13, 2020, 6:22pm

Bug filed: https://github.com/huggingface/transformers/issues/7765

Topic		Replies	Views
Loading model from checkpoint after error in training Beginners	9	41644	May 2, 2024
Trainer using Checkpoint makes TPU crash 🤗Transformers	4	589	October 15, 2021
A very strange error when saving the checkpoint Beginners	1	541	January 24, 2024
Training Resumes with Increased Loss Despite Checkpoint Loading Beginners	0	91	September 5, 2024
Issue while loading file-tuned gemma2 Models	3	183	December 29, 2024

Warning occured when trying to load checkpoint to continue training

Related topics