Loading model from checkpoint after error in training

aclifton314 · August 18, 2020, 9:53pm

Let’s say I am finetuning a model and during training an error is encountered and the training stops. Let’s also say that, using Trainer, I have it configured to save checkpoints along the way in training. How would I go about loading the model from the last checkpoint before it encountered the error?

For reference, here is the configuration of my Trainer object:

TRAINER ARGS
args: TrainingArguments(
output_dir='models/textgen/out', 
overwrite_output_dir=False, 
do_train='True', 
do_eval=False, 
do_predict=False, 
evaluate_during_training=False, 
per_device_train_batch_size=8, 
per_device_eval_batch_size=8, 
per_gpu_train_batch_size=None, 
per_gpu_eval_batch_size=None, 
gradient_accumulation_steps=1, 
learning_rate=5e-05, 
weight_decay=0.0, 
adam_epsilon=1e-08, 
max_grad_norm=1.0, 
num_train_epochs=3.0, 
max_steps=-1, 
warmup_steps=0, 
logging_dir='models/textgen/logs', 
logging_first_step=False, 
logging_steps=500, 
save_steps=500, 
save_total_limit=None, 
no_cuda=False, 
seed=42, 
fp16=False, 
fp16_opt_level='O1', 
local_rank=-1, 
tpu_num_cores=None, 
tpu_metrics_debug=False, 
debug=False, 
dataloader_drop_last=False, 
eval_steps=1000, 
past_index=-1)

data_collator: <function sd_data_collator at 0x7ffaba8f8e18>
train_dataset: <custom_dataset.SDAbstractsDataset object at 0x7ffa18c8c400>
eval_dataset: None
compute_metrics: None
prediction_loss_only: False
optimizers: None
tb_writer: <torch.utils.tensorboard.writer.SummaryWriter object at 0x7ff9f79e45c0>

sgugger · August 19, 2020, 12:33pm

The checkpoint should be saved in a directory that will allow you to go model = XXXModel.from_pretrained(that_directory).

kouohhashi · October 26, 2020, 5:09am

Hi, I have a question.
I tried to load weights from a checkpoint like below.

config = AutoConfig.from_pretrained("./saved/checkpoint-480000")
model = RobertaForMaskedLM(config=config)

Is this the right way?
It seems training speed is slower than before and training process crashed after some steps…

anaconda3/envs/pytorch/lib/python3.7/site-packages/transformers/trainer.py:263: FutureWarning: Passing `prediction_loss_only` as a keyword argument is deprecated and won't be possible in a future version. Use `args.prediction_loss_only` instead. Setting `args.prediction_loss_only=True
  FutureWarning,
  0%|          | 0/2755530 [00:00<?, ?it/s] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  0%|          | 10000/2755530 [10:53:37<2855:04:31,  3.74s/it] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  1%|          | 20000/2755530 [21:44:42<2934:49:34,  3.86s/it] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  1%|          | 30000/2755530 [32:35:52<2922:14:07,  3.86s/it] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  1%|          | 32292/2755530 [35:05:09<3263:20:29,  4.31s/it]

I could not find what wend wrong but the process was gone…

BTW, I started training with transformers version 3.1.0.
Then stop it.
I upgraded the transformers into 3.4.0 and restart training because I could not even start training from checkpoint.

Could you give me hints for debugging?

Thanks in advance.

sgugger · October 26, 2020, 1:15pm

No this will load a model similar to the one you had saved, but without the weights. You should use

model = RobertaForMaskedLM.from_pretrained("./saved/checkpoint-480000")

MattiaMG · September 27, 2021, 1:01am

If we use just the directory as it was saved without specifying which checkpoint:

model = RobertaForMaskedLM.from_pretrained("./saved/")

what is the model that is used when calling the model() function?

In my case, I have the arguments:

training_args = TrainingArguments(
    output_dir='./saved',
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=8,
    logging_steps=3000,
    save_steps=3000,
    save_total_limit=2,
    seed=1,
    fp16=True
)

The trainer setting:

trainer = Trainer(
    model=some_roberta_model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

And running:

trainer.train()

trainer.save_model('./saved')

After this, the .saved folder contains a config.json, training_args.bin, pytorch_model.bin files and two checkpoint sub-folders. But each of these checkpoint folders also contains a config.json, training_args.bin, pytorch_model.bin.

When I load the folder:

new_roberta = AutoModel.from_pretrained('./saved')

Which one is the model that is used in:

new_roberta(**token_output)

Are the config.json, training_argsbin, pytorch_model.bin in the main folder the same as the corresponding ones in any of the checkpoints sub-folders?

Thanks!

sgugger · September 27, 2021, 10:50am

I’m not sure I completely understand your question, but AutoModel.from_pretrained does not look in subfolders, so if you pass it ".saved/" it will look for the model and tokenizer files directly in that folder.

MattiaMG · September 27, 2021, 4:03pm

Thanks a lot for you reply.

According to my training args in my previous message (save_total_limit=2, load_best_model_at_end=False). After running trainer.save_model('./saved'). Is the model that is saved directly into the ./saved folder (not in the checkpoint subfolders) the model obtained at the end of the training process (meaning the model obtained at the last update step)?

sgugger · September 27, 2021, 5:13pm

This is the last model, yes.

HimSinghvi · June 16, 2022, 12:12pm

I have similar query. I was fine tuning a model. it was visible that it will be done in some 10-12 hours or so but after some time say 8 hours due to network issue or any other issue, the training/fine-tuning was stopped and it had created 3 model checkpoints. Now If I restart the training/finetuning, can I use the already created/saved model checkpoints and continue from there and complete the fine tuning in 2-3 hours or I need to restart the whole process from 0?
How to use model checkpoints which were saved intermediately while finetuning?
@sgugger @aclifton314

nishtha2802 · May 2, 2024, 5:16am

did you get an answer for this?

Topic		Replies	Views
Continuing Pre Training from Model Checkpoint Models	12	42295	January 13, 2025
Load checkpoint from Trainer 🤗Transformers	0	582	February 13, 2024
How to load model after running Trainer.save_model? Beginners	3	3152	November 28, 2023
How to continue training a model from where it left off? 🤗Transformers	0	188	September 5, 2024
How can I load specific checkpoint of trained model 🤗Transformers	0	612	April 28, 2022

Loading model from checkpoint after error in training

Related topics