Loading model from checkpoint after error in training

Let’s say I am finetuning a model and during training an error is encountered and the training stops. Let’s also say that, using Trainer, I have it configured to save checkpoints along the way in training. How would I go about loading the model from the last checkpoint before it encountered the error?

For reference, here is the configuration of my Trainer object:

TRAINER ARGS
args: TrainingArguments(
output_dir='models/textgen/out', 
overwrite_output_dir=False, 
do_train='True', 
do_eval=False, 
do_predict=False, 
evaluate_during_training=False, 
per_device_train_batch_size=8, 
per_device_eval_batch_size=8, 
per_gpu_train_batch_size=None, 
per_gpu_eval_batch_size=None, 
gradient_accumulation_steps=1, 
learning_rate=5e-05, 
weight_decay=0.0, 
adam_epsilon=1e-08, 
max_grad_norm=1.0, 
num_train_epochs=3.0, 
max_steps=-1, 
warmup_steps=0, 
logging_dir='models/textgen/logs', 
logging_first_step=False, 
logging_steps=500, 
save_steps=500, 
save_total_limit=None, 
no_cuda=False, 
seed=42, 
fp16=False, 
fp16_opt_level='O1', 
local_rank=-1, 
tpu_num_cores=None, 
tpu_metrics_debug=False, 
debug=False, 
dataloader_drop_last=False, 
eval_steps=1000, 
past_index=-1)

data_collator: <function sd_data_collator at 0x7ffaba8f8e18>
train_dataset: <custom_dataset.SDAbstractsDataset object at 0x7ffa18c8c400>
eval_dataset: None
compute_metrics: None
prediction_loss_only: False
optimizers: None
tb_writer: <torch.utils.tensorboard.writer.SummaryWriter object at 0x7ff9f79e45c0>
3 Likes

The checkpoint should be saved in a directory that will allow you to go model = XXXModel.from_pretrained(that_directory).

5 Likes

Hi, I have a question.
I tried to load weights from a checkpoint like below.

config = AutoConfig.from_pretrained("./saved/checkpoint-480000")
model = RobertaForMaskedLM(config=config)

Is this the right way?
It seems training speed is slower than before and training process crashed after some steps…

anaconda3/envs/pytorch/lib/python3.7/site-packages/transformers/trainer.py:263: FutureWarning: Passing `prediction_loss_only` as a keyword argument is deprecated and won't be possible in a future version. Use `args.prediction_loss_only` instead. Setting `args.prediction_loss_only=True
  FutureWarning,
  0%|          | 0/2755530 [00:00<?, ?it/s] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  0%|          | 10000/2755530 [10:53:37<2855:04:31,  3.74s/it] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  1%|          | 20000/2755530 [21:44:42<2934:49:34,  3.86s/it] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  1%|          | 30000/2755530 [32:35:52<2922:14:07,  3.86s/it] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  1%|          | 32292/2755530 [35:05:09<3263:20:29,  4.31s/it]

I could not find what wend wrong but the process was gone…

BTW, I started training with transformers version 3.1.0.
Then stop it.
I upgraded the transformers into 3.4.0 and restart training because I could not even start training from checkpoint.

Could you give me hints for debugging?

Thanks in advance.

No this will load a model similar to the one you had saved, but without the weights. You should use

model = RobertaForMaskedLM.from_pretrained("./saved/checkpoint-480000")
4 Likes

If we use just the directory as it was saved without specifying which checkpoint:

model = RobertaForMaskedLM.from_pretrained("./saved/")

what is the model that is used when calling the model() function?

In my case, I have the arguments:

training_args = TrainingArguments(
    output_dir='./saved',
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=8,
    logging_steps=3000,
    save_steps=3000,
    save_total_limit=2,
    seed=1,
    fp16=True
)

The trainer setting:

trainer = Trainer(
    model=some_roberta_model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

And running:

trainer.train()

trainer.save_model('./saved')

After this, the .saved folder contains a config.json, training_args.bin, pytorch_model.bin files and two checkpoint sub-folders. But each of these checkpoint folders also contains a config.json, training_args.bin, pytorch_model.bin.

When I load the folder:

new_roberta = AutoModel.from_pretrained('./saved')

Which one is the model that is used in:

new_roberta(**token_output)

Are the config.json, training_argsbin, pytorch_model.bin in the main folder the same as the corresponding ones in any of the checkpoints sub-folders?

Thanks!

1 Like

I’m not sure I completely understand your question, but AutoModel.from_pretrained does not look in subfolders, so if you pass it ".saved/" it will look for the model and tokenizer files directly in that folder.

Thanks a lot for you reply.

According to my training args in my previous message (save_total_limit=2, load_best_model_at_end=False). After running trainer.save_model('./saved'). Is the model that is saved directly into the ./saved folder (not in the checkpoint subfolders) the model obtained at the end of the training process (meaning the model obtained at the last update step)?

This is the last model, yes.

1 Like

I have similar query. I was fine tuning a model. it was visible that it will be done in some 10-12 hours or so but after some time say 8 hours due to network issue or any other issue, the training/fine-tuning was stopped and it had created 3 model checkpoints. Now If I restart the training/finetuning, can I use the already created/saved model checkpoints and continue from there and complete the fine tuning in 2-3 hours or I need to restart the whole process from 0?
How to use model checkpoints which were saved intermediately while finetuning?
@sgugger @aclifton314