Loading model from checkpoint after error in training

Let’s say I am finetuning a model and during training an error is encountered and the training stops. Let’s also say that, using Trainer, I have it configured to save checkpoints along the way in training. How would I go about loading the model from the last checkpoint before it encountered the error?

For reference, here is the configuration of my Trainer object:

TRAINER ARGS
args: TrainingArguments(
output_dir='models/textgen/out', 
overwrite_output_dir=False, 
do_train='True', 
do_eval=False, 
do_predict=False, 
evaluate_during_training=False, 
per_device_train_batch_size=8, 
per_device_eval_batch_size=8, 
per_gpu_train_batch_size=None, 
per_gpu_eval_batch_size=None, 
gradient_accumulation_steps=1, 
learning_rate=5e-05, 
weight_decay=0.0, 
adam_epsilon=1e-08, 
max_grad_norm=1.0, 
num_train_epochs=3.0, 
max_steps=-1, 
warmup_steps=0, 
logging_dir='models/textgen/logs', 
logging_first_step=False, 
logging_steps=500, 
save_steps=500, 
save_total_limit=None, 
no_cuda=False, 
seed=42, 
fp16=False, 
fp16_opt_level='O1', 
local_rank=-1, 
tpu_num_cores=None, 
tpu_metrics_debug=False, 
debug=False, 
dataloader_drop_last=False, 
eval_steps=1000, 
past_index=-1)

data_collator: <function sd_data_collator at 0x7ffaba8f8e18>
train_dataset: <custom_dataset.SDAbstractsDataset object at 0x7ffa18c8c400>
eval_dataset: None
compute_metrics: None
prediction_loss_only: False
optimizers: None
tb_writer: <torch.utils.tensorboard.writer.SummaryWriter object at 0x7ff9f79e45c0>

The checkpoint should be saved in a directory that will allow you to go model = XXXModel.from_pretrained(that_directory).

2 Likes

Hi, I have a question.
I tried to load weights from a checkpoint like below.

config = AutoConfig.from_pretrained("./saved/checkpoint-480000")
model = RobertaForMaskedLM(config=config)

Is this the right way?
It seems training speed is slower than before and training process crashed after some steps…

anaconda3/envs/pytorch/lib/python3.7/site-packages/transformers/trainer.py:263: FutureWarning: Passing `prediction_loss_only` as a keyword argument is deprecated and won't be possible in a future version. Use `args.prediction_loss_only` instead. Setting `args.prediction_loss_only=True
  FutureWarning,
  0%|          | 0/2755530 [00:00<?, ?it/s] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  0%|          | 10000/2755530 [10:53:37<2855:04:31,  3.74s/it] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  1%|          | 20000/2755530 [21:44:42<2934:49:34,  3.86s/it] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  1%|          | 30000/2755530 [32:35:52<2922:14:07,  3.86s/it] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  1%|          | 32292/2755530 [35:05:09<3263:20:29,  4.31s/it]

I could not find what wend wrong but the process was gone…

BTW, I started training with transformers version 3.1.0.
Then stop it.
I upgraded the transformers into 3.4.0 and restart training because I could not even start training from checkpoint.

Could you give me hints for debugging?

Thanks in advance.

No this will load a model similar to the one you had saved, but without the weights. You should use

model = RobertaForMaskedLM.from_pretrained("./saved/checkpoint-480000")
1 Like