Loading model from checkpoint after error in training

kouohhashi · October 26, 2020, 5:09am

Hi, I have a question.
I tried to load weights from a checkpoint like below.

config = AutoConfig.from_pretrained("./saved/checkpoint-480000")
model = RobertaForMaskedLM(config=config)

Is this the right way?
It seems training speed is slower than before and training process crashed after some steps…

anaconda3/envs/pytorch/lib/python3.7/site-packages/transformers/trainer.py:263: FutureWarning: Passing `prediction_loss_only` as a keyword argument is deprecated and won't be possible in a future version. Use `args.prediction_loss_only` instead. Setting `args.prediction_loss_only=True
  FutureWarning,
  0%|          | 0/2755530 [00:00<?, ?it/s] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  0%|          | 10000/2755530 [10:53:37<2855:04:31,  3.74s/it] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  1%|          | 20000/2755530 [21:44:42<2934:49:34,  3.86s/it] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  1%|          | 30000/2755530 [32:35:52<2922:14:07,  3.86s/it] anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  1%|          | 32292/2755530 [35:05:09<3263:20:29,  4.31s/it]

I could not find what wend wrong but the process was gone…

BTW, I started training with transformers version 3.1.0.
Then stop it.
I upgraded the transformers into 3.4.0 and restart training because I could not even start training from checkpoint.

Could you give me hints for debugging?

Thanks in advance.

Topic		Replies	Views
Continuing Pre Training from Model Checkpoint Models	12	42908	January 13, 2025
Load checkpoint from Trainer 🤗Transformers	0	591	February 13, 2024
How to load model after running Trainer.save_model? Beginners	3	3170	November 28, 2023
How to continue training a model from where it left off? 🤗Transformers	0	191	September 5, 2024
How can I load specific checkpoint of trained model 🤗Transformers	0	614	April 28, 2022

Loading model from checkpoint after error in training

Related topics