Continuing model training takes seconds in next round

amisoccassiopeia · May 5, 2022, 9:38am

Hi there,

I’m currently working with the huggingface framework to train a binary classifier. I saved the newly trained model and wanted to use the checkpoint for incremental learning (what I basically want to do is to retrain my model with new data). I use

model = AutoModelForSequenceClassification.from_pretrained('path_to_model/checkpoint-500', num_labels=2)

to load the model and

trainer.train(resume_from_checkpoint=True)

to train it. But regardless of the size of the new labeled data, the model runs super fast (I don’t use GPU but normal CPU - so this cannot explain the speed). It should take a few minutes (I have a small data size for the trial run) but it seems to be finished within seconds. I read through many GitHub issues but people observe bad results or slow performance when reloading their models but not such an increase in speed (Saving and reloading DistilBertForTokenClassification fine-tuned model · Issue #8272 · huggingface/transformers · GitHub).
(I tried it both in a Jupyter notebook and a Python script but keep observing the same issue)

Here’s the output:

Loading model from *model_path*.
The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: __index_level_0__, window_text, document_id. If __index_level_0__, window_text, document_id are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
*path*: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 167
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 22
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 45
  Continuing training from global step 500

  0%|          | 0/22 [00:00<?, ?it/s]

Training completed. Do not forget to share your model on huggingface.co/models =)

I am grateful for any ideas or recommendations Thank you!

BramVanroy · May 5, 2022, 2:52pm

It seems to me that you set the trainer to run for X steps/epochs (Num Epochs = 2), but that the checkpoint contains a state much beyond that (epoch 45/global step 500). So the script does not need to train because it thinks it has already trained 2 epochs.

The problem, I think, is that I don’t think you need this line:

trainer.train(resume_from_checkpoint=True)

As per the documentation, what this does is load the last checkpoint from the output_dir and also continue with the same states, including the progress in terms of steps/epochs.

If a bool and equals True , load the last checkpoint in args.output_dir as saved by a previous instance of Trainer. If present, training will resume from the model/optimizer/scheduler states loaded here.

However, in your case you want to start again completely, but from the weights of 'path_to_model/checkpoint-500'. So you can just drop the resume_from_checkpoint

amisoccassiopeia · May 6, 2022, 3:12am

Thanks so much! That makes absolute sense and is very helpful. Just a few follow-up questions:

If I drop resume_from_checkpoint=True, this would go against my expected retraining pipeline, wouldn’t it? My understanding now is as follows: the max_steps are 753 (as saved in the trainer_state.json). And because I only saved and loaded checkpoint 500, the model does not perform the expected training when loading the checkpoint because it has already been trained for more steps. So I would theoretically need to save the very last step to be able to continue with the training (I checked the documentation but couldn’t find this option as a specific argument). Also, if I enable load_best_model_at_end, this would not work in my suggested pipeline as it would not necessarily load the last but the best model. (It doesn’t feel like the “meant to be workflow” but would it be an option in my case to reset/remove the trainer_state.json before continuing with the training?)

My feeling is that my expected workflow doesn’t match the current workflow that huggingface suggests/expects and I wonder if you have an idea if there is a better way to do the retraining/incremental learning?

Thanks so much!

Rammohan12 · June 1, 2023, 6:20am

Hi @amisoccassiopeia I was wondering if you were able to figure out the solution for this problem I am facing a similar issue, please share if you made any progress regarding this.

Topic		Replies	Views
Saving CHECKPOINTS takes way too long Beginners	0	112	September 2, 2024
Inference for a 7B model on A100 takes too long? Beginners	1	1661	March 15, 2024
Trainer.train() seems to finish almost instantly 🤗Transformers	0	520	September 29, 2023
Unable to load saved fine tuned tensorflow model 🤗Transformers	0	1777	July 25, 2022
Load from checkpoint not skipping steps 🤗Transformers	7	3641	April 17, 2023

Continuing model training takes seconds in next round

Related topics