How to continue training and not overwrite checkpoint number?

alvations · November 2, 2022, 12:28am

When training a model with something like:

model = EncoderDecoderModel.from_pretrained("super-seq2seq-model")

# set training arguments - these params are not really tuned, feel free to change
training_args = Seq2SeqTrainingArguments(
    output_dir="path/to/mymodel/",
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    logging_steps=500,  # set to 1000 for full training
    save_steps=500,    # set to 500 for full training
    eval_steps=500,     # set to 8000 for full training
    warmup_steps=2000,   # set to 2000 for full training
    max_steps=16,     # delete for full training
    overwrite_output_dir=True,
    save_total_limit=3,
    fp16=True, 
)


# instantiate trainer
trainer = Seq2SeqTrainer(
    model=multibert,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
)

It will create the model checkpoints, e.g.

$ ls path/to/mymodel/
checkpoint-4500   checkpoint-5000   checkpoint-5500

And when the model training ends, I reload the model and continue to try:

model = EncoderDecoderModel.from_pretrained("path/to/mymodel/checkpoint-5500")

training_args = Seq2SeqTrainingArguments(
    output_dir="path/to/mymodel/",
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    logging_steps=500,  # set to 1000 for full training
    save_steps=500,    # set to 500 for full training
    eval_steps=500,     # set to 8000 for full training
    warmup_steps=2000,   # set to 2000 for full training
    max_steps=16,     # delete for full training
    overwrite_output_dir=True,
    save_total_limit=5,
    fp16=True, 
)


# instantiate trainer
trainer = Seq2SeqTrainer(
    model=multibert,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
)

When saving the model in the continued / 2nd training round wave, the checkpoints were reset and starts from 500 again, i.e.

$ ls path/to/mymodel/
checkpoint-500   checkpoint-5000   checkpoint-5500

Is that the expected behavior of the model saving?

I could try to save it to a different directory in the 2nd training but is there some way / args in Seq2SeqTrainingArguments to tell it to continue checkpoint counter from 5500 + 500?

sgugger · November 2, 2022, 1:58pm

You are beginning a new training. The Trainer can’t know that it’s not starting at step 0 (and in general with the default learning scheduler, one longer training is >> two trainings)

alvations · November 2, 2022, 3:14pm

Thanks for the clarification! Will work around by changing the output directory and combine the logs outside of the Trainer.

Topic		Replies	Views
No skipping steps after loading from checkpoint 🤗Transformers	16	6535	April 21, 2022
Resume training from checkpoint Beginners	1	2352	January 5, 2023
Cannot change training arguments when resuming from a checkpoint 🤗Transformers	0	96	February 5, 2024
Is it possible to push_to_hub at every checkpoint? 🤗Transformers	2	633	November 16, 2023
Resume_from_checkpoint Models	0	1343	July 5, 2023

How to continue training and not overwrite checkpoint number?

Related Topics