Trainer: Save Checkpoint After Each Epoch

agemagician · October 20, 2020, 5:12pm

I am trying to fine-tune a model using Pytorch trainer, however, I couldn’t find an option to save checkpoint after each validation of each epoch.

I could only find “save_steps” which only save a checkpoint after specific steps, but I validatie the model at the end of each epoch, and I want to store the checkpoint at this point.

Any idea ?

vblagoje · October 20, 2020, 8:43pm

Perhaps you could use the Trainer callback mechanism and register handler for on_epoch_end.

sgugger · October 20, 2020, 9:19pm

If you set the option load_best_model_at_end to True, the saves will be done at each evaluation (and the Trainer will reload the best model found during the fine-tuning).

agemagician · October 21, 2020, 10:03am

Thanks for the tip.

agemagician · October 21, 2020, 10:04am

Thanks a lot @sgugger.
This is exactly what I am looking for.

nnhwin · November 24, 2023, 7:25am

@sgugger Hello, I would like to train data batch by batch received from Kafka. then save the chekpoints. If next batch is coming from kafka, want to reload the previous checkpoint to the model and then train again with new arrived data. I do not want do the checkpoint separately for each batch, they always need to recombine with previous checkpoints and train again with new data. Only after all data has been received, I would like to pick the best model (from the lastest checkpoint) and then saved as onnx file. Is it possible to do that way. Thank you in advance and hoping your advice.

Topic		Replies	Views
How to save a cehckpoint after each epoch - dataloader states Beginners	0	481	January 16, 2021
Saving only the best performing checkpoint 🤗Transformers	19	18221	May 23, 2023
Training models for smaller epochs and then continue trianing 🤗Transformers	5	1322	January 16, 2021
Saving CHECKPOINTS takes way too long Beginners	0	117	September 2, 2024
Does checkpoint have memory in the case of resume from checkpoint Beginners	0	224	February 28, 2024

Trainer: Save Checkpoint After Each Epoch

Related topics