Trainer: Save Checkpoint After Each Epoch

I am trying to fine-tune a model using Pytorch trainer, however, I couldn’t find an option to save checkpoint after each validation of each epoch.

I could only find “save_steps” which only save a checkpoint after specific steps, but I validatie the model at the end of each epoch, and I want to store the checkpoint at this point.

Any idea ?

1 Like

Perhaps you could use the Trainer callback mechanism and register handler for on_epoch_end.

If you set the option load_best_model_at_end to True, the saves will be done at each evaluation (and the Trainer will reload the best model found during the fine-tuning).

3 Likes

Thanks for the tip.

Thanks a lot @sgugger.
This is exactly what I am looking for.

@sgugger Hello, I would like to train data batch by batch received from Kafka. then save the chekpoints. If next batch is coming from kafka, want to reload the previous checkpoint to the model and then train again with new arrived data. I do not want do the checkpoint separately for each batch, they always need to recombine with previous checkpoints and train again with new data. Only after all data has been received, I would like to pick the best model (from the lastest checkpoint) and then saved as onnx file. Is it possible to do that way. Thank you in advance and hoping your advice.