Training models for smaller epochs and then continue trianing

I am under limited compute hours, I need to train the models for 3 hours and then restart from the time it broke, I am using, could you tell me how I can train my models for max_steps X into smaller chunks of max_steps=X/1000 for instance but still getting the same results.
I am using evaluation_strategy = steps

  1. how I can save the current model in addition to the best model in each saving step
  2. when restarting, how I can skip the done steps

@sgugger thanks

Hello @julia, welcome to the forum! I think you have created two topics for the same purpose, I will answer here.

If I understand correctly, you are trying to save a checkpoint every time you do an evaluation. This can be done, using the script, changing the parameter save_steps to be the same as eval_steps.

For example if you want to evaluate and save a checkpoint every 1k steps, you call

python --evaluation_strategy steps --eval_steps 1000 --save_steps 1000

Hope this help :slight_smile:

thanks, but this was not my question, my question is if I create the checkpoints, how I can continue training from there, to train the model reaching the performance of continously training

I see, in that case you just need to change the --model_name_or_path parameter to be the folder of your last checkpoint, e.g. if you trained for 3k steps, the folder checkpoint-3000 would contain the model at checkpoint 3k, which is saved locally. Keep also in mind the number of checkpoints you want to keep locally with --save_total_limit.

thanks, I did this but still not getting the same results as full training with, is there anythig I am missing? thanks

I am using
model is loaded in the beginning
optimizers as well
but still nothing works
I am using eval_strategy=steps also
could anyone help and confirm how one can make work from training from past checkpoints? thanks