Training models for smaller epochs and then continue trianing

julia · January 12, 2021, 7:49pm

Hi
I am under limited compute hours, I need to train the models for 3 hours and then restart from the time it broke, I am using finetune_trainer.py, could you tell me how I can train my models for max_steps X into smaller chunks of max_steps=X/1000 for instance but still getting the same results.
I am using evaluation_strategy = steps

how I can save the current model in addition to the best model in each saving step
when restarting, how I can skip the done steps

@sgugger thanks

marcoabrate · January 13, 2021, 9:25am

Hello @julia, welcome to the forum! I think you have created two topics for the same purpose, I will answer here.

If I understand correctly, you are trying to save a checkpoint every time you do an evaluation. This can be done, using the finetune_trainer.py script, changing the parameter save_steps to be the same as eval_steps.

For example if you want to evaluate and save a checkpoint every 1k steps, you call

python finetune_trainer.py --evaluation_strategy steps --eval_steps 1000 --save_steps 1000

Hope this help

julia · January 13, 2021, 4:50pm

Hi,
thanks, but this was not my question, my question is if I create the checkpoints, how I can continue training from there, to train the model reaching the performance of continously training

marcoabrate · January 14, 2021, 9:33am

I see, in that case you just need to change the --model_name_or_path parameter to be the folder of your last checkpoint, e.g. if you trained for 3k steps, the folder checkpoint-3000 would contain the model at checkpoint 3k, which is saved locally. Keep also in mind the number of checkpoints you want to keep locally with --save_total_limit.

julia · January 16, 2021, 10:25am

Hi
thanks, I did this but still not getting the same results as full training with finetune_trainer.py, is there anythig I am missing? thanks

julia · January 16, 2021, 12:52pm

I am using finetune_trainer.py
model is loaded in the beginning
optimizers as well
but still nothing works
I am using eval_strategy=steps also
could anyone help and confirm how one can make finetune_trainer.py work from training from past checkpoints? thanks

Topic		Replies	Views
Checkpointing in each step 🤗Transformers	1	947	January 20, 2021
Trainer: Save Checkpoint After Each Epoch 🤗Transformers	5	10004	November 24, 2023
Choosing save_steps value and getting the best checkpoint 🤗Transformers	0	241	December 28, 2023
Does checkpoint have memory in the case of resume from checkpoint Beginners	0	224	February 28, 2024
Saving only the best performing checkpoint 🤗Transformers	19	18221	May 23, 2023

Training models for smaller epochs and then continue trianing

Related topics