Hi
I am under limited compute hours, I need to train the models for 3 hours and then restart from the time it broke, I am using finetune_trainer.py, could you tell me how I can train my models for max_steps X into smaller chunks of max_steps=X/1000 for instance but still getting the same results.
I am using evaluation_strategy = steps
how I can save the current model in addition to the best model in each saving step
Hello @julia, welcome to the forum! I think you have created two topics for the same purpose, I will answer here.
If I understand correctly, you are trying to save a checkpoint every time you do an evaluation. This can be done, using the finetune_trainer.py script, changing the parameter save_steps to be the same as eval_steps.
For example if you want to evaluate and save a checkpoint every 1k steps, you call
Hi,
thanks, but this was not my question, my question is if I create the checkpoints, how I can continue training from there, to train the model reaching the performance of continously training
I see, in that case you just need to change the --model_name_or_path parameter to be the folder of your last checkpoint, e.g. if you trained for 3k steps, the folder checkpoint-3000 would contain the model at checkpoint 3k, which is saved locally. Keep also in mind the number of checkpoints you want to keep locally with --save_total_limit.
I am using finetune_trainer.py
model is loaded in the beginning
optimizers as well
but still nothing works
I am using eval_strategy=steps also
could anyone help and confirm how one can make finetune_trainer.py work from training from past checkpoints? thanks