Hi @sgugger , How do I get the last iteration step number in order to save the
trainer.save_model()
with the corresponding filename. Is there a way to get the total number of steps done during training from Trainer class ?
Hi @sgugger , How do I get the last iteration step number in order to save the
trainer.save_model()
with the corresponding filename. Is there a way to get the total number of steps done during training from Trainer class ?
Hey cramraj8, I think that if you use the following in the training config
save_total_limits=2
save_strategy=ânoâ
then the best and the latest models will be saved. You can compare the checkpoint number of these two models and infer which one is the largest number to get the latest iteration essentially.
Alternatively, if you use load_best_model=True
in the config as well, and then do trainer.state.best_model_checkpoint
after training, you can get the best checkpoint number, and again from that you can infer that the other output directory contains the latest model.
This is not exact, but if you use save_strategy=steps
and save_steps=NUMBER
, it seems that the total number of steps done during training is approximately the number of steps defined in save_steps
multiplied with the batch size defined in per_device_train_batch_size
.
I was running into the same issue. However, according to the current documentation (Trainer), with those parameter settings only the final model will be used rather than the best one:
save_total_limit (`int`, optional) â If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints inoutput_dir
. Whenload_best_model_at_end
is enabled, the âbestâ checkpoint according tometric_for_best_model
will always be retained in addition to the most recent ones. For example, forsave_total_limit=5
andload_best_model_at_end
, the four last checkpoints will always be retained alongside the best model. Whensave_total_limit=1
andload_best_model_at_end
, it is possible that two checkpoints are saved: the last one and the best one (if they are different).
For load_best_model_at_end
, the documentation furthermore includes:
When set toTrue
, the parameterssave_strategy
needs to be the same asevaluation_strategy
, and in the case it is âstepsâ,save_steps
must be a round multiple ofeval_steps
.
So from this I gather that the following combination should be used to save both the best and the most recent checkpoint, and in the end use the best performing one:
save_total_limit = 2
load_best_model_at_end = True
save_strategy = âepochâ or âstepsâ
evaluation_strategy = âepochâ or âstepsâ
See my training arguments:
CUDA_VISIBLE_DEVICES=3 python run_translation.py --model_name_or_path castorini/afriteva_v2_large --do_train --do_eval --source_lang unyo --target_lang dcyo --source_prefix "<unyo2dcyo>: " --train_file /mnt/disk/makindele/data_prep_eng/data_prep_eng/output_data/menyo_train.json --validation_file /mnt/disk/makindele/data_prep_eng/data_prep_eng/output_data/dev.json --test_file /mnt/disk/makindele/data_prep_eng/data_prep_eng/output_data/test.json --output_dir afriteva_v2_large_unyo_dcyo --max_source_length 512 --max_target_length 512 --per_device_train_batch_size=2 --per_device_eval_batch_size=2 --gradient_accumulation_steps=12 --num_train_epochs 3 --overwrite_output_dir --predict_with_generate --save_steps 10000 --num_beams 10 --do_predict
What should I do to get it? I need the output file.
@DeleMike There is nothing wrong, but you should definitely save your model to a subdirectory to avoid mixing up files.
A model is made up of the config.json
file, which describes the architecture of the model; and the model.safetensors
file, which contains the weights. See also here (the name of the second file is different).
Depending on how you saved your model, you get more files containing further information about the training, the tokenizer etc., which might or not be useful depending on your intended successive tasks. You can then load the model by calling, for instance
your_model = AutoModel.from_pretrained(path_to_dir)
Aside: what I do not know (and I would like it if someone could comment on this) is what difference it makes to save the model with the following three methods, and which of them is recommended (assuming you are doing fine-tuning with a trainer
and you have chosen load_best_model_at_end = True
in the training arguments):
trainer.model.save_pretrained(path_to_dir)
trainer.save_model(path_to_dir)
trainer.state.best_model_checkpoint
Depending on the method you choose you get more or less files saved to path_to_dir
or in the trainer.state.best_model_checkpoint
folder.
@nnml Thank you for the response!
I was looking for the file and I saw from this issue that there is no longer pytorch_model.bin
file. It has been updated to model.safetensors
. The logs after training were misleading.
I understand, if I set save_total_limit=2, it will save best and the last models. But I saw it didnât save the best model, For example, I have following results from 3 epochs,
Best checkpoint according to Trainer api,
{'eval_loss': 0.4162479341030121, 'eval_accuracy': 0.81, 'eval_precision': 0.803921568627451, 'eval_recall': 0.82, 'eval_f1': 0.80998099809981, 'eval_runtime': 4.1178, 'eval_samples_per_second': 48.57, 'eval_steps_per_second': 6.071, 'epoch': 3.0}
But the checkpoint have shown best result on eval_dataset,
{'eval_loss': 0.46666741371154785, 'eval_accuracy': 0.84, 'eval_precision': 0.84, 'eval_recall': 0.84, 'eval_f1': 0.8399999999999999, 'eval_runtime': 4.0709, 'eval_samples_per_second': 49.129, 'eval_steps_per_second': 6.141, 'epoch': 6.0}
In Last Epoch result on evaluation dataset,
{'eval_loss': 0.8557769656181335, 'eval_accuracy': 0.82, 'eval_precision': 0.7962962962962963, 'eval_recall': 0.86, 'eval_f1': 0.8197115384615384, 'eval_runtime': 4.0955, 'eval_samples_per_second': 48.834, 'eval_steps_per_second': 6.104, 'epoch': 14.93}
I didnât understand, why it saves the 1st one instead of 2nd one. @sgugger
âIs this loss?â
-Unknown Wise Man
Your trainer is considering the lowest evaluation loss to select the best model.
Consider using metric_for_best_model=metric_name
in your TrainingArguments
This is helpful, thanks
Wait, wonât this just not save the best model?