Save only best model in Trainer

cramraj8 · July 19, 2023, 3:19am

Hi @sgugger , How do I get the last iteration step number in order to save the

trainer.save_model()

with the corresponding filename. Is there a way to get the total number of steps done during training from Trainer class ?

KonstantinaL · September 18, 2023, 2:47pm

Hey cramraj8, I think that if you use the following in the training config

save_total_limits=2
save_strategy=”no”

then the best and the latest models will be saved. You can compare the checkpoint number of these two models and infer which one is the largest number to get the latest iteration essentially.

Alternatively, if you use load_best_model=True in the config as well, and then do trainer.state.best_model_checkpoint after training, you can get the best checkpoint number, and again from that you can infer that the other output directory contains the latest model.

This is not exact, but if you use save_strategy=steps and save_steps=NUMBER, it seems that the total number of steps done during training is approximately the number of steps defined in save_steps multiplied with the batch size defined in per_device_train_batch_size.

Emmeke · December 1, 2023, 10:51am

I was running into the same issue. However, according to the current documentation (Trainer), with those parameter settings only the final model will be used rather than the best one:

save_total_limit (`int`, optional) — If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir. When load_best_model_at_end is enabled, the “best” checkpoint according to metric_for_best_model will always be retained in addition to the most recent ones. For example, for save_total_limit=5 and load_best_model_at_end, the four last checkpoints will always be retained alongside the best model. When save_total_limit=1 and load_best_model_at_end, it is possible that two checkpoints are saved: the last one and the best one (if they are different).

For load_best_model_at_end, the documentation furthermore includes:

When set to True, the parameters save_strategy needs to be the same as evaluation_strategy, and in the case it is “steps”, save_steps must be a round multiple of eval_steps.

So from this I gather that the following combination should be used to save both the best and the most recent checkpoint, and in the end use the best performing one:

save_total_limit = 2 load_best_model_at_end = True save_strategy = “epoch” or “steps” evaluation_strategy = “epoch” or “steps”

DeleMike · January 15, 2024, 4:23am

Please all, I cannot find the final model output. What flag do I have to pass to ensure I have thr output in this directory

See my training arguments:

CUDA_VISIBLE_DEVICES=3 python run_translation.py --model_name_or_path castorini/afriteva_v2_large --do_train --do_eval --source_lang unyo --target_lang dcyo --source_prefix "<unyo2dcyo>: " --train_file /mnt/disk/makindele/data_prep_eng/data_prep_eng/output_data/menyo_train.json --validation_file /mnt/disk/makindele/data_prep_eng/data_prep_eng/output_data/dev.json --test_file /mnt/disk/makindele/data_prep_eng/data_prep_eng/output_data/test.json --output_dir afriteva_v2_large_unyo_dcyo --max_source_length 512 --max_target_length 512 --per_device_train_batch_size=2 --per_device_eval_batch_size=2 --gradient_accumulation_steps=12 --num_train_epochs 3 --overwrite_output_dir --predict_with_generate --save_steps 10000 --num_beams 10 --do_predict

What should I do to get it? I need the output file.

nnml · January 17, 2024, 11:45am

@DeleMike There is nothing wrong, but you should definitely save your model to a subdirectory to avoid mixing up files.

A model is made up of the config.json file, which describes the architecture of the model; and the model.safetensors file, which contains the weights. See also here (the name of the second file is different).

Depending on how you saved your model, you get more files containing further information about the training, the tokenizer etc., which might or not be useful depending on your intended successive tasks. You can then load the model by calling, for instance

your_model = AutoModel.from_pretrained(path_to_dir)

Aside: what I do not know (and I would like it if someone could comment on this) is what difference it makes to save the model with the following three methods, and which of them is recommended (assuming you are doing fine-tuning with a trainer and you have chosen load_best_model_at_end = True in the training arguments):

trainer.model.save_pretrained(path_to_dir)
trainer.save_model(path_to_dir)
trainer.state.best_model_checkpoint

Depending on the method you choose you get more or less files saved to path_to_dir or in the trainer.state.best_model_checkpoint folder.

DeleMike · January 17, 2024, 2:23pm

@nnml Thank you for the response!

I was looking for the file and I saw from this issue that there is no longer pytorch_model.bin file. It has been updated to model.safetensors. The logs after training were misleading.

shantu95 · January 18, 2024, 1:17pm

I understand, if I set save_total_limit=2, it will save best and the last models. But I saw it didn’t save the best model, For example, I have following results from 3 epochs,
Best checkpoint according to Trainer api,

{'eval_loss': 0.4162479341030121, 'eval_accuracy': 0.81, 'eval_precision': 0.803921568627451, 'eval_recall': 0.82, 'eval_f1': 0.80998099809981, 'eval_runtime': 4.1178, 'eval_samples_per_second': 48.57, 'eval_steps_per_second': 6.071, 'epoch': 3.0}

But the checkpoint have shown best result on eval_dataset,

{'eval_loss': 0.46666741371154785, 'eval_accuracy': 0.84, 'eval_precision': 0.84, 'eval_recall': 0.84, 'eval_f1': 0.8399999999999999, 'eval_runtime': 4.0709, 'eval_samples_per_second': 49.129, 'eval_steps_per_second': 6.141, 'epoch': 6.0}

In Last Epoch result on evaluation dataset,

{'eval_loss': 0.8557769656181335, 'eval_accuracy': 0.82, 'eval_precision': 0.7962962962962963, 'eval_recall': 0.86, 'eval_f1': 0.8197115384615384, 'eval_runtime': 4.0955, 'eval_samples_per_second': 48.834, 'eval_steps_per_second': 6.104, 'epoch': 14.93}

I didn’t understand, why it saves the 1st one instead of 2nd one. @sgugger

Cheeky77 · February 5, 2024, 4:43am

“Is this loss?”
-Unknown Wise Man

Your trainer is considering the lowest evaluation loss to select the best model.
Consider using metric_for_best_model=metric_name in your TrainingArguments

Iamexperimenting · March 7, 2024, 10:27pm

@nielsr is there a way to save checkpoint for every 5th epoch?

nielsr · March 8, 2024, 11:06pm

Yes, see Trainer

subamkarthik · March 12, 2024, 9:00pm

This is helpful, thanks

rjurney · June 25, 2024, 7:54pm

Wait, won’t this just not save the best model?

Topic		Replies	Views
Checkpoints and disk storage 🤗Transformers	15	8024	June 2, 2024
Question Regarding trainer arguments:: load_best_model_at_end Beginners	2	1942	April 19, 2021
Saving only the best performing checkpoint 🤗Transformers	19	18191	May 23, 2023
Behaviour change in checkpoints saved by Trainer 🤗Transformers	0	954	July 17, 2023
Disable checkpointing in Trainer 🤗Transformers	4	7730	January 10, 2022

Save only best model in Trainer

Related topics