Questions about default checkpointing behavior (train v. val)

w-nicole · August 10, 2021, 12:57am

Hello,

I had a few questions about how Huggingface checkpoint behavior changes depending on the arguments to the Trainer.

In the documentation, I noticed that by default:

because evaluation_strategy is ‘no’, evaluation is never run during training. Reference was here: Trainer — transformers 4.7.0 documentation
because metric_for_best_model is by default ‘None’, the metric defaults to “loss”, which is the same as “eval_loss”. The same reference as above.

My questions were:

If I run a model with default arguments, does the model checkpointing automatically save and load the best checkpoint based on validation loss/the eval_dataloader?
Do the losses displayed in trainer_state.json correspond to val or train loss?
Is there an easy way to plot the train and val losses that doesn’t involve overriding the default model behavior or going through an external visualization library like comet?

Additionally, I tried manually overwriting the metric_for_best_model as follows:
training_args.metric_for_best_model = “eval_loss”

I do this before the training arguments are passed to Trainer initialization.

If I do this, 1) is this a correct way to enforce validation-based checkpointing? and 2) in this situation, what are the losses displayed in trainer_state.json?

Thank you for your help! I appreciate it.

sgugger · August 10, 2021, 8:20am

Hi there, here are the answers:

No, by default the model checkpointing only saves model to resume training later if something goes wrong, but there is no best model loading logic unless you use load_best_model_at_end. You will then need to set an eval_strategy and a save_strategy that match (either epoch or steps)
It’s the accumulated training loss since the beginning of the training.
No there is not inside Trainer. We integrate with most reporting tooling, TensorBoard, Wandb, CometML etc for this reason.

For your last questions, setting metric_for_best_model is not enough, you need to set load_best_model_at_end to True. The losses dispalyed in the trainer_state.json will still be the training losses.

w-nicole · August 23, 2021, 3:08pm

Thank you for your help!

If I have load_best_model_at_end activated when I train my model, whenever I use the

BertForMaskedLM.from_pretrained()

am I guaranteed to automatically recover my best checkpoint?

sgugger · August 30, 2021, 6:21pm

If you use the from_pretrained method you will get the model associated with the folder/model identifier you pass. This class has no knowledge of the best checkpoint.

CryyingFace · October 16, 2023, 1:54pm

hi,is it possible to save model base on the best traning loss without the need to pass eval_strategy ?

Topic		Replies	Views
Behaviour change in checkpoints saved by Trainer 🤗Transformers	0	957	July 17, 2023
Question Regarding trainer arguments:: load_best_model_at_end Beginners	2	1948	April 19, 2021
Unexpected behavior of load_best_model_at_end in Trainer (or am I doing it wrong?) 🤗Transformers	2	50	March 25, 2025
Saving only the best performing checkpoint 🤗Transformers	19	18203	May 23, 2023
Different loss values for trained and saved model Beginners	0	273	April 14, 2023

Questions about default checkpointing behavior (train v. val)

Related topics