Questions about default checkpointing behavior (train v. val)


I had a few questions about how Huggingface checkpoint behavior changes depending on the arguments to the Trainer.

In the documentation, I noticed that by default:

  1. because evaluation_strategy is ‘no’, evaluation is never run during training. Reference was here: Trainer — transformers 4.7.0 documentation
  2. because metric_for_best_model is by default ‘None’, the metric defaults to “loss”, which is the same as “eval_loss”. The same reference as above.

My questions were:

  1. If I run a model with default arguments, does the model checkpointing automatically save and load the best checkpoint based on validation loss/the eval_dataloader?

  2. Do the losses displayed in trainer_state.json correspond to val or train loss?

  3. Is there an easy way to plot the train and val losses that doesn’t involve overriding the default model behavior or going through an external visualization library like comet?

Additionally, I tried manually overwriting the metric_for_best_model as follows:
training_args.metric_for_best_model = “eval_loss”

I do this before the training arguments are passed to Trainer initialization.

If I do this, 1) is this a correct way to enforce validation-based checkpointing? and 2) in this situation, what are the losses displayed in trainer_state.json?

Thank you for your help! I appreciate it.

Hi there, here are the answers:

  1. No, by default the model checkpointing only saves model to resume training later if something goes wrong, but there is no best model loading logic unless you use load_best_model_at_end. You will then need to set an eval_strategy and a save_strategy that match (either epoch or steps)

  2. It’s the accumulated training loss since the beginning of the training.

  3. No there is not inside Trainer. We integrate with most reporting tooling, TensorBoard, Wandb, CometML etc for this reason.

For your last questions, setting metric_for_best_model is not enough, you need to set load_best_model_at_end to True. The losses dispalyed in the trainer_state.json will still be the training losses.

Thank you for your help!

If I have load_best_model_at_end activated when I train my model, whenever I use the


am I guaranteed to automatically recover my best checkpoint?

If you use the from_pretrained method you will get the model associated with the folder/model identifier you pass. This class has no knowledge of the best checkpoint.