Hello,
I had a few questions about how Huggingface checkpoint behavior changes depending on the arguments to the Trainer.
In the documentation, I noticed that by default:
- because evaluation_strategy is ‘no’, evaluation is never run during training. Reference was here: Trainer — transformers 4.7.0 documentation
- because metric_for_best_model is by default ‘None’, the metric defaults to “loss”, which is the same as “eval_loss”. The same reference as above.
My questions were:
-
If I run a model with default arguments, does the model checkpointing automatically save and load the best checkpoint based on validation loss/the eval_dataloader?
-
Do the losses displayed in trainer_state.json correspond to val or train loss?
-
Is there an easy way to plot the train and val losses that doesn’t involve overriding the default model behavior or going through an external visualization library like comet?
Additionally, I tried manually overwriting the metric_for_best_model as follows:
training_args.metric_for_best_model = “eval_loss”
I do this before the training arguments are passed to Trainer initialization.
If I do this, 1) is this a correct way to enforce validation-based checkpointing? and 2) in this situation, what are the losses displayed in trainer_state.json?
Thank you for your help! I appreciate it.