I had a few questions about how Huggingface checkpoint behavior changes depending on the arguments to the Trainer.
In the documentation, I noticed that by default:
- because evaluation_strategy is ‘no’, evaluation is never run during training. Reference was here: Trainer — transformers 4.7.0 documentation
- because metric_for_best_model is by default ‘None’, the metric defaults to “loss”, which is the same as “eval_loss”. The same reference as above.
My questions were:
If I run a model with default arguments, does the model checkpointing automatically save and load the best checkpoint based on validation loss/the eval_dataloader?
Do the losses displayed in trainer_state.json correspond to val or train loss?
Is there an easy way to plot the train and val losses that doesn’t involve overriding the default model behavior or going through an external visualization library like comet?
Additionally, I tried manually overwriting the metric_for_best_model as follows:
training_args.metric_for_best_model = “eval_loss”
I do this before the training arguments are passed to Trainer initialization.
If I do this, 1) is this a correct way to enforce validation-based checkpointing? and 2) in this situation, what are the losses displayed in trainer_state.json?
Thank you for your help! I appreciate it.