Hi! I have a use case where I would like evaluation to happen at the beginning of training (before a training step has been taken) in addition to every n steps. I can easily get the latter using evaluation_strategy and eval_steps, but not sure how to get the former.
There is an argument, logging_first_step that sounds like it should do exactly what I need:
logging_first_step (bool, optional, defaults to False) — Whether to log and evaluate the first global_step or not.
But providing this argument does not lead to evaluation at the beginning of training like I would expect (tested with the run_summarization.py script). Does anyone have an idea how to get this behaviour from the HF Trainer?
Thanks for the quick response! I guess this solution works, but the results do not end up in the log_history field of the training_state.json file, which is how I am tracking performance over time. So I guess two questions:
Is there a better way to track evaluation metrics over time (using the provided example scripts like run_summarization.py) than the log_history field of the training_state.json?
Might adding this functionality (evaluating during training before a training step has been taken) be a good idea? Seems like a pretty ubiquitous use case to me as you may want to plot performance over time and knowing the performance of the model (either randomly initialized or pre-trained) before the first train step is useful. Would be happy to take a crack at this if I could get some advice as to where to implement it.
THanks for the post. One small issue here is, by doing this way, there will be one training step already done, which is not equal to the model evaluation without any training
Doesn’t that get solved using on_step_begin? Works for me for NLLB finetuning with Seq2SeqTrainingArguments. The only caveat seems to be that train loss doesn’t exist at this point so the wandb plots are offset.
class EvaluateFirstStepCallback(TrainerCallback):
def on_step_begin(self, args, state, control, **kwargs):
if state.global_step == 1:
control.should_evaluate = True
trainer.add_callback(EvaluateFirstStepCallback())
on_step_begin(): if step % args.gradient_accumulation_steps == 0:, called before all operations except for the random state, when global_step has not yet been updated, and the model parameters have not yet been updated
on_step_end(): called after all operations in each training iteration, when global_step has been updated and when global_step == 1, if the gradient accumulation step is 1, the model parameters have been updated once
Therefore, it is necessary to call on_step_begin() at global_step == 0 in order to ensure that model parameters that have not yet been updated are evaluated.
The corresponding code is below:
class EvaluateFirstStepCallback(TrainerCallback):
def on_step_begin(self, args, state, control, **kwargs):
if state.global_step == 0:
control.should_evaluate = True
trainer.add_callback(EvaluateFirstStepCallback())
I wish they’d fix this in the actual code. It’s so annoying that the flag for logging delay doesn’t work as expected. Wasted 10-15 mins finding this solution.
Now (2024) we can pass the parameter eval_on_start to your initialized TrainingArguments object to make your model evaluate before undergoing any training steps
This new parameter was renamed from a deprecated one, “sanity_evaluation”, as introduced here.