How to evaluate before first training step?

Hi! I have a use case where I would like evaluation to happen at the beginning of training (before a training step has been taken) in addition to every n steps. I can easily get the latter using evaluation_strategy and eval_steps, but not sure how to get the former.

There is an argument, logging_first_step that sounds like it should do exactly what I need:

logging_first_step (bool, optional, defaults to False) — Whether to log and evaluate the first global_step or not.

But providing this argument does not lead to evaluation at the beginning of training like I would expect (tested with the script). Does anyone have an idea how to get this behaviour from the HF Trainer?


You can just add a line trainer.evaluate() before the call to train.

1 Like

Thanks for the quick response! I guess this solution works, but the results do not end up in the log_history field of the training_state.json file, which is how I am tracking performance over time. So I guess two questions:

  • Is there a better way to track evaluation metrics over time (using the provided example scripts like than the log_history field of the training_state.json?
  • Might adding this functionality (evaluating during training before a training step has been taken) be a good idea? Seems like a pretty ubiquitous use case to me as you may want to plot performance over time and knowing the performance of the model (either randomly initialized or pre-trained) before the first train step is useful. Would be happy to take a crack at this if I could get some advice as to where to implement it.

A alternative approach, which may or may not solve the problem is to use a callback, like so:

class EvaluateFirstStepCallback(TrainerCallback):
    def on_step_end(self, args, state, control, **kwargs):
        if state.global_step == 1:
            control.should_evaluate = True


It does seem like it might be nice to have as a built-in TrainingArguments, given it mirrors logging_first_step pretty closely.


THanks for the post. One small issue here is, by doing this way, there will be one training step already done, which is not equal to the model evaluation without any training :thinking:

Doesn’t that get solved using on_step_begin? Works for me for NLLB finetuning with Seq2SeqTrainingArguments. The only caveat seems to be that train loss doesn’t exist at this point so the wandb plots are offset.

class EvaluateFirstStepCallback(TrainerCallback):
    def on_step_begin(self, args, state, control, **kwargs):
        if state.global_step == 1:
            control.should_evaluate = True


The code provided by @frankier and @zouharvi both seem to have some minor errors

According to

  • on_step_begin(): if step % args.gradient_accumulation_steps == 0:, called before all operations except for the random state, when global_step has not yet been updated, and the model parameters have not yet been updated
  • on_step_end(): called after all operations in each training iteration, when global_step has been updated and when global_step == 1, if the gradient accumulation step is 1, the model parameters have been updated once
    Therefore, it is necessary to call on_step_begin() at global_step == 0 in order to ensure that model parameters that have not yet been updated are evaluated.
    The corresponding code is below:
class EvaluateFirstStepCallback(TrainerCallback):
    def on_step_begin(self, args, state, control, **kwargs):
        if state.global_step == 0:
            control.should_evaluate = True


I wish they’d fix this in the actual code. It’s so annoying that the flag for logging delay doesn’t work as expected. Wasted 10-15 mins finding this solution.

1 Like

I have to say the progress bar will also step to “1” even using your providing code:

1 Like

The reason is that evaluation happens after training, i.e. should_evaluate is only checked afterwards.

Hey everybody!

Now (2024) we can pass the parameter eval_on_start to your initialized TrainingArguments object to make your model evaluate before undergoing any training steps :wink:

This new parameter was renamed from a deprecated one, “sanity_evaluation”, as introduced here.

I hope this update will find all of you well!