How does generation work with compute_metrics

I finetune a model and my validation metrics are an order of magnitude higher than the metrics on the test set. I know that this is quite possible, but such a difference seems extreme to me. I’ve noticed that generation is very sensitive to different parameters (i.e. repetition_penalty, min_length, max_length).

So I’m trying to understand how exactly prediction occurs on the validation set. The only mention of compute_metrics in the source code for trainer is here:

        # later use `self.model is self.model_wrapped` to check if it's wrapped or not
        self.model_wrapped = model
        self.model = model

        self.compute_metrics = compute_metrics

and then it appears in Evaluate`:

        eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
        output = eval_loop(
            eval_dataloader,
            description="Evaluation",
            # No point gathering the predictions if there are no metrics, otherwise we defer to
            # self.args.prediction_loss_only
            prediction_loss_only=True if self.compute_metrics is None else None,
            ignore_keys=ignore_keys,
            metric_key_prefix=metric_key_prefix,
        )

Having trouble understanding how the prediction actually occurs. I’d like to verify that I’m using exactly the same generation parameters when I predict on the test set in order to investigate the difference in metrics.

1 Like