I finetune a model and my validation metrics are an order of magnitude higher than the metrics on the test set. I know that this is quite possible, but such a difference seems extreme to me. I’ve noticed that generation is very sensitive to different parameters (i.e. repetition_penalty, min_length, max_length).
So I’m trying to understand how exactly prediction occurs on the validation set. The only mention of compute_metrics
in the source code for trainer is here:
# later use `self.model is self.model_wrapped` to check if it's wrapped or not
self.model_wrapped = model
self.model = model
self.compute_metrics = compute_metrics
and then it appears in Evaluate`:
eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
output = eval_loop(
eval_dataloader,
description="Evaluation",
# No point gathering the predictions if there are no metrics, otherwise we defer to
# self.args.prediction_loss_only
prediction_loss_only=True if self.compute_metrics is None else None,
ignore_keys=ignore_keys,
metric_key_prefix=metric_key_prefix,
)
Having trouble understanding how the prediction actually occurs. I’d like to verify that I’m using exactly the same generation parameters when I predict on the test set in order to investigate the difference in metrics.