Different generations during test time and validation time

I am trying to fine tune a model on a summarization task where I am trying to convert a Domain specific language into a summary and I get different quality of outputs for the validation and the test phase. For example, when I pass in the same input during the validation and testing phases I get two very different results:

Validation Phase output:

I used the tititanic dataset and anding only those records where the passenger’ a parents.

Test phase output:

I used the Titanic dataset, retaining only those records where the passenger had two children.

As you can see the quality of these outputs are vastly different. And just to be clear, what I mean by the validation phase is getting the prediction text via a compute_metrics function during training. And by testing time, I mean outputs generated by using the model.generate() function after the training loop is complete using the final model or any of its checkpoints during the intermediate stages.

Expected behavior

I want to understand what is going on here and why there are vastly different results during the two phases. Finally, it would be helpful if someone could point out how to bring some uniformity in these generations in terms of quality