Different generations during test time and validation time

karths324 · August 9, 2023, 2:13pm

I am trying to fine tune a model on a summarization task where I am trying to convert a Domain specific language into a summary and I get different quality of outputs for the validation and the test phase. For example, when I pass in the same input during the validation and testing phases I get two very different results:

Validation Phase output:

I used the tititanic dataset and anding only those records where the passenger’ a parents.

Test phase output:

I used the Titanic dataset, retaining only those records where the passenger had two children.

As you can see the quality of these outputs are vastly different. And just to be clear, what I mean by the validation phase is getting the prediction text via a compute_metrics function during training. And by testing time, I mean outputs generated by using the model.generate() function after the training loop is complete using the final model or any of its checkpoints during the intermediate stages.

Expected behavior

I want to understand what is going on here and why there are vastly different results during the two phases. Finally, it would be helpful if someone could point out how to bring some uniformity in these generations in terms of quality

Topic		Replies	Views
[Urgent] trainer.predict() and model.generate creates totally different predictions 🤗Transformers	4	6958	February 1, 2021
Fine-tune MT5ConditionalGeneration for question generation Intermediate	0	494	January 4, 2022
Run_summarization.py Rouge in eval cf. in final eval, predict Beginners	0	680	September 8, 2021
Evaluate model at saved checkpoint 🤗Transformers	0	1303	June 22, 2021
Evaluation results (metric) during training is different from the evaluation results at the end 🤗Transformers	4	3287	September 26, 2022

Different generations during test time and validation time

Expected behavior

Related topics