Generation / Inference

Hello !

I fine tune Llama-2-7b (base model) for a single-task text generation, during my training after each epoch, I generate with the “generate” function 10 examples (which are not in the training) to see the evolution of my learning.
The generation results seem to improve with each epoch, but when my training is over and I load my model for inference, these same 10 examples don’t give the same results at all. Why ?

I’m using the same prompt format in training and inference in :

Instruction: {}

Input : {}

Response :

Am I registering my model incorrectly? Is my inference code wrong?