Why accuracy of finetune model is less when evaluated after loading from disk, than during training?

I am finetuning a transformer model and during the training cycle, evaluating it at each epoch. The best model is selected based on the highest evaluation accuracy among all epochs. Once the training cycle is completed and the best model is dumped to the disk, I try to regenerate that validation accuracy. I am unable to regenerate the exact validation accuracy reported by the training phase. I am getting a 3% to 4% drop in accuracy on the same evaluation data.

(For regeneration, I am calling the same evaluation function and passing it model and dataset. Nothing else changed for evaluation accuracy regeneration)