Llama2-7b-hf model not reproducible across runs

I am fine-tuning a Llama2-7b-hf model on my custom dataset. However, the train and eval loss is different any time a re-run the training with the HuggingFace Trainer. I set the seed prior model training using the set_seed function and also passed the seed as arg to the Trainer.

I tested the same code with the Mistral model and could not observe similar behavior. Any idea what can cause this difference?

I also see the same issue – I have set the random seed prior to model loading. This behavior is not observed on Phi in my case.

Update: seems like the cause is from flash attention