I’m currently working on a document-based score analysis using Gemini 2.0-Flash. The goal is to evaluate Document X by extracting certain metrics from it and comparing them with a reference document, Document Y — essentially comparing “what I have” (X) versus “what’s required” (Y).
Issue:
We’re using a consistent prompt to compare Document X against Document Y. However, we’ve noticed that the generated scores vary between runs, even when the input documents and prompt remain unchanged.
Most of the time, the score fluctuation is within a range of ~5%, which is acceptable.
But occasionally, we see a much larger variation — sometimes as high as 25%, which is problematic for our use case where reliability and consistency are critical.
Looking for Suggestions:
How can we reduce this inconsistency and make the score generation more stable and reliable?
Yep, tried the temperature, topP, topK and couldn’t achieve determinism, thanks for sharing the paper I noticed a new thing “seed” which I didn’t come across in my research, which seems interesting.
I’ll try it out and test it, and see if it solves the problem.
Thanks for helping out and your time.
I do not know what is the matter but it reminds me of Chaos theory and initial conditions with Floating Point numbers.
So, my simple idea is to restart everything from a fresh boot and just run it alone. Record the output and then run it again to see if it differs.
But feel free to ignore my suggestions as I am not even running my own stuff here yet.