I’m currently working on a document-based score analysis using Gemini 2.0-Flash. The goal is to evaluate Document X by extracting certain metrics from it and comparing them with a reference document, Document Y — essentially comparing “what I have” (X) versus “what’s required” (Y).
Issue:
We’re using a consistent prompt to compare Document X against Document Y. However, we’ve noticed that the generated scores vary between runs, even when the input documents and prompt remain unchanged.
Most of the time, the score fluctuation is within a range of ~5%, which is acceptable.
But occasionally, we see a much larger variation — sometimes as high as 25%, which is problematic for our use case where reliability and consistency are critical.
Looking for Suggestions:
How can we reduce this inconsistency and make the score generation more stable and reliable?
Yep, tried the temperature, topP, topK and couldn’t achieve determinism, thanks for sharing the paper I noticed a new thing “seed” which I didn’t come across in my research, which seems interesting.
I’ll try it out and test it, and see if it solves the problem.
Thanks for helping out and your time.
I do not know what is the matter but it reminds me of Chaos theory and initial conditions with Floating Point numbers.
So, my simple idea is to restart everything from a fresh boot and just run it alone. Record the output and then run it again to see if it differs.
But feel free to ignore my suggestions as I am not even running my own stuff here yet.
I tried with the approach mentioned in the paper - seed but it didn’t actually help much.
To be frank, I got better results without configuring anything, rather than keeping {temperature, topP, topK, seed} I am not sure what is going wrong, something feels off.
AI is not-being deterministic, or not in allowed range of ±5%.
Even tried Open AI GPT-4o but, nothing changed much.
If anyone facing same issue, and was able to find a solution, please feel free to share your approach.
Setting seed to a fixed value just ensures repeatable results because it initialises the pseudo-random numbers generator (PRNG) to the same value for each run, but doesn’t ensure the correctness of the result. If there are large discrepancies from run to run it’s more sensible to repeat a large number of runs with the same inputs, and average their results, using the sample’s standard deviation to estimate a confidence interval. (This assuming that the distribution of the results is normal, otherwise some more robust estimators of mean and spread should be used).
Thanks for your suggestions, but here I am validating a document and reasoning why a document is scored X, it’s not just for research but also as a product wise, so re-running each document, multiple times wont be a good option ig.
Also, this is a bit domain specific documents and analysis, do you think using some rag with similar docs in vector store would give more deterministic score?
Thanks, there is so much to AI and it keeps changing ever the faster so it made me smile to be even a little useful.
I have all the stuff to set up my old workstation as an AI lab so that is on the todo.
I’m not sure: in RAG systems a vector database stores chunks of documents, and can retrieve chunks semantically close to the query. But in your case, if I understand it correctly, you should estimate the closeness of whole documents.