Controlling AI's determinism - Score

priyatham10101 · July 4, 2025, 4:55am

Context:

I’m currently working on a document-based score analysis using Gemini 2.0-Flash. The goal is to evaluate Document X by extracting certain metrics from it and comparing them with a reference document, Document Y — essentially comparing “what I have” (X) versus “what’s required” (Y).

Issue:

We’re using a consistent prompt to compare Document X against Document Y. However, we’ve noticed that the generated scores vary between runs, even when the input documents and prompt remain unchanged.

Most of the time, the score fluctuation is within a range of ~5%, which is acceptable.
But occasionally, we see a much larger variation — sometimes as high as 25%, which is problematic for our use case where reliability and consistency are critical.

Looking for Suggestions:

How can we reduce this inconsistency and make the score generation more stable and reliable?

John6666 · July 4, 2025, 9:11am

You may already be doing this, but I think it’s better to set the temperature to 0 first.

github.com/langchain-ai/langchain-google

temperature=0 not working with Gemini-pro

opened 03:42PM - 05 Sep 24 UTC

closed 03:58PM - 06 Sep 24 UTC

nprime496

investigate

I set `temperature=0` in my `chatVertexAI`, yet I still get non deterministic an…swers. ```python from langchain.chains import RetrievalQA model: str = "gemini-pro" temperature: float = 0.0 max_output_tokens: int = 1024 qa_prompt = "Here is a context : {context}, here is my question {question}, what is the answer ?" prompt = PromptTemplate( template=qa_prompt, input_variables=["context", "question"] ) llm = ChatVertexAI( model=model, temperature=temperature, project=project, max_output_tokens=max_output_tokens, ) chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=retriever, chain_type_kwargs={"verbose": False, "prompt": prompt}, return_source_documents=True, ) st.session_state.result = chain( {"query": st.session_state.question} ) ``` My retriever always output the same files, the prompt does not change, yet I get different answers each time. What could be the issue ?

priyatham10101 · July 4, 2025, 10:14am

Yep, tried the temperature, topP, topK and couldn’t achieve determinism, thanks for sharing the paper I noticed a new thing “seed” which I didn’t come across in my research, which seems interesting.
I’ll try it out and test it, and see if it solves the problem.
Thanks for helping out and your time.

Ernst03 · July 4, 2025, 3:03pm

Welcome to posting @priyatham10101

I do not know what is the matter but it reminds me of Chaos theory and initial conditions with Floating Point numbers.

So, my simple idea is to restart everything from a fresh boot and just run it alone. Record the output and then run it again to see if it differs.
But feel free to ignore my suggestions as I am not even running my own stuff here yet.

priyatham10101 · July 8, 2025, 12:43pm

Hey, thanks for the suggestion, I had gone through it and tried, but it didn’t help much.

priyatham10101 · July 8, 2025, 12:47pm

I tried with the approach mentioned in the paper - seed but it didn’t actually help much.
To be frank, I got better results without configuring anything, rather than keeping {temperature, topP, topK, seed} I am not sure what is going wrong, something feels off.
AI is not-being deterministic, or not in allowed range of ±5%.
Even tried Open AI GPT-4o but, nothing changed much.

If anyone facing same issue, and was able to find a solution, please feel free to share your approach.

enzomich · July 11, 2025, 6:38am

Setting seed to a fixed value just ensures repeatable results because it initialises the pseudo-random numbers generator (PRNG) to the same value for each run, but doesn’t ensure the correctness of the result. If there are large discrepancies from run to run it’s more sensible to repeat a large number of runs with the same inputs, and average their results, using the sample’s standard deviation to estimate a confidence interval. (This assuming that the distribution of the results is normal, otherwise some more robust estimators of mean and spread should be used).

priyatham10101 · July 11, 2025, 7:24am

Thanks for your suggestions, but here I am validating a document and reasoning why a document is scored X, it’s not just for research but also as a product wise, so re-running each document, multiple times wont be a good option ig.
Also, this is a bit domain specific documents and analysis, do you think using some rag with similar docs in vector store would give more deterministic score?

Ernst03 · July 11, 2025, 12:22pm

Thanks, there is so much to AI and it keeps changing ever the faster so it made me smile to be even a little useful.
I have all the stuff to set up my old workstation as an AI lab so that is on the todo.

Ernst03 · July 11, 2025, 12:24pm

Hey it seems you have been gone a while. Welcome back.

enzomich · July 11, 2025, 1:08pm

I’m not sure: in RAG systems a vector database stores chunks of documents, and can retrieve chunks semantically close to the query. But in your case, if I understand it correctly, you should estimate the closeness of whole documents.

Topic		Replies	Views
FlaxGPTNeoForCausalLM generates the same text regardless of seed, temperature, top_k and top_p values Intermediate	1	393	September 22, 2021
Disable determinism in inference API text generation Beginners	0	490	January 12, 2022
Making llama text generation, deterministic Models	1	9992	August 16, 2023
GPT2 Generated Output Always the Same? Beginners	3	5730	December 16, 2020
mistralai/Mistral-7B-v0.1 temperature Models	2	260	November 13, 2024

Controlling AI's determinism - Score

Related topics