Making llama text generation, deterministic

Following the text generation code template here, I’ve been trying to generate some outputs from llama2 but running into stochastic generations.
For instance, running the same prompt through the model.generate() twice results in two different outputs as shown in the example below.

I’ve used model.generate() with other LLMs (e.g., flant5) with the other parameters remaining the same and have obtained deterministic outputs.
Also tried AutoModelForCausalLM instead of LLamaForCausalLM but still got different outputs each time for the same prompt.

How do I make sure I get the same text generated each time?

Code to reproduce:

from transformers import AutoTokenizer, LlamaForCausalLM

model_name = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir="/data2/racball/llms")
model = LlamaForCausalLM.from_pretrained(
    model_name,
    cache_dir="/data2/racball/llms",
    device_map = "sequential",
)

prompt = "What is up?"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate
generate_ids = model.generate(inputs.input_ids, max_length=30)

tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

# Run1: 'What is up?\n\nI have a problem with my `docker-compose.yml` file. I have a service that should run a'
# Run2: "What is up?\n\nIt's been a while since I've posted, but I've been pretty busy with work and other"
1 Like

Resolved [here]
(Is non-determinism in outputs generated by LlamaForCausalLM, the expected behavior? · Issue #25507 · huggingface/transformers · GitHub)

TLDR: set do_sample = False in model.generate(inputs.input_ids, max_length=30) as in model.generate(inputs.input_ids, max_length=30,do_sample=False)

1 Like