Following the text generation code template here, I’ve been trying to generate some outputs from llama2 but running into stochastic generations.
For instance, running the same prompt through the model.generate() twice results in two different outputs as shown in the example below.
I’ve used model.generate() with other LLMs (e.g., flant5) with the other parameters remaining the same and have obtained deterministic outputs.
Also tried AutoModelForCausalLM
instead of LLamaForCausalLM
but still got different outputs each time for the same prompt.
How do I make sure I get the same text generated each time?
Code to reproduce:
from transformers import AutoTokenizer, LlamaForCausalLM
model_name = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir="/data2/racball/llms")
model = LlamaForCausalLM.from_pretrained(
model_name,
cache_dir="/data2/racball/llms",
device_map = "sequential",
)
prompt = "What is up?"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate
generate_ids = model.generate(inputs.input_ids, max_length=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
# Run1: 'What is up?\n\nI have a problem with my `docker-compose.yml` file. I have a service that should run a'
# Run2: "What is up?\n\nIt's been a while since I've posted, but I've been pretty busy with work and other"