Model.generate use_cache=True generates different results than use_cache=False

edenlum · March 4, 2025, 11:34am

Using huggingface transformers library, I see different outputs when generating text with model.generate with and without use_cache argument. Is this intended and how can I combat this?
The scores when I use cache (from the second token generated and onwards) are different. AFAIK use_cache is an optimization that shouldn’t effect the outputs. I also got this error on gpu (here in the code I use 'cpu). Code to reproduce:

MODEL = “meta-llama/Meta-Llama-3.1-8B-Instruct”

from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorWithPadding
import torch

MODEL = “Qwen/Qwen2.5-0.5B-Instruct”

MODEL = “meta-llama/Meta-Llama-3.1-8B-Instruct”

tokenizer = AutoTokenizer.from_pretrained(MODEL, padding_side=‘left’)
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=torch.bfloat16) # , cache_dir=“/workspace/fmrai/”)

tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token = tokenizer.eos_token

model.eval()

device = ‘cpu’ # ‘cuda’ if torch.cuda.is_available() else ‘cpu’
model.to(device)

Example text to generate from

prompt = “Tell me something that is very exciting”

Format the prompt using the chat template

formatted_prompt = tokenizer.apply_chat_template([{“role”: “user”, “content”: prompt}], tokenize=False)

Tokenize the input

inputs = tokenizer(formatted_prompt, return_tensors=“pt”, padding=True)

def generate(model, tokenizer, inputs, use_cache, output_attentions=False, device=‘cpu’):
with torch.no_grad():
outputs = model.generate(
inputs[“input_ids”].to(device),
attention_mask=inputs[“attention_mask”].to(device),
max_new_tokens=2,
pad_token_id=tokenizer.eos_token_id,
do_sample=False,
return_dict_in_generate=True,
use_cache=use_cache,
output_scores=True,
# output_hidden_states=True,
output_attentions=output_attentions,
)
return outputs

outputs_cache = generate(model, tokenizer, inputs, use_cache=True, device=device)
outputs_no_cache = generate(model, tokenizer, inputs, use_cache=False, device=device)
outputs_cache_attentions = generate(model, tokenizer, inputs, use_cache=True, output_attentions=True, device=device)
outputs_no_cache_attentions = generate(model, tokenizer, inputs, use_cache=False, output_attentions=True, device=device)

for i in range(2):
print(f"Cache {i}: {outputs_cache.scores[i]}“)
print(f"No Cache {i}: {outputs_no_cache.scores[i]}”)
print(f"Cache Att {i}: {outputs_cache_attentions.scores[i]}“)
print(f"No Cache Att {i}: {outputs_no_cache_attentions.scores[i]}”)

John6666 · March 4, 2025, 1:24pm

Similar case?

edenlum · March 4, 2025, 1:51pm

This case talks about a problem with his custom implementation. I am facing a problem with the official model.generate function in huggingface transformers.

John6666 · March 4, 2025, 2:55pm

I found a cause.

github.com/huggingface/transformers

Possible Bug with KV Caching in Llama (original) model

opened 07:50PM - 09 Aug 23 UTC

closed 08:07AM - 09 Jan 24 UTC

maximkha

### System Info transformers==4.31.0 - huggingface_hub version: 0.15.1 - …Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35 - Python version: 3.10.12 - Running in iPython ?: No - Running in notebook ?: No - Running in Google Colab ?: No - Token path ?: /u/k/h/khanov/.cache/huggingface/token - Has saved token ?: False - Configured git credential helpers: - FastAI: N/A - Tensorflow: N/A - Torch: 2.0.0 - Jinja2: 3.0.3 - Graphviz: N/A - Pydot: N/A - Pillow: 9.0.1 - hf_transfer: N/A - gradio: N/A - numpy: 1.24.2 - ENDPOINT: https://huggingface.co - HUGGINGFACE_HUB_CACHE: /u/k/h/khanov/.cache/huggingface/hub - HUGGINGFACE_ASSETS_CACHE: /u/k/h/khanov/.cache/huggingface/assets - HF_TOKEN_PATH: /u/k/h/khanov/.cache/huggingface/token - HF_HUB_OFFLINE: False - HF_HUB_DISABLE_TELEMETRY: False - HF_HUB_DISABLE_PROGRESS_BARS: None - HF_HUB_DISABLE_SYMLINKS_WARNING: False - HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False - HF_HUB_DISABLE_IMPLICIT_TOKEN: False - HF_HUB_ENABLE_HF_TRANSFER: False ### Who can help? @ArthurZucker, @younesbelkada ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below) ### Reproduction I was working on a custom decoding method, however, I found a deviation from greedy search when using KV caching. ```python import torch import transformers from transformers import AutoTokenizer, AutoModelForCausalLM from tqdm import tqdm MODEL_PATH = "/nobackup-fast/khanov/llama-7b" # "huggyllama/llama-7b" GEN_DEV = "cuda:0" tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16).to(GEN_DEV) def get_input_ids(prompt: str) -> torch.Tensor: global model, tokenizer tokens = tokenizer(prompt, return_tensors="pt").input_ids.to(GEN_DEV) return tokens def tokens_to_text(tokens: torch.Tensor): return tokenizer.batch_decode(tokens, skip_special_tokens=True) PROMPT = "This is a " # this is just a test prompt # greedy decoding without caching tokens = get_input_ids(PROMPT) for _ in tqdm(range(40)): with torch.no_grad(): mout = model(tokens) tokens = torch.hstack((tokens, torch.argmax(mout.logits[0, -1]).unsqueeze(0).unsqueeze(0))) without_cache = tokens_to_text(tokens)[0] print(f"{without_cache=}") # greedy decoding WITH caching tokens = get_input_ids(PROMPT) cached = None for _ in tqdm(range(40)): with torch.no_grad(): if cached is None: mout = model(tokens, output_hidden_states=True, use_cache=True) cached = mout.past_key_values else: mout = model(tokens, past_key_values=cached, use_cache=True, output_hidden_states=True) cached = mout.past_key_values tokens = torch.hstack((tokens, torch.argmax(mout.logits[0, -1]).unsqueeze(0).unsqueeze(0))) with_cache = tokens_to_text(tokens)[0] print(f"{with_cache=}") # normal greedy search with HF Generate implementation tokens = get_input_ids(PROMPT) tokens = model.generate(tokens, num_return_sequences=1, max_new_tokens=40) generate_output = tokens_to_text(tokens)[0] print(f"{generate_output=}") # this matches exactly assert without_cache == generate_output # this does not! assert without_cache == with_cache ``` ### Expected behavior I was expecting the results to not change when using the past_key_values kwarg, however, when passing past_key_values, the model assigned different logits to the tokens. This deviates from the model.generate behavior too. This is possibly related to #18809, and #21080.

#model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=torch.bfloat16) # , cache_dir="/workspace/fmrai/")
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=torch.float32) # , cache_dir="/workspace/fmrai/")

Cache 0:        tensor([[ 3.6974, 22.4915, 14.4676,  ..., -5.7953, -3.8699, -7.3202]],
       device='cuda:0')
No Cache 0:     tensor([[ 3.6974, 22.4915, 14.4676,  ..., -5.7953, -3.8699, -7.3202]],
       device='cuda:0')
Cache Att 0:    tensor([[ 3.6974, 22.4915, 14.4676,  ..., -5.7953, -3.8699, -7.3202]],
       device='cuda:0')
No Cache Att 0: tensor([[ 3.6974, 22.4915, 14.4676,  ..., -5.7953, -3.8699, -7.3202]],
       device='cuda:0')
Cache 1:        tensor([[ 7.8056,  6.0996, -1.6019,  ..., -3.3531,  4.3285,  1.0036]],
       device='cuda:0')
No Cache 1:     tensor([[ 7.8056,  6.0996, -1.6019,  ..., -3.3530,  4.3285,  1.0037]],
       device='cuda:0')
Cache Att 1:    tensor([[ 7.8056,  6.0996, -1.6019,  ..., -3.3530,  4.3285,  1.0037]],
       device='cuda:0')
No Cache Att 1: tensor([[ 7.8056,  6.0996, -1.6019,  ..., -3.3530,  4.3285,  1.0037]],
       device='cuda:0')

Topic		Replies	Views
Outputs change if re-using KVCache (past_key_values) for model.forward and generation 🤗Transformers	5	207	January 22, 2025
Transformer KV-Cache Produces Worse Output Than Normal Generation – Why? Beginners	1	180	March 3, 2025
What is the purpose of 'use_cache' in decoder? 🤗Transformers	5	23698	July 4, 2023
What does the `use_cache` in `generate` actually do? 🤗Transformers	1	2361	May 9, 2024
Using gradient checkpointing and KV caching when generation happens in no grad context 🤗Transformers	2	269	September 28, 2024

Model.generate use_cache=True generates different results than use_cache=False

MODEL = “Qwen/Qwen2.5-0.5B-Instruct”

Example text to generate from

Format the prompt using the chat template

Tokenize the input

Related topics