In relatively newer Transformers, the KV cache specifications have been overhauled, so you’ll likely need to use the new implementation.
Your crash is caused by generate() inferring an empty cache_position when you resume with a prefilled cache. Then _cache_dependant_input_preparation does cache_position[-1] and throws IndexError. Solution: use the new Cache API (DynamicCache) and pass an explicit cache_position that starts at the prefill length. Do not hand a raw tuple to past_key_values. (Hugging Face)
What’s happening, in context
- Prefill creates a KV cache of length
N. Resume should place the next token at positions [N, N+K-1]. If cache_position is missing or empty, generate()’s heuristics can fail with an empty tensor and crash. This exact failure is reported by multiple users when resuming from past_key_values. (Hugging Face)
- Transformers moved off legacy tuples. The default is now
Cache classes (e.g., DynamicCache). Legacy tuple input is deprecated and more brittle. Convert tuple → Cache, edit, and pass the Cache. (Hugging Face)
- LLaVA-specific note. Modern LLaVA paths thread
cache_position through attention updates; older forks may not accept it and force a manual decode loop. (Hugging Face)
Drop-in fix (version-robust)
Use DynamicCache and set cache_position explicitly. Avoid in-place edits on tensors that can be view-shared.
# Requires: transformers >= 4.47, torch >= 2.2
# Docs:
# https://huggingface.co/docs/transformers/en/cache_explanation
# https://huggingface.co/docs/transformers/en/kv_cache
import torch
from transformers import DynamicCache
@torch.inference_mode()
def generate_with_steering(model, processor, image, prompt_text,
steering_k_list, steering_v_list, coeff_k=0.1, coeff_v=2.0):
# 0) Build the multimodal prompt with LLaVA
prompt = f"USER: <image>\n{prompt_text}\nASSISTANT:"
inputs = processor(text=prompt, images=image, return_tensors='pt').to(model.device, torch.float16)
# 1) Prefill to build cache
out = model(**inputs, use_cache=True, return_dict=True)
cache = DynamicCache.from_legacy_cache(out.past_key_values) # convert tuple -> Cache
# 2) Edit last-token KV per layer
legacy = list(cache.to_legacy_cache()) # [(k,v), ...] with shapes [B, H, T, D]
for i, (k, v) in enumerate(legacy):
nh, hd = k.shape[1], k.shape[3]
k2 = k.clone()
v2 = v.clone()
k2[0, :, -1, :] += coeff_k * steering_k_list[i].reshape(nh, hd).to(k2.dtype).to(k2.device)
v2[0, :, -1, :] += coeff_v * steering_v_list[i].reshape(nh, hd).to(v2.dtype).to(v2.device)
legacy[i] = (k2, v2)
cache = DynamicCache.from_legacy_cache(tuple(legacy)) # rewrap edits
# 3) Seed generation with the last text token
seed_ids = inputs["input_ids"][:, -1:] # K = 1
past_len = cache.get_seq_length() # N
cache_pos = torch.arange(past_len, past_len + seed_ids.shape[1],
device=seed_ids.device) # [N]
# 4) Resume decoding
out_ids = model.generate(
input_ids=seed_ids,
past_key_values=cache, # pass Cache object
cache_position=cache_pos, # explicit, avoids empty cache_position bug
max_new_tokens=100,
do_sample=False,
)
return processor.batch_decode(out_ids, skip_special_tokens=True)[0].strip()
Why this works:
cache_position must equal [N, N+K-1] for the next tokens. You set it explicitly, so no empty tensor can be inferred. (Hugging Face)
DynamicCache is the supported path. Converting to and from legacy is the documented way to do custom edits safely. (Hugging Face)
If you must keep legacy tuples for now
Passing a tuple still works if you give cache_position yourself.
# Minimal workaround for legacy tuples
past_len = past_key_values[0][0].shape[2] # seq length T from any layer's K
seed_ids = inputs["input_ids"][:, -1:]
cache_pos = torch.arange(past_len, past_len + 1, device=seed_ids.device)
output = model.generate(
input_ids=seed_ids,
past_key_values=past_key_values, # legacy tuple
cache_position=cache_pos, # critical
max_new_tokens=100,
do_sample=False,
)
This sidesteps the empty-cache_position inference in generate() that triggers your exact IndexError. Multiple users have reported the same empty-tensor failure when resuming with past_key_values. (GitHub)
If your LLaVA build rejects cache_position
Some older LLaVA forks raise TypeError: ... unexpected keyword argument 'cache_position'. In that case, bypass generate() and step a manual loop:
# Manual decode loop if your model.forward(...) lacks cache_position
tokens = inputs["input_ids"][:, -1:]
cache = DynamicCache.from_legacy_cache(out.past_key_values)
outs = []
for _ in range(100):
fwd = model(input_ids=tokens, past_key_values=cache, use_cache=True)
next_token = fwd.logits[:, -1].argmax(-1, keepdim=True)
tokens = next_token
outs.append(next_token)
continuation = torch.cat(outs, dim=1)
This avoids prepare_inputs_for_generation and the failing inference branch entirely. Verify whether your fork threads cache_position through attention as in newer LLaVA code paths. (GitHub)
Detailed checklist and common pitfalls
- Give
cache_position explicitly when resuming. Do not rely on inference. This prevents the empty-tensor path that causes IndexError: index -1 ... size 0. (GitHub)
- Use a
Cache class. DynamicCache is default; legacy tuples are deprecated. Convert legacy ↔ cache for your edits. (Hugging Face)
- Edit the correct time step. For one-shot interventions, update the last prefill token:
k[:, :, -1, :] and v[:, :, -1, :]. Cache docs assume absolute positions, which your cache_position aligns with. (Hugging Face)
- No in-place on shared storage. Clone K/V before editing to avoid view aliasing across the Cache object. The snippet above uses
clone() then rewraps. (General PyTorch + HF cache guidance.) (Hugging Face)
- Dtype/device alignment. Cast steering tensors to
k.dtype/v.dtype, on the same device. Avoid silent host→device copies mid-decode. (HF docs assume this.) (Hugging Face)
- Do not resend images during decode. Seed with one text token only. The cache already contains the image tokens from prefill; modern processors expand image features during prefill. (SemanticDiff)
- Version sensitivity. Several cache-position fixes landed across 4.44–4.49; newer releases prefer
cache_position as the primary source of truth. If you stay on legacy tuples, always pass cache_position. (SemanticDiff)
Background: why cache_position matters
- Definition. “Cache position tracks where to insert new tokens in the attention cache. If you cached
N tokens, the next K tokens use positions [N ... N+K-1].” Generation and RoPE masking depend on these absolute indices. (Hugging Face)
- API change. Transformers v4.4x made
Cache classes first-class. Legacy past_key_values tuples are converted internally and may produce edge cases if you hand them back during generate() without cache_position. (GitHub)
Quick test to validate the fix
- Run prefill → resume with explicit
cache_position.
- Confirm no exception.
- Generate a few tokens, then resume again using the returned cache and a new
cache_position that continues from the last. This double-resume catches the exact bug class you hit. Reports show failures on the second resume if cache_position is inferred. (GitHub)
References and further reading
Core docs
- Cache API overview, legacy ↔
DynamicCache, and position semantics. (Hugging Face)
- Generation internals and utilities touchpoints. (Hugging Face)
Bugs and issues that match your trace
- Empty
cache_position inferred → IndexError when resuming with past_key_values. (GitHub)
- Progressive generation interactions with
inputs_embeds and cache_position. (GitHub)
- Iterative generation cache edge cases noted by users. (GitHub)
LLaVA specifics
- LLaVA and related multimodal code paths propagating
cache_position in attention updates; older forks may lack it. (Hugging Face)
- Known LLaVA prepare/merge behaviors with image tokens. Useful when reasoning about prefill length and seeding. (GitHub)
KV-cache steering background
- Paper: “KV Cache Steering for Controlling Frozen LLMs.” One-shot KV edits that mirror your method. (arXiv)
- Repo with implementation notes and examples. (GitHub)
- Short paper page summary. (Hugging Face)
Minimal legacy-tuple patch you can try immediately
If you want to keep your code shape and only change a few lines:
# 1) After prefill
outputs = model(**inputs, use_cache=True, return_dict=True)
past_key_values = outputs.past_key_values # legacy tuple
# 2) After your in-place edits, just add:
seed = inputs["input_ids"][:, -1:]
past_len = past_key_values[0][0].shape[2] # K.shape[2] is prefill length
cache_pos = torch.arange(past_len, past_len + 1, device=seed.device)
# 3) Generate with explicit cache_position
out = model.generate(
input_ids=seed,
past_key_values=past_key_values,
cache_position=cache_pos,
max_new_tokens=100,
do_sample=False,
)
This alone should eliminate IndexError: index -1 is out of bounds for dimension 0 with size 0. If your fork doesn’t accept cache_position, switch to the DynamicCache path or manual loop as shown above. (GitHub)
Curated extras
Implementation guides
- Cache strategies and when to initialize/pass caches manually. Good examples of explicit cache handling. (Hugging Face)
- Best practices for generation with caches, including quantized caches if you later need memory headroom during steering experiments. (Hugging Face)
Useful code to inspect
cache_utils.py and model prepare_inputs_for_generation code paths for models that already wire cache_position. Reading these clarifies why inference sometimes picks the wrong branch. (fossies.org)