IndexError: index -1 is out of bounds for dimension 0 with size 0

I am currently experimenting with modifying the KV cache of the LLaVA model in order to perform controlled interventions during generation (similar to cache-steering methods in recent research). The goal is to alter the cached key-value tensors after the prefill phase and then continue decoding from the modified cache.

However, whenever I try to resume generation using model.generate() with my modified past_key_values, I consistently encounter the following error:
Code:

def generate_with_steering(model, processor, image, prompt_text, steering_k_list, steering_v_list, coeff_k, coeff_v):
    """
    Generates a caption with one-shot KV cache steering, as described in the paper.[1]
    """
    prompt = f"USER: <image>\n{prompt_text}\nASSISTANT:"
    inputs = processor(text=prompt, images=image, return_tensors='pt').to("cuda", torch.float16)
    
    # 1. Prefill the KV cache by running a forward pass
    with torch.no_grad():
        outputs = model(**inputs, use_cache=True)
        past_key_values = outputs.past_key_values

    # 2. Modify the KV cache object
    for i in range(len(steering_k_list)):
        k, v = past_key_values[i] 
        
        num_heads = k.shape[1]
        head_dim = k.shape[3]
        
        reshaped_k = steering_k_list[i].reshape(num_heads, head_dim)
        reshaped_v = steering_v_list[i].reshape(num_heads, head_dim)
        
        # Apply the steering vector IN-PLACE to the cache of the *last* token
        # This modifies the tensors *inside* the past_key_values object
        k[0, :, -1, :] += coeff_k * reshaped_k
        v[0, :, -1, :] += coeff_v * reshaped_v
        
    # 3. Generate text using 
    output = model.generate(
        input_ids=inputs['input_ids'][:, -1:],  
        past_key_values=past_key_values, # Pass the original, modified Cache object
        max_new_tokens=100,
        do_sample=False
    )
    
    full_response_list = processor.batch_decode(output, skip_special_tokens=True)
    # The output from generate() when using past_key_values might not include the prompt
    return full_response_list[0].strip()

Error:

--- 2. Generating WITH STEERING (k_coeff=0.1, v_coeff=2.0) ---
Traceback (most recent call last):
  File "/home/gpuuser3/Pulkit/CACHE_STEERING/kv-steering-for-vlm/src/verify_steering_vector.py", line 138, in <module>
    steered_caption = generate_with_steering(
                      ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gpuuser3/Pulkit/CACHE_STEERING/kv-steering-for-vlm/src/verify_steering_vector.py", line 85, in generate_with_steering
    output = model.generate(
             ^^^^^^^^^^^^^^^
  File "/home/gpuuser3/.pyenv/versions/kv-steering-vlm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/gpuuser3/.pyenv/versions/kv-steering-vlm/lib/python3.11/site-packages/transformers/generation/utils.py", line 2564, in generate
    result = decoding_method(
             ^^^^^^^^^^^^^^^^
  File "/home/gpuuser3/.pyenv/versions/kv-steering-vlm/lib/python3.11/site-packages/transformers/generation/utils.py", line 2781, in _sample
    model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gpuuser3/.pyenv/versions/kv-steering-vlm/lib/python3.11/site-packages/transformers/models/llava/modeling_llava.py", line 466, in prepare_inputs_for_generation
    model_inputs = super().prepare_inputs_for_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gpuuser3/.pyenv/versions/kv-steering-vlm/lib/python3.11/site-packages/transformers/generation/utils.py", line 574, in prepare_inputs_for_generation
    inputs_embeds, input_ids = self._cache_dependant_input_preparation(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gpuuser3/.pyenv/versions/kv-steering-vlm/lib/python3.11/site-packages/transformers/generation/utils.py", line 476, in _cache_dependant_input_preparation
    or (cache_position[-1] >= input_ids.shape[1])  # Exception 3
        ~~~~~~~~~~~~~~^^^^
IndexError: index -1 is out of bounds for dimension 0 with size 0
1 Like

In relatively newer Transformers, the KV cache specifications have been overhauled, so you’ll likely need to use the new implementation.


Your crash is caused by generate() inferring an empty cache_position when you resume with a prefilled cache. Then _cache_dependant_input_preparation does cache_position[-1] and throws IndexError. Solution: use the new Cache API (DynamicCache) and pass an explicit cache_position that starts at the prefill length. Do not hand a raw tuple to past_key_values. (Hugging Face)

What’s happening, in context

  • Prefill creates a KV cache of length N. Resume should place the next token at positions [N, N+K-1]. If cache_position is missing or empty, generate()’s heuristics can fail with an empty tensor and crash. This exact failure is reported by multiple users when resuming from past_key_values. (Hugging Face)
  • Transformers moved off legacy tuples. The default is now Cache classes (e.g., DynamicCache). Legacy tuple input is deprecated and more brittle. Convert tuple → Cache, edit, and pass the Cache. (Hugging Face)
  • LLaVA-specific note. Modern LLaVA paths thread cache_position through attention updates; older forks may not accept it and force a manual decode loop. (Hugging Face)

Drop-in fix (version-robust)

Use DynamicCache and set cache_position explicitly. Avoid in-place edits on tensors that can be view-shared.

# Requires: transformers >= 4.47, torch >= 2.2
# Docs:
#   https://huggingface.co/docs/transformers/en/cache_explanation
#   https://huggingface.co/docs/transformers/en/kv_cache

import torch
from transformers import DynamicCache

@torch.inference_mode()
def generate_with_steering(model, processor, image, prompt_text,
                           steering_k_list, steering_v_list, coeff_k=0.1, coeff_v=2.0):
    # 0) Build the multimodal prompt with LLaVA
    prompt = f"USER: <image>\n{prompt_text}\nASSISTANT:"
    inputs = processor(text=prompt, images=image, return_tensors='pt').to(model.device, torch.float16)

    # 1) Prefill to build cache
    out = model(**inputs, use_cache=True, return_dict=True)
    cache = DynamicCache.from_legacy_cache(out.past_key_values)  # convert tuple -> Cache

    # 2) Edit last-token KV per layer
    legacy = list(cache.to_legacy_cache())  # [(k,v), ...] with shapes [B, H, T, D]
    for i, (k, v) in enumerate(legacy):
        nh, hd = k.shape[1], k.shape[3]
        k2 = k.clone()
        v2 = v.clone()
        k2[0, :, -1, :] += coeff_k * steering_k_list[i].reshape(nh, hd).to(k2.dtype).to(k2.device)
        v2[0, :, -1, :] += coeff_v * steering_v_list[i].reshape(nh, hd).to(v2.dtype).to(v2.device)
        legacy[i] = (k2, v2)
    cache = DynamicCache.from_legacy_cache(tuple(legacy))  # rewrap edits

    # 3) Seed generation with the last text token
    seed_ids = inputs["input_ids"][:, -1:]                  # K = 1
    past_len = cache.get_seq_length()                       # N
    cache_pos = torch.arange(past_len, past_len + seed_ids.shape[1],
                             device=seed_ids.device)        # [N]

    # 4) Resume decoding
    out_ids = model.generate(
        input_ids=seed_ids,
        past_key_values=cache,      # pass Cache object
        cache_position=cache_pos,   # explicit, avoids empty cache_position bug
        max_new_tokens=100,
        do_sample=False,
    )
    return processor.batch_decode(out_ids, skip_special_tokens=True)[0].strip()

Why this works:

  • cache_position must equal [N, N+K-1] for the next tokens. You set it explicitly, so no empty tensor can be inferred. (Hugging Face)
  • DynamicCache is the supported path. Converting to and from legacy is the documented way to do custom edits safely. (Hugging Face)

If you must keep legacy tuples for now

Passing a tuple still works if you give cache_position yourself.

# Minimal workaround for legacy tuples
past_len = past_key_values[0][0].shape[2]          # seq length T from any layer's K
seed_ids = inputs["input_ids"][:, -1:]
cache_pos = torch.arange(past_len, past_len + 1, device=seed_ids.device)

output = model.generate(
    input_ids=seed_ids,
    past_key_values=past_key_values,  # legacy tuple
    cache_position=cache_pos,         # critical
    max_new_tokens=100,
    do_sample=False,
)

This sidesteps the empty-cache_position inference in generate() that triggers your exact IndexError. Multiple users have reported the same empty-tensor failure when resuming with past_key_values. (GitHub)


If your LLaVA build rejects cache_position

Some older LLaVA forks raise TypeError: ... unexpected keyword argument 'cache_position'. In that case, bypass generate() and step a manual loop:

# Manual decode loop if your model.forward(...) lacks cache_position
tokens = inputs["input_ids"][:, -1:]
cache = DynamicCache.from_legacy_cache(out.past_key_values)
outs = []
for _ in range(100):
    fwd = model(input_ids=tokens, past_key_values=cache, use_cache=True)
    next_token = fwd.logits[:, -1].argmax(-1, keepdim=True)
    tokens = next_token
    outs.append(next_token)
continuation = torch.cat(outs, dim=1)

This avoids prepare_inputs_for_generation and the failing inference branch entirely. Verify whether your fork threads cache_position through attention as in newer LLaVA code paths. (GitHub)


Detailed checklist and common pitfalls

  • Give cache_position explicitly when resuming. Do not rely on inference. This prevents the empty-tensor path that causes IndexError: index -1 ... size 0. (GitHub)
  • Use a Cache class. DynamicCache is default; legacy tuples are deprecated. Convert legacy ↔ cache for your edits. (Hugging Face)
  • Edit the correct time step. For one-shot interventions, update the last prefill token: k[:, :, -1, :] and v[:, :, -1, :]. Cache docs assume absolute positions, which your cache_position aligns with. (Hugging Face)
  • No in-place on shared storage. Clone K/V before editing to avoid view aliasing across the Cache object. The snippet above uses clone() then rewraps. (General PyTorch + HF cache guidance.) (Hugging Face)
  • Dtype/device alignment. Cast steering tensors to k.dtype/v.dtype, on the same device. Avoid silent host→device copies mid-decode. (HF docs assume this.) (Hugging Face)
  • Do not resend images during decode. Seed with one text token only. The cache already contains the image tokens from prefill; modern processors expand image features during prefill. (SemanticDiff)
  • Version sensitivity. Several cache-position fixes landed across 4.44–4.49; newer releases prefer cache_position as the primary source of truth. If you stay on legacy tuples, always pass cache_position. (SemanticDiff)

Background: why cache_position matters

  • Definition. “Cache position tracks where to insert new tokens in the attention cache. If you cached N tokens, the next K tokens use positions [N ... N+K-1].” Generation and RoPE masking depend on these absolute indices. (Hugging Face)
  • API change. Transformers v4.4x made Cache classes first-class. Legacy past_key_values tuples are converted internally and may produce edge cases if you hand them back during generate() without cache_position. (GitHub)

Quick test to validate the fix

  1. Run prefill → resume with explicit cache_position.
  2. Confirm no exception.
  3. Generate a few tokens, then resume again using the returned cache and a new cache_position that continues from the last. This double-resume catches the exact bug class you hit. Reports show failures on the second resume if cache_position is inferred. (GitHub)

References and further reading

Core docs

  • Cache API overview, legacy ↔ DynamicCache, and position semantics. (Hugging Face)
  • Generation internals and utilities touchpoints. (Hugging Face)

Bugs and issues that match your trace

  • Empty cache_position inferred → IndexError when resuming with past_key_values. (GitHub)
  • Progressive generation interactions with inputs_embeds and cache_position. (GitHub)
  • Iterative generation cache edge cases noted by users. (GitHub)

LLaVA specifics

  • LLaVA and related multimodal code paths propagating cache_position in attention updates; older forks may lack it. (Hugging Face)
  • Known LLaVA prepare/merge behaviors with image tokens. Useful when reasoning about prefill length and seeding. (GitHub)

KV-cache steering background

  • Paper: “KV Cache Steering for Controlling Frozen LLMs.” One-shot KV edits that mirror your method. (arXiv)
  • Repo with implementation notes and examples. (GitHub)
  • Short paper page summary. (Hugging Face)

Minimal legacy-tuple patch you can try immediately

If you want to keep your code shape and only change a few lines:

# 1) After prefill
outputs = model(**inputs, use_cache=True, return_dict=True)
past_key_values = outputs.past_key_values  # legacy tuple

# 2) After your in-place edits, just add:
seed = inputs["input_ids"][:, -1:]
past_len = past_key_values[0][0].shape[2]   # K.shape[2] is prefill length
cache_pos = torch.arange(past_len, past_len + 1, device=seed.device)

# 3) Generate with explicit cache_position
out = model.generate(
    input_ids=seed,
    past_key_values=past_key_values,
    cache_position=cache_pos,
    max_new_tokens=100,
    do_sample=False,
)

This alone should eliminate IndexError: index -1 is out of bounds for dimension 0 with size 0. If your fork doesn’t accept cache_position, switch to the DynamicCache path or manual loop as shown above. (GitHub)


Curated extras

Implementation guides

  • Cache strategies and when to initialize/pass caches manually. Good examples of explicit cache handling. (Hugging Face)
  • Best practices for generation with caches, including quantized caches if you later need memory headroom during steering experiments. (Hugging Face)

Useful code to inspect

  • cache_utils.py and model prepare_inputs_for_generation code paths for models that already wire cache_position. Reading these clarifies why inference sometimes picks the wrong branch. (fossies.org)
1 Like

Thanks for the amazing approach, just one small doubt with the working of this code, if I pass the coeff_v and coeff_k as 0, shouldn’t the response match the baseline response which I get from generate() function? Currently I just made a small change in the code you provided by adding the attention_mask into the function like:

    out_ids = model.generate(
        input_ids=seed_ids,
        past_key_values=cache,      # pass Cache object
        cache_position=cache_pos,   # explicit, avoids empty cache_position bug
        attention_mask= inputs['attention_mask'],
        max_new_tokens=100,
        do_sample=False,
    )
    return processor.batch_decode(out_ids, skip_special_tokens=True)[0].strip()

but the results in the baseline and steered are not matching

1 Like

Due to attention_mask?


No. With coeff_k = coeff_v = 0, the continuation should match the baseline. Your mismatch comes from passing an un-updated attention_mask. When you resume with a prefilled cache, the mask must cover past + current tokens. Supplying the original mask of length N while giving input_ids of length 1 and cache_position=N changes how masks and positions are computed, so logits differ. Either omit attention_mask or extend it to length N+1 before calling generate(). (Hugging Face)

Fix

Extend the mask by one and pass it. Keep everything else identical.

# assumes: cache = DynamicCache(...); seed_ids = inputs["input_ids"][:, -1:]
past_len = cache.get_seq_length()  # N

# attention_mask must represent past + new tokens
attn = torch.cat(
    [inputs["attention_mask"], inputs["attention_mask"].new_ones((inputs["attention_mask"].size(0), seed_ids.size(1)))],
    dim=-1
)

cache_pos = torch.arange(past_len, past_len + seed_ids.shape[1], device=seed_ids.device)

out_ids = model.generate(
    input_ids=seed_ids,
    past_key_values=cache,
    cache_position=cache_pos,
    attention_mask=attn,        # now length N+1
    max_new_tokens=100,
    do_sample=False,
)

Why this is required: when reusing a cache, the attention module expects a mask whose length equals past_kv_length + new_tokens_length. Hugging Face’s cache docs show this explicitly and demonstrate appending 1s to the mask each step; PRs around static/FlashAttention also note that incorrect or partial masks yield wrong generations. (Hugging Face)

Minimal parity checklist

If coeff_k = coeff_v = 0, verify these to get exact parity with the baseline:

  1. Mask length: attention_mask.shape[1] == past_len + current_len for the resume call. If unsure, don’t pass the mask at all and let generate() infer it, but be consistent across baseline and steered paths. (Hugging Face)
  2. Positions: cache_position = arange(past_len, past_len + current_len). Off-by-one breaks equality. (Hugging Face)
  3. Eval mode: ensure model.eval() for both runs to eliminate dropout differences. (Hugging Face)
  4. Cache integrity: when “editing with zeros,” still rewrap a cloned cache to avoid aliasing. Do not mutate views in place. (HF cache doc recommends careful shape/concat semantics.) (Hugging Face)
  5. Same seed token: seed with exactly the final text token from prefill. Do not resend image features; LLaVA merges image tokens during prefill and tracks positions relative to those tokens. Mismatched merge logic or masks around image tokens will alter positions. (Hugging Face)

Quick A/B to confirm

Run:

# Baseline
baseline = model.generate(**inputs, max_new_tokens=100, do_sample=False)

# Resume path with coeffs = 0 and fixed mask
steered0 = out_ids  # from the snippet above

assert torch.equal(baseline[:, inputs["input_ids"].shape[1]:], steered0[:, 1:]), "Mismatch after resume"

If the assertion fails, print and compare cache.get_seq_length(), attn.shape, and cache_position.

References

  • HF cache guide: mask must be past+current; example shows appending 1 and advancing cache_position. Also explains absolute position semantics. (Hugging Face)
  • Static/FlashAttention note: wrong or incomplete mask leads to wrong generations. (SemanticDiff)
  • Reports of cache_position inference pitfalls when resuming, which you already avoided by setting it explicitly. (GitHub)
  • LLaVA code path that remaps text positions around image tokens, so masks/positions must stay consistent when resuming. (Hugging Face)
1 Like