Hello! :wave: I’m benchmarking inference performance using Whisper and the .generate() method, switching between using/not using the k-v cache ). My understanding is that when using the cache, inference should be faster (since we don’t recompute k-v states and cache them instead), but VRAM usage hi…

Nice write-up! I think the decoder sequence length and the hidden states of the model might be too small to see a difference here in VRAM. The reason VRAM should be higher when caching the k,v states is because we cache the projected k,v states of every layer. This means that our cache is of size: …

Generate: using k-v cache is faster but no difference to memory usage

🤗Transformers

vhr1007 June 3, 2025, 9:25pm 6

Good Analysis, but generally you need to monitor max_cuda_allocation to know the max memory choke point in inference call, that will know usage of VRAM,

1 Like

Topic		Replies	Views
Oscillating VRAM when generating Intermediate	0	30	November 25, 2024
Use_cache (and past_key_values) in GPT2 leads to slower inference? 🤗Transformers	1	1055	April 9, 2023
Model.generate use_cache=True generates different results than use_cache=False Intermediate	3	326	March 4, 2025
VRAM keeps increasing during sequential llama2-13b inferencing Models	1	293	July 15, 2024
Outputs change if re-using KVCache (past_key_values) for model.forward and generation 🤗Transformers	5	283	January 22, 2025

Generate: using k-v cache is faster but no difference to memory usage

Related topics