Generate: using k-v cache is faster but no difference to memory usage

sanchit-gandhi · February 8, 2023, 2:21pm

Thank you very much for the detailed response!

That makes sense that the difference in VRAM with/without using cache is not significant for a model with such low dimensionality.

Repeating the experiment with the large-v2 checkpoint (hidden_size=1280, num_layers=32) and generating to 256 tokens yields measurable differences in VRAM, albeit still only marginal:

VRAM with: 7597
VRAM without: 7515
Diff: 82

(all values in MB)

As we expect, the effect is amplified at 512 tokens, scaling (almost) linearly with decoder_length:

VRAM with: 7639
VRAM without: 7519
Diff: 120

ASR models tend to generate quite short decoder-lengths. For example, the average token length in the LibriSpeech validation corpus is just ~20 tokens. Setting the max length accordingly, we get:

VRAM with: 7515
VRAM without: 7511
Diff: 4

So pretty insignificant! My intuition is that since VRAM difference with/without cache is proportional to decoder-length, k-v cache doesn’t have a big effect on VRAM for ASR models, even for larger checkpoints.

Topic		Replies	Views
Should 8bit quantization make inference faster on GPU? 🤗Transformers	1	665	April 1, 2024
Use_cache (and past_key_values) in GPT2 leads to slower inference? 🤗Transformers	1	1045	April 9, 2023
Baffling performance issue on most NVidia GPUs with simple transformers + pytorch code Intermediate	5	4495	April 9, 2024
KV Cache Managment 🤗Transformers	0	500	July 4, 2024
GPU memory GPTJ inference 🤗Transformers	0	237	June 13, 2023

Generate: using k-v cache is faster but no difference to memory usage

Related topics