Thank you very much for the detailed response!
That makes sense that the difference in VRAM with/without using cache is not significant for a model with such low dimensionality.
Repeating the experiment with the large-v2 checkpoint (hidden_size=1280, num_layers=32) and generating to 256 tokens yields measurable differences in VRAM, albeit still only marginal:
VRAM with: 7597
VRAM without: 7515
Diff: 82
(all values in MB)
As we expect, the effect is amplified at 512 tokens, scaling (almost) linearly with decoder_length
:
VRAM with: 7639
VRAM without: 7519
Diff: 120
ASR models tend to generate quite short decoder-lengths. For example, the average token length in the LibriSpeech validation corpus is just ~20 tokens. Setting the max length accordingly, we get:
VRAM with: 7515
VRAM without: 7511
Diff: 4
So pretty insignificant! My intuition is that since VRAM difference with/without cache is proportional to decoder-length, k-v cache doesn’t have a big effect on VRAM for ASR models, even for larger checkpoints.