Role of past_key_value in self attention

tkon3 · November 23, 2021, 8:15pm

Hi

In most self attention layers, there is a variable named past_key_value which stores previous keys and values (decoder only).

What is the purpose of storing them ?
Why it is concatenated to current keys and values in an unidirectional setup ? (see in RoBERTa)
How attention mask is handled in this case ?

Thank you.

Topic		Replies	Views
Past_key_values - why not past_key_values_queries? Beginners	5	10949	October 15, 2023
What does the decoder with past values means 🤗Optimum	2	1958	July 7, 2025
Key-value pair from attention layer of GPT2 🤗Transformers	0	327	June 28, 2023
How does attention key/value caching work with models that have learned absolute position embeddings? 🤗Transformers	0	1354	September 26, 2023
Default for the Decoder past_key_values - Marian Intermediate	0	403	January 5, 2023

Role of past_key_value in self attention

Related topics