Hi
In most self attention layers, there is a variable named past_key_value which stores previous keys and values (decoder only).
What is the purpose of storing them ?
Why it is concatenated to current keys and values in an unidirectional setup ? (see in RoBERTa)
How attention mask is handled in this case ?
Thank you.