Role of past_key_value in self attention


In most self attention layers, there is a variable named past_key_value which stores previous keys and values (decoder only).

What is the purpose of storing them ?
Why it is concatenated to current keys and values in an unidirectional setup ? (see in RoBERTa)
How attention mask is handled in this case ?

Thank you.