Self-attention query vs key size in gpt2

amanrs · June 16, 2022, 5:17am

For non-cross-attention (self-attention), I’d expect the query and key size to always be the same.

huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py#L200-L201

      
        
            query_length, key_length = query.size(-2), key.size(-2)
            causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length].to(torch.bool)

Is there a reason why they could be different sizes here?

dblakely · June 17, 2022, 9:02pm

The query length is typically 1, as it corresponds to the most recently predicted token being fed through the model attending to the past tokens (and itself). The keys and values, on the other hand, will be > 1, as they correspond to the past tokens that have already been written. The keys and values are always the same length.

There are cases where all three of them will have the same length. Namely, when you pass a set of input_ids to the model–for one timestep, they’ll all be the same length. Then the query length will be back to 1 afterwards.

Topic		Replies	Views
GPT2 Implementation from scratch 🤗Transformers	0	396	August 11, 2020
Question about query_length in modeling_t5.py Beginners	0	253	April 18, 2022
Past_key_values - why not past_key_values_queries? Beginners	5	10950	October 15, 2023
Sizes of Query, key and value vector in Bert Model 🤗Transformers	3	5942	March 25, 2021
Is attention_mask in LanguageModels such as GPT2LMHeadModel related to attention mechanism is it just to specify padding tokens Beginners	2	206	June 27, 2024

Self-attention query vs key size in gpt2

Related topics