Self-attention query vs key size in gpt2

For non-cross-attention (self-attention), I’d expect the query and key size to always be the same.

Is there a reason why they could be different sizes here?

The query length is typically 1, as it corresponds to the most recently predicted token being fed through the model attending to the past tokens (and itself). The keys and values, on the other hand, will be > 1, as they correspond to the past tokens that have already been written. The keys and values are always the same length.

There are cases where all three of them will have the same length. Namely, when you pass a set of input_ids to the model–for one timestep, they’ll all be the same length. Then the query length will be back to 1 afterwards.