In transformers/src/transformers/modeling_gpt2.py .
what is nd, ns variable in line 150
def _attn(self, q, k, v, attention_mask=None, head_mask=None, output_attentions=False): w = torch.matmul(q, k) if self.scale: w = w / (float(v.size(-1)) ** 0.5) nd, ns = w.size(-2), w.size(-1) mask = self.bias[:, :, ns - nd : ns, :ns]
because gpt model performs self-attention, isn’t the nd and ns always same?
What is the meaning of “ns - nd: ns”
As you can see in line 187,
query, key, value = x.split(self.split_size, dim=2)
the query and key should have same sequence length which is nd and ns