Sorry for ask about same question. But I need help.
I am trying to understand GPT2 structure from bottom to top.
What I am having trouble is that why the causal masking has such variable like ns, nd.
I think the nd and ns is always same because they were splited from same ‘x’ input
So, the nd and ns would be sequence length. Is it right?
The lower text is more detailed question. Thank you for help
In transformers/src/transformers/modeling_gpt2.py .
what is nd, ns variable in line 150
def _attn(self, q, k, v, attention_mask=None, head_mask=None, output_attentions=False): w = torch.matmul(q, k) if self.scale: w = w / (float(v.size(-1)) ** 0.5) nd, ns = w.size(-2), w.size(-1) mask = self.bias[:, :, ns - nd : ns, :ns]
because gpt model performs self-attention, isn’t the nd and ns always same?
What is the meaning of “ns - nd: ns”
As you can see in line 187,
query, key, value = x.split(self.split_size, dim=2)
the query and key should have same sequence length which is nd and ns