GPT2 Implementation from scratch

jisng-prk · August 11, 2020, 3:00pm

Sorry for ask about same question. But I need help.
I am trying to understand GPT2 structure from bottom to top.

What I am having trouble is that why the causal masking has such variable like ns, nd.
I think the nd and ns is always same because they were splited from same ‘x’ input
So, the nd and ns would be sequence length. Is it right?

The lower text is more detailed question. Thank you for help

In transformers/src/transformers/modeling_gpt2.py .

what is nd, ns variable in line 150

def _attn(self, q, k, v, attention_mask=None, head_mask=None, output_attentions=False):
    w = torch.matmul(q, k)
    if self.scale:
        w = w / (float(v.size(-1)) ** 0.5)
    nd, ns = w.size(-2), w.size(-1)
    mask = self.bias[:, :, ns - nd : ns, :ns]

because gpt model performs self-attention, isn’t the nd and ns always same?
What is the meaning of “ns - nd: ns”

As you can see in line 187,

query, key, value = x.split(self.split_size, dim=2)

the query and key should have same sequence length which is nd and ns

Thank you

Topic		Replies	Views
What is ns, nd variable Beginners	0	329	August 9, 2020
Self-attention query vs key size in gpt2 🤗Transformers	1	1051	June 17, 2022
Is attention_mask in LanguageModels such as GPT2LMHeadModel related to attention mechanism is it just to specify padding tokens Beginners	2	209	June 27, 2024
Understanding attention output from generate method in GPT model Beginners	0	626	November 8, 2023
Inconsistent GPT2Model results between transformers versions Intermediate	7	29	July 19, 2025

GPT2 Implementation from scratch

Related topics