Is attention_mask in LanguageModels such as GPT2LMHeadModel related to attention mechanism is it just to specify padding tokens

Hi,

I was wondering is attention_mask in LanguageModels such as GPT2LMHeadModel related to attention mechanism is it just to specify padding tokens?

What I was wondering is when the below code calculates logits will it use causal attention (only attend to tokens before it) or will it attend to all tokens now since I set attention_mask to 1 for all tokens except padding?

max_len = max(len(ex_enc) for ex_enc in examples_enc)
padding_lens = [max_len - len(ex_enc) for ex_enc in examples_enc]
padded_examples_enc = [ex_enc + [0]*pad_len for ex_enc, pad_len in zip(examples_enc, padding_lens)]
examples_tensor = torch.tensor(padded_examples_enc, dtype = torch.long)

padding_mask = torch.ones((4, max_len))
for i, pad_len in enumerate(padding_lens):
padding_mask[i, -pad_len:] = 0

model = GPT2LMHeadModel.from_pretrained(‘gpt2’).to(‘cuda’)
logits = model(examples_tensor.to(‘cuda’), attention_mask = padding_mask.to(‘cuda’)).logits

You’re right, attention_mask is key to the attention mechanism! It tells the model which parts of the sequence are relevant and which are padding.

With your mask setup, the model will still use causal attention because GPT-2 is a decoder-only architecture. So it will only attend to previous tokens, even for non-padded parts.

Thank you so much!

The parameter name got me a bit confused, I was expecting to see something like padding_mask.