Is attention_mask in LanguageModels such as GPT2LMHeadModel related to attention mechanism is it just to specify padding tokens

SharmaTarun · June 26, 2024, 9:46pm

Hi,

I was wondering is attention_mask in LanguageModels such as GPT2LMHeadModel related to attention mechanism is it just to specify padding tokens?

What I was wondering is when the below code calculates logits will it use causal attention (only attend to tokens before it) or will it attend to all tokens now since I set attention_mask to 1 for all tokens except padding?

max_len = max(len(ex_enc) for ex_enc in examples_enc)
padding_lens = [max_len - len(ex_enc) for ex_enc in examples_enc]
padded_examples_enc = [ex_enc + [0]*pad_len for ex_enc, pad_len in zip(examples_enc, padding_lens)]
examples_tensor = torch.tensor(padded_examples_enc, dtype = torch.long)

padding_mask = torch.ones((4, max_len))
for i, pad_len in enumerate(padding_lens):
padding_mask[i, -pad_len:] = 0

model = GPT2LMHeadModel.from_pretrained(‘gpt2’).to(‘cuda’)
logits = model(examples_tensor.to(‘cuda’), attention_mask = padding_mask.to(‘cuda’)).logits

arnoldmatt · June 27, 2024, 2:10pm

You’re right, attention_mask is key to the attention mechanism! It tells the model which parts of the sequence are relevant and which are padding.

With your mask setup, the model will still use causal attention because GPT-2 is a decoder-only architecture. So it will only attend to previous tokens, even for non-padded parts.

SharmaTarun · June 27, 2024, 9:01pm

Thank you so much!

The parameter name got me a bit confused, I was expecting to see something like padding_mask.

Topic		Replies	Views
Do automatically generated attention masks ignore padding? 🤗Transformers	4	16433	March 8, 2022
Understanding attention output from generate method in GPT model Beginners	0	607	November 8, 2023
LLaMA2 - tokenizer padding affecting logits (even with attention_mask) 🤗Transformers	8	4533	March 26, 2024
Causal masks in BERT vs. GPT2 Intermediate	4	2713	December 30, 2022
Padding to the left of the inputs, GPT2LMHeadModel gives different answer Intermediate	2	1282	February 21, 2023

Is attention_mask in LanguageModels such as GPT2LMHeadModel related to attention mechanism is it just to specify padding tokens

Related topics