Hi,
I was wondering is attention_mask in LanguageModels such as GPT2LMHeadModel related to attention mechanism is it just to specify padding tokens?
What I was wondering is when the below code calculates logits will it use causal attention (only attend to tokens before it) or will it attend to all tokens now since I set attention_mask to 1 for all tokens except padding?
max_len = max(len(ex_enc) for ex_enc in examples_enc)
padding_lens = [max_len - len(ex_enc) for ex_enc in examples_enc]
padded_examples_enc = [ex_enc + [0]*pad_len for ex_enc, pad_len in zip(examples_enc, padding_lens)]
examples_tensor = torch.tensor(padded_examples_enc, dtype = torch.long)
padding_mask = torch.ones((4, max_len))
for i, pad_len in enumerate(padding_lens):
padding_mask[i, -pad_len:] = 0
model = GPT2LMHeadModel.from_pretrained(‘gpt2’).to(‘cuda’)
logits = model(examples_tensor.to(‘cuda’), attention_mask = padding_mask.to(‘cuda’)).logits