Understanding attention output from generate method in GPT model

# Importing necessary modules
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Loading pre-trained GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Encoding input text
input_ids = tokenizer.encode("The dog is running", return_tensors='pt')

# Generating model output with attention information
output = model.generate(
    input_ids,
    max_length=6,
    num_return_sequences=1,
    no_repeat_ngram_size=2,
    output_attentions=True,
    return_dict_in_generate=True,
)

# Extracting attention tensors
attn = output.attentions

My observations are following.

  • The attn variable is a tuple with two items representing the number of new generated tokens (because 6 - 4 is 2).
  • Each item is a tuple of 12 tensors, corresponding to the number of layers in each GPT block.
  • The shape of the first tensor is [1, 12, 4, 4], and for the second tensor, it’s [1, 12, 1, 5].
  • When visualized, the tensor of shape [1, 12, 4, 4] represents masked attention.

Here are my questions.

  • What do tensors with shapes [1, 12, 4, 4] and [1, 12, 1, 5] represent? How are they different?
  • At what decoding stage these tensors come from?