# Importing necessary modules from transformers import GPT2Tokenizer, GPT2LMHeadModel # Loading pre-trained GPT-2 tokenizer and model tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2LMHeadModel.from_pretrained('gpt2') # Encoding input text input_ids = tokenizer.encode("The dog is running", return_tensors='pt') # Generating model output with attention information output = model.generate( input_ids, max_length=6, num_return_sequences=1, no_repeat_ngram_size=2, output_attentions=True, return_dict_in_generate=True, ) # Extracting attention tensors attn = output.attentions
My observations are following.
attnvariable is a tuple with two items representing the number of new generated tokens (because 6 - 4 is 2).
- Each item is a tuple of 12 tensors, corresponding to the number of layers in each GPT block.
- The shape of the first tensor is [1, 12, 4, 4], and for the second tensor, it’s [1, 12, 1, 5].
- When visualized, the tensor of shape [1, 12, 4, 4] represents masked attention.
Here are my questions.
- What do tensors with shapes [1, 12, 4, 4] and [1, 12, 1, 5] represent? How are they different?
- At what decoding stage these tensors come from?