Error when trying to visualize attention in T5 model

I have a pretrained T5 model that predicts the solution to quadratic equations and I want to use bertviz library for visualizing attention. In all the examples I found the length of the input and output are the same, but in my case they are different.

tokenizer = PreTrainedTokenizerFast.from_pretrained("my_repo/content")
model = T5ForConditionalGeneration.from_pretrained("my_repo/content", output_attentions=True)

For example for this input:

inputs = tokenizer("7*x^2+3556*x+451612=0", return_tensors="pt")

The model predicts:

outputs = model.generate(inputs.input_ids, attention_mask=inputs.attention_mask, max_length=80, min_length=10, output_attentions=True, return_dict_in_generate=True)

This sequnce:

D = 3556 ^ 2 - 4 * 7 * 4 5 1 6 1 2 = 2 1 ; x 1 = ( - 3556 + ( 2 1 ) * * 0. 5 ) / / ( 2 * 7 ) = - 2. 0 ; x 2 = ( - 3556 - ( 2 1 ) * * 0. 5 ) / / ( 2 * 7 ) = - 2. 0

It’s length is 79, whereas the length of the input is 18.

Then I do as in the example:

encoder_input_ids = tokenizer("7*x^2+3556*x+451612=0", return_tensors="pt", add_special_tokens=True).input_ids

with tokenizer.as_target_tokenizer():
    decoder_input_ids = tokenizer("D = 3556 ^ 2 - 4 * 7 * 4 5 1 6 1 2 = 2 1 ; x 1 = ( - 3556 + ( 2 1 ) * * 0. 5 ) / / ( 2 * 7 ) = - 2. 0 ; x 2 = ( - 3556 - ( 2 1 ) * * 0. 5 ) / / ( 2 * 7 ) = - 2. 0", 
                                  return_tensors="pt", add_special_tokens=True).input_ids

encoder_text = tokenizer.convert_ids_to_tokens(encoder_input_ids[0])
decoder_text = tokenizer.convert_ids_to_tokens(decoder_input_ids[0])

model_view(
    cross_attention = outputs.cross_attentions[0],
    encoder_attention = encoder_attention, 
    decoder_attention = decoder_attention,
    encoder_tokens = encoder_text,
    decoder_tokens = decoder_text)

However I get this error:

AttributeError: 'tuple' object has no attribute 'shape'

For some reason the attentions I get are in a form of tuples (cross attention is even a tuple of tuples). In bertviz docs it is said that the dimensions should be like this:

For encoder-decoder models:
                encoder_attention: list of ``torch.FloatTensor``(one for each layer) of shape
                    ``(batch_size(must be 1), num_heads, encoder_sequence_length, encoder_sequence_length)``
                decoder_attention: list of ``torch.FloatTensor``(one for each layer) of shape
                    ``(batch_size(must be 1), num_heads, decoder_sequence_length, decoder_sequence_length)``
                cross_attention: list of ``torch.FloatTensor``(one for each layer) of shape
                    ``(batch_size(must be 1), num_heads, decoder_sequence_length, encoder_sequence_length)``
                encoder_tokens: list of tokens for encoder input
                decoder_tokens: list of tokens for decoder input

I don’t understand why am I not getting the right dimensions. Is it because my input and ouput sequences are of different sizes?

I trained my model on a custom dataset. I changed it’s lm_head.out_features to 1 and it’s vocab_size to 100_000.

Some info about dimensions:

cross_attention = outputs.cross_attentions
len(cross_attention[0]) = 12
cross_attention[0][0].shape = torch.Size([1, 12, 1, 18])

decoder_attention = outputs.decoder_attentions
len(decoder_attention[0]) = 12
decoder_attention[0][0].shape = torch.Size([1, 12, 1, 1])

encoder_attention = outputs.encoder_attentions
len(encoder_attention) = 12
encoder_attention[0].shape = torch.Size([1, 12, 18, 18])

If I do something like this:

model_view(
    cross_attention = outputs.cross_attentions[0],
    encoder_attention = encoder_attention[0], 
    decoder_attention = decoder_attention[0],
    encoder_tokens = encoder_text,
    decoder_tokens = decoder_text)

I get this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-53-ae747cb8eba0> in <cell line: 1>()
----> 1 model_view(
      2     cross_attention = outputs.cross_attentions[0],
      3     encoder_attention = encoder_attention[0],
      4     decoder_attention = decoder_attention[0],
      5     encoder_tokens = encoder_text,

1 frames
/usr/local/lib/python3.9/dist-packages/bertviz/model_view.py in model_view(attention, tokens, sentence_b_start, prettify_tokens, display_mode, encoder_attention, decoder_attention, cross_attention, encoder_tokens, decoder_tokens, include_layers, include_heads, html_action)
    128             if include_heads is None:
    129                 include_heads = list(range(n_heads))
--> 130             encoder_attention = format_attention(encoder_attention, include_layers, include_heads)
    131             attn_data.append(
    132                 {

/usr/local/lib/python3.9/dist-packages/bertviz/util.py in format_attention(attention, layers, heads)
      9         # 1 x num_heads x seq_len x seq_len
     10         if len(layer_attention.shape) != 4:
---> 11             raise ValueError("The attention tensor does not have the correct number of dimensions. Make sure you set "
     12                              "output_attentions=True when initializing your model.")
     13         layer_attention = layer_attention.squeeze(0)

ValueError: The attention tensor does not have the correct number of dimensions. Make sure you set output_attentions=True when initializing your model.

Hi, what’s going on here is that when you get back the attentions from a generate call, you’re getting back a tuple of attentions for each token you generated, so that’s what the outer tuple is for. Meanwhile, the inner tuples are for each layer of the model.

For example if you run:

from transformers import T5ForConditionalGeneration, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base")

inputs = tokenizer(
    "Translate from English to French: Hello, how are you?",
    return_tensors="pt",
)

output = model.generate(**inputs, return_dict_in_generate=True, output_attentions=True)

sequences = output.sequences

encoder_attentions = output.encoder_attentions
cross_attentions = output.cross_attentions
decoder_attentions = output.decoder_attentions

print(f"{sequences.shape = }")
print(f"{len(cross_attentions) = }")
print(f"{len(decoder_attentions) = }")

you’ll get:

sequences.shape = torch.Size([1, 10])
len(cross_attentions) = 9
len(decoder_attentions) = 9

Note that it’s actually num_tokens_generated - 1 because the final token (usually the EOS token) is never fed through the decoder after it’s predicted, so there are no attention scores for it. So what you see is that len(cross_attentions) = len(decoder_attentions) = num tokens generated - 1.

Then if you check how long the inner tuples are, you get:

len(cross_attentions[0]) = 12
len(decoder_attentions[0]) = 12

because there are 12 attention layers in t5-base.

In contrast, for the encoder attentions, you don’t get a nested tuple because the encoder is only used once at the beginning. You simply get a tuple containing the encoder attentions from each encoder layer (12 layers for t5-base).

Getting to the issue of how to make this work with Bertviz: the error you got from Bertviz says the decoder_attentions need to be a tuple of tensors (1 tensor of attentions for each layer) and each tensor needs to have shape (1, num_heads, decoder_sequence_length, decoder_sequence_length). In the example above, that would be (1, 12, 9, 9) (batch size 1, 12 heads, 9 positions attending to 9 positions). The tricky part is that the decoder attentions you get back from a generate call don’t have a consistent shape because the size of decoder sequence increases by 1 each time a new token is generated.

For example, you can check this like this:

decoder_attentions[0][0].shape = torch.Size([1, 12, 1, 1]) # token 0, layer 0
decoder_attentions[1][0].shape = torch.Size([1, 12, 1, 2]) # token 1, layer 0
decoder_attentions[2][0].shape = torch.Size([1, 12, 1, 3]) # token 2, layer 0

At each step, there’s one query token attending to the sequence that’s been written so far. But to work with Bertviz, the per-token attentions for each layer need to be combined into one tensor (by concatenating), but that’s tricky because you’d have to concatenate on dimension 2, but the tensors vary in shape along position 3, which will cause an pytorch error. It’d be possible to get around this by padding all the decoder attentions to the same length on dimension 3 (9 in this example) by using dummy a tensor of 0s. I.e., pad it so that the shape of decoder_attentions[0][0] goes from [1, 12, 1, 1] to [1, 12, 1, 9].

Though ultimately it would probably be easiest to just pass your generated outputs through the model to get back attentions in a form that’s easier to work with for your purposes. For example, you could do something like this:

# Generate
output = model.generate(**inputs, return_dict_in_generate=True)

# Pass the outputs through the model again to get the attentions
out = model(**inputs, decoder_input_ids=output.sequences, output_attentions=True, return_dict=True)

encoder_attentions = out.encoder_attentions
cross_attentions = out.cross_attentions
decoder_attentions = out.decoder_attentions

print(f"{len(encoder_attentions) = }")
print(f"{len(cross_attentions) = }")
print(f"{len(decoder_attentions) = }")

Then you get back:

len(encoder_attentions) = 12
len(cross_attentions) = 12
len(decoder_attentions) = 12

So type of attention has 12 tensors–1 for each layer–as requested by Bertviz and you no longer have the nested tuples from before. So feeding these outputs should work with Bertviz.

1 Like

Hello! Thank you soo much for your time and clear explanation! It has finally worked!

1 Like

Hi @snork-maiden , Could you share your notebook if possible,
I tried this but still getting error regarding mismatch in size in attention and number of tokens (I am using text to sql t5 model)
ValueError: Attention has 17 positions, while number of tokens is 16 for tokens:

Hello @rr-7 here is part of my notebook where I do visualization.

Hope this helps.