Multihead attention

VamshiKrishna2001 · October 2, 2024, 4:34am

Hi all
I started looking at modules in the flan t5 model that i have downloaded …
I am trying to understand how multihead attention is implemented there …
I can see only self attention layers … there
can anyone please explain this

model.encoder.block

T5Stack(
  (embed_tokens): Embedding(32128, 512)
  (block): ModuleList(
    (0): T5Block(
      (layer): ModuleList(
        (0): T5LayerSelfAttention(
          (SelfAttention): T5Attention(
            (q): Linear(in_features=512, out_features=384, bias=False)
            (k): Linear(in_features=512, out_features=384, bias=False)
            (v): Linear(in_features=512, out_features=384, bias=False)
            (o): Linear(in_features=384, out_features=512, bias=False)
            (relative_attention_bias): Embedding(32, 6)
          )
          (layer_norm): T5LayerNorm()
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (1): T5LayerFF(
          (DenseReluDense): T5DenseGatedActDense(
            (wi_0): Linear(in_features=512, out_features=1024, bias=False)
            (wi_1): Linear(in_features=512, out_features=1024, bias=False)
            (wo): Linear(in_features=1024, out_features=512, bias=False)
            (dropout): Dropout(p=0.1, inplace=False)
            (act): NewGELUActivation()
          )
          (layer_norm): T5LayerNorm()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (1-7): 7 x T5Block(
      (layer): ModuleList(
        (0): T5LayerSelfAttention(
          (SelfAttention): T5Attention(
            (q): Linear(in_features=512, out_features=384, bias=False)
            (k): Linear(in_features=512, out_features=384, bias=False)
            (v): Linear(in_features=512, out_features=384, bias=False)
            (o): Linear(in_features=384, out_features=512, bias=False)
          )
          (layer_norm): T5LayerNorm()
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (1): T5LayerFF(
          (DenseReluDense): T5DenseGatedActDense(
            (wi_0): Linear(in_features=512, out_features=1024, bias=False)
            (wi_1): Linear(in_features=512, out_features=1024, bias=False)
            (wo): Linear(in_features=1024, out_features=512, bias=False)
            (dropout): Dropout(p=0.1, inplace=False)
            (act): NewGELUActivation()
          )
          (layer_norm): T5LayerNorm()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (final_layer_norm): T5LayerNorm()
  (dropout): Dropout(p=0.1, inplace=False)
)

samchain · October 2, 2024, 8:33am

Hey !
Actually, the multi-head attention mechanism is not within the layers (it cannot be seen using this view).
The process happens with the ‘.call’ of the layer where the tensors are getting sliced in several sub tensors (the heads), then the tensor is reshaped, after that the process of computing K,Q and V happens and the attention is computed. At the end of the process, the tensor is re aggregated into a single one.

To understand the process, with 3 attention heads, you can think of :

Step 1 : Input tensor of shape (batch size, sequence length, embedding dim)
Step 2 : Sliced tensor of shape (batch size, sequence length, 3, embedding dim/3)
Step 3 : Reshaped tensor of shape (batch size, 3, sequence length, embedding dim/3)
Step 4: Compute attention and add to the “Values” tensor
Step 5: Re aggregate the tensor to get (batch size, sequence length, embedding dim)

All of this steps are in torch or tensorflow and “hard coded” within the call of the layer.

Topic		Replies	Views
Specify attention masks for some heads in multi-head attention Intermediate	3	2380	November 17, 2020
Getting output attentions for encoder_attention decoder layers 🤗Transformers	0	362	October 24, 2020
How to get cross-attention values of T5? 🤗Transformers	2	3883	October 9, 2020
Output embedding from each self-attention head from each encoder layer Intermediate	0	411	February 28, 2022
Question about all_head_size under BertSelfAttention 🤗Transformers	0	368	July 13, 2020

Multihead attention

Related topics