Multihead attention

Hi all
I started looking at modules in the flan t5 model that i have downloaded …
I am trying to understand how multihead attention is implemented there …
I can see only self attention layers … there
can anyone please explain this

model.encoder.block

T5Stack(
  (embed_tokens): Embedding(32128, 512)
  (block): ModuleList(
    (0): T5Block(
      (layer): ModuleList(
        (0): T5LayerSelfAttention(
          (SelfAttention): T5Attention(
            (q): Linear(in_features=512, out_features=384, bias=False)
            (k): Linear(in_features=512, out_features=384, bias=False)
            (v): Linear(in_features=512, out_features=384, bias=False)
            (o): Linear(in_features=384, out_features=512, bias=False)
            (relative_attention_bias): Embedding(32, 6)
          )
          (layer_norm): T5LayerNorm()
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (1): T5LayerFF(
          (DenseReluDense): T5DenseGatedActDense(
            (wi_0): Linear(in_features=512, out_features=1024, bias=False)
            (wi_1): Linear(in_features=512, out_features=1024, bias=False)
            (wo): Linear(in_features=1024, out_features=512, bias=False)
            (dropout): Dropout(p=0.1, inplace=False)
            (act): NewGELUActivation()
          )
          (layer_norm): T5LayerNorm()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (1-7): 7 x T5Block(
      (layer): ModuleList(
        (0): T5LayerSelfAttention(
          (SelfAttention): T5Attention(
            (q): Linear(in_features=512, out_features=384, bias=False)
            (k): Linear(in_features=512, out_features=384, bias=False)
            (v): Linear(in_features=512, out_features=384, bias=False)
            (o): Linear(in_features=384, out_features=512, bias=False)
          )
          (layer_norm): T5LayerNorm()
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (1): T5LayerFF(
          (DenseReluDense): T5DenseGatedActDense(
            (wi_0): Linear(in_features=512, out_features=1024, bias=False)
            (wi_1): Linear(in_features=512, out_features=1024, bias=False)
            (wo): Linear(in_features=1024, out_features=512, bias=False)
            (dropout): Dropout(p=0.1, inplace=False)
            (act): NewGELUActivation()
          )
          (layer_norm): T5LayerNorm()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (final_layer_norm): T5LayerNorm()
  (dropout): Dropout(p=0.1, inplace=False)
)







1 Like

Hey !
Actually, the multi-head attention mechanism is not within the layers (it cannot be seen using this view).
The process happens with the ‘.call’ of the layer where the tensors are getting sliced in several sub tensors (the heads), then the tensor is reshaped, after that the process of computing K,Q and V happens and the attention is computed. At the end of the process, the tensor is re aggregated into a single one.

To understand the process, with 3 attention heads, you can think of :

  • Step 1 : Input tensor of shape (batch size, sequence length, embedding dim)
  • Step 2 : Sliced tensor of shape (batch size, sequence length, 3, embedding dim/3)
  • Step 3 : Reshaped tensor of shape (batch size, 3, sequence length, embedding dim/3)
  • Step 4: Compute attention and add to the “Values” tensor
  • Step 5: Re aggregate the tensor to get (batch size, sequence length, embedding dim)

All of this steps are in torch or tensorflow and “hard coded” within the call of the layer.

1 Like