why should be output size in multi head should be same as input size?
after we concat multiple self attention output then we need to apply linear transformations to the generated output why?
why cant be the size be kept same after concatenation
why should be output size in multi head should be same as input size?
after we concat multiple self attention output then we need to apply linear transformations to the generated output why?
why cant be the size be kept same after concatenation
Great question! Let’s break it down step by step to understand why this design choice is made in the Transformer architecture.
The main reason is residual connections (skip connections). In Transformers, the input of a layer is added directly to its output before moving to the next layer. For this addition to work, the dimensions of the input and output need to match. Keeping the output size the same ensures everything works seamlessly.
When using Multi-Head Attention, we divide the input into several “heads.” Each head learns different patterns or relationships in the input, and their outputs are concatenated to form a larger tensor. For example:
This size is larger than the input, so we use a linear transformation (a simple matrix multiplication with learned weights) to reduce it back to the original size. But there’s more to it than just resizing:
If we skipped the linear transformation and left the size as (h \times d_k), a few issues would arise:
The linear transformation after concatenation:
I hope this helps clear things up! Let me know if you have any follow-up questions.
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.