I am going through the fusion-in-decoder paper. Basically, the authors concatenated the encoded outputs of different “chunks” representation before being fed to the decoder of T5. The resultant representation would have a sequence length longer than the maximum sequence length of T5 (512).
How does these affects the dimension of the queries, key and value weights matrices of the cross attention during initialization?
Does this means that the first 512 tokens has the pretrained weights while others are randomly initialized?