Alibi and Extrapolation


I read the paper about alibi and extrapolating the sequence length. I want to alter a given decoder transformer with alibi attention. I understand the alibi implementation but I am stuck on understanding the extrapolation. In the paper they try an extrapolation on a given positional attention network and it doesn’t work good. Okay, got it, Alibi is better, but how did they do the extrapolation? I mean, you have 512 input embeddings, how do I put 1024 embeddings into it?
Can anyone help? Thank you so much!