Relative Position Representation/Encoding for Transformer

  1. In GPT-NeoX-20B: An Open-Source Autoregressive Language Model paper, why did the author stated that Rotary embeddings are a form of static relative positional embeddings ?

  2. In https://medium.com/@init/how-self-attention-with-relative-position-representations-works-28173b8c245a , could anyone explain the rationale behind the value of the lookup indices after the 3rd element are all 6 ?

  3. What is the actual purpose of skewing mechanism ?