Relative Position Representation/Encoding for Transformer

  1. In GPT-NeoX-20B: An Open-Source Autoregressive Language Model paper, why did the author stated that Rotary embeddings are a form of static relative positional embeddings ?

  2. In , could anyone explain the rationale behind the value of the lookup indices after the 3rd element are all 6 ?

  3. What is the actual purpose of skewing mechanism ?