In the LLaMA model of the Transformers library, applying rotary embedding seems not correctly implemented, specifically in the `rotate_half`

function (first link). For a query vector `[1, 2, 3, 4, 5, 6]`

, the expected output is `[-2, 1, -4, 3, -6, 5]`

, but the function returns `[-4, -5, -6, 1, 2, 3]`

. Slicing should have been interleaved. The RoFormer implementation seems correct though (second link).

@reminisce I think youâ€™re correct. I saw the same thing. The embeddings are supposed to be interweaved. In this manner, the embeddings have an identical form at dim 0 that they would at dim x.shape[-1]//2.

(screenshot below shows what Iâ€™m referring to in poor detail - it shows dummy data (`torch.arange(0, 256).unsqueeze(0).repeat(16, 1)`

which is why thereâ€™s the smooth background color change from 0 to 256) that is encoded with this rotary PE code)

(Note: I havenâ€™t rigorously examined this test, but this simple examination raises my concern that itâ€™s not correct)

Thanks for checking this @GalacticKip7. I actually later found that simply rotating half is also a correct form of rotary embedding (see the following vector-vector multiplication-addition form and the equivalent matrix-vector multiplication form). As long as the matrix (R) satisfies the third equation for any q and k vectors, itâ€™s a valid form of rotary embedding. The only caveat is use the same form for both training and inference.

@reminisce Thank you for the clarification. Iâ€™ve only seen the original implementation (see pic) - your pictures help clarify this alternative derivation, though. Iâ€™ll try to go over it again once I have a pencil and paper in front of me