Is LLaMA rotary embedding implementation correct?

In the LLaMA model of the Transformers library, applying rotary embedding seems not correctly implemented, specifically in the rotate_half function (first link). For a query vector [1, 2, 3, 4, 5, 6], the expected output is [-2, 1, -4, 3, -6, 5], but the function returns [-4, -5, -6, 1, 2, 3]. Slicing should have been interleaved. The RoFormer implementation seems correct though (second link).

1 Like

@reminisce I think you’re correct. I saw the same thing. The embeddings are supposed to be interweaved. In this manner, the embeddings have an identical form at dim 0 that they would at dim x.shape[-1]//2.
(screenshot below shows what I’m referring to in poor detail - it shows dummy data (torch.arange(0, 256).unsqueeze(0).repeat(16, 1) which is why there’s the smooth background color change from 0 to 256) that is encoded with this rotary PE code)

(Note: I haven’t rigorously examined this test, but this simple examination raises my concern that it’s not correct)

Thanks for checking this @GalacticKip7. I actually later found that simply rotating half is also a correct form of rotary embedding (see the following vector-vector multiplication-addition form and the equivalent matrix-vector multiplication form). As long as the matrix (R) satisfies the third equation for any q and k vectors, it’s a valid form of rotary embedding. The only caveat is use the same form for both training and inference.

@reminisce Thank you for the clarification. I’ve only seen the original implementation (see pic) - your pictures help clarify this alternative derivation, though. I’ll try to go over it again once I have a pencil and paper in front of me :slight_smile: