Is LLaMA rotary embedding implementation correct?

reminisce · June 25, 2023, 7:35pm

In the LLaMA model of the Transformers library, applying rotary embedding seems not correctly implemented, specifically in the rotate_half function (first link). For a query vector [1, 2, 3, 4, 5, 6], the expected output is [-2, 1, -4, 3, -6, 5], but the function returns [-4, -5, -6, 1, 2, 3]. Slicing should have been interleaved. The RoFormer implementation seems correct though (second link).

github.com

huggingface/transformers/blob/8e164c5400b7b413c7b8fb32e35132001effc970/src/transformers/models/llama/modeling_llama.py#L124-L128


      
          def rotate_half(x):
              """Rotates half the hidden dims of the input."""
              x1 = x[..., : x.shape[-1] // 2]
              x2 = x[..., x.shape[-1] // 2 :]
              return torch.cat((-x2, x1), dim=-1)

github.com

huggingface/transformers/blob/8e164c5400b7b413c7b8fb32e35132001effc970/src/transformers/models/roformer/modeling_roformer.py#L328-L331


      
          # rotate_half_query_layer [-q1,q0,-q3,q2......,-qd-1,qd-2]
          rotate_half_query_layer = torch.stack([-query_layer[..., 1::2], query_layer[..., ::2]], dim=-1).reshape_as(
              query_layer
          )

GalacticKip7 · September 7, 2023, 12:32am

@reminisce I think you’re correct. I saw the same thing. The embeddings are supposed to be interweaved. In this manner, the embeddings have an identical form at dim 0 that they would at dim x.shape[-1]//2.
(screenshot below shows what I’m referring to in poor detail - it shows dummy data (torch.arange(0, 256).unsqueeze(0).repeat(16, 1) which is why there’s the smooth background color change from 0 to 256) that is encoded with this rotary PE code)

(Note: I haven’t rigorously examined this test, but this simple examination raises my concern that it’s not correct)

reminisce · September 7, 2023, 2:25am

Thanks for checking this @GalacticKip7. I actually later found that simply rotating half is also a correct form of rotary embedding (see the following vector-vector multiplication-addition form and the equivalent matrix-vector multiplication form). As long as the matrix (R) satisfies the third equation for any q and k vectors, it’s a valid form of rotary embedding. The only caveat is use the same form for both training and inference.

GalacticKip7 · September 7, 2023, 4:53pm

@reminisce Thank you for the clarification. I’ve only seen the original implementation (see pic) - your pictures help clarify this alternative derivation, though. I’ll try to go over it again once I have a pencil and paper in front of me

alexchen4ai · February 6, 2024, 9:41am

Hi I have the same question after viewing the source code in the huggingface. The key problem here is that we load the original meta llama weights to the huggingface llama model. Thus, the rotary embedding should be designed as the same structure. The source llama meta also use the same rotary embedding (let me know if I am wrong) as original rope paper.

To test if we really need the same setup during training and inference, I change the source code of transformers current main branch to use the original paper. And I get the response of this CodeLlama-7b-hf:

import socket

def ping_exponential_backoff(host: str):
    for i in range(1, 10):
    try:

    #     s = socket.socket(socket.AF_INETH, socket.SOCK_STREQ, socket.SOCK_STREUSE_TCP_NODELAY)
    s.connect((host, 80))
    s.close()
    return True
  except:
    return False




def ping_host(host: str):
  for i in range(1, 10):
    if ping_host(host):
      return True
    return False


def ping_host(host: str):
  for i in range(1, 10):
    if ping_host(host):
      return False




def ping_host

If I keep the current transfomers main branch, I can still get meaningful result.

def ping_exponential_backoff(host: str):
    """
    Ping a host with exponential backoff.
    """
    for i in range(1, 10):
        try:
            socket.create_connection((host, 80), 5).close()
            return True
        except OSError as e:
            if i < 10:
                time.sleep(2 ** i)
            else:
                raise e


def ping_exponential_backoff_with_timeout(host: str, timeout: int):
    """
    Ping a host with exponential backoff and timeout.
    """
    for i in range(1, 10):
        try:
            socket.create_connection((host, 80), timeout).close()
            return True
        except OSError as e

I guess a more in-depth investigation may be required to compare the comprehensive performance change…

ShoufaChen · April 10, 2024, 3:29am

Hello, @alexchen4ai

In fact the checkpoint is not totally same.
Please see [LLaMA] Rotary positional embedding differs with official implementation · Issue #25199 · huggingface/transformers · GitHub

They did some weight permutation when convert llama weight to huggingface.

matkad · November 23, 2024, 2:17pm

Did you check if it’s really equivalent ? I am having my doubts implementing this in Julia and I would prefer simper concatenation but I am also puzzled by two distinct PyTorch implementations.

cc007 · February 1, 2025, 1:06pm

I am afraid your answer is not right. Since the \theta is not map correctly. And the key point why transformer’s RoPE only use rotate_half is that in transformers/models/llama/convert_llama_weights_to_hf.py, there exists permute function, which makes the weight rearrange along with last dimension, leading like this:

Topic		Replies	Views
Llama rotary positional embeddings implementation details Beginners	1	197	January 16, 2025
Why llama weight in huggingface need to do permute on wq/wk Beginners	3	984	January 2, 2025
Can I convert llama 2 "Chat" model into onnx using llama/convert_to_onnx.py script? 🤗Optimum	5	1777	August 26, 2024
How to calculate embeddings with Llama-2 model Beginners	3	13260	October 31, 2023
Embeddings via API fundamental doubts Models	0	601	September 10, 2023

Is LLaMA rotary embedding implementation correct?

Related topics