Why llama weight in huggingface need to do permute on wq/wk

irasinn · April 25, 2023, 5:46am

In the transformers/convert_llama_weights_to_hf.py at main · huggingface/transformers · GitHub

there is a function called permute which is called on wq/wk

# permute for sliced rotary
    def permute(w):
        return w.view(n_heads, dim // n_heads // 2, 2, dim).transpose(1, 2).reshape(dim, dim)

    for layer_i in range(n_layers):
        filename = f"pytorch_model-{layer_i + 1}-of-{n_layers + 1}.bin"
        if model_size == "7B":
            # Unsharded
            state_dict = {
                f"model.layers.{layer_i}.self_attn.q_proj.weight": permute(
                    loaded[f"layers.{layer_i}.attention.wq.weight"]
                ),
                f"model.layers.{layer_i}.self_attn.k_proj.weight": permute(
                    loaded[f"layers.{layer_i}.attention.wk.weight"]
                ),
                f"model.layers.{layer_i}.self_attn.v_proj.weight": loaded[f"layers.{layer_i}.attention.wv.weight"],
                f"model.layers.{layer_i}.self_attn.o_proj.weight": loaded[f"layers.{layer_i}.attention.wo.weight"],
                f"model.layers.{layer_i}.mlp.gate_proj.weight": loaded[f"layers.{layer_i}.feed_forward.w1.weight"],
                f"model.layers.{layer_i}.mlp.down_proj.weight": loaded[f"layers.{layer_i}.feed_forward.w2.weight"],
                f"model.layers.{layer_i}.mlp.up_proj.weight": loaded[f"layers.{layer_i}.feed_forward.w3.weight"],
                f"model.layers.{layer_i}.input_layernorm.weight": loaded[f"layers.{layer_i}.attention_norm.weight"],
                f"model.layers.{layer_i}.post_attention_layernorm.weight": loaded[f"layers.{layer_i}.ffn_norm.weight"],
            }

I don’t understand why we need this since model size 7B have not used any parallel strategy here

irasinn · May 6, 2023, 7:07am

I found the reason, but the usage of rotary embedding in huggingface llama is different from the origin facebook llama, be careful!!

unkmaster · January 5, 2024, 2:11am

Can you explain the reason? I have the same question

andrewkchan · January 2, 2025, 1:35am

@unkmaster late reply, but if you or anyone else is wondering this, it’s because of what @irasinn mentioned: the rotary embedding operations in huggingface transformers assume different input layout than the facebook llama code.

When applying rope to some vector q, transformers assumes that q is laid out as

[ r_1, r_2, ..., r_n, i_1, i_2, ..., i_n]

where r_k is the real part of the kth pair and i_k is the imaginary part.

But llama assumes it is laid out as

[ r_1, i_1, r_2, i_2, ..., r_n, i_n ]

llama has learned wq and wk to output the latter, so we permute the matrices to output the former, hence first viewing the matrix columns in [r_k, i_k] pairs, then transposing the pair and real-imaginary axes.

Topic		Replies	Views
Llama rotary positional embeddings implementation details Beginners	1	188	January 16, 2025
How does one reinitialize the weights of a Hugging Face LLaMA v2 model the official way as the original model? 🤗Transformers	4	4408	January 20, 2024
Does quantization compress the model weights? Research	16	363	September 26, 2024
Trying the inference with model Llama-2-70b-hf on 2 A100 (80g) GPUs but getting errors Beginners	6	6611	November 28, 2023
Model Size Mismatch Beginners	0	314	May 11, 2024

Why llama weight in huggingface need to do permute on wq/wk

Related topics