Why llama weight in huggingface need to do permute on wq/wk

In the transformers/convert_llama_weights_to_hf.py at main · huggingface/transformers · GitHub

there is a function called permute which is called on wq/wk

# permute for sliced rotary
    def permute(w):
        return w.view(n_heads, dim // n_heads // 2, 2, dim).transpose(1, 2).reshape(dim, dim)
    for layer_i in range(n_layers):
        filename = f"pytorch_model-{layer_i + 1}-of-{n_layers + 1}.bin"
        if model_size == "7B":
            # Unsharded
            state_dict = {
                f"model.layers.{layer_i}.self_attn.q_proj.weight": permute(
                    loaded[f"layers.{layer_i}.attention.wq.weight"]
                ),
                f"model.layers.{layer_i}.self_attn.k_proj.weight": permute(
                    loaded[f"layers.{layer_i}.attention.wk.weight"]
                ),
                f"model.layers.{layer_i}.self_attn.v_proj.weight": loaded[f"layers.{layer_i}.attention.wv.weight"],
                f"model.layers.{layer_i}.self_attn.o_proj.weight": loaded[f"layers.{layer_i}.attention.wo.weight"],
                f"model.layers.{layer_i}.mlp.gate_proj.weight": loaded[f"layers.{layer_i}.feed_forward.w1.weight"],
                f"model.layers.{layer_i}.mlp.down_proj.weight": loaded[f"layers.{layer_i}.feed_forward.w2.weight"],
                f"model.layers.{layer_i}.mlp.up_proj.weight": loaded[f"layers.{layer_i}.feed_forward.w3.weight"],
                f"model.layers.{layer_i}.input_layernorm.weight": loaded[f"layers.{layer_i}.attention_norm.weight"],
                f"model.layers.{layer_i}.post_attention_layernorm.weight": loaded[f"layers.{layer_i}.ffn_norm.weight"],
            }

I don’t understand why we need this since model size 7B have not used any parallel strategy here

1 Like

I found the reason, but the usage of rotary embedding in huggingface llama is different from the origin facebook llama, be careful!!

3 Likes

Can you explain the reason? I have the same question

1 Like

@unkmaster late reply, but if you or anyone else is wondering this, it’s because of what @irasinn mentioned: the rotary embedding operations in huggingface transformers assume different input layout than the facebook llama code.

When applying rope to some vector q, transformers assumes that q is laid out as

[ r_1, r_2, ..., r_n, i_1, i_2, ..., i_n]

where r_k is the real part of the kth pair and i_k is the imaginary part.

But llama assumes it is laid out as

[ r_1, i_1, r_2, i_2, ..., r_n, i_n ]

llama has learned wq and wk to output the latter, so we permute the matrices to output the former, hence first viewing the matrix columns in [r_k, i_k] pairs, then transposing the pair and real-imaginary axes.

1 Like