Upload custom Llama2 model with injected linear layers

Jeevs · January 12, 2024, 11:33am

Hi,
I have created a custom Llama2 model by replacing all linear layers with a custom linear layer, as follows:

def replace_quantlinear_layers(model):
    # Collect the names and modules to be replaced
    layers_to_replace = {}
    for name, module in model.named_modules():
        if isinstance(module, QuantLinear):
            layers_to_replace[name] = module

    # Replace the layers
    for name, module in layers_to_replace.items():
        # Create a new instance of the custom quantized layer
        new_linear = CustomLinear(module.bits, module.group_size, module.infeatures, module.outfeatures, module.bias is not None)

        # Transfer weights (and biases) from the original layer
        new_linear.qweight.data = module.qweight.data.clone().to("cuda")
        new_linear.qzeros.data = module.qzeros.data.clone().to("cuda")
        new_linear.scales.data = module.scales.data.clone().to("cuda")
        new_linear.wf.data = module.wf.data.clone().to("cuda")
        if module.bias is not None:
            new_linear.bias.data = module.bias.data.clone().to("cuda")

        # Find the parent module and replace the original layer with the new one
        if '.' in name:
            parent_name, child_name = name.rsplit('.', 1)
            parent_module = dict(model.named_modules())[parent_name]
            setattr(parent_module, child_name, new_linear)
        else:
            # For top-level modules
            setattr(model, name, new_linear)

The original model is TheBloke/Llama-2-7B-Chat-GPTQ
The final model is defined as:

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (rotary_emb): LlamaRotaryEmbedding()
          (k_proj): CustomLinear()
          (o_proj): CustomLinear()
          (q_proj): CustomLinear()
          (v_proj): CustomLinear()
        )
        (mlp): LlamaMLP(
          (act_fn): SiLU()
          (down_proj): CustomLinear()
          (gate_proj): CustomLinear()
          (up_proj): CustomLinear()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): CustomLinear(in_features=4096, out_features=32000, bias=False)
)

All other layers are the same. The QuantLinear and Linear layers in the GPTQ Llama2 are replaced with my custom linear layers.

What is the easiest and quickest way to upload this model to huggingface so that I can quickly load the weights and run it. Do I have to write a complete model definition in PyTorch and a config?

tulipdu · August 14, 2024, 2:01pm

hai can I get your code? recently I try this work ,replace models layers to quant layers,but I didt get a good idea ,thank you very much if you can provite it .

Topic		Replies	Views
Unable to load a saved custom model 🤗Transformers	0	554	February 24, 2024
The best way to modify a transformers model with minimal modifications 🤗Transformers	0	668	December 25, 2023
LLama2 Finetuning giving RuntimeError: mat1 and mat2 shapes cannot be multiplied (33x4096 and 1x8388608) Models	0	499	November 17, 2023
Loading Llama 2 with quantization on M1 MacBooks Models	2	5392	December 15, 2023
AutoModelForCausalLM() to HuggingFaceLLM Beginners	2	2976	October 4, 2024

Upload custom Llama2 model with injected linear layers

Related topics