Hi,
I have created a custom Llama2 model by replacing all linear layers with a custom linear layer, as follows:
def replace_quantlinear_layers(model):
# Collect the names and modules to be replaced
layers_to_replace = {}
for name, module in model.named_modules():
if isinstance(module, QuantLinear):
layers_to_replace[name] = module
# Replace the layers
for name, module in layers_to_replace.items():
# Create a new instance of the custom quantized layer
new_linear = CustomLinear(module.bits, module.group_size, module.infeatures, module.outfeatures, module.bias is not None)
# Transfer weights (and biases) from the original layer
new_linear.qweight.data = module.qweight.data.clone().to("cuda")
new_linear.qzeros.data = module.qzeros.data.clone().to("cuda")
new_linear.scales.data = module.scales.data.clone().to("cuda")
new_linear.wf.data = module.wf.data.clone().to("cuda")
if module.bias is not None:
new_linear.bias.data = module.bias.data.clone().to("cuda")
# Find the parent module and replace the original layer with the new one
if '.' in name:
parent_name, child_name = name.rsplit('.', 1)
parent_module = dict(model.named_modules())[parent_name]
setattr(parent_module, child_name, new_linear)
else:
# For top-level modules
setattr(model, name, new_linear)
The original model is TheBloke/Llama-2-7B-Chat-GPTQ
The final model is defined as:
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(rotary_emb): LlamaRotaryEmbedding()
(k_proj): CustomLinear()
(o_proj): CustomLinear()
(q_proj): CustomLinear()
(v_proj): CustomLinear()
)
(mlp): LlamaMLP(
(act_fn): SiLU()
(down_proj): CustomLinear()
(gate_proj): CustomLinear()
(up_proj): CustomLinear()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): CustomLinear(in_features=4096, out_features=32000, bias=False)
)
All other layers are the same. The QuantLinear and Linear layers in the GPTQ Llama2 are replaced with my custom linear layers.
What is the easiest and quickest way to upload this model to huggingface so that I can quickly load the weights and run it. Do I have to write a complete model definition in PyTorch and a config?