Hi everyone,
I trained a quantized Phi3 mini model with a Lora Adapter and Accelerate. I saved the adapter in the end and would now like to reload the model. This is the way I loaded the model, trained it and saved it (Note, I did not use flash attn, because I am having trouble with installing it, but get acceptable results anyways):
bnb_config=BitsAndBytesConfig(load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type=“nf4”,
bnb_4bit_compute_dtype=torch.bfloat16)
peft_config=LoraConfig(inference_mode=False, r=16, lora_alpha=64, lora_dropout=0.1, task_type=TaskType.CAUSAL_LM, target_modules=“all-linear”)
model=AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, trust_remote_code=True)
model=get_peft_model(model, peft_config)
model, train_dataloader, val_dataloader, test_dataloader, optimizer, lr_scheduler=accelerator.prepare(model, train_dataloader, val_dataloader, test_dataloader, optimizer, lr_scheduler)
After the training, I saved the model in the following way:
unwrapped_model=accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(save_path, is_main_process=accelerator.is_main_process, save_function=accelerator.save)
Now, If I try to load the model, I am getting many many lines like this:
Traceback (most recent call last):
File “/homes/mkulcsar/behaviour_classification/llms/reload_phi3.py”, line 15, in
inference_model=PeftModel.from_pretrained(model, “./phi3_2e-5_10ep_temp1_500_checkpoint/”)
File “/homes/mkulcsar/.conda/envs/new_autism_diag_bubble/lib/python3.9/site-packages/peft/peft_model.py”, line 430, in from_pretrained
model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
File “/homes/mkulcsar/.conda/envs/new_autism_diag_bubble/lib/python3.9/site-packages/peft/peft_model.py”, line 988, in load_adapter
load_result = set_peft_model_state_dict(
File “/homes/mkulcsar/.conda/envs/new_autism_diag_bubble/lib/python3.9/site-packages/peft/utils/save_and_load.py”, line 353, in set_peft_model_state_dict
load_result = model.load_state_dict(peft_model_state_dict, strict=False)
File “/homes/mkulcsar/.conda/envs/new_autism_diag_bubble/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 2189, in load_state_dict
raise RuntimeError(‘Error(s) in loading state_dict for {}:\n\t{}’.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([16, 3072]).
size mismatch for base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([3072, 16]).
size mismatch for base_model.model.model.layers.0.self_attn.qkv_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([16, 3072]).
size mismatch for base_model.model.model.layers.0.self_attn.qkv_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([9216, 16]).
size mismatch for base_model.model.model.layers.0.mlp.gate_up_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([16, 3072]).
size mismatch for base_model.model.model.layers.0.mlp.gate_up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([16384, 16]).
size mismatch for base_model.model.model.layers.0.mlp.down_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([16, 8192]).
size mismatch for base_model.model.model.layers.0.mlp.down_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([3072, 16]).
Since the number of characters is limited, I cant put the whole traceback here, but these error occur for all layers). Others have had similar issues ( size mismatch for base_model.model.model.layers.0.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([14336, 16]). · Issue #1443 · huggingface/peft (github.com)) but there has not really been a solution. I would be very grateful for any advice!