Parameter Count & Shape Discrepancies in 4-bit vs. Higher bit LLM models

I’m delving into an intriguing issue related to model quantization and seeking insights from the community. Specifically, I’ve observed a curious difference in the parameter count and shape between 4-bit and 8/16/32-bit models.

I loaded the same model in different bit representations. Intriguingly, the 4-bit model shows different parameter counts and shapes compared to its 8/16/32-bit counterparts. Checked on Phi and Llama.

m1 = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True, load_in_4bit=True)
m2 = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True, load_in_8bit=True)

print(list(m1.parameters())[1].shape, list(m2.parameters())[1].shape)
# Output: (torch.Size([3276800, 1]), torch.Size([2560, 2560]))

Why does the 4-bit model exhibit different parameter shapes and counts in comparison to higher-bit models? Is this a typical result of the quantization process or something else?

I found a related discussion that hints at quantization impacting trainable parameters Less Trainable Parameters after quantization

Any insights or theories on why this discrepancy occurs would be incredibly valuable.

Looking forward to your thoughts and discussions!