When we load in 4 bit, the linear layers are replaced with linear 4bit layers. These layers have half the number of parameters. But still I am also not clear how number of parameters become half.
In the source code, when picking 4-bit, the parameter count is divided by 2.
I don’t know why, you can check the code
I still don’t have a clear answer for this and I would love to know, bumping for visibility.
I too has the same doubt. @sgugger @sayakpaul sorry to tag you guys… I am thinking I could use your help here.
Hi everyone. Have you figured this out? Is there a way to get access to the full weight with 4bit quantization?
cc @ybelkada
This is normal since torch.int4
data dtype is not supported in PyTorch. Instead, we pack the 4bit data into torch.int8
tensor, hence the number of parameters is divided by 2 when we quantize in 4 bit !
From @marcsun13