Less Trainable Parameters after quantization

ChrisMcCormick · February 29, 2024, 4:41am

Hey all!
I think I have a reasonably satisfying answer to this question.

If you load Mistral-7B without quantization, and print out the named parameters from a single decoder layer (along with their shape, number of elements, and whether they’re set to train), here’s what you’ll see:

Parameter Name                                              Dimensions       Total Values    Trainable

==== Embedding Layer ====

model.embed_tokens.weight                                   32,000 x 4,096           125M    True

==== First Decoder ====

model.layers.0.self_attn.q_proj.weight                       4,096 x 4,096            16M    True
model.layers.0.self_attn.k_proj.weight                       1,024 x 4,096             4M    True
model.layers.0.self_attn.v_proj.weight                       1,024 x 4,096             4M    True
model.layers.0.self_attn.o_proj.weight                       4,096 x 4,096            16M    True
model.layers.0.mlp.gate_proj.weight                         14,336 x 4,096            56M    True
model.layers.0.mlp.up_proj.weight                           14,336 x 4,096            56M    True
model.layers.0.mlp.down_proj.weight                          4,096 x 14,336           56M    True
model.layers.0.input_layernorm.weight                        4,096 x -                 4K    True
model.layers.0.post_attention_layernorm.weight               4,096 x -                 4K    True

These are the correct weight matrix shapes and parameter counts.

But with 4-bit quantization enabled, this becomes:

Parameter Name                                              Dimensions       Total Values    Trainable

==== Embedding Layer ====

model.embed_tokens.weight                                   32,000 x 4,096           125M    True

==== First Decoder ====

model.layers.0.self_attn.q_proj.weight                   8,388,608 x 1                 8M    False
model.layers.0.self_attn.k_proj.weight                   2,097,152 x 1                 2M    False
model.layers.0.self_attn.v_proj.weight                   2,097,152 x 1                 2M    False
model.layers.0.self_attn.o_proj.weight                   8,388,608 x 1                 8M    False
model.layers.0.mlp.gate_proj.weight                     29,360,128 x 1                28M    False
model.layers.0.mlp.up_proj.weight                       29,360,128 x 1                28M    False
model.layers.0.mlp.down_proj.weight                     29,360,128 x 1                28M    False
model.layers.0.input_layernorm.weight                        4,096 x -                 4K    True
model.layers.0.post_attention_layernorm.weight               4,096 x -                 4K    True

The weight matrices have been flattened and the number of elements has cut in half.

Someone discussed this a bit here saying: “in general the quantized weight is not simply saved as a quantized tensor with X elements each having Y bits, rather it has to be saved as packedparams…”.

So, with the quantized model, if you try to count the model parameters by looping over the weights and tallying their numel, you’ll get the wrong total.

Topic		Replies	Views
Number of parameters reduced after loading in 4bit Models	7	934	June 28, 2024
Parameter Count & Shape Discrepancies in 4-bit vs. Higher bit LLM models 🤗Transformers	2	680	June 3, 2024
Does quantization compress the model weights? Research	16	378	September 26, 2024
Difference in Number of Parameters for load_in_4bit Beginners	0	556	August 2, 2023
Loading quantised weights does not work Beginners	0	122	April 12, 2024

Less Trainable Parameters after quantization

Related topics