Less Trainable Parameters after quantization

I have been playing with quantization and PEFT and I noticed that the trainable parameters are significantly reduced after applying quantization (but before applying PEFT) does anyone know why this happens? is it normal?

Here is an example:

Counting Functions:

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

def count_trainable_params(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

Normal Model:

from transformers import AutoModelForCausalLM

model_name = "ehartford/WizardLM-Uncensored-Falcon-7b"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True
)

print_trainable_parameters(model)
count_trainable_params(model)
# trainable params: 6921725248 || all params: 6921725248 || trainable%: 100.0
# 6921725248

Quantized One:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "ehartford/WizardLM-Uncensored-Falcon-7b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    quantization_config=bnb_config, 
    trust_remote_code=True
)

print_trainable_parameters(model)
count_trainable_params(model)
# trainable params: 295773504 || all params: 3608749376 || trainable%: 8.19601122668819
# 295773504
1 Like

Actually I copied this exact same question to chatGPT and here is its response:

Yes, it is normal for the number of trainable parameters to decrease after applying quantization to a neural network. Quantization is a technique that reduces the precision of the weights and activations in the network, typically from floating-point values (32-bit or 16-bit) to lower-precision integer values (such as 8-bit or 4-bit).

When you apply quantization to a neural network, some weights and activations may become redundant and can be removed. For example, if an 8-bit weight has only 4 unique values, it can be represented using only 2 bits instead of 8 bits, which reduces the size of the weight tensor and the number of trainable parameters. Additionally, if the quantization technique is designed to impose sparsity, some of the weights may become zero and can be pruned, further reducing the number of trainable parameters.

In your example, you applied quantization using the BitsAndBytesConfig from the Hugging Face Transformers library. This configuration applies a 4-bit quantization scheme called “nf4” that reduces the precision of the weights to 4 bits and uses half-precision floating-point numbers to compute the quantized weights. This accounts for the significant reduction in the number of trainable parameters in the quantized model.

It’s worth noting that the reduction in the number of trainable parameters does not necessarily mean a loss in model performance. In fact, quantization can often improve the performance of neural networks by reducing their memory footprint and increasing their inference speed.

Sounds reasonable although I don’t think the pruning will be done on-site however I have no way of knowing how much chatGPT is hallucinating.

After some investigation, I think it might be due to Linear4bit(https://github.com/TimDettmers/bitsandbytes/blob/main/bitsandbytes/nn/modules.py#L207) is setting requires_grad=False. So it will reduce a lot of parameters.

Hello, @dfrank.

Looking for an answer to the same question. Along with what you have mentioned above, the no. of total parameters reduces as well post-quantization. Please let me know if you figured out why.

1 Like

+1 looking for the same question, why the number of parameters of the model (all parameters) reduce as well with just quantization, this method should not prune the model in anyway ?

One explanation I have is from @dfrank 's ChatGPT answer stating that with quantization some parameters may become 0 and overall the matrices may become sparser so that pruning can be applied, which sounds logical, but how this procedure is justified in the construction of the model (if torch or bitsandbytes does this automatically) ?

Can @TheBloke provide us with an answer? We would appreciate some clarity on what is happening here.

1 Like

Anyone has found an answer to this in the meanwhile?

Hey all!
I think I have a reasonably satisfying answer to this question.

If you load Mistral-7B without quantization, and print out the named parameters from a single decoder layer (along with their shape, number of elements, and whether they’re set to train), here’s what you’ll see:

Parameter Name                                              Dimensions       Total Values    Trainable

==== Embedding Layer ====

model.embed_tokens.weight                                   32,000 x 4,096           125M    True

==== First Decoder ====

model.layers.0.self_attn.q_proj.weight                       4,096 x 4,096            16M    True
model.layers.0.self_attn.k_proj.weight                       1,024 x 4,096             4M    True
model.layers.0.self_attn.v_proj.weight                       1,024 x 4,096             4M    True
model.layers.0.self_attn.o_proj.weight                       4,096 x 4,096            16M    True
model.layers.0.mlp.gate_proj.weight                         14,336 x 4,096            56M    True
model.layers.0.mlp.up_proj.weight                           14,336 x 4,096            56M    True
model.layers.0.mlp.down_proj.weight                          4,096 x 14,336           56M    True
model.layers.0.input_layernorm.weight                        4,096 x -                 4K    True
model.layers.0.post_attention_layernorm.weight               4,096 x -                 4K    True

These are the correct weight matrix shapes and parameter counts.

But with 4-bit quantization enabled, this becomes:

Parameter Name                                              Dimensions       Total Values    Trainable

==== Embedding Layer ====

model.embed_tokens.weight                                   32,000 x 4,096           125M    True

==== First Decoder ====

model.layers.0.self_attn.q_proj.weight                   8,388,608 x 1                 8M    False
model.layers.0.self_attn.k_proj.weight                   2,097,152 x 1                 2M    False
model.layers.0.self_attn.v_proj.weight                   2,097,152 x 1                 2M    False
model.layers.0.self_attn.o_proj.weight                   8,388,608 x 1                 8M    False
model.layers.0.mlp.gate_proj.weight                     29,360,128 x 1                28M    False
model.layers.0.mlp.up_proj.weight                       29,360,128 x 1                28M    False
model.layers.0.mlp.down_proj.weight                     29,360,128 x 1                28M    False
model.layers.0.input_layernorm.weight                        4,096 x -                 4K    True
model.layers.0.post_attention_layernorm.weight               4,096 x -                 4K    True

The weight matrices have been flattened and the number of elements has cut in half.

Someone discussed this a bit here saying: “in general the quantized weight is not simply saved as a quantized tensor with X elements each having Y bits, rather it has to be saved as packedparams…”.

So, with the quantized model, if you try to count the model parameters by looping over the weights and tallying their numel, you’ll get the wrong total.

As for the number of trainable parameters…

The print_trainable_parameters function iterates over the “named parameters” (the different weight matrices) and, if they are set to train, it adds all of the elements in that weight matrix to the tally. ChatGPT’s comments about individual values being trainable or not is leading us astray–that’s not relevant here (and I don’t know if it’s even true or not :man_shrugging:).

So loading in 4-bit breaks that parameter counting code. It’s not as simple as doubling it, either, because note how the Mistral embedding matrix didn’t change size in the quantized version.

Lastly, I think we need to be careful in what significance we give to seeing the number of trainable parameters going down… You can arbitrarily reduce the number of “trainable parameters” in a model simply by choosing to freeze parts of the model (you just set requires_grad = False on a weight matrix), and I think that’s all that’s happening here.

Don’t conflate that with “parameter efficient fine tuning” techniques like LoRA, where you get to train fewer parameters while still getting a similar effect to training all of them.

So I went and investigate into the actual counting of parameters and this is the behaviour (using Mistral7B as example). The quantized layers have half the parameters of the non-quantized layers

And the reason for this is the actual Linear4bit implementation (bitsandbytes/bitsandbytes/nn/modules.py at 048a2d404c6a909e6f835ba18182fbfae130ba09 · TimDettmers/bitsandbytes · GitHub)

Gemini explains this better than I can jeje:

Let's consider a simplified example with a Linear4bit layer having 4 input features and 4 output features. This means the original weight tensor (before quantization) would have a shape of [4, 4], containing 16 individual weights.

1. Original Weight Tensor (FP16):
weights = torch.randn(4, 4, dtype=torch.float16)
print(weights)
This might output something like:
tensor([[ 0.2344, -0.1234,  0.5678, -0.9876],
        [-0.4567,  0.8765, -0.3456,  0.7890],
        [ 0.1234, -0.5678,  0.9876, -0.2345],
        [-0.7890,  0.3456, -0.7890,  0.1234]], dtype=torch.float16)

2. Quantization and Packing:
When we quantize this layer using Linear4bit, each of these 16 weights will be converted to a 4-bit representation. Since 8 bits make up a byte, we can pack two 4-bit weights into a single element of a torch.uint8 tensor.

3. Packed Weight Tensor (uint8):
After quantization and packing, the weight tensor will have a shape of [8, 1]. Each element in this tensor will hold two 4-bit quantized weights.
The exact values in the packed tensor will depend on the specific quantization map used (e.g., fp4 or nf4). However, the key point is that the information from the original 16 weights is now stored in 8 elements, effectively achieving a 50% reduction in memory usage for the weights.
1. Original Shape:
We start with a weight tensor of shape [4, 4], containing 16 individual 16-bit weights.
2. 4-bit Quantization:
Each 16-bit weight is converted to a 4-bit representation. We now have 16 4-bit weights.
3. Packing:
Since 8 bits make up a byte, we can pack two 4-bit weights into a single element of a torch.uint8 tensor.
Therefore, the 16 4-bit weights can be packed into 8 elements.
4. Final Shape:
The most logical and efficient way to store these 8 packed elements would be in a 1D tensor of shape [8].
However, the Linear4bit layer implementation uses a slightly different approach for internal reasons related to memory alignment and computation.
It stores the packed weights in a 2D tensor of shape [8, 1].
This shape technically has the same number of elements (8) as the 1D [8] shape, but it's organized differently in memory.
3 Likes

Hi @jjovalle99, do you mind share your code of checking the parameters for the quantized/non-quantized models? Thank you.

Sure! tbh is not anything fancy:

model_id = "mistralai/Mistral-7B-Instruct-v0.2"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    # quantization_config=bnb_config,
    device_map='auto',
)

params_dict = {
    'Name': [],
    'Shape': [],
    'Parameters': [],
    'RequiresGrad': []
}

for name, param in model.named_parameters():
    params_dict['Name'].append(name)
    params_dict['Shape'].append(param.shape)
    params_dict['Parameters'].append(param.numel())
    params_dict['RequiresGrad'].append(param.requires_grad)

pd.DataFrame(params_dict)

This happens with whisper as well. Quantizing does have an effect both quantitively and qualitatively. I’ve trained and tested hundreds of whisper models out in the real world. I think what people need to be careful of are eval metrics. I’ve trained models that had great numbers but struggled with basic Japanese audio translations in practice. Also, the opposite. Horrible wer but understood things that were surprising to me even now. Whispers can be a bit odd though.

Whisper-tiny
trainable params: 37184640 || all params: 37760640 || trainable%: 98.47
trainable params: 20640384 || all params: 29503104 || trainable%: 69.96

That explains a lot…

I wrote a forward hook to undo that but didn’t know what was doing it… Now I do.
Good catch.

“requires_grad=False, # quantized weights should be frozen by default”