Less Trainable Parameters after quantization

I have been playing with quantization and PEFT and I noticed that the trainable parameters are significantly reduced after applying quantization (but before applying PEFT) does anyone know why this happens? is it normal?

Here is an example:

Counting Functions:

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

def count_trainable_params(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

Normal Model:

from transformers import AutoModelForCausalLM

model_name = "ehartford/WizardLM-Uncensored-Falcon-7b"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True
)

print_trainable_parameters(model)
count_trainable_params(model)
# trainable params: 6921725248 || all params: 6921725248 || trainable%: 100.0
# 6921725248

Quantized One:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "ehartford/WizardLM-Uncensored-Falcon-7b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    quantization_config=bnb_config, 
    trust_remote_code=True
)

print_trainable_parameters(model)
count_trainable_params(model)
# trainable params: 295773504 || all params: 3608749376 || trainable%: 8.19601122668819
# 295773504
1 Like

Actually I copied this exact same question to chatGPT and here is its response:

Yes, it is normal for the number of trainable parameters to decrease after applying quantization to a neural network. Quantization is a technique that reduces the precision of the weights and activations in the network, typically from floating-point values (32-bit or 16-bit) to lower-precision integer values (such as 8-bit or 4-bit).

When you apply quantization to a neural network, some weights and activations may become redundant and can be removed. For example, if an 8-bit weight has only 4 unique values, it can be represented using only 2 bits instead of 8 bits, which reduces the size of the weight tensor and the number of trainable parameters. Additionally, if the quantization technique is designed to impose sparsity, some of the weights may become zero and can be pruned, further reducing the number of trainable parameters.

In your example, you applied quantization using the BitsAndBytesConfig from the Hugging Face Transformers library. This configuration applies a 4-bit quantization scheme called “nf4” that reduces the precision of the weights to 4 bits and uses half-precision floating-point numbers to compute the quantized weights. This accounts for the significant reduction in the number of trainable parameters in the quantized model.

It’s worth noting that the reduction in the number of trainable parameters does not necessarily mean a loss in model performance. In fact, quantization can often improve the performance of neural networks by reducing their memory footprint and increasing their inference speed.

Sounds reasonable although I don’t think the pruning will be done on-site however I have no way of knowing how much chatGPT is hallucinating.

After some investigation, I think it might be due to Linear4bit(https://github.com/TimDettmers/bitsandbytes/blob/main/bitsandbytes/nn/modules.py#L207) is setting requires_grad=False. So it will reduce a lot of parameters.

Hello, @dfrank.

Looking for an answer to the same question. Along with what you have mentioned above, the no. of total parameters reduces as well post-quantization. Please let me know if you figured out why.

1 Like

+1 looking for the same question, why the number of parameters of the model (all parameters) reduce as well with just quantization, this method should not prune the model in anyway ?

One explanation I have is from @dfrank 's ChatGPT answer stating that with quantization some parameters may become 0 and overall the matrices may become sparser so that pruning can be applied, which sounds logical, but how this procedure is justified in the construction of the model (if torch or bitsandbytes does this automatically) ?

Can @TheBloke provide us with an answer? We would appreciate some clarity on what is happening here.

1 Like

Anyone has found an answer to this in the meanwhile?