Less Trainable Parameters after quantization

I have been playing with quantization and PEFT and I noticed that the trainable parameters are significantly reduced after applying quantization (but before applying PEFT) does anyone know why this happens? is it normal?

Here is an example:

Counting Functions:

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

def count_trainable_params(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

Normal Model:

from transformers import AutoModelForCausalLM

model_name = "ehartford/WizardLM-Uncensored-Falcon-7b"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True
)

print_trainable_parameters(model)
count_trainable_params(model)
# trainable params: 6921725248 || all params: 6921725248 || trainable%: 100.0
# 6921725248

Quantized One:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "ehartford/WizardLM-Uncensored-Falcon-7b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    quantization_config=bnb_config, 
    trust_remote_code=True
)

print_trainable_parameters(model)
count_trainable_params(model)
# trainable params: 295773504 || all params: 3608749376 || trainable%: 8.19601122668819
# 295773504

Actually I copied this exact same question to chatGPT and here is its response:

Yes, it is normal for the number of trainable parameters to decrease after applying quantization to a neural network. Quantization is a technique that reduces the precision of the weights and activations in the network, typically from floating-point values (32-bit or 16-bit) to lower-precision integer values (such as 8-bit or 4-bit).

When you apply quantization to a neural network, some weights and activations may become redundant and can be removed. For example, if an 8-bit weight has only 4 unique values, it can be represented using only 2 bits instead of 8 bits, which reduces the size of the weight tensor and the number of trainable parameters. Additionally, if the quantization technique is designed to impose sparsity, some of the weights may become zero and can be pruned, further reducing the number of trainable parameters.

In your example, you applied quantization using the BitsAndBytesConfig from the Hugging Face Transformers library. This configuration applies a 4-bit quantization scheme called “nf4” that reduces the precision of the weights to 4 bits and uses half-precision floating-point numbers to compute the quantized weights. This accounts for the significant reduction in the number of trainable parameters in the quantized model.

It’s worth noting that the reduction in the number of trainable parameters does not necessarily mean a loss in model performance. In fact, quantization can often improve the performance of neural networks by reducing their memory footprint and increasing their inference speed.

Sounds reasonable although I don’t think the pruning will be done on-site however I have no way of knowing how much chatGPT is hallucinating.