Less Trainable Parameters after quantization

dfrank · June 14, 2023, 9:57am

I have been playing with quantization and PEFT and I noticed that the trainable parameters are significantly reduced after applying quantization (but before applying PEFT) does anyone know why this happens? is it normal?

Here is an example:

Counting Functions:

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

def count_trainable_params(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

Normal Model:

from transformers import AutoModelForCausalLM

model_name = "ehartford/WizardLM-Uncensored-Falcon-7b"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True
)

print_trainable_parameters(model)
count_trainable_params(model)
# trainable params: 6921725248 || all params: 6921725248 || trainable%: 100.0
# 6921725248

Quantized One:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "ehartford/WizardLM-Uncensored-Falcon-7b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    quantization_config=bnb_config, 
    trust_remote_code=True
)

print_trainable_parameters(model)
count_trainable_params(model)
# trainable params: 295773504 || all params: 3608749376 || trainable%: 8.19601122668819
# 295773504

dfrank · June 14, 2023, 10:52am

Actually I copied this exact same question to chatGPT and here is its response:

Yes, it is normal for the number of trainable parameters to decrease after applying quantization to a neural network. Quantization is a technique that reduces the precision of the weights and activations in the network, typically from floating-point values (32-bit or 16-bit) to lower-precision integer values (such as 8-bit or 4-bit).

When you apply quantization to a neural network, some weights and activations may become redundant and can be removed. For example, if an 8-bit weight has only 4 unique values, it can be represented using only 2 bits instead of 8 bits, which reduces the size of the weight tensor and the number of trainable parameters. Additionally, if the quantization technique is designed to impose sparsity, some of the weights may become zero and can be pruned, further reducing the number of trainable parameters.

In your example, you applied quantization using the BitsAndBytesConfig from the Hugging Face Transformers library. This configuration applies a 4-bit quantization scheme called “nf4” that reduces the precision of the weights to 4 bits and uses half-precision floating-point numbers to compute the quantized weights. This accounts for the significant reduction in the number of trainable parameters in the quantized model.

It’s worth noting that the reduction in the number of trainable parameters does not necessarily mean a loss in model performance. In fact, quantization can often improve the performance of neural networks by reducing their memory footprint and increasing their inference speed.

Sounds reasonable although I don’t think the pruning will be done on-site however I have no way of knowing how much chatGPT is hallucinating.

xguihugging · October 23, 2023, 6:41am

After some investigation, I think it might be due to Linear4bit(https://github.com/TimDettmers/bitsandbytes/blob/main/bitsandbytes/nn/modules.py#L207) is setting requires_grad=False. So it will reduce a lot of parameters.

aakashveera · October 29, 2023, 6:00am

Hello, @dfrank.

Looking for an answer to the same question. Along with what you have mentioned above, the no. of total parameters reduces as well post-quantization. Please let me know if you figured out why.

Devrim · October 31, 2023, 7:47pm

+1 looking for the same question, why the number of parameters of the model (all parameters) reduce as well with just quantization, this method should not prune the model in anyway ?

One explanation I have is from @dfrank 's ChatGPT answer stating that with quantization some parameters may become 0 and overall the matrices may become sparser so that pruning can be applied, which sounds logical, but how this procedure is justified in the construction of the model (if torch or bitsandbytes does this automatically) ?

kdcyberdude · December 23, 2023, 8:27am

Can @TheBloke provide us with an answer? We would appreciate some clarity on what is happening here.

aristsakpinisaws · December 27, 2023, 12:54am

Anyone has found an answer to this in the meanwhile?

ChrisMcCormick · February 29, 2024, 4:41am

Hey all!
I think I have a reasonably satisfying answer to this question.

If you load Mistral-7B without quantization, and print out the named parameters from a single decoder layer (along with their shape, number of elements, and whether they’re set to train), here’s what you’ll see:

Parameter Name                                              Dimensions       Total Values    Trainable

==== Embedding Layer ====

model.embed_tokens.weight                                   32,000 x 4,096           125M    True

==== First Decoder ====

model.layers.0.self_attn.q_proj.weight                       4,096 x 4,096            16M    True
model.layers.0.self_attn.k_proj.weight                       1,024 x 4,096             4M    True
model.layers.0.self_attn.v_proj.weight                       1,024 x 4,096             4M    True
model.layers.0.self_attn.o_proj.weight                       4,096 x 4,096            16M    True
model.layers.0.mlp.gate_proj.weight                         14,336 x 4,096            56M    True
model.layers.0.mlp.up_proj.weight                           14,336 x 4,096            56M    True
model.layers.0.mlp.down_proj.weight                          4,096 x 14,336           56M    True
model.layers.0.input_layernorm.weight                        4,096 x -                 4K    True
model.layers.0.post_attention_layernorm.weight               4,096 x -                 4K    True

These are the correct weight matrix shapes and parameter counts.

But with 4-bit quantization enabled, this becomes:

Parameter Name                                              Dimensions       Total Values    Trainable

==== Embedding Layer ====

model.embed_tokens.weight                                   32,000 x 4,096           125M    True

==== First Decoder ====

model.layers.0.self_attn.q_proj.weight                   8,388,608 x 1                 8M    False
model.layers.0.self_attn.k_proj.weight                   2,097,152 x 1                 2M    False
model.layers.0.self_attn.v_proj.weight                   2,097,152 x 1                 2M    False
model.layers.0.self_attn.o_proj.weight                   8,388,608 x 1                 8M    False
model.layers.0.mlp.gate_proj.weight                     29,360,128 x 1                28M    False
model.layers.0.mlp.up_proj.weight                       29,360,128 x 1                28M    False
model.layers.0.mlp.down_proj.weight                     29,360,128 x 1                28M    False
model.layers.0.input_layernorm.weight                        4,096 x -                 4K    True
model.layers.0.post_attention_layernorm.weight               4,096 x -                 4K    True

The weight matrices have been flattened and the number of elements has cut in half.

Someone discussed this a bit here saying: “in general the quantized weight is not simply saved as a quantized tensor with X elements each having Y bits, rather it has to be saved as packedparams…”.

So, with the quantized model, if you try to count the model parameters by looping over the weights and tallying their numel, you’ll get the wrong total.

ChrisMcCormick · February 29, 2024, 5:00am

As for the number of trainable parameters…

The print_trainable_parameters function iterates over the “named parameters” (the different weight matrices) and, if they are set to train, it adds all of the elements in that weight matrix to the tally. ChatGPT’s comments about individual values being trainable or not is leading us astray–that’s not relevant here (and I don’t know if it’s even true or not ).

So loading in 4-bit breaks that parameter counting code. It’s not as simple as doubling it, either, because note how the Mistral embedding matrix didn’t change size in the quantized version.

ChrisMcCormick · February 29, 2024, 5:03am

Lastly, I think we need to be careful in what significance we give to seeing the number of trainable parameters going down… You can arbitrarily reduce the number of “trainable parameters” in a model simply by choosing to freeze parts of the model (you just set requires_grad = False on a weight matrix), and I think that’s all that’s happening here.

Don’t conflate that with “parameter efficient fine tuning” techniques like LoRA, where you get to train fewer parameters while still getting a similar effect to training all of them.

jjovalle99 · March 6, 2024, 7:54pm

So I went and investigate into the actual counting of parameters and this is the behaviour (using Mistral7B as example). The quantized layers have half the parameters of the non-quantized layers

And the reason for this is the actual Linear4bit implementation (bitsandbytes/bitsandbytes/nn/modules.py at 048a2d404c6a909e6f835ba18182fbfae130ba09 · TimDettmers/bitsandbytes · GitHub)

Gemini explains this better than I can jeje:

Let's consider a simplified example with a Linear4bit layer having 4 input features and 4 output features. This means the original weight tensor (before quantization) would have a shape of [4, 4], containing 16 individual weights.

1. Original Weight Tensor (FP16):
weights = torch.randn(4, 4, dtype=torch.float16)
print(weights)
This might output something like:
tensor([[ 0.2344, -0.1234,  0.5678, -0.9876],
        [-0.4567,  0.8765, -0.3456,  0.7890],
        [ 0.1234, -0.5678,  0.9876, -0.2345],
        [-0.7890,  0.3456, -0.7890,  0.1234]], dtype=torch.float16)

2. Quantization and Packing:
When we quantize this layer using Linear4bit, each of these 16 weights will be converted to a 4-bit representation. Since 8 bits make up a byte, we can pack two 4-bit weights into a single element of a torch.uint8 tensor.

3. Packed Weight Tensor (uint8):
After quantization and packing, the weight tensor will have a shape of [8, 1]. Each element in this tensor will hold two 4-bit quantized weights.
The exact values in the packed tensor will depend on the specific quantization map used (e.g., fp4 or nf4). However, the key point is that the information from the original 16 weights is now stored in 8 elements, effectively achieving a 50% reduction in memory usage for the weights.

1. Original Shape:
We start with a weight tensor of shape [4, 4], containing 16 individual 16-bit weights.
2. 4-bit Quantization:
Each 16-bit weight is converted to a 4-bit representation. We now have 16 4-bit weights.
3. Packing:
Since 8 bits make up a byte, we can pack two 4-bit weights into a single element of a torch.uint8 tensor.
Therefore, the 16 4-bit weights can be packed into 8 elements.
4. Final Shape:
The most logical and efficient way to store these 8 packed elements would be in a 1D tensor of shape [8].
However, the Linear4bit layer implementation uses a slightly different approach for internal reasons related to memory alignment and computation.
It stores the packed weights in a 2D tensor of shape [8, 1].
This shape technically has the same number of elements (8) as the 1D [8] shape, but it's organized differently in memory.

joyu-ai · March 15, 2024, 2:39pm

Hi @jjovalle99, do you mind share your code of checking the parameters for the quantized/non-quantized models? Thank you.

jjovalle99 · March 15, 2024, 2:58pm

Sure! tbh is not anything fancy:

model_id = "mistralai/Mistral-7B-Instruct-v0.2"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    # quantization_config=bnb_config,
    device_map='auto',
)

params_dict = {
    'Name': [],
    'Shape': [],
    'Parameters': [],
    'RequiresGrad': []
}

for name, param in model.named_parameters():
    params_dict['Name'].append(name)
    params_dict['Shape'].append(param.shape)
    params_dict['Parameters'].append(param.numel())
    params_dict['RequiresGrad'].append(param.requires_grad)

pd.DataFrame(params_dict)

sin2piusc · May 2, 2024, 2:01am

This happens with whisper as well. Quantizing does have an effect both quantitively and qualitatively. I’ve trained and tested hundreds of whisper models out in the real world. I think what people need to be careful of are eval metrics. I’ve trained models that had great numbers but struggled with basic Japanese audio translations in practice. Also, the opposite. Horrible wer but understood things that were surprising to me even now. Whispers can be a bit odd though.

Whisper-tiny
trainable params: 37184640 || all params: 37760640 || trainable%: 98.47
trainable params: 20640384 || all params: 29503104 || trainable%: 69.96

sin2piusc · May 2, 2024, 2:19am

That explains a lot…

I wrote a forward hook to undo that but didn’t know what was doing it… Now I do.
Good catch.

“requires_grad=False, # quantized weights should be frozen by default”

Topic		Replies	Views
Number of parameters reduced after loading in 4bit Models	7	928	June 28, 2024
Parameter Count & Shape Discrepancies in 4-bit vs. Higher bit LLM models 🤗Transformers	2	675	June 3, 2024
Does quantization compress the model weights? Research	16	370	September 26, 2024
Difference in Number of Parameters for load_in_4bit Beginners	0	556	August 2, 2023
Loading quantised weights does not work Beginners	0	122	April 12, 2024

Less Trainable Parameters after quantization

Related topics