Does quantization compress the model weights?

I am calculating some values using the model weights and its input.

I want to use “meta-llama/Meta-Llama-3-8B-Instruct” for Generation task.

The weights of the first transformer layer have the following shape:

model.embed_tokens.weight: torch.Size([128256, 4096])
model.layers.0.self_attn.q_proj.weight: torch.Size([4096, 4096])
model.layers.0.self_attn.k_proj.weight: torch.Size([1024, 4096])
model.layers.0.self_attn.v_proj.weight: torch.Size([1024, 4096])
model.layers.0.self_attn.o_proj.weight: torch.Size([4096, 4096])
model.layers.0.mlp.gate_proj.weight: torch.Size([14336, 4096])
model.layers.0.mlp.up_proj.weight: torch.Size([14336, 4096])
model.layers.0.mlp.down_proj.weight: torch.Size([4096, 14336])
model.layers.0.input_layernorm.weight: torch.Size([4096])
model.layers.0.post_attention_layernorm.weight: torch.Size([4096])

Now, when I am using the quantized version of “unsloth/llama-3-8b-bnb-4bit”.
The weights of the first transformer layer have the following shape:

model.embed_tokens.weight: torch.Size([128256, 4096])
model.layers.0.self_attn.q_proj.weight: torch.Size([8388608, 1])
model.layers.0.self_attn.k_proj.weight: torch.Size([2097152, 1])
model.layers.0.self_attn.v_proj.weight: torch.Size([2097152, 1])
model.layers.0.self_attn.o_proj.weight: torch.Size([8388608, 1])
model.layers.0.mlp.gate_proj.weight: torch.Size([29360128, 1])
model.layers.0.mlp.up_proj.weight: torch.Size([29360128, 1])
model.layers.0.mlp.down_proj.weight: torch.Size([29360128, 1])
model.layers.0.input_layernorm.weight: torch.Size([4096])
model.layers.0.post_attention_layernorm.weight: torch.Size([4096])

As per my limited knowledge, in the quantization step, we convert the float16 or float32 value to int4 or int8. Also, to fast access, the weights are reshaped into 1-D.

But when you look for, let’s say, weights of q_proj in self-attention of the first transformer layer of “meta-llama/Meta-Llama-3-8B-Instruct”, the weight shape is torch.Size([4096, 4096]).

When you convert it into1-D, it will be (16777216, 1). But if you look at the shape of the corresponding weight in “unsloth/llama-3-8b-bnb-4bit”, it is ** torch.Size([8388608, 1])**.

I have two questions:

  1. How the quantized weight shape in this particular case is ** torch.Size([8388608, 1])**?
  2. If I want to reshape the weight for some calculation, how can I do it (from ** torch.Size([8388608, 1])** to torch.Size([4096, 4096]).

2024-09-22T18:30:00Z

How the quantized weight shape in this particular case is ** torch.Size([8388608, 1])**?

This is because the NF4 (4bit normal format) quantization algorithm is quite ingenious, unlike so-called normal casts.
The same applies to GGUF quantization, etc., but it is easier to think of it as a kind of compression rather than just a division.

If I want to reshape the weight for some calculation, how can I do it (from ** torch.Size([8388608, 1])** to torch.Size([4096, 4096]).

I think it was in a torch function (or rather, there is a function for almost any calculation…), but I’m not a math’s expert, so I left that part to someone else.
But is the data itself safe when you simply transform the tensor shape?

I want to transform the weight shape.
Return them to the original without changing their dtype for faster calculation.

Suppose, you know the input of an intermediate layer, you can use these weights to calculate the output of that layer.

When you load LLAMA3 model using huggingface in pytorch, you can get the all trainable weights by using the following code:

for name, param in model.named_parameters():
    print(f"{name}: {param.shape}")

Output:

model.embed_tokens.weight: torch.Size([128256, 4096])
model.layers.0.self_attn.q_proj.weight: torch.Size([8388608, 1])
model.layers.0.self_attn.k_proj.weight: torch.Size([2097152, 1])
model.layers.0.self_attn.v_proj.weight: torch.Size([2097152, 1])
model.layers.0.self_attn.o_proj.weight: torch.Size([8388608, 1])
model.layers.0.mlp.gate_proj.weight: torch.Size([29360128, 1])
model.layers.0.mlp.up_proj.weight: torch.Size([29360128, 1])
model.layers.0.mlp.down_proj.weight: torch.Size([29360128, 1])
model.layers.0.input_layernorm.weight: torch.Size([4096])
model.layers.0.post_attention_layernorm.weight: torch.Size([4096])
model.layers.1.self_attn.q_proj.weight: torch.Size([8388608, 1])
model.layers.1.self_attn.k_proj.weight: torch.Size([2097152, 1])
model.layers.1.self_attn.v_proj.weight: torch.Size([2097152, 1])
model.layers.1.self_attn.o_proj.weight: torch.Size([8388608, 1])
model.layers.1.mlp.gate_proj.weight: torch.Size([29360128, 1])
...

I use :slight_smile:


import torch

print('Convert to FP16...')
model.to(torch.float16)

there is no loss : i use my quantized 4-bit models for traiing and i use my 16fp model for converting to gguf : i also use my 4-bit models locally ! - So i use the 4-bit to download the model and thi script to convert the downloaded model Back to fp16 :slight_smile:
therre is no loss !

1 Like

Also to change setting they are controlled in the config file so you can effectively chage it there but there will be loss !

You need GGUF surgery ! a part of the llama cpp :slight_smile:
or somehing in the mergit :

part of the GGUF ( files they have in the source )

Here you can do a calclulation : here you can touch each tensor and dived by or multiply by ( scaleing ) the tensor ! so apply a scalling algirithm to the tensors to change thier size without loss !..as all that happens is multiplication or division but the data is rescaled ! :::
this is quantization bro in its raw form ! it is basically scaling !

SO again the bits and bytes will have something also which mayhel ( i the source ) which will alow you to customing the scaling to your desire : ( for you it would be a factorization !)

2 Likes

@LeroyDyer, Can you please share the link to the huggingface implementation on Git Hub?

I am trying but unable to find it.

Well, in the case of bitsandbytes, this is closer to the use case, although it’s 8 bits instead of 4.

8-bit (LLM.int8() algorithm)

8-bit models can offload weights between the CPU and GPU to support fitting very large models into memory. The weights dispatched to the CPU are actually stored in float32, and aren’t converted to 8-bit.

By the way, it is easy to change back to a float. Virtually one line.

model_id = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_id, BitsAndBytesConfig(load_in_4bit=True))
model.dequantize() # Done!

In addition, it took a search.

2 Likes

@John6666 I have tried the dequantization, but it changes the storage type from 4-bit or 8-bit to FP-16 or FP-32.

Is there any other way to get the original weights while the storage type remains 4-bit or 8-bit?

Also, Does two 4-bit values are packed into a single 8-bit byte?

No, it is easier to think of 4-bit and 8-bit as different formats. So it would be easiest to dequantize 4 bits to float and then quantize to 8 bits. There would be some memory consumption along the way, but…
It would be easier if NF4 supported this, but I’m sure there must be some structural difficulty or some other reason.

I think the GGUF suggested above is a good idea, but the HF ecosystem and the Llamacpp (GGUF) ecosystem are completely separate, and if you use one of them first, it’s pretty hard to switch after that. You have to make a decision there according to your use case. Alternatively, there may be a way to keep the curricula and make them reusable.

1 Like

hmm :slight_smile:
Im not sure :slight_smile:
When i make my model : i use unsloth . so i quantize the model into 4bit (forced)
When i load the model i can load it in 4bit or 8bit or dfloat16 ?
So i always after training make 2 1 full precision ( fp16 of bf16 ) what ever it merges as … ad also a 4-bit:
I always use my 4-bit for trainign or even to download and use the weights local:
So once the 4bit model loads it always uses bitsnbytes :
and you can reconvert the model back to fp16 ?
It did not seem to be a problem if the model was saved as a 4-bit 8-bit or fp16 :slight_smile: ITs basically the same outputs : i train in 4bit double quantized state every time so i dont expect there to be loss :slight_smile:
my 4-bit models got better also after i changed the training to just the attention heads : in fact the model was traing even better and faster and retaiing more of the past trainign without damage !
i will note there is a difference : in quantized and 4bits !
as the gguf quantized version : hmm … its not always the same as the 4-bits : I think its more stable : as the chat template etc has been embedded and the max tokes embedded as well as the tokeizer ! … its a bit final but hugging face release their gguf to weights :slight_smile:




def ConvertGGUF_toPretrained(model_id = "LeroyDyer/MODELS",filename = "Mixtral_AI_Q4.gguf",OUTPUT_DIR = "",):

    print('LoadModel')


    tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
    model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)

    print('Extract and Convert to FP16...')
    model.to(torch.float16)

    print('Saving Model to...')

    tokenizer.save_pretrained(OUTPUT_DIR)
    model.save_pretrained(OUTPUT_DIR)
    return model,tokenizer 


this add the abilty to extract the fp16 from the gguf ?
so there is a lot of options to keeping a storiing a model :slight_smile:
I use the 4bit weights ( not Quantized ) ( but sayingn that 4 bit is a quantizationa and 8 bit is a quantization ~ LOL!

i think the problem your facing is storiage an i would suggest to use gguf like normal or down load the 4bit weights and use them local : ( but if bitsandbytes messes up like they do sometimes on windows ( you cant load these models local )) SO the Fp16 model local is the real best option ! as it can do anything and mess around with ! ( personally i would not worry about conversions too much ! as this technology will catch up , so build your librarys )

1 Like

Nice one Even i did not get to find this out yet ! dequantize lol !

1 Like

The GGUF that can be loaded with this, so to speak, is a file dedicated to toransoformers using GGUF instead of NF4, and without the config, you get an error.
It doesn’t mean that you can use a quantized GGUF file for Llamacpp, it just means that you can use GGUF quantization instead of BNB quantization.

People may want to use GGUF for Llamacpp, so what’s the point of adding more dialects in the opposite direction!
I guess I could write my own config file for toransoformers, but that’s too much trouble!
Or maybe they fixed it so that we don’t need config anymore…

In any case, I think it’s a good idea to keep them in their original format once, e.g. float16. It will be safe even if the original model disappears.

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)
1 Like

YES this is the keyphrase here !

the nf4 double quantize ca only be loaded with the bits and bytes !..
hence when loading a model applying the quantize … as a bnb config …
this way the original weight stay the same ! … when you quantize the gguf with unsloth or llamacpp … choose the lowest you feel is best : for me q4_KS or KM …
now my machine is better i should goto 8bit instead ! … but the 4bits has worked very well for me !
the models which i doublequantized nf4 they couldnot load when the bits and bytes library failled ! (so i choose to keep in fp16 when i know i need to use it local this way but its not neccsary ! ) the gguf loads the 4bit nf4 loads ( just use the bitsandbytes config ) …
the flat moel nees the config to be specified to double quantize but the model which has been prequantzed will automatically use bite and bytes to process !
this bits and bytes is the key ! ( hence i do not update this library if i do i revert ! as it is sensitive to your cuda setup too ) …

I think it was about 3 months ago?
It’s getting harder and harder for bitsandbytes to fail in Windows pip. It was really bad before. I would say it was at a level where it was practically impossible to operate outside of a virtual environment.

Probably, thanks to the image model Flux, the demand for NF4 increased at once, and they did their best to stabilize it.
But I often use GGUF’s Q4_k_M in Spaces for Llamacpp. Q5_k_M is also good.
The question is whether each format can be used for training or processing tensors in transoformers in their quantized state, but perhaps not many people have tried it…

1 Like

Admittedly, this part initially threw me off, as I was expecting the 4-bit representation to be packed into a 4-bit data type, which assumes exactly 16 unique values, not an 8-bit data type with 256 unique values. However, after going through the code, it turns out the author of bitsandbytes converts the 4-bit values into 8-bit by packing two 4-bit values into a single 8-bit value, this results ofcourse, in a different shape for the quantized tensor. This is because PyTorch does not support 4-bit data types and the smallest type it supports is 8-bits — as of the writing of this post

Furthermore, the reason it uses an 8-bit integer format and not an 8-bit floating point format FP8 is due to the lack of native support for FP8 in PyTorch. The packing operation is exactly what Pytorch’s new data type ‘quantized 4-bit integertorch.quint4x2 does as well, as you can see in the documentation. The packing of the two 4-bits values to 8 bits is very straightforward using simple bitwise operations. The actual packing step in bitsandbytes is performed in this part of the code, but make sure to follow along to see our implementation.

It is clearly mentioned that pytorch does not support 4-bit data types. So, the author of bitsandbytes converts the 4-bit values into 8-bit by packing two 4-bit values into a single 8-bit value.

Now, the issue is how can get the 4-bit values back from a single 8-bit value.

If this happens, we have a 1D tensor of shape [4096 * 4096, 1].

So, how can we unpack the tensor carefully?

Does it pack two consecutive 4-bit values into a 8-bit value or something else?

1 Like

Oh, so that’s why they can’t support it…can’t help if torch doesn’t support it.:sob:
As for the calculation, if you could dequantize the tensor each time, that would be a sure thing, but I’ve never thought to try it, so I don’t know if it’s possible or not…

If all you need to modify is weights, it would be easier to offload them from the beginning.

device_map = {
    "transformer.word_embeddings": 0,
    "transformer.word_embeddings_layernorm": 0,
    "lm_head": "cpu",
    "transformer.h": 0,
    "transformer.ln_f": 0,
}
1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.