+1 looking for the same question, why the number of parameters of the model (all parameters) reduce as well with just quantization, this method should not prune the model in anyway ?
One explanation I have is from @dfrank 's ChatGPT answer stating that with quantization some parameters may become 0 and overall the matrices may become sparser so that pruning can be applied, which sounds logical, but how this procedure is justified in the construction of the model (if torch or bitsandbytes does this automatically) ?