Less Trainable Parameters after quantization

As for the number of trainable parameters…

The print_trainable_parameters function iterates over the “named parameters” (the different weight matrices) and, if they are set to train, it adds all of the elements in that weight matrix to the tally. ChatGPT’s comments about individual values being trainable or not is leading us astray–that’s not relevant here (and I don’t know if it’s even true or not :man_shrugging:).

So loading in 4-bit breaks that parameter counting code. It’s not as simple as doubling it, either, because note how the Mistral embedding matrix didn’t change size in the quantized version.