Lastly, I think we need to be careful in what significance we give to seeing the number of trainable parameters going down… You can arbitrarily reduce the number of “trainable parameters” in a model simply by choosing to freeze parts of the model (you just set requires_grad = False on a weight matrix), and I think that’s all that’s happening here.
Don’t conflate that with “parameter efficient fine tuning” techniques like LoRA, where you get to train fewer parameters while still getting a similar effect to training all of them.