I have fine-tuned a Mixtral-8x7B model with SFTTrainer and accelerate, with the official training script.
One of the parameters that I set during training is --load_in_4bit. After training, I pushed the model to the hub, and I see that the specified Tensor type is F32:
I think this behavior is correct. While the weights are quantized to 4bits, they are de-quantized during calculations. I don’t know much about the specifics myself, but it seems that the training/fine-tuning processes (like that of QLoRA) have been optimized to enable this.
@icpro I’ve never tried working with llama.cpp before.
My initial guess would be no; the quantized weights are determined by the quantization algorithm that is applied to the original weights, and I would assume that different libraries implement different quantization algorithms.
Then again, I still haven’t fully grasped how load_in_4bit performs quantization, so my answer is a guess at best.