Understanding how changing bnb_4bit_compute_dtype affects outputs

I’m struggling to understand the impact of bnb_4bit_compute_dtype. Specifically, I’m thinking that if i were to use the quantization config below and switched the bnb_4bit_compute_dtype to float32 - nothing should change in terms of outputs/quality of the model? My thinking is that a 4 bit value should fit into both 16 and 32 bits.

    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)

Does anyone have an explanation for this?

From QLoRA paper:

“QLORA has one low-precision storage data type, in our case usually 4-bit, and one computation data type that is usually BFloat16. In practice, this means whenever a QLORA weight tensor is used, we dequantize the tensor to BFloat16, and then perform a matrix multiplication in 16-bit.for CausalLM models, the last lm_head is kept in its original dtype.”

This simply means that the tensors are stored in 4-bit quantized format as proposed by QLoRa and whenever a computation needs to be performed, these are pulled in the “computation” data type which is usually FP16 or BF16.

Note that in QLoRa (a parameter-efficient fine-tuning technique), the purpose is to freeze the original layers (which are now quantized in 4-bit and only converted to bnb_4bit_compute_dtype when any calculations are to be performed) and only train the new adapter weights in FP32 or in mixed-precision if fp16=True is specified while fine-tuning.

1 Like