Iâ€™m struggling to understand the impact of bnb_4bit_compute_dtype. Specifically, Iâ€™m thinking that if i were to use the quantization config below and switched the bnb_4bit_compute_dtype to float32 - nothing should change in terms of outputs/quality of the model? My thinking is that a 4 bit value should fit into both 16 and 32 bits.

```
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16
)
model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)
```

Does anyone have an explanation for this?

From QLoRA paper:

*â€śQLORA has one low-precision storage data type, in our case usually 4-bit, and one computation data type that is usually BFloat16. In practice, this means whenever a QLORA weight tensor is used, we dequantize the tensor to BFloat16, and then perform a matrix multiplication in 16-bit.for CausalLM models, the last lm_head is kept in its original dtype.â€ť*

This simply means that the tensors are stored in 4-bit quantized format as proposed by QLoRa and whenever a computation needs to be performed, these are pulled in the â€ścomputationâ€ť data type which is usually FP16 or BF16.

Note that in QLoRa (a parameter-efficient fine-tuning technique), the purpose is to freeze the original layers (which are now quantized in 4-bit and only converted to bnb_4bit_compute_dtype when any calculations are to be performed) and only train the new adapter weights in FP32 or in mixed-precision if `fp16=True`

is specified while fine-tuning.