Hi @peterhung! Indeed, 4-bit and 8-bit quantization through bitsandbytes
enables to reduce the memory footprint of the model. However, when the output of a layer is being computed, the weights of this layer are casted to 32-bit or 16-bit precision. Which is why it is compatible with Flash Attention.
Here is a more detailed explanation: Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA