FlashAttention-2's 16 bit requirement

regisss · December 26, 2023, 6:31pm

Hi @peterhung! Indeed, 4-bit and 8-bit quantization through bitsandbytes enables to reduce the memory footprint of the model. However, when the output of a layer is being computed, the weights of this layer are casted to 32-bit or 16-bit precision. Which is why it is compatible with Flash Attention.

Here is a more detailed explanation: Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

Topic		Replies	Views
Deepspeed inference and infinity offload with bitsandbytes 4bit loaded models DeepSpeed	2	3799	July 27, 2023
Does loading in 4bit override an 8bit model? 🤗Transformers	0	682	October 20, 2023
Compatibility of flash attention 2 and type conversion due to accelerator.prepare 🤗Accelerate	0	683	April 6, 2024
Load_in_8bit vs. loading 8-bit quantized model 🤗Transformers	6	5874	May 13, 2024
Is this needed: bnb 4bit use double quant = True? Beginners	3	2345	March 7, 2025

FlashAttention-2's 16 bit requirement

Related topics