FlashAttention-2's 16 bit requirement

peterhung · December 26, 2023, 2:38pm

On this page on GPU Inference, FA-2 section, it says:

FlashAttention-2 can only be used when the model’s dtype is fp16 or bf16. Make sure to cast your model to the appropriate dtype and load them on a supported device before using FlashAttention-2.

But then below that, it then says that it can be used with 4bit quantized model:

load in 4bit

model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_4bit=True,
attn_implementation=“flash_attention_2”,
)

These 2 statements seem contradictory to me, because 4bit model isn’t stored in fp16 right?

Thanks for your clarifications.

regisss · December 26, 2023, 6:31pm

Hi @peterhung! Indeed, 4-bit and 8-bit quantization through bitsandbytes enables to reduce the memory footprint of the model. However, when the output of a layer is being computed, the weights of this layer are casted to 32-bit or 16-bit precision. Which is why it is compatible with Flash Attention.

Here is a more detailed explanation: Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

peterhung · December 26, 2023, 6:51pm

Hi @regisss

Thanks for your explanations - they’re clear.

And thanks for the link - I’d check it out as well.

Topic		Replies	Views
Deepspeed inference and infinity offload with bitsandbytes 4bit loaded models DeepSpeed	2	3842	July 27, 2023
Does loading in 4bit override an 8bit model? 🤗Transformers	0	692	October 20, 2023
Compatibility of flash attention 2 and type conversion due to accelerator.prepare 🤗Accelerate	0	755	April 6, 2024
Load_in_8bit vs. loading 8-bit quantized model 🤗Transformers	6	6551	May 13, 2024
Is this needed: bnb 4bit use double quant = True? Beginners	3	2555	March 7, 2025

FlashAttention-2's 16 bit requirement

load in 4bit

Related topics