Compatibility of flash attention 2 and type conversion due to accelerator.prepare

bellos1203 · April 6, 2024, 2:46am

Hi, I’m trying to fine-tune my model, which is BLIP-2, using flash attention 2 on OPT 2.7B, but using FA2 produces significantly higher loss than using eager attention mode, which seems similar to issues reported previously (#26498, #28925, #28142).
From the comments from those issues, the best way to use fa2 normally is to load the model in full precision and train the model with autocast context.
However, when using accelerate library, accelerator.prepare function converts the model into a specified dtype (for me, bf16) including layer norm.
I guess this caused the problem for me, but I’m not sure.

Could you check this behavior and give any suggestions? I’m using transformers==4.40.0.dev0, accelerate==0.23.0 and flash_attn==2.5.5.
Or if there is any more detail that I have to elaborate on, please let me know.
Thanks in advance

Topic		Replies	Views
FlashAttention-2's 16 bit requirement 🤗Optimum	2	2367	December 26, 2023
Enabling Flash Attention 2 🤗Transformers	2	5109	July 3, 2024
FlashAttention or equivalent? 🤗Transformers	0	902	April 30, 2023
Loading in Float32 vs Float16 has very different speed 🤗Transformers	1	94	February 20, 2025
Bfloat16 conversion results in significantly slower computation for various transformer models 🤗Transformers	0	1412	December 20, 2021

Compatibility of flash attention 2 and type conversion due to accelerator.prepare

Related topics