Swapping GPT-2 Attention with Flash Attention

Hi all,

Is there currently a way to extract the attention attribute from a model such as GPT-2 and swap it with Flash-Attention?

Thank you,


1 Like

I think you can multiplicate the positional embeddings, from what I have read, but it s not empirically tested.

I forgot to close this out. Resolved it awhile ago. You can swap the attention layers by building a wrapper.

Can you share your code on how to swap the standard attention with flash attention on HF models?

1 Like