Swapping GPT-2 Attention with Flash Attention

conceptofmind · January 23, 2023, 8:57pm

Hi all,

Is there currently a way to extract the attention attribute from a model such as GPT-2 and swap it with Flash-Attention?

Thank you,

Enrico

iliemihai · March 18, 2023, 10:23am

I think you can multiplicate the positional embeddings, from what I have read, but it s not empirically tested.

conceptofmind · March 18, 2023, 3:22pm

I forgot to close this out. Resolved it awhile ago. You can swap the attention layers by building a wrapper.

AnanthZeke · June 4, 2023, 3:25pm

Can you share your code on how to swap the standard attention with flash attention on HF models?

Topic		Replies	Views
Key-value pair from attention layer of GPT2 🤗Transformers	0	327	June 28, 2023
Adding cross-attention to custom models 🤗Transformers	2	3540	October 21, 2022
Is attention_mask in LanguageModels such as GPT2LMHeadModel related to attention mechanism is it just to specify padding tokens Beginners	2	207	June 27, 2024
Finetuing GPT model? 🤗Transformers	2	354	August 29, 2021
Is Flash Attention implemented in GPTBigCodeModel? Models	1	861	July 6, 2023