Are there any plans for replacing attention in transformers?

marksverdhei · April 8, 2022, 10:32am

As I’m currently doing a lot of work on replacing quadratic attention with linear attention mechanisms on pretrained transformer models that don’t yet have it implemented, I was wondering; Are there any plans to port efficient attention substitutes to different models and make them easily swappable in the transformers library?

nielsr · April 8, 2022, 12:20pm

Short answer: no.

Longer answer: this blog post (especially point 3, cc @patrickvonplaten):

We do implement models rather independently in the library, hence we do have some linear attention variants available, including:

There’s also a FLAX implementation available of Performer, which you can find here: transformers/examples/research_projects/performer at main · huggingface/transformers · GitHub

marksverdhei · April 8, 2022, 12:26pm

Ok thanks! I think these models are great, but was thinking of way the attention could be abstracted, so that I could for example load a checkpoint that was trained using quadratic self attention and then fine-tune using linear attention just by specifying some argument. I suppose this could be an idea for a separate library.

jackcai1206 · July 11, 2024, 10:31pm

This repo seems like a good foundation for this use case.

Topic		Replies	Views
Original transformers model implementation Beginners	2	977	June 1, 2022
Implementing a custom Attention Transformer Awesome paper	5	3188	September 6, 2021
How to make pure transformer model Beginners	0	136	May 22, 2024
Specify different attention masks for different layers 🤗Transformers	0	220	January 16, 2023
Optimal methods to monitor attention matrices when doing training/inference using BERT-type models Intermediate	2	714	September 11, 2021

Are there any plans for replacing attention in transformers?

Related topics