As I’m currently doing a lot of work on replacing quadratic attention with linear attention mechanisms on pretrained transformer models that don’t yet have it implemented, I was wondering; Are there any plans to port efficient attention substitutes to different models and make them easily swappable in the transformers library?
Short answer: no.
Longer answer: this blog post (especially point 3, cc @patrickvonplaten):
We do implement models rather independently in the library, hence we do have some linear attention variants available, including:
There’s also a FLAX implementation available of Performer, which you can find here: transformers/examples/research_projects/performer at main · huggingface/transformers · GitHub
Ok thanks! I think these models are great, but was thinking of way the attention could be abstracted, so that I could for example load a checkpoint that was trained using quadratic self attention and then fine-tune using linear attention just by specifying some argument. I suppose this could be an idea for a separate library.
This repo seems like a good foundation for this use case.