Make available BERT like models work on longer sequences (flash attention)

bjaidi · September 27, 2023, 7:27pm

If I understand well, flash-attention will make it much easier to encode long documents. I’ve only seen it applied to LLMs since its been announced, but I was wondering, if I wanted to encode a novel for example, and I wanted to save some GPU compute time, instead of starting to train a BERT like model from scratch, I would take something that’s already pre-trained, with all the vocabulary, embeddings and weights available to us, tweak the positional embeddings maybe (depending on the model), then maybe fine-tune the model a little bit more on some language modeling task, then I would be able to fine-tune it for some downstream tasks.

Has anyone tried this already? What are your thoughts?

Topic		Replies	Views
Modeling long sequences Models	0	460	June 9, 2022
Fine-tuning BERT with sequences longer than 512 tokens Models	7	27537	April 4, 2022
Aggregate encoder states in encoder-decoder models for long sequences? Research	0	723	April 8, 2022
Using Attention matrix to explain a classification problem? Models	0	641	March 25, 2022
Sliding Transformer into a long sequence Models	3	663	August 20, 2022

Make available BERT like models work on longer sequences (flash attention)

Related topics