If I understand well, flash-attention will make it much easier to encode long documents. I’ve only seen it applied to LLMs since its been announced, but I was wondering, if I wanted to encode a novel for example, and I wanted to save some GPU compute time, instead of starting to train a BERT like model from scratch, I would take something that’s already pre-trained, with all the vocabulary, embeddings and weights available to us, tweak the positional embeddings maybe (depending on the model), then maybe fine-tune the model a little bit more on some language modeling task, then I would be able to fine-tune it for some downstream tasks.
Has anyone tried this already? What are your thoughts?