Aggregate encoder states in encoder-decoder models for long sequences?

Hi. I would like to train a text-to-text QA model for long documents.
I was wondering if anyone has seen success in aggregating the encoder states of a long document in any way (e.g. pooling) before passing it to the decoder, similar to the sliding window technique done for e.g. classification with BERT. I’m well aware of models like the longformer, etc, but just wondering if this approach has any utility, and if not, why not?