Using Longformer with full attention for comparison

I’m doing some research on Longformers ability to classify long texts on a dataset where the majority of samples are shorter than 1024 tokens. Because the attention mechanism theoretically will never be better than full attention for shorter texts, i was wondering if i could easily estimate the tradeoff between these.

From the model documentation it does not seem possible to set full attention directly, but can i approximate this by truncation in the tokenizer to 1024 and setting the (local) window size to 1024?

Longformer should be a little better on inputs with > 512 tokens (with a 512 window size).
If you change the local window to 1024, it should work like full attention.
However, since the model has not been pretrained on a large window, you should see some performance degradation.

I personally observed a performance loss between a model trained on 512 length sequences (full attention) and the same model with local attention instead when the attention window is > 768 tokens (== 3 blocks of 256 tokens).
See this repo for instance if you want to extrapolate existing models.

Thank you for valuable input and the link. That is really interesting work.

Silly me didn’t check the Longformer source code. It seems that you can set BERT’s full attention mechanism by passing this config option:

attention_mode == ‘n2’

Will have to check (after i finish the current fine tuning) if i can fit 1024 token length with full attention in my GPU, and how the model performs. Longformer has been pretrained on very long sequences, so that should not be a limitation.

The attention configuration and implementation by AllenAI differs from HF’s. HF doesn’t support switching attention mode, and AllenAI has no classification head example. I looked at subclassing, but couldn’t estimate how easy it is to replace the attention mechanism with BERT/RoBERTa, as the there are implementation differences. Does anyone know?

When adjusting the attention_window to 1024 and enabling gradient checkpointing in HF pre-trained model (4096 max token length), i get very high training loss, and no performance metrics for Precision, Recall, F1. I get that performance might degrade due to pretraining differences, but shouldn’t it produce metrics from 20K sample evaluations?