Using Longformer with full attention for comparison

khv2202 · November 1, 2022, 12:31pm

I’m doing some research on Longformers ability to classify long texts on a dataset where the majority of samples are shorter than 1024 tokens. Because the attention mechanism theoretically will never be better than full attention for shorter texts, i was wondering if i could easily estimate the tradeoff between these.

From the model documentation it does not seem possible to set full attention directly, but can i approximate this by truncation in the tokenizer to 1024 and setting the (local) window size to 1024?

ccdv · November 1, 2022, 7:11pm

Longformer should be a little better on inputs with > 512 tokens (with a 512 window size).
If you change the local window to 1024, it should work like full attention.
However, since the model has not been pretrained on a large window, you should see some performance degradation.

I personally observed a performance loss between a model trained on 512 length sequences (full attention) and the same model with local attention instead when the attention window is > 768 tokens (== 3 blocks of 256 tokens).
See this repo for instance if you want to extrapolate existing models.

khv2202 · November 2, 2022, 5:46pm

Thank you for valuable input and the link. That is really interesting work.

Silly me didn’t check the Longformer source code. It seems that you can set BERT’s full attention mechanism by passing this config option:

attention_mode == ‘n2’

Will have to check (after i finish the current fine tuning) if i can fit 1024 token length with full attention in my GPU, and how the model performs. Longformer has been pretrained on very long sequences, so that should not be a limitation.

khv2202 · November 18, 2022, 8:20am

The attention configuration and implementation by AllenAI differs from HF’s. HF doesn’t support switching attention mode, and AllenAI has no classification head example. I looked at subclassing, but couldn’t estimate how easy it is to replace the attention mechanism with BERT/RoBERTa, as the there are implementation differences. Does anyone know?

When adjusting the attention_window to 1024 and enabling gradient checkpointing in HF pre-trained model (4096 max token length), i get very high training loss, and no performance metrics for Precision, Recall, F1. I get that performance might degrade due to pretraining differences, but shouldn’t it produce metrics from 20K sample evaluations?

Topic		Replies	Views
"Initializing global attention on CLS token" on Longformer Training Beginners	1	1129	October 7, 2021
longformer speed compared to bert model Models	1	1112	May 4, 2021
Character level attention with Longformer for sequence classification Intermediate	0	293	February 25, 2021
Longformer for text summarization Beginners	10	5253	August 6, 2022
Longformer seemingly initializing global attention mask for every step Intermediate	0	730	October 25, 2021

Using Longformer with full attention for comparison

Related topics