longformer speed compared to bert model

We are trying to use a LongFormer and Bert model for multi-label classification of different documents.

When we use the BERT model (BertForSequenceClassification) with max length 512 (batch size 8) each epoch takes approximately 30 minutes.

When we use LongFormer (LongformerForSequenceClassification with the ‘allenai/longformer-base-4096’ and gradient_checkpointing=True) with max length 4096 (batch size 1, Gradient Accumulation step 8) each epoch takes approximately 12 hours.

Is this reasonable or are we missing something?
Is there anything that we can try to make the training faster?


I was using LED and found it’s also roughly 10 times slower than Bart model.