Fine-tuning BERT with sequences longer than 512 tokens

In my experience, LongFormer and BigBird require a lot of GPU memory. I tried using these on a 14GB GPU, but I was limited to batch_size=1, which took for ever to train and yielded rather poor results.