Comparison of methods for large token inputs

We are currently working on making use of a pretrained clinical style model in order to perform multilabel classification on large text inputs.

Obviously, we have a few options here: use a larger model like longformers/bigbird or alternatively chunk input sizes and use something with a smaller token limit like clinical/bio bert.

I imagine bigbird or longformers would be desirable here to chunking, as if we chunk the input we have to come up with some method at inference to determine one ultimate classification per record as opposed to one classification per chunk? Is anyone aware of general benchmarking comparing these methods?

Also - can someone explain why BigBird appears to fit into 10 GB vRAM + 32 GB of RAM, while Longformers does not in terms of their architecture (both batch size = 1)? And when vRAM is saturated, is regular RAM used for the remaining required memory allotment? I think I’m missing an important piece of fundamental understanding regarding the hard memory usage limit of large transformers.

If anyone has resources on this, it’d be much appreciated if you shared them! Thank you in advance!