Comparison of methods for large token inputs

slyle · July 5, 2023, 8:17pm

We are currently working on making use of a pretrained clinical style model in order to perform multilabel classification on large text inputs.

Obviously, we have a few options here: use a larger model like longformers/bigbird or alternatively chunk input sizes and use something with a smaller token limit like clinical/bio bert.

I imagine bigbird or longformers would be desirable here to chunking, as if we chunk the input we have to come up with some method at inference to determine one ultimate classification per record as opposed to one classification per chunk? Is anyone aware of general benchmarking comparing these methods?

Also - can someone explain why BigBird appears to fit into 10 GB vRAM + 32 GB of RAM, while Longformers does not in terms of their architecture (both batch size = 1)? And when vRAM is saturated, is regular RAM used for the remaining required memory allotment? I think I’m missing an important piece of fundamental understanding regarding the hard memory usage limit of large transformers.

If anyone has resources on this, it’d be much appreciated if you shared them! Thank you in advance!

Topic		Replies	Views
Token Classification Models on (Very) Long Text Models	8	11147	March 9, 2023
Out of Memory on very small custom transformer Models	7	2125	October 12, 2020
longformer speed compared to bert model Models	1	1112	May 4, 2021
Transformer for very big text 🤗Transformers	1	664	May 6, 2022
DataCollator for list of inputs? Intermediate	0	458	November 1, 2022

Comparison of methods for large token inputs

Related topics