Token Classification Models on (Very) Long Text

Hi everyone,

From what I have seen, most token classification models out there have max token lengths less than 1k. Are there any models out there that can be used (i.e. customized) to be used with very long texts (long-form documents?

Assuming a model’s max token length is customizable, I assume its memory footprint has to be light for it to be able to batch a large number of embeddings&weights in GPU memory?

Any help/recommendation would be greatly appreciated in tackling this problem.


HI @nottakumasato

Most models have a 512 tokens limit and cannot extrapolate to longer sequences.
Memory footprint also increases quadratically with sequence length because standard attention is O(n²).

Best way to handle long sequences is to use a custom attention mecanism.
You can try this repo with a small model and a small block size, you should be able to process 16k tokens sequences.

The BART model goes up to 1024 tokens.
Then there are models which can take up to 16k tokens but they’re more custom and not always available out of the box on HuggingFace. One of these is the Longformer for example. Their model can be accessed via HuggingFace as shown here. You may also want to take a look at this recent paper from Google. It is a model specific for text generation (not exactly classification as you asked, but gives you an idea for what’s possible) and they have also made their code available (you can see more details here and here - there is still an open PR which will be merged into the main HuggingFace branch soon, so right now you’d have to take their code from the fork)

Thank you for the replies @ccdv & @AndreaSottana !

So I guess there are two ways to tackle this:

  1. Split up the input text into segments that are less than the model’s max sequence (token) length
  2. Find a model like Pegasus-X or Longformer that can handle all the samples (based on their sequence length) in my dataset

Option #1 seems more plausible and will give it a try.

Is it also possible to use RNN (non-Transformer) based models? I assume the tradeoff is model “accuracy” vs the max sequence length?

Best way to handle long sequences is to use a custom attention mecanism.

Is there a specific reason that you didn’t recommend using earlier RNN-based models? Since they don’t have an attention mechanism, their memory footprint should theoretically be linear to the sequence lenght, right?

You can try this repo

Is there a paper about this LSG attention mechanism? Looks interesting and any further info would be appreciated to understand it a bit more.