Sentiment analysis for long text - canonical solution

What is the canonical way of dealing with text sequences longer than 512 tokens, when doing sentiment analysis? I can split the text into chunks that will be less than 512 tokens each, using some fairly convoluted custom code (convert to ids, split the id lists into chunks less than 512 ids each, re-convert to text, then predict for each chunk) but I was wondering if there is a more general solution, ideally not straying out of the pipeline.

E.g. this code will fail because the text is too long:

import torch
from transformers import pipeline
from faker import Faker

model_name = "bhadresh-savani/distilbert-base-uncased-emotion"
classifier = pipeline(
    "text-classification",
    model=model_name,
    top_k=None,
    device="cuda:0" if torch.cuda.is_available() else "cpu",
)
f = Faker()

text = " ".join(f.words(nb=1000))
classifier(text)

The error is:

Token indices sequence length is longer than the specified maximum sequence length for this model (1002 > 512). Running this sequence through the model will result in indexing errors

There are multiple ways:

  • you can process the sequence by chunks and add a layer on top to merge chunks
  • you can truncate the input
  • you can use long models (Longformer, BigBird)
  • you can convert an existing model, see

The first one wont be compatible with the pipeline.

1 Like