Sentiment analysis for long text - canonical solution

FlorinAndrei · April 17, 2023, 11:48pm

What is the canonical way of dealing with text sequences longer than 512 tokens, when doing sentiment analysis? I can split the text into chunks that will be less than 512 tokens each, using some fairly convoluted custom code (convert to ids, split the id lists into chunks less than 512 ids each, re-convert to text, then predict for each chunk) but I was wondering if there is a more general solution, ideally not straying out of the pipeline.

E.g. this code will fail because the text is too long:

import torch
from transformers import pipeline
from faker import Faker

model_name = "bhadresh-savani/distilbert-base-uncased-emotion"
classifier = pipeline(
    "text-classification",
    model=model_name,
    top_k=None,
    device="cuda:0" if torch.cuda.is_available() else "cpu",
)
f = Faker()

text = " ".join(f.words(nb=1000))
classifier(text)

The error is:

Token indices sequence length is longer than the specified maximum sequence length for this model (1002 > 512). Running this sequence through the model will result in indexing errors

ccdv · April 22, 2023, 9:11am

There are multiple ways:

you can process the sequence by chunks and add a layer on top to merge chunks
you can truncate the input
you can use long models (Longformer, BigBird)
you can convert an existing model, see

The first one wont be compatible with the pipeline.

Topic		Replies	Views
Sentiment analysis for long sequences Beginners	3	2293	December 7, 2020
Longformer and sentiment analysis Beginners	6	3869	August 30, 2023
Question on splitting input sequence Beginners	3	5603	June 14, 2022
Sentiment Analysis 🤗Transformers	0	278	April 4, 2023
Returning score associated with prediction_value from loaded_tokenizer Beginners	2	38	October 11, 2024

Sentiment analysis for long text - canonical solution

Related topics