What is the canonical way of dealing with text sequences longer than 512 tokens, when doing sentiment analysis? I can split the text into chunks that will be less than 512 tokens each, using some fairly convoluted custom code (convert to ids, split the id lists into chunks less than 512 ids each, re-convert to text, then predict for each chunk) but I was wondering if there is a more general solution, ideally not straying out of the pipeline.
E.g. this code will fail because the text is too long:
import torch
from transformers import pipeline
from faker import Faker
model_name = "bhadresh-savani/distilbert-base-uncased-emotion"
classifier = pipeline(
"text-classification",
model=model_name,
top_k=None,
device="cuda:0" if torch.cuda.is_available() else "cpu",
)
f = Faker()
text = " ".join(f.words(nb=1000))
classifier(text)
The error is:
Token indices sequence length is longer than the specified maximum sequence length for this model (1002 > 512). Running this sequence through the model will result in indexing errors