I’m trying to use the sentiment analysis pipeline,. currently using:
nlp = pipeline('sentiment-analysis')
nlp.tokenizer = transformers.DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
to classify a large corpus of textual data. Most of my data points are < 512 tokens, and the pipeline seems to be working well. However, some data points (~15% of the whole datasets) have more than 512 tokens. I’ve tried to split them into chunks of size 512 and aggregate the results, but this didn’t seem to work very well. Is there any principled/recommended approach for such situations? Perhaps using a different model/tokenizer? I’ve tried using XLNet but didn’t get very good results…