Sentiment analysis for long sequences

adamh · July 29, 2020, 9:52am

I’m trying to use the sentiment analysis pipeline,. currently using:

nlp = pipeline('sentiment-analysis')
nlp.tokenizer = transformers.DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

to classify a large corpus of textual data. Most of my data points are < 512 tokens, and the pipeline seems to be working well. However, some data points (~15% of the whole datasets) have more than 512 tokens. I’ve tried to split them into chunks of size 512 and aggregate the results, but this didn’t seem to work very well. Is there any principled/recommended approach for such situations? Perhaps using a different model/tokenizer? I’ve tried using XLNet but didn’t get very good results…

valhalla · July 29, 2020, 12:25pm

Hi @adamh, if your context is really long then you can consider using the longformer model, it allows to use sequences with upto 4096 tokens. But you’ll need to fine-tune the model first.

adamh · July 29, 2020, 1:21pm

Thanks! Most of the data points that have more than 512 tokens are between 500 and 1000, but it can get up to 2000 or so (which is more “document sentiment analysis” than a “sentence sentiment analysis”).

Regarding Longformer - is there reasonable way to asses how many examples would be needed for fine tuning?

t-lochhead · December 7, 2020, 5:46pm

@adamh - did you find a viable solution? I am looking to solve this myself.

Topic		Replies	Views
Sentiment analysis for long text - canonical solution Beginners	1	2451	April 22, 2023
Longformer and sentiment analysis Beginners	6	3855	August 30, 2023
Token classification on long sentences 🤗Transformers	0	835	February 2, 2022
Text classification training on long text Intermediate	3	4960	June 18, 2024
Token Classification Models on (Very) Long Text Models	8	11161	March 9, 2023

Sentiment analysis for long sequences

Related topics