Sentiment analysis for long sequences

I’m trying to use the sentiment analysis pipeline,. currently using:

nlp = pipeline('sentiment-analysis')
nlp.tokenizer = transformers.DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

to classify a large corpus of textual data. Most of my data points are < 512 tokens, and the pipeline seems to be working well. However, some data points (~15% of the whole datasets) have more than 512 tokens. I’ve tried to split them into chunks of size 512 and aggregate the results, but this didn’t seem to work very well. Is there any principled/recommended approach for such situations? Perhaps using a different model/tokenizer? I’ve tried using XLNet but didn’t get very good results…

Hi @adamh, if your context is really long then you can consider using the longformer model, it allows to use sequences with upto 4096 tokens. But you’ll need to fine-tune the model first.

Thanks! Most of the data points that have more than 512 tokens are between 500 and 1000, but it can get up to 2000 or so (which is more “document sentiment analysis” than a “sentence sentiment analysis”).

Regarding Longformer - is there reasonable way to asses how many examples would be needed for fine tuning?

1 Like

@adamh - did you find a viable solution? I am looking to solve this myself.

1 Like