Longformer and sentiment analysis

nielsr · August 24, 2021, 7:00pm

Models like BERT, RoBERTa, etc. all take a max sequence length of 512 tokens. Note that these models use subword tokenization, which means that a given word might be tokenized into several tokens, so in practice these models can take in less than 500 words. So if you really want to use the pipeline API with a very long text, you can use models like LongFormer or BigBird, which can handle 4096 tokens in a single forward pass.

As an alternative (if you still want to use a model like BERT/RoBERTa), you can implement it using a sliding window approach. You can do this using the tokenizer, like so:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "a very long text"
encoding = tokenizer(text, return_overflowing_tokens=True, stride=128)

This will create multiple training examples (encodings) for a given text, by sliding a window (each having 128 tokens of overlap) across the text. You can then feed each training example through the model, and average the predictions.

Topic		Replies	Views
How can I view the output of the answer? Beginners	0	199	June 4, 2021
Sentiment analysis for long sequences Beginners	3	2279	December 7, 2020
Getting predictions 🤗Transformers	1	286	October 15, 2020
Huggingface classification struggling with prediction 🤗Transformers	0	833	April 5, 2022
Huggingface sequence classification unfreezing layers 🤗Transformers	2	1312	March 24, 2022

Longformer and sentiment analysis

Related topics