Longformer and sentiment analysis

codedancer · August 23, 2021, 5:51pm

I am trying to use longformer to do a sentiment analysis and I am wondering what the best way is to do it. I have the following code:

from transformers import LongformerTokenizer, EncoderDecoderModel
model = EncoderDecoderModel.from_pretrained(“patrickvonplaten/longformer2roberta-cnn_dailymail-fp16”)
tokenizer = LongformerTokenizer.from_pretrained(“allenai/longformer-base-4096”)

nielsr · August 23, 2021, 6:58pm

Hi,

LongFormer itself is a Transformer encoder, and that’s more than sufficient to perform sentiment analysis. You can just use LongFormerForSequenceClassification, like so:

from transformers import LongformerTokenizer, LongformerForSequenceClassification
import torch

tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096')

inputs = tokenizer("This text is positive", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits

Note that this model will have a classification head that is randomly initialized, so you’ll need to fine-tune it on a custom dataset.

codedancer · August 23, 2021, 7:08pm

This is great! I think I am getting it but I am just wondering how you get the label and socre then. I normally use pipeline and sentiment-analysis and I can get something like [{‘label’: ‘NEGATIVE’, ‘score’: 0.9982099533081055}].

nielsr · August 24, 2021, 7:00am

So you want to do inference with an already fine-tuned model?

Looking on the hub, there are currently no LongFormer checkpoints fine-tuned on a sentiment analysis dataset. So feel free to upload the first LongFormer checkpoint fine-tuned on a sentiment analysis dataset to the hub

Once you’ve trained a model, you can plug it into the pipeline API for quick inference.

Here’s an example with a RoBERTa model from the hub, fine-tuned on sentiment analysis:

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

model_name = "cardiffnlp/twitter-roberta-base-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

text = "this is a very positive text"

outputs = nlp(text)

In this case, printing the outputs returns [{'label': 'LABEL_2', 'score': 0.9623518586158752}]. This is of course not that helpful, as we don’t know what LABEL_2 means. This is because the authors of this model have not updated the config.id2label. However, looking at this file (mentioned in the code example on the model page), it appears that LABEL_2 means “positive”.

codedancer · August 24, 2021, 11:32am

Thanks @nielsr! I think I’d need to train one and if I get to finish that, I’ll share it out. I guess the other way to tackle the longform text is to focus on the maximum limit of the text the pre-trained model can take. Do you happen to know what the limit is - like how much text I can push into the pipeline for the sentiment?

nielsr · August 24, 2021, 7:00pm

Models like BERT, RoBERTa, etc. all take a max sequence length of 512 tokens. Note that these models use subword tokenization, which means that a given word might be tokenized into several tokens, so in practice these models can take in less than 500 words. So if you really want to use the pipeline API with a very long text, you can use models like LongFormer or BigBird, which can handle 4096 tokens in a single forward pass.

As an alternative (if you still want to use a model like BERT/RoBERTa), you can implement it using a sliding window approach. You can do this using the tokenizer, like so:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "a very long text"
encoding = tokenizer(text, return_overflowing_tokens=True, stride=128)

This will create multiple training examples (encodings) for a given text, by sliding a window (each having 128 tokens of overlap) across the text. You can then feed each training example through the model, and average the predictions.

shensmobile · August 30, 2023, 11:49pm

Hi nielsr,

Sorry to bother you on this! If I’m looking to utilize RoBERTa for long sequence sentiment analysis and I want to use tokenizer stride during training, do I need to write an entirely custom train() logic to average the predictions during the batch evaluation? Or is there something already built-in to trainer.train() that I could leverage to take advantage of strides?

Topic		Replies	Views
How can I view the output of the answer? Beginners	0	199	June 4, 2021
Sentiment analysis for long sequences Beginners	3	2285	December 7, 2020
Getting predictions 🤗Transformers	1	286	October 15, 2020
Huggingface classification struggling with prediction 🤗Transformers	0	833	April 5, 2022
Huggingface sequence classification unfreezing layers 🤗Transformers	2	1312	March 24, 2022

Longformer and sentiment analysis

Related topics