Longformer and sentiment analysis

I am trying to use longformer to do a sentiment analysis and I am wondering what the best way is to do it. I have the following code:

from transformers import LongformerTokenizer, EncoderDecoderModel
model = EncoderDecoderModel.from_pretrained(“patrickvonplaten/longformer2roberta-cnn_dailymail-fp16”)
tokenizer = LongformerTokenizer.from_pretrained(“allenai/longformer-base-4096”)

Hi,

LongFormer itself is a Transformer encoder, and that’s more than sufficient to perform sentiment analysis. You can just use LongFormerForSequenceClassification, like so:

from transformers import LongformerTokenizer, LongformerForSequenceClassification
import torch

tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096')

inputs = tokenizer("This text is positive", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits

Note that this model will have a classification head that is randomly initialized, so you’ll need to fine-tune it on a custom dataset.

1 Like

This is great! I think I am getting it but I am just wondering how you get the label and socre then. I normally use pipeline and sentiment-analysis and I can get something like [{‘label’: ‘NEGATIVE’, ‘score’: 0.9982099533081055}].

So you want to do inference with an already fine-tuned model?

Looking on the hub, there are currently no LongFormer checkpoints fine-tuned on a sentiment analysis dataset. So feel free to upload the first LongFormer checkpoint fine-tuned on a sentiment analysis dataset to the hub :wink:

Once you’ve trained a model, you can plug it into the pipeline API for quick inference.

Here’s an example with a RoBERTa model from the hub, fine-tuned on sentiment analysis:

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

model_name = "cardiffnlp/twitter-roberta-base-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

text = "this is a very positive text"

outputs = nlp(text)

In this case, printing the outputs returns [{'label': 'LABEL_2', 'score': 0.9623518586158752}]. This is of course not that helpful, as we don’t know what LABEL_2 means. This is because the authors of this model have not updated the config.id2label. However, looking at this file (mentioned in the code example on the model page), it appears that LABEL_2 means “positive”.

1 Like

Thanks @nielsr! I think I’d need to train one and if I get to finish that, I’ll share it out. I guess the other way to tackle the longform text is to focus on the maximum limit of the text the pre-trained model can take. Do you happen to know what the limit is - like how much text I can push into the pipeline for the sentiment?

Models like BERT, RoBERTa, etc. all take a max sequence length of 512 tokens. Note that these models use subword tokenization, which means that a given word might be tokenized into several tokens, so in practice these models can take in less than 500 words. So if you really want to use the pipeline API with a very long text, you can use models like LongFormer or BigBird, which can handle 4096 tokens in a single forward pass.

As an alternative (if you still want to use a model like BERT/RoBERTa), you can implement it using a sliding window approach. You can do this using the tokenizer, like so:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "a very long text"
encoding = tokenizer(text, return_overflowing_tokens=True, stride=128)

This will create multiple training examples (encodings) for a given text, by sliding a window (each having 128 tokens of overlap) across the text. You can then feed each training example through the model, and average the predictions.

3 Likes

Hi nielsr,

Sorry to bother you on this! If I’m looking to utilize RoBERTa for long sequence sentiment analysis and I want to use tokenizer stride during training, do I need to write an entirely custom train() logic to average the predictions during the batch evaluation? Or is there something already built-in to trainer.train() that I could leverage to take advantage of strides?