Different sentiments when texts processed in batches vs singles

kayne88 · June 22, 2022, 3:28pm

I was observing a strange behavior in sentiment analysis with a finetuned model and tokenizer. Basically, when I tokenize the texts and give them individually as input to the model, it produces different class probabilities compared to when I tokenize and input as batch.

Here’s my observation:

text = tweets["eth"]["text"].values.tolist()[0]
print(text)
encoded_input = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
output = model(**encoded_input)
softmax(output.logits.detach().numpy())

Outputs:

ethereum will be back above xxx at some point im buying more now
array([[1.9731345e-04, 1.1291850e-03, 9.9867338e-01]], dtype=float32)

Whereas

text = tweets["eth"]["text"].values.tolist()[:2]
print(text)
encoded_input = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
output = model(**encoded_input)
softmax(output.logits.detach().numpy())

Outputs:

['ethereum will be back above xxx at some point im buying more now', 'tripledigit eth is also a chance of a lifetime']
array([[1.12888454e-04, 6.46036817e-04, 5.71367383e-01],
       [8.55998078e-05, 1.14106620e-03, 4.26646918e-01]], dtype=float32)

So, the same sentence yields different class probabilities. What is the issue here? I would of course like to process my data in batches.

Cheers

kayne88 · July 3, 2022, 3:21pm

any ideas would be helpful

Topic		Replies	Views
Tokenize a batch of data Models	0	161	May 1, 2023
TokenClassification pipeline doing batch processing over a sequence of already tokenised messages Intermediate	1	832	July 6, 2022
Sentiment Analysis outputs Beginners	0	419	December 11, 2022
Sentiment Analysis 🤗Transformers	0	278	April 4, 2023
Segmentation for sentiment analysis Beginners	2	527	March 28, 2022

Different sentiments when texts processed in batches vs singles

Related topics