Starting by loading a pretrained model (I tried two here) and tokenizer:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import pandas as pd
import re
import torch
model_name = "Souvikcmsa/SentimentAnalysisDistillBERT"
#model_name = "Souvikcmsa/BERT_sentiment_analysis" --> Same issue found with a different model!
model = AutoModelForSequenceClassification.from_pretrained(model_name, use_auth_token=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)
then, loading some txt data:
# Read the data
url = 'https://raw.githubusercontent.com/Giskard-AI/examples/main/datasets/twitter_us_airline_sentiment_analysis.csv'
data = pd.read_csv(url)
defining a basic preprocessor:
# Preprocess text (username and link placeholders)
# Replace the Username with @user and the URL in the tweet with http for better comprehension of data for the model
def preprocess(text):
text = " ".join(text.split())
text = re.sub(r'http\S+', 'http', text)
text = re.sub(r'@\S+', '@user', text)
text = text.lower()
return text
taking two subset from the data (first 5 and 100 entries):
torch.set_printoptions(precision=10)
for param in model.base_model.parameters():
param.requires_grad = False
# ----- 1. Preprocess data -----#
# Preprocess data
X = list(data["text"].apply(preprocess))
X_tokenized = tokenizer(X, padding=True, return_tensors="pt")
num_subsample1=5
X_tokenized_subset1={}
for key in X_tokenized.keys():
X_tokenized_subset1[key]=X_tokenized[key][:num_subsample1]
num_subsample2=100
X_tokenized_subset2={}
for key in X_tokenized.keys():
X_tokenized_subset2[key]=X_tokenized[key][:num_subsample2]
the output of the model on the first subset’s first entry:
display_index=0
outputs1 = model(**X_tokenized_subset1)
outputs1[0][display_index]
gives:
tensor([-1.6196994781, 3.0899136066, -1.3701400757],
grad_fn=<SelectBackward0>)
while the output of the model on the second subset’s first entry (same entry effectively) is:
outputs2 = model(**X_tokenized_subset2)
outputs2[0][display_index]
gives:
tensor([-1.6196994781, 3.0899133682, -1.3701400757],
grad_fn=<SelectBackward0>)
Although they should be the same, there’s a difference in the second prediction:
outputs2[0][display_index]-outputs1[0][display_index]
which gives:
tensor([ 0.0000000000e+00, -2.3841857910e-07, 0.0000000000e+00],
grad_fn=<SubBackward0>)
Any insights? Thanks