I have a pandas dataframe with 9m rows. Each one of them has a small comment (like a tweet). I would like to create a new column with label/score of a sentiment analysis. I’m trying to do that using apply
function of pandas dataframe.
model_name = 'finiteautomata/bertweet-base-sentiment-analysis'
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, normalization=True)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
def apply_model(row):
batch = tokenizer(row.text, truncation=True, return_tensors='pt')
with torch.no_grad():
outputs = model(**batch)
predictions = F.softmax(outputs.logits, dim=1)
labels = torch.argmax(predictions, dim=1)
label = [model.config.id2label[label_id] for label_id in labels.tolist()][0]
score = torch.topk(predictions,1)[0].item()
return [label, score]
df['SA'] = df.progress_apply(apply_model, axis=1)
The estimated time to conclusion (TQDM) is around 62 hours, lol.
How can i speed up this process?