Faster way to apply a model to dataframe

I have a pandas dataframe with 9m rows. Each one of them has a small comment (like a tweet). I would like to create a new column with label/score of a sentiment analysis. I’m trying to do that using apply function of pandas dataframe.


model_name = 'finiteautomata/bertweet-base-sentiment-analysis'
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, normalization=True)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

def apply_model(row):
    
    batch = tokenizer(row.text, truncation=True, return_tensors='pt')
    
    with torch.no_grad():
        outputs = model(**batch)
        predictions = F.softmax(outputs.logits, dim=1)
        labels = torch.argmax(predictions, dim=1)
        label = [model.config.id2label[label_id] for label_id in labels.tolist()][0]
        score = torch.topk(predictions,1)[0].item()
    return [label, score]

df['SA'] = df.progress_apply(apply_model, axis=1)

The estimated time to conclusion (TQDM) is around 62 hours, lol.

How can i speed up this process?

1 Like