Faster way to apply a model to dataframe

thelara · March 2, 2022, 12:59pm

I have a pandas dataframe with 9m rows. Each one of them has a small comment (like a tweet). I would like to create a new column with label/score of a sentiment analysis. I’m trying to do that using apply function of pandas dataframe.


model_name = 'finiteautomata/bertweet-base-sentiment-analysis'
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, normalization=True)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

def apply_model(row):
    
    batch = tokenizer(row.text, truncation=True, return_tensors='pt')
    
    with torch.no_grad():
        outputs = model(**batch)
        predictions = F.softmax(outputs.logits, dim=1)
        labels = torch.argmax(predictions, dim=1)
        label = [model.config.id2label[label_id] for label_id in labels.tolist()][0]
        score = torch.topk(predictions,1)[0].item()
    return [label, score]

df['SA'] = df.progress_apply(apply_model, axis=1)

The estimated time to conclusion (TQDM) is around 62 hours, lol.

How can i speed up this process?

Topic		Replies	Views
Sentiment analysis with large Pandas dataframe 🤗Transformers	2	1622	May 2, 2022
Fastest way to tokenize millions of examples? 🤗Tokenizers	4	2867	March 8, 2024
Processing time and methods Beginners	2	352	March 21, 2022
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1074	August 19, 2021
The most efficient way for predictions(zero-shot classification) on huge dataset Beginners	0	526	July 6, 2022

Faster way to apply a model to dataframe

Related topics