Sentiment analysis with large Pandas dataframe

zackemcee · January 30, 2022, 5:36pm

Hey everyone,

So I’m working on a project which deals with textual data, and I’ve roughly 390k rows in that dataframe.

I tried mapping a function to use a transformer/pipeline to analyze the sentiments, but it’s taking quite a while, approximately 25 hours… Is there anyway I can deal with that?

Thanks in advance.

csantanaes · May 1, 2022, 1:45pm

can you share your code?

BramVanroy · May 2, 2022, 7:29am

Typically the approach from starting from a dataframe is:

df = ... your pd.DataFrame
tokenizer = ... your tokenizer
dataset = Dataset.from_pandas(df)
encoded_dataset = dataset.map(lambda examples: tokenizer(examples['sentence1']),
                              batched=True)

Note the “batched=True” argument, which should greatly speed things up.

Topic		Replies	Views
Faster way to apply a model to dataframe Beginners	0	1741	March 2, 2022
Sentiment Analysis 🤗Transformers	0	278	April 4, 2023
Processing time and methods Beginners	2	353	March 21, 2022
Happytransformer Inference on dataset Beginners	1	699	January 31, 2023
Fastest way to tokenize millions of examples? 🤗Tokenizers	4	2879	March 8, 2024

Sentiment analysis with large Pandas dataframe

Related topics