Weird execution time when using filter() with multiprocessing

MinhQuan2710 · February 19, 2023, 2:03pm

Hello everyone, I got a dataset that contains 9746395 rows, and the dataset structure is as follows:

Dataset({
    features: ['time', 'text', 'lang', 'sentiment_result'],
    num_rows: 9746395
})

The sentiment_result is “positive”, “negative”, or “neural”.
For the given time (yyyy-mm-dd for example), I want to find out that this date has how many “positive”, “negative” and “neural”.

and my code is as follows:

for date in tqdm(df["Date"].tolist()):
   tmp = sentiment_dataset.filter(lambda x: x["time"] == date,keep_in_memory =True,num_proc=10)   
   number_of_positives.append(len(tmp.filter(lambda x: x["sentiment_result"]=="positive",keep_in_memory = True,num_proc=10)))
   number_of_negatives.append(len(tmp.filter(lambda x: x["sentiment_result"] == "negative",keep_in_memory = True,num_proc=10)))
   number_of_neutral.append(len(tmp.filter(lambda x: x["sentiment_result"] == "neutral",keep_in_memory = True,num_proc=10)))

While the code takes around 2 minutes (for tmp and tmp.filter(…) I just try with ‘positive’ and remove ‘neral’ and ‘negative’ because of execution time) to complete each iteration, the process sometimes freezes for no reason and takes around 8 minutes to finish.
As you can see in the following image.

Can anyone show me what the problem is and how I can resolve it?
Thank you for your help.

Topic		Replies	Views
Datasets filter/map hangs when multithreading 🤗Datasets	8	2374	May 2, 2023
Datasets mapping slow down in the end 🤗Datasets	0	27	January 27, 2025
Dataset map function takes forever to run! 🤗Datasets	16	6707	August 15, 2024
Using num_proc>1 in Dataset.map hangs 🤗Datasets	8	4007	August 19, 2024
Datasets behaving strange when calling filter twice 🤗Datasets	3	419	October 6, 2021

Weird execution time when using filter() with multiprocessing

Related topics