Hello everyone, I got a dataset that contains 9746395 rows, and the dataset structure is as follows:
Dataset({
features: ['time', 'text', 'lang', 'sentiment_result'],
num_rows: 9746395
})
The sentiment_result is “positive”, “negative”, or “neural”.
For the given time (yyyy-mm-dd for example), I want to find out that this date has how many “positive”, “negative” and “neural”.
and my code is as follows:
for date in tqdm(df["Date"].tolist()):
tmp = sentiment_dataset.filter(lambda x: x["time"] == date,keep_in_memory =True,num_proc=10)
number_of_positives.append(len(tmp.filter(lambda x: x["sentiment_result"]=="positive",keep_in_memory = True,num_proc=10)))
number_of_negatives.append(len(tmp.filter(lambda x: x["sentiment_result"] == "negative",keep_in_memory = True,num_proc=10)))
number_of_neutral.append(len(tmp.filter(lambda x: x["sentiment_result"] == "neutral",keep_in_memory = True,num_proc=10)))
While the code takes around 2 minutes (for tmp and tmp.filter(…) I just try with ‘positive’ and remove ‘neral’ and ‘negative’ because of execution time) to complete each iteration, the process sometimes freezes for no reason and takes around 8 minutes to finish.
As you can see in the following image.
Can anyone show me what the problem is and how I can resolve it?
Thank you for your help.