Processing time and methods

Cheltone · March 19, 2022, 3:11am

Hello Huggingface friends! So I have about 7 mil tweets that I’m planning to use for sentiment analysis. Does anyone have advice on processing such large quantities in a timely manner? Thank you for your time!

mariosasko · March 21, 2022, 12:14pm

Hi! That number of tweets doesn’t seem much . You can import them into datasets and use map with multiprocessing (num_proc greater than 1).

Also, feel free to provide more info on the libraries/tools you plan to use for preprocessing, etc.

Soon we will have a page in the docs dedicated to using datasets at scale to give tips for such situations.

Cheltone · March 21, 2022, 1:41pm

That’s excellent! Could you point me to a demo/tutorial that executes something similar? I need to be able to import my data from csv’s and transform the dict into a pandas dataframe. I’m not exactly sure where to add in the multiprocessing step. Thanks for your help! I have a deadline approaching and this info will be a life saver

Topic		Replies	Views
Processing time? Beginners	2	1841	March 22, 2022
Sentiment analysis with large Pandas dataframe 🤗Transformers	2	1623	May 2, 2022
Weird execution time when using filter() with multiprocessing 🤗Datasets	0	287	February 19, 2023
Different sentiments when texts processed in batches vs singles Intermediate	1	447	July 3, 2022
Map function skipping rows (only 8k out of 1.6M rows) 🤗Datasets	1	195	December 25, 2023

Processing time and methods

Related topics