Hello Huggingface friends! So I have about 7 mil tweets that I’m planning to use for sentiment analysis. Does anyone have advice on processing such large quantities in a timely manner? Thank you for your time!
Hi! That number of tweets doesn’t seem much . You can import them into
datasets
and use map
with multiprocessing (num_proc
greater than 1).
Also, feel free to provide more info on the libraries/tools you plan to use for preprocessing, etc.
Soon we will have a page in the docs dedicated to using datasets
at scale to give tips for such situations.
That’s excellent! Could you point me to a demo/tutorial that executes something similar? I need to be able to import my data from csv’s and transform the dict into a pandas dataframe. I’m not exactly sure where to add in the multiprocessing step. Thanks for your help! I have a deadline approaching and this info will be a life saver