Processing time and methods

Hello Huggingface friends! So I have about 7 mil tweets that I’m planning to use for sentiment analysis. Does anyone have advice on processing such large quantities in a timely manner? Thank you for your time!

Hi! That number of tweets doesn’t seem much :slight_smile:. You can import them into datasets and use map with multiprocessing (num_proc greater than 1).

Also, feel free to provide more info on the libraries/tools you plan to use for preprocessing, etc.

Soon we will have a page in the docs dedicated to using datasets at scale to give tips for such situations.

That’s excellent! Could you point me to a demo/tutorial that executes something similar? I need to be able to import my data from csv’s and transform the dict into a pandas dataframe. I’m not exactly sure where to add in the multiprocessing step. Thanks for your help! I have a deadline approaching and this info will be a life saver