I need to perform few tasks on certain columns in a dataset, and once done merge all these columns into a single column.
For a given column/set of columns, are there any built in methods available to compute ‘mean’, ‘mode’ etc… (similar to pandas dataframe’s mean(), mode() ).
For a given column, if I need to run a function(for ex: get number of token in a text column), is it possible to run these in batches or using multiprocessing? Any examples would be of great help!
For a given column/set of columns, are there any built in methods available to compute ‘mean’, ‘mode’ etc… (similar to pandas dataframe’s mean(), mode() ).
You can convert the dataset to a pandas DataFrame to use such analytics methods. Just make sure your dataset fits in RAM. If it doesn’t you can try to use map instead.
For a given column, if I need to run a function(for ex: get number of token in a text column), is it possible to run these in batches or using multiprocessing? Any examples would be of great help!
Sure, map does support batching and multiprocessing:
If we convert the dataset to a pandas Dataframe, that’s an extra memory to hold the dataframe along with huggingface datasets, right? Is there a plan to support these operations within datasets, or is there a way to perform these with operations using datasets with less memory utilization?
@lhoestq - What are the pros and cons of using datasets.set_format(type=“pandas”) and then say datasets[‘column1’].mean() ? Do we need to reset the format once done?
It creates lot of cache files at each step. Is there a way we can clean cache files after each map step?
Also, if we use keep_in_memory=True with num_proc>1, it slows down.
I am using v1.16.1 and I have certain constraints to upgrade.
Is there a better way to speed up these preprocessing steps without caching lot of files (constraint on disk space), or with keep_in_memory=True and num_proc > 1 ?
@lhoestq
So this is as good as doing df = ds.to_pandas() ?
Also, it incurs extra memory as opposed to in-memory processing for a single column ( as in df[“category”].mode() ), right?