Copy columns in a dataset and compute statistics for a column

sriniv · August 26, 2022, 11:33am

Hi,
Need help with the following.

I need to perform few tasks on certain columns in a dataset, and once done merge all these columns into a single column.
For a given column/set of columns, are there any built in methods available to compute ‘mean’, ‘mode’ etc… (similar to pandas dataframe’s mean(), mode() ).
For a given column, if I need to run a function(for ex: get number of token in a text column), is it possible to run these in batches or using multiprocessing? Any examples would be of great help!

Thanks

sriniv · August 26, 2022, 11:34am

@lhoestq - Appreciate any examples for the above.

lhoestq · August 26, 2022, 1:14pm

Hi !

For a given column/set of columns, are there any built in methods available to compute ‘mean’, ‘mode’ etc… (similar to pandas dataframe’s mean(), mode() ).

You can convert the dataset to a pandas DataFrame to use such analytics methods. Just make sure your dataset fits in RAM. If it doesn’t you can try to use map instead.

For a given column, if I need to run a function(for ex: get number of token in a text column), is it possible to run these in batches or using multiprocessing? Any examples would be of great help!

Sure, map does support batching and multiprocessing:

dataset = dataset.map(count_tokens, batched=True, batch_size=512, num_proc=4)

See more in the docs: Main classes

sriniv · August 27, 2022, 12:50am

Thanks, @lhoestq.

If we convert the dataset to a pandas Dataframe, that’s an extra memory to hold the dataframe along with huggingface datasets, right? Is there a plan to support these operations within datasets, or is there a way to perform these with operations using datasets with less memory utilization?

lhoestq · August 27, 2022, 12:08pm

The dataset library uses memory mapping to load the data from disk without filling up your RAM.

A memory efficient way is to use map to compute the mean (you can pass a stateful function)

sriniv · August 29, 2022, 6:09am

@lhoestq - What are the pros and cons of using datasets.set_format(type=“pandas”) and then say datasets[‘column1’].mean() ? Do we need to reset the format once done?

lhoestq · August 29, 2022, 9:16am

It simply brings “column1” in memory to compute the mean. If you want to not have to reset the format afterwards, you can use

mean = datasets.with_format("pandas")["column1"].mean()

with_format returns a new dataset with the specified format, you don’t need to call datasets.reset_format

sriniv · August 29, 2022, 11:04am

Thanks, @lhoestq .

Also, If I am applying multiple .map() operations… as in below

ds = ds.map(preprocess1, batched=True, num_proc=8)
ds = ds.map(preprocess2, batched=True, num_proc=8)
ds = ds.map(preprocess3, batched=True, num_proc=8)
ds = ds.map(preprocess4, batched=True, num_proc=8)

It creates lot of cache files at each step. Is there a way we can clean cache files after each map step?

Also, if we use keep_in_memory=True with num_proc>1, it slows down.

I am using v1.16.1 and I have certain constraints to upgrade.

Is there a better way to speed up these preprocessing steps without caching lot of files (constraint on disk space), or with keep_in_memory=True and num_proc > 1 ?

lhoestq · August 30, 2022, 9:56am

One way would be to combine your different preprocess functions into 1.

Alternatively, you can clear the cache of an intermediate step with

processed_ds = ds.map(...)
ds.cleanup_cache_files()

sriniv · September 14, 2022, 6:17am

@lhoestq -

If we need to apply dataframe methods at the whole datasets level instead of column level, the following code does not work.

df = load_dataset(“csv”, data_files=sys.argv[1], split=“train”)
df.set_format(type=“pandas”)

print(df[“category”].mode()) #This works
print(df.sample(10)) #This DOES NOT work

AttributeError: ‘Dataset’ object has no attribute ‘sample’

lhoestq · September 14, 2022, 9:34am

You need to query the Dataset to get a DataFrame

In particular if you need the full dataset as a dataframe you can do

ds.set_format("pandas")
df = ds[:]

sriniv · September 14, 2022, 9:53am

@lhoestq
So this is as good as doing df = ds.to_pandas() ?
Also, it incurs extra memory as opposed to in-memory processing for a single column ( as in df[“category”].mode() ), right?

lhoestq · September 14, 2022, 11:03am

Yes correct !

inweriok · July 10, 2024, 12:28am

solved my problem! thank you so much!

Topic		Replies	Views
IterableDataset compute feature mean and create histogram 🤗Datasets	2	439	May 15, 2023
Data exploration/visualisation 🤗Datasets	3	557	January 15, 2024
How to operate on columns of a dataset Beginners	2	138	January 30, 2025
Querying column is slow for datasets with indices mapping 🤗Datasets	3	1483	May 17, 2021
Image dataset performance when using map 🤗Datasets	0	120	June 24, 2024

Copy columns in a dataset and compute statistics for a column

Related topics