Copy columns in a dataset and compute statistics for a column

Hi,
Need help with the following.

  • I need to perform few tasks on certain columns in a dataset, and once done merge all these columns into a single column.
  • For a given column/set of columns, are there any built in methods available to compute ‘mean’, ‘mode’ etc… (similar to pandas dataframe’s mean(), mode() ).
  • For a given column, if I need to run a function(for ex: get number of token in a text column), is it possible to run these in batches or using multiprocessing? Any examples would be of great help!

Thanks

@lhoestq - Appreciate any examples for the above.

Hi !

  • For a given column/set of columns, are there any built in methods available to compute ‘mean’, ‘mode’ etc… (similar to pandas dataframe’s mean(), mode() ).

You can convert the dataset to a pandas DataFrame to use such analytics methods. Just make sure your dataset fits in RAM. If it doesn’t you can try to use map instead.

  • For a given column, if I need to run a function(for ex: get number of token in a text column), is it possible to run these in batches or using multiprocessing? Any examples would be of great help!

Sure, map does support batching and multiprocessing:

dataset = dataset.map(count_tokens, batched=True, batch_size=512, num_proc=4)

See more in the docs: Main classes

Thanks, @lhoestq.

If we convert the dataset to a pandas Dataframe, that’s an extra memory to hold the dataframe along with huggingface datasets, right? Is there a plan to support these operations within datasets, or is there a way to perform these with operations using datasets with less memory utilization?

The dataset library uses memory mapping to load the data from disk without filling up your RAM.

A memory efficient way is to use map to compute the mean (you can pass a stateful function)

@lhoestq - What are the pros and cons of using datasets.set_format(type=“pandas”) and then say datasets[‘column1’].mean() ? Do we need to reset the format once done?

It simply brings “column1” in memory to compute the mean. If you want to not have to reset the format afterwards, you can use

mean = datasets.with_format("pandas")["column1"].mean()

with_format returns a new dataset with the specified format, you don’t need to call datasets.reset_format

Thanks, @lhoestq .

Also, If I am applying multiple .map() operations… as in below

ds = ds.map(preprocess1, batched=True, num_proc=8)
ds = ds.map(preprocess2, batched=True, num_proc=8)
ds = ds.map(preprocess3, batched=True, num_proc=8)
ds = ds.map(preprocess4, batched=True, num_proc=8)

It creates lot of cache files at each step. Is there a way we can clean cache files after each map step?

Also, if we use keep_in_memory=True with num_proc>1, it slows down.

I am using v1.16.1 and I have certain constraints to upgrade.

Is there a better way to speed up these preprocessing steps without caching lot of files (constraint on disk space), or with keep_in_memory=True and num_proc > 1 ?

One way would be to combine your different preprocess functions into 1.

Alternatively, you can clear the cache of an intermediate step with

processed_ds = ds.map(...)
ds.cleanup_cache_files()

@lhoestq -

If we need to apply dataframe methods at the whole datasets level instead of column level, the following code does not work.

df = load_dataset(“csv”, data_files=sys.argv[1], split=“train”)
df.set_format(type=“pandas”)

print(df[“category”].mode()) #This works
print(df.sample(10)) #This DOES NOT work

AttributeError: ‘Dataset’ object has no attribute ‘sample’

You need to query the Dataset to get a DataFrame

In particular if you need the full dataset as a dataframe you can do

ds.set_format("pandas")
df = ds[:]

@lhoestq
So this is as good as doing df = ds.to_pandas() ?
Also, it incurs extra memory as opposed to in-memory processing for a single column ( as in df[“category”].mode() ), right?

Yes correct !