Does the Dataset instance have a "batched reduce" style method?

YalunHu · March 21, 2024, 11:08am

Hey I’d like to know that: does the Dataset instance have a “batched reduce” style method?

e.g. in each row of the dataset, I have a column named “num of person”. If I wanted to get the sum number of all the persons of all the rows, I need to loop at each row and add each row’s “num of person” to my final counting number.

So I want to know, if there is a batched-reduce method can achive the goal above? Since loop every row is kind of slow. And I know that HF dataset instance has a batch-map method already, equivlently, maybe it has a batch-reduce method as well？

mariosasko · March 22, 2024, 5:50pm

Hi! Adding “batched reduce” has been attempted once in Add reduce function by AJDERS · Pull Request #5533 · huggingface/datasets · GitHub, but we decided not to merge it for the reasons mentioned in the PR. There, you can find a Colab that explains how to use Dataset.map to get the same result.

Topic		Replies	Views
Dataset.map() with batching and multiprocessing 🤗Datasets	1	287	March 5, 2024
Copy columns in a dataset and compute statistics for a column 🤗Datasets	13	1982	July 10, 2024
Homogeneous batches from list of IterableDatasets 🤗Datasets	6	62	October 23, 2024
One-to-many batch mapping with IterableDatasets and batch_size=1 doesn't work 🤗Datasets	2	23	April 14, 2025
batched I/O from disk when load_dataset API is used? 🤗Datasets	2	27	January 27, 2025

Does the Dataset instance have a "batched reduce" style method?

Related topics