Performance tips for shuffle and flatten_indices

vblagoje · November 18, 2020, 5:25pm

How can I speed up shuffle+flatten on a dataset with millions of instances? It’s painfully slow for whatever setting I tried.

TIA

lhoestq · November 18, 2020, 5:40pm

Hi !

By flatten you mean flattend_indices ?
Is your dataset made of strings ?

If so, then the speed bottleneck is the I/O: read from the current dataset arrow file, and write to a new file.
The new file is written when doing flatten_indices.
To speed up things you can use a SSD or distribute the writing (using shard on the shuffled dataset for example).

vblagoje · November 18, 2020, 6:12pm

Yes, flatten_indices. Not strings, int arrays of various precisions. Let me backtrack. I need to concatenate two large datasets, all features are int arrays. Now, concatenate works fast except the dataset is not shuffled. I need a shuffled dataset and I don’t want it to use indices, caches etc. Just clean dataset (dataset.arrow, dataset_info.json, and state.json). I found those to be the fastest when loading and processing.

How could I do that?

vblagoje · November 20, 2020, 2:30pm

I tried flatten_indices on a dataset with 30 million examples, the progress bar indicated a running time of 28 hours. How can we speed it up?

jxm · December 5, 2024, 4:20pm

I have the same question! A performance guide for these kinds of simple tasks with Datasets would be great.

lhoestq · December 11, 2024, 3:38pm

My guide is passing num_proc= to make it faster ^^’ We also did various optimizations since 2020 so make sure to use a recent version of datasets

Topic		Replies	Views
What is the diffrence between copy.deepcopy and flatten_indices? 🤗Datasets	1	2589	July 20, 2021
Does saving a shuffled dataset to arrow format eliminate the indirection? 🤗Datasets	3	96	December 4, 2024
Querying column is slow for datasets with indices mapping 🤗Datasets	3	1485	May 17, 2021
Working with large datasets 🤗Datasets	5	4139	November 10, 2020
Dataset map function takes forever to run! 🤗Datasets	16	6629	August 15, 2024

Performance tips for shuffle and flatten_indices

Related topics