Performance tips for shuffle and flatten_indices

Hey @lhoestq,

How can I speed up shuffle+flatten on a dataset with millions of instances? It’s painfully slow for whatever setting I tried.


Hi !

By flatten you mean flattend_indices ?
Is your dataset made of strings ?

If so, then the speed bottleneck is the I/O: read from the current dataset arrow file, and write to a new file.
The new file is written when doing flatten_indices.
To speed up things you can use a SSD or distribute the writing (using shard on the shuffled dataset for example).

Yes, flatten_indices. Not strings, int arrays of various precisions. Let me backtrack. I need to concatenate two large datasets, all features are int arrays. Now, concatenate works fast except the dataset is not shuffled. I need a shuffled dataset and I don’t want it to use indices, caches etc. Just clean dataset (dataset.arrow, dataset_info.json, and state.json). I found those to be the fastest when loading and processing.

How could I do that?

I tried flatten_indices on a dataset with 30 million examples, the progress bar indicated a running time of 28 hours. How can we speed it up?