Hey @lhoestq,
How can I speed up shuffle+flatten on a dataset with millions of instances? It’s painfully slow for whatever setting I tried.
TIA
Hey @lhoestq,
How can I speed up shuffle+flatten on a dataset with millions of instances? It’s painfully slow for whatever setting I tried.
TIA
Hi !
By flatten you mean flattend_indices ?
Is your dataset made of strings ?
If so, then the speed bottleneck is the I/O: read from the current dataset arrow file, and write to a new file.
The new file is written when doing flatten_indices.
To speed up things you can use a SSD or distribute the writing (using shard on the shuffled dataset for example).
Yes, flatten_indices. Not strings, int arrays of various precisions. Let me backtrack. I need to concatenate two large datasets, all features are int arrays. Now, concatenate works fast except the dataset is not shuffled. I need a shuffled dataset and I don’t want it to use indices, caches etc. Just clean dataset (dataset.arrow, dataset_info.json, and state.json). I found those to be the fastest when loading and processing.
How could I do that?
I tried flatten_indices on a dataset with 30 million examples, the progress bar indicated a running time of 28 hours. How can we speed it up?
I have the same question! A performance guide for these kinds of simple tasks with Datasets would be great.
My guide is passing num_proc=
to make it faster ^^’ We also did various optimizations since 2020 so make sure to use a recent version of datasets