Does saving a shuffled dataset to arrow format eliminate the indirection?

dhruvgrammarly · November 25, 2024, 5:38am

According to Differences between Dataset and IterableDataset

However as soon as your Dataset has an indices mapping (via Dataset.shuffle() for example), the speed can become 10x slower. This is because there is an extra step to get the row index to read using the indices mapping, and most importantly, you aren’t reading contiguous chunks of data anymore. To restore the speed, you’d need to rewrite the entire dataset on your disk again using Dataset.flatten_indices(), which removes the indices mapping. This may take a lot of time depending on the size of your dataset though:

However, if I save the dataset to disk using save_to_disk(), will it save the shuffled dataset in the shuffled order or will it save the dataset in the original order along with the redirection sequence? i.e. will doing random access on the saved dataset be fast or 10x slower?

lhoestq · December 4, 2024, 10:50am

Hi ! Save_to_disk() re-writes the dataset in the right order, and therefore reloading from there gives you max speed

Spela40 · December 4, 2024, 3:25pm

I’m Spela and I’m here for the first time

system · December 5, 2024, 3:26am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance tips for shuffle and flatten_indices 🤗Datasets	5	2079	December 11, 2024
Behavior of shuffled parquet dataset 🤗Datasets	1	99	November 30, 2024
Saving a dataset to disk after select copies the data 🤗Datasets	8	2302	April 7, 2022
Querying column is slow for datasets with indices mapping 🤗Datasets	3	1490	May 17, 2021
Saving dataset in the current state without cache 🤗Datasets	9	5895	March 17, 2022

Does saving a shuffled dataset to arrow format eliminate the indirection?

Related topics