Does saving a shuffled dataset to arrow format eliminate the indirection?

According to Differences between Dataset and IterableDataset

However as soon as your Dataset has an indices mapping (via Dataset.shuffle() for example), the speed can become 10x slower. This is because there is an extra step to get the row index to read using the indices mapping, and most importantly, you aren’t reading contiguous chunks of data anymore. To restore the speed, you’d need to rewrite the entire dataset on your disk again using Dataset.flatten_indices(), which removes the indices mapping. This may take a lot of time depending on the size of your dataset though:

However, if I save the dataset to disk using save_to_disk(), will it save the shuffled dataset in the shuffled order or will it save the dataset in the original order along with the redirection sequence? i.e. will doing random access on the saved dataset be fast or 10x slower?

1 Like

Hi ! Save_to_disk() re-writes the dataset in the right order, and therefore reloading from there gives you max speed :slight_smile:

1 Like

I’m Spela and I’m here for the first time

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.