According to Differences between Dataset and IterableDataset
However as soon as your Dataset has an indices mapping (via Dataset.shuffle() for example), the speed can become 10x slower. This is because there is an extra step to get the row index to read using the indices mapping, and most importantly, you aren’t reading contiguous chunks of data anymore. To restore the speed, you’d need to rewrite the entire dataset on your disk again using Dataset.flatten_indices(), which removes the indices mapping. This may take a lot of time depending on the size of your dataset though:
However, if I save the dataset to disk using save_to_disk()
, will it save the shuffled dataset in the shuffled order or will it save the dataset in the original order along with the redirection sequence? i.e. will doing random access on the saved dataset be fast or 10x slower?