Caching and Shuffling Datasets on the Same Machine


I am using a dataset from your Hub multiple times on the same machine and would like to cache it for improved efficiency. Each iteration is associated with a random seed and requires a corresponding unique shuffling of the dataset. I’m concerned about whether the shuffle operation overwrites the existing cached dataset. If so, can you please suggest a way to shuffle the dataset without overwriting the cache? Is calling disable_caching() immediately after downloading the dataset sufficient?

Additionally, I would appreciate clarification on the caching process. I was able to load a dataset from the Hub, cache it, and then shuffle it. Subsequently, when I loaded the cached dataset, it appeared to be identical to the original unshuffled dataset. Can you please explain this behavior?

Thank you.


  1. I usually change the shuffle seed with each iteration.
  • for iteration in range(iterations):
  •           ds_data = ds_data.shuffle(seed=2023 + iteration)
  •            # continue the process
  1. Otherwise, disable caching is sufficient for your case, as the method shuffle use by default Random seed.