Caching and Shuffling Datasets on the Same Machine

Hello,

I am using a dataset from your Hub multiple times on the same machine and would like to cache it for improved efficiency. Each iteration is associated with a random seed and requires a corresponding unique shuffling of the dataset. I’m concerned about whether the shuffle operation overwrites the existing cached dataset. If so, can you please suggest a way to shuffle the dataset without overwriting the cache? Is calling disable_caching() immediately after downloading the dataset sufficient?

Additionally, I would appreciate clarification on the caching process. I was able to load a dataset from the Hub, cache it, and then shuffle it. Subsequently, when I loaded the cached dataset, it appeared to be identical to the original unshuffled dataset. Can you please explain this behavior?

Thank you.

Hello,

  1. I usually change the shuffle seed with each iteration.
  • for iteration in range(iterations):
    
  •           ds_data = ds_data.shuffle(seed=2023 + iteration)
    
  •            # continue the process
    
  1. Otherwise, disable caching is sufficient for your case, as the method shuffle use by default Random seed.