Hello,
I am using a dataset from your Hub multiple times on the same machine and would like to cache it for improved efficiency. Each iteration is associated with a random seed and requires a corresponding unique shuffling of the dataset. I’m concerned about whether the shuffle
operation overwrites the existing cached dataset. If so, can you please suggest a way to shuffle the dataset without overwriting the cache? Is calling disable_caching()
immediately after downloading the dataset sufficient?
Additionally, I would appreciate clarification on the caching process. I was able to load a dataset from the Hub, cache it, and then shuffle it. Subsequently, when I loaded the cached dataset, it appeared to be identical to the original unshuffled dataset. Can you please explain this behavior?
Thank you.