Caching and Shuffling Datasets on the Same Machine

Nadav-Timor · April 22, 2023, 7:24pm

Hello,

I am using a dataset from your Hub multiple times on the same machine and would like to cache it for improved efficiency. Each iteration is associated with a random seed and requires a corresponding unique shuffling of the dataset. I’m concerned about whether the shuffle operation overwrites the existing cached dataset. If so, can you please suggest a way to shuffle the dataset without overwriting the cache? Is calling disable_caching() immediately after downloading the dataset sufficient?

Additionally, I would appreciate clarification on the caching process. I was able to load a dataset from the Hub, cache it, and then shuffle it. Subsequently, when I loaded the cached dataset, it appeared to be identical to the original unshuffled dataset. Can you please explain this behavior?

Thank you.

sawadogosalif · July 21, 2023, 2:30pm

Hello,

I usually change the shuffle seed with each iteration.

```
for iteration in range(iterations):
```

          ds_data = ds_data.shuffle(seed=2023 + iteration)

```
           # continue the process
```

Otherwise, disable caching is sufficient for your case, as the method shuffle use by default Random seed.

Topic		Replies	Views
Caching a dataset processed with randomness 🤗Datasets	1	208	December 15, 2023
Streaming dataset and cache 🤗Datasets	5	3552	August 4, 2023
How to ensure the dataset is shuffled for each epoch using Trainer and Datasets? 🤗Transformers	13	19473	April 10, 2025
Behavior of shuffled parquet dataset 🤗Datasets	1	98	November 30, 2024
Shuffling and buffer size 🤗Datasets	1	839	October 3, 2023

Caching and Shuffling Datasets on the Same Machine

Related topics