Shuffling and buffer size

sachin · October 3, 2023, 10:46am

For the following why is the last statement false. I would have assumed having the buffer size equal to take argument would make the actual elements equivalent

from datasets import load_dataset

dataset = load_dataset("laion/laion400m", split="train", streaming=True)
full_ds1 = dataset.shuffle(seed=42, buffer_size=10).take(10)
full_ds2 = dataset.shuffle(seed=43, buffer_size=10).take(10)
set(x["key"] for x in full_ds1) == set(x["key"] for x in full_ds2)

mariosasko · October 3, 2023, 1:26pm

Besides using the “shuffle” buffer, we also shuffle the shards (underlying data files; dataset.n_shards returns the number of them) for more randomness, which leads to a different result unless the seed is fixed.

Topic		Replies	Views
`buffer_size` argument and train/ valid splits 🤗Datasets	1	460	September 5, 2022
Splitting dataset via length 🤗Datasets	3	1727	September 1, 2022
Caching and Shuffling Datasets on the Same Machine 🤗Datasets	1	393	July 21, 2023
How to use split_dataset_by_node and shuffle on iterable dataset 🤗Datasets	3	543	February 17, 2025
"too many open files" despite streaming with IterableDataset 🤗Datasets	2	27	January 27, 2025

Shuffling and buffer size

Related topics