Shuffling and buffer size

For the following why is the last statement false. I would have assumed having the buffer size equal to take argument would make the actual elements equivalent

from datasets import load_dataset

dataset = load_dataset("laion/laion400m", split="train", streaming=True)
full_ds1 = dataset.shuffle(seed=42, buffer_size=10).take(10)
full_ds2 = dataset.shuffle(seed=43, buffer_size=10).take(10)
set(x["key"] for x in full_ds1) == set(x["key"] for x in full_ds2)

Besides using the “shuffle” buffer, we also shuffle the shards (underlying data files; dataset.n_shards returns the number of them) for more randomness, which leads to a different result unless the seed is fixed.

1 Like