Behavior of shuffled parquet dataset

SwayStar123 · October 25, 2024, 3:22pm

if a dataset is stored as parquets, loading with huggingface load_dataset, and then shuffled, does this mean that batches contain rows from several files? Or does it only shuffle the order of reading parquet files?

lhoestq · November 30, 2024, 4:16pm

Hi ! By default load_dataset() downloads the Parquet files and convert the data as Arrow data. Then if you call .shuffle() it will shuffle a list dataset index → row index in the Arrow data and use this mapping when the examples are accessed. This means that it is a global shuffling, across all the original Parquet files.

Small note: shuffle for a streaming dataset is different, you can get more details at Differences between Dataset and IterableDataset

Topic		Replies	Views
Extremely Slow Loading of Parquet Dataset with datasets 🤗Datasets	2	53	April 30, 2025
How to access order of shards in streaming IterableDataset 🤗Datasets	1	1471	October 6, 2022
Shuffle a Single Feature (column) in a Dataset Beginners	3	1397	January 3, 2022
Load Dataset and Save as Parquet 🤗Datasets	3	4135	January 7, 2025
Stream Audio Dataset that Can't be moved to Hub 🤗Datasets	7	489	March 17, 2023

Behavior of shuffled parquet dataset

Related topics