Behavior of shuffled parquet dataset

if a dataset is stored as parquets, loading with huggingface load_dataset, and then shuffled, does this mean that batches contain rows from several files? Or does it only shuffle the order of reading parquet files?

1 Like

Hi ! By default load_dataset() downloads the Parquet files and convert the data as Arrow data. Then if you call .shuffle() it will shuffle a list dataset index → row index in the Arrow data and use this mapping when the examples are accessed. This means that it is a global shuffling, across all the original Parquet files.

Small note: shuffle for a streaming dataset is different, you can get more details at Differences between Dataset and IterableDataset

1 Like