Dataset only have n_shard=1 when has multiple shards in repo

jacobbieker · July 1, 2022, 7:49am

Hi,

This repo openclimatefix/gfs-surface-pressure-2.0deg · Datasets at Hugging Face contains 39 parquet shards of data. But when I query the number of shards, I get 1, which means when trying to use it with PyTorch dataloader, I can’t use more than a single num_worker, and my training is very slow. Is there a way I should push to hub differently to make it see all the shards? Or something else I can do so I can use more than a single worker?

hf_ds = datasets.load_dataset("openclimatefix/gfs-surface-pressure-2.0deg", split='train', streaming=True)
print(hf_ds.n_shards)

The same issue is also for openclimatefix/gfs-surface-pressure-0.5deg · Datasets at Hugging Face with 411 parquet shards, but only 1 shard when querying n_shards

Curiously, on a different machine, the last dataset gives 431 shards, and the 2 deg data, 39 shards like expected, so I guess its some odd platform thing.

mariosasko · July 1, 2022, 4:49pm

Hi! Our packaged loaders (CSV, Parquet, etc.) rely on a special generator to iterate over data files/directories, but the generator hides the number of data files resulting in n_shards being equal to 1.

cc @lhoestq I think we can replace gen_kwargs = {"files": dl_manager.iter_files(files) with gen_kwargs = {"files": [dl_manager.iter_files(file) for file in files]} to allow some parallelization.

Topic		Replies	Views
Streaming and creating refactored dataset with shards using Generator 🤗Datasets	4	238	October 30, 2024
“too many open files” despite streaming with IterableDataset 🤗Datasets	2	48	January 30, 2025
Download only 1 of many parquet file 🤗Datasets	2	223	March 19, 2025
Num_worker with IterableDataset 🤗Datasets	4	2722	November 16, 2023
Any workaround for push_to_hub() limits? 🤗Datasets	9	2181	May 2, 2024

Dataset only have n_shard=1 when has multiple shards in repo

Related topics