Dataset only have n_shard=1 when has multiple shards in repo


This repo openclimatefix/gfs-surface-pressure-2.0deg · Datasets at Hugging Face contains 39 parquet shards of data. But when I query the number of shards, I get 1, which means when trying to use it with PyTorch dataloader, I can’t use more than a single num_worker, and my training is very slow. Is there a way I should push to hub differently to make it see all the shards? Or something else I can do so I can use more than a single worker?

hf_ds = datasets.load_dataset("openclimatefix/gfs-surface-pressure-2.0deg", split='train', streaming=True)

The same issue is also for openclimatefix/gfs-surface-pressure-0.5deg · Datasets at Hugging Face with 411 parquet shards, but only 1 shard when querying n_shards

Curiously, on a different machine, the last dataset gives 431 shards, and the 2 deg data, 39 shards like expected, so I guess its some odd platform thing.

Hi! Our packaged loaders (CSV, Parquet, etc.) rely on a special generator to iterate over data files/directories, but the generator hides the number of data files resulting in n_shards being equal to 1.

cc @lhoestq I think we can replace gen_kwargs = {"files": dl_manager.iter_files(files) with gen_kwargs = {"files": [dl_manager.iter_files(file) for file in files]} to allow some parallelization.

1 Like