Hi,
This repo openclimatefix/gfs-surface-pressure-2.0deg · Datasets at Hugging Face contains 39 parquet shards of data. But when I query the number of shards, I get 1, which means when trying to use it with PyTorch dataloader, I can’t use more than a single num_worker, and my training is very slow. Is there a way I should push to hub differently to make it see all the shards? Or something else I can do so I can use more than a single worker?
hf_ds = datasets.load_dataset("openclimatefix/gfs-surface-pressure-2.0deg", split='train', streaming=True)
print(hf_ds.n_shards)
The same issue is also for openclimatefix/gfs-surface-pressure-0.5deg · Datasets at Hugging Face with 411 parquet shards, but only 1 shard when querying n_shards
Curiously, on a different machine, the last dataset gives 431 shards, and the 2 deg data, 39 shards like expected, so I guess its some odd platform thing.