Loading webdatasets across multiple nodes

mwalmsley · May 24, 2024, 2:13pm

Hi all,

I’m working on releasing a dataset of around 1M images of galaxies labelled by volunteers (www.galaxyzoo.org). I’m trying to understand the pros and cons of:

Storing the dataset on the hub as either webdataset .tar or HF ‘standard’ .parquet shards
Loading the data with either the webdataset library or the HF ‘standard’ Datasets library

My current setup uses webdatasets for both storage and loading. But I think other users might like the friendly Datasets API.

If I upload webdataset .tar and load them with Datasets load_dataset(…, streaming=True), what happens under the hood?

Does each node download the full dataset?
Are the webdataset files optionally cached as arrow/parquet, and from here onwards can be treated as a naive HF dataset?
How are the sequential read .tar shards compatible with the Dataset approach of skipping individual indexes to rebalance across nodes?

If I instead upload HF native .parquet files:

Will I see reduced I/O performance from losing sequential read access?

Lastly, I wanted to bump another user’s thread who asked if (in general, i.e. assuming the typical arrow/parquet setup) the Datasets library is compatible with PyTorch distributed training via Lightning’s Trainer.

Thank you for your time and for building these cool tools, it’s especially awesome that uploading WDS on the hub is so easy.

lhoestq · May 27, 2024, 12:36pm

In streaming mode, only the requested samples are downloaded on-the-fly when iterating on the dataset

Does each node download the full dataset?

it depends how you separate your dataset by node. By default the Hugging Face Trainer will make every node stream the full dataset but skip samples to avoid duplicates. There is split_dataset_by_node() that can assign shards to each node though so that each node streams only the sample it uses for training:

Are the webdataset files optionally cached as arrow/parquet, and from here onwards can be treated as a naive HF dataset?

there is no cache in streaming mode

How are the sequential read .tar shards compatible with the Dataset approach of skipping individual indexes to rebalance across nodes?

Yes webdataset or any sharded format is compatible with this. See split_dataset_by_node()

If I instead upload HF native .parquet files:

Will I see reduced I/O performance from losing sequential read access?

it’s a bit slower in single process, but if you use multiple DataLoader workers you should be able to saturate your bandwidth

Lastly, I wanted to bump another user’s thread who asked if (in general, i.e. assuming the typical arrow/parquet setup) the Datasets library is compatible with PyTorch distributed training via Lightning’s Trainer.

I’m not too familiar with Lightning’s Trainer, but calling split_dataset_by_node in a custom IterableDataset should work indeed. Maybe we should add a way to split by node without having to specify the rank and world_size manually in split_by_node, this could solve the issue.

e.g. calling split_dataset_by_node(dataset) without rank and world_size could default to using the values for torch.distributed, WDYT ? cc @rbrthogan as well

mwalmsley · May 30, 2024, 12:17pm

Thanks for the detailed answer @lhoestq! Very helpful. I’ve uploaded my datasets in HF native format and I’ll test out how they work distributed with the Lightning Trainer and let you know.

I think it would make a lot of sense to default rank/world to the torch.distributed versions if not provided, as I imagine most users would just be using the torch.distributed versions as args to split_dataset_by_node. Maybe add some logging/warning to make it clear what’s happening (“rank/world not provided, using torch.distributed values of…”)? I generally find distributed data setups a bit opaque in tracking what is going where.

On split_by_node for WDS:

Yes webdataset or any sharded format is compatible with this. See split_dataset_by_node()

I think the part I’m puzzled by is: WDS is sharded to allow for fast sequential reads. If the shards don’t divide evenly, HF will have all workers read all shards but have each worker skip most indices. But unlike the HF indexed parquet files, WDS is not designed to be read while skipping indices? You could read through the WDS shard and do nothing for most of the data, but that seems quite inefficient without some care (see wids for the ‘official’ version of this).

Alejandro98 · April 21, 2025, 10:04pm

Really interesting thread. However, the question ‘Loading WebDatasets across multiple nodes’ isn’t fully answered. Is there an example that can be used as a reference?

Topic		Replies	Views
Using Webdatasets to stream data 🤗Datasets	6	1878	February 19, 2024
How to load a large hf dataset efficiently? 🤗Datasets	5	2461	January 22, 2024
Streaming in dataset uploads 🤗Datasets	2	63	March 31, 2025
Num_worker with IterableDataset 🤗Datasets	4	2778	November 16, 2023
Stream Audio Dataset that Can't be moved to Hub 🤗Datasets	7	490	March 17, 2023

Loading webdatasets across multiple nodes

Related topics