Loading webdatasets across multiple nodes

lhoestq · May 27, 2024, 12:36pm

In streaming mode, only the requested samples are downloaded on-the-fly when iterating on the dataset

Does each node download the full dataset?

it depends how you separate your dataset by node. By default the Hugging Face Trainer will make every node stream the full dataset but skip samples to avoid duplicates. There is split_dataset_by_node() that can assign shards to each node though so that each node streams only the sample it uses for training:

Are the webdataset files optionally cached as arrow/parquet, and from here onwards can be treated as a naive HF dataset?

there is no cache in streaming mode

How are the sequential read .tar shards compatible with the Dataset approach of skipping individual indexes to rebalance across nodes?

Yes webdataset or any sharded format is compatible with this. See split_dataset_by_node()

If I instead upload HF native .parquet files:

Will I see reduced I/O performance from losing sequential read access?

it’s a bit slower in single process, but if you use multiple DataLoader workers you should be able to saturate your bandwidth

Lastly, I wanted to bump another user’s thread who asked if (in general, i.e. assuming the typical arrow/parquet setup) the Datasets library is compatible with PyTorch distributed training via Lightning’s Trainer.

I’m not too familiar with Lightning’s Trainer, but calling split_dataset_by_node in a custom IterableDataset should work indeed. Maybe we should add a way to split by node without having to specify the rank and world_size manually in split_by_node, this could solve the issue.

e.g. calling split_dataset_by_node(dataset) without rank and world_size could default to using the values for torch.distributed, WDYT ? cc @rbrthogan as well

Topic		Replies	Views
Load Dataset and Save as Parquet 🤗Datasets	3	3861	January 7, 2025
Streaming in dataset uploads 🤗Datasets	2	49	March 31, 2025
Roadmap/timeline for dataset streaming 🤗Datasets	9	2271	July 5, 2021
batched I/O from disk when load_dataset API is used? 🤗Datasets	2	27	January 27, 2025
Best Practices for Large-Scale Image Datasets? (between WebDataset and Parquet) 🤗Datasets	3	236	February 8, 2025

Loading webdatasets across multiple nodes

Related topics