Loading webdatasets across multiple nodes

In streaming mode, only the requested samples are downloaded on-the-fly when iterating on the dataset

  • Does each node download the full dataset?

it depends how you separate your dataset by node. By default the Hugging Face Trainer will make every node stream the full dataset but skip samples to avoid duplicates. There is split_dataset_by_node() that can assign shards to each node though so that each node streams only the sample it uses for training:

  • Are the webdataset files optionally cached as arrow/parquet, and from here onwards can be treated as a naive HF dataset?

there is no cache in streaming mode

Yes webdataset or any sharded format is compatible with this. See split_dataset_by_node() :wink:

If I instead upload HF native .parquet files:

  • Will I see reduced I/O performance from losing sequential read access?

it’s a bit slower in single process, but if you use multiple DataLoader workers you should be able to saturate your bandwidth

Lastly, I wanted to bump another user’s thread who asked if (in general, i.e. assuming the typical arrow/parquet setup) the Datasets library is compatible with PyTorch distributed training via Lightning’s Trainer.

I’m not too familiar with Lightning’s Trainer, but calling split_dataset_by_node in a custom IterableDataset should work indeed. Maybe we should add a way to split by node without having to specify the rank and world_size manually in split_by_node, this could solve the issue.

e.g. calling split_dataset_by_node(dataset) without rank and world_size could default to using the values for torch.distributed, WDYT ? cc @rbrthogan as well