MutliGPU Training using split_dataset_per_node with PyTorch Lightning

mwalmsley · May 24, 2024, 1:21pm

So I’m not a pro here, but maybe relevant:

webdataset solves the issue of distributing across nodes with a dataset(…, nodesplitter_func=nodesplitter_func) approach. This nodesplitter_func can be anything but typically does torch.distributed.get_rank etc, like you tried
You can/should place the dataset() function inside a datamodule
The distributed environment exists by the time datamodule.setup() is called, and functions called inside the datamodule get_dataloader functions can read rank/world
nodesplitter_func is called when each worker is set up, but that’s okay as it returns the same list for workers on a given node (being deterministic)

I think the same thing might work for split_dataset_by_node here. You would need to split the dataset inside webdatamodule and after webdatamodule.setup(), and then the distributed environment might be set up for you.

Topic		Replies	Views
Problem in training iterable dataset 🤗Datasets	1	1023	December 26, 2023
How to use split_dataset_by_node and shuffle on iterable dataset 🤗Datasets	3	519	February 17, 2025
How to handle IterableDataset with HuggingFace trainer and num_workers in DDP setup 🤗Datasets	5	2903	September 26, 2024
Keeping IterableDataset node-wise split fixed during DDP 🤗Datasets	8	1923	April 29, 2024
Using an IterableDataset for 1+ epochs in Trainer Beginners	3	131	January 2, 2025

MutliGPU Training using split_dataset_per_node with PyTorch Lightning

Related topics