Is split_dataset_by_node (streaming dataset) compatible with multi processing?

proj-persona · September 26, 2024, 5:38am

Streaming dataset (IterableDataset) doesn’t support multi processing because we need to make sure each rank loads different data. However, if we use split_dataset_by_node, then it is possible to process data with multiple workers, right?

It is important when we process text data with on-the-fly tokenization. Naive streaming dataloader only has 1 worker to tokenize text, which is inefficient. Is there any good way to improve the efficiency? At least, each rank should have its own worker to tokenize?

lhoestq · October 23, 2024, 2:32pm

Is split_dataset_by_node (streaming dataset) compatible with multi processing?

yes !

Streaming dataset (IterableDataset) doesn’t support multi processing because we need to make sure each rank loads different data. However, if we use split_dataset_by_node, then it is possible to process data with multiple workers, right?

correct, actually that’s what split_dataset_by_node was made for in the first place

It is important when we process text data with on-the-fly tokenization. Naive streaming dataloader only has 1 worker to tokenize text, which is inefficient. Is there any good way to improve the efficiency? At least, each rank should have its own worker to tokenize?

you can use DataLoader(..., num_workers=num_workers) to use one or more processes per rank to tokenize the data.

Topic		Replies	Views
Distributed data sampling for streaming 🤗Datasets	2	1880	October 4, 2023
Limitations of iterable datasets 🤗Datasets	11	5670	June 28, 2024
Roadmap/timeline for dataset streaming 🤗Datasets	9	2284	July 5, 2021
Interleaving Iterable Dataset with num_workers > 0 🤗Datasets	3	1625	April 11, 2023
Loading multiple serialized datasets with `multiprocessing` 🤗Datasets	2	623	April 2, 2022

Is split_dataset_by_node (streaming dataset) compatible with multi processing?

Related topics