Is split_dataset_by_node (streaming dataset) compatible with multi processing?

Is split_dataset_by_node (streaming dataset) compatible with multi processing?

Streaming dataset (IterableDataset) doesn’t support multi processing because we need to make sure each rank loads different data. However, if we use split_dataset_by_node, then it is possible to process data with multiple workers, right?

It is important when we process text data with on-the-fly tokenization. Naive streaming dataloader only has 1 worker to tokenize text, which is inefficient. Is there any good way to improve the efficiency? At least, each rank should have its own worker to tokenize?

1 Like

Is split_dataset_by_node (streaming dataset) compatible with multi processing?

yes !

Streaming dataset (IterableDataset) doesn’t support multi processing because we need to make sure each rank loads different data. However, if we use split_dataset_by_node, then it is possible to process data with multiple workers, right?

correct, actually that’s what split_dataset_by_node was made for in the first place

It is important when we process text data with on-the-fly tokenization. Naive streaming dataloader only has 1 worker to tokenize text, which is inefficient. Is there any good way to improve the efficiency? At least, each rank should have its own worker to tokenize?

you can use DataLoader(..., num_workers=num_workers) to use one or more processes per rank to tokenize the data.

1 Like