Skip() not implemented for IterableDataset after split_dataset_by_node

tianyu-l · June 11, 2024, 4:40am

Hi!

When I tried to call skip() on an IterableDataset obtained from calling split_dataset_by_node() on the “parent” IterableDataset, I got:

SkipExamplesIterable doesn't implement shard_data_sources yet

I wonder if you have plans to support this functionality?

Otherwise I have two backup options:

Still use split_dataset_by_node, but manually skip one by one using next.
Implement my own split_dataset_by_node by doing round-robin data distribution, and still being able to use skip. This is with the hope that in the future skip would support “jumping to target location” even in IterableDataset.

lhoestq · June 11, 2024, 4:48pm

Hi ! This shouldn’t be hard to implement as soon as we agree on what it means to skip examples in a distributed setup. Should each node skip n / world_size examples so that n examples are skipped in total ?

tianyu-l · June 11, 2024, 5:55pm

From my perspective, it should be the other way (of course depending the definition of skip(n)).

Assuming for simplicity that split_dataset_by_node works in a round-robin manner to distribute data to all the nodes. After split_dataset_by_node we have an IterableDataset for each node. Now if we are calling skip(n) on any of the node, the n should be referring to a local counter, which means effectively we should be skipping around n * world_size in the global counter.

lhoestq · June 11, 2024, 7:40pm

That makes sense, and actually it would be more aligned with other behaviors in the datasets lib regarding distributed setups

tianyu-l · June 11, 2024, 8:18pm

Cool! Do you think there would be a plan/timeline on when this simple feature can be added? Thanks!

lhoestq · June 12, 2024, 12:31pm

Actually it depends on the whether skip/take is called after or before split_by_node, I opened a PR here: Improve skip take shuffling and distributed by lhoestq · Pull Request #6965 · huggingface/datasets · GitHub

Topic		Replies	Views
How to use split_dataset_by_node and shuffle on iterable dataset 🤗Datasets	3	571	February 17, 2025
Making an infinite IterableDataset 🤗Datasets	6	97	March 19, 2025
`train_test_split` with IterableDataset 🤗Datasets	2	1825	January 26, 2023
Keeping IterableDataset node-wise split fixed during DDP 🤗Datasets	8	1956	April 29, 2024
[Question] Is there a `skip_lines` when using `datasets.load_dataset("csv", stream=True, ...) like how torchdata supports it? 🤗Datasets	2	480	November 3, 2022

Skip() not implemented for IterableDataset after split_dataset_by_node

Related topics