Skip() not implemented for IterableDataset after split_dataset_by_node

Hi!

When I tried to call skip() on an IterableDataset obtained from calling split_dataset_by_node() on the “parent” IterableDataset, I got:

SkipExamplesIterable doesn't implement shard_data_sources yet

I wonder if you have plans to support this functionality?

Otherwise I have two backup options:

  1. Still use split_dataset_by_node, but manually skip one by one using next.
  2. Implement my own split_dataset_by_node by doing round-robin data distribution, and still being able to use skip. This is with the hope that in the future skip would support “jumping to target location” even in IterableDataset.

Hi ! This shouldn’t be hard to implement as soon as we agree on what it means to skip examples in a distributed setup. Should each node skip n / world_size examples so that n examples are skipped in total ?

From my perspective, it should be the other way (of course depending the definition of skip(n)).

Assuming for simplicity that split_dataset_by_node works in a round-robin manner to distribute data to all the nodes. After split_dataset_by_node we have an IterableDataset for each node. Now if we are calling skip(n) on any of the node, the n should be referring to a local counter, which means effectively we should be skipping around n * world_size in the global counter.

That makes sense, and actually it would be more aligned with other behaviors in the datasets lib regarding distributed setups

Cool! Do you think there would be a plan/timeline on when this simple feature can be added? Thanks!

Actually it depends on the whether skip/take is called after or before split_by_node, I opened a PR here: Improve skip take shuffling and distributed by lhoestq · Pull Request #6965 · huggingface/datasets · GitHub

1 Like