Hi!
When I tried to call skip()
on an IterableDataset
obtained from calling split_dataset_by_node()
on the “parent” IterableDataset
, I got:
SkipExamplesIterable doesn't implement shard_data_sources yet
I wonder if you have plans to support this functionality?
Otherwise I have two backup options:
- Still use
split_dataset_by_node
, but manually skip one by one using next
.
- Implement my own
split_dataset_by_node
by doing round-robin data distribution, and still being able to use skip
. This is with the hope that in the future skip
would support “jumping to target location” even in IterableDataset
.
Hi ! This shouldn’t be hard to implement as soon as we agree on what it means to skip examples in a distributed setup. Should each node skip n / world_size
examples so that n
examples are skipped in total ?
From my perspective, it should be the other way (of course depending the definition of skip(n)
).
Assuming for simplicity that split_dataset_by_node
works in a round-robin manner to distribute data to all the nodes. After split_dataset_by_node
we have an IterableDataset
for each node. Now if we are calling skip(n)
on any of the node, the n
should be referring to a local counter, which means effectively we should be skipping around n * world_size
in the global counter.
That makes sense, and actually it would be more aligned with other behaviors in the datasets
lib regarding distributed setups
Cool! Do you think there would be a plan/timeline on when this simple feature can be added? Thanks!