Hi,
I would like to download 5% of the German dataset from allenai/c4
. It is a huge, 300+ billion-word dataset, so its 5% is also very big. My initial idea was to stream it, sample every 20th document, and then convert the iterable dataset back to a regular one, like this:
def gen_from_iterable_dataset(iterable_ds):
yield from iterable_ds
def sample_fn(elem: dict|Any, rate: float):
return random.random() < rate
de_train_it1 = datasets.load_dataset(
"allenai/c4", "sr", split='train', streaming=True
)
de_train_it2 = de_train_it1.filter(partial(sample_fn, rate=0.05))
de_train = datasets.Dataset.from_generator(
partial(gen_from_iterable_dataset, de_train_it),
features=de_train_it1.features,
num_proc=80
)
Now, there are multiple problems with this code:
- Is there no way to convert an IterableDataset back to a Dataset without the clumsy gen_from_iterable_dataset approach (taken from Can I convert an `IterableDataset` to ` Dataset`? - Stack Overflow)?
- The num_proc=80 part is disregarded (“Setting num_proc from 80 back to 1 for the train split to disable multiprocessing as it only contains one shard.”). Is there no way around it?
- In either case, it seems as if I cannot force the iteration to happen in more than a single process, even though the original dataset has 256 shards. I end up with 100 samples / second, which would generate the dataset for me in 25,000 hours.
So the question is: is there a way to download a subset of a dataset with acceptable speed? Thank you!