How can I download a sizable subset of a dataset


I would like to download 5% of the German dataset from allenai/c4. It is a huge, 300+ billion-word dataset, so its 5% is also very big. My initial idea was to stream it, sample every 20th document, and then convert the iterable dataset back to a regular one, like this:

def gen_from_iterable_dataset(iterable_ds):
    yield from iterable_ds

def sample_fn(elem: dict|Any, rate: float):
    return random.random() < rate

de_train_it1 = datasets.load_dataset(
    "allenai/c4", "sr", split='train', streaming=True
de_train_it2 = de_train_it1.filter(partial(sample_fn, rate=0.05))

de_train = datasets.Dataset.from_generator(
    partial(gen_from_iterable_dataset, de_train_it),

Now, there are multiple problems with this code:

  1. Is there no way to convert an IterableDataset back to a Dataset without the clumsy gen_from_iterable_dataset approach (taken from Can I convert an `IterableDataset` to ` Dataset`? - Stack Overflow)?
  2. The num_proc=80 part is disregarded (“Setting num_proc from 80 back to 1 for the train split to disable multiprocessing as it only contains one shard.”). Is there no way around it?
  3. In either case, it seems as if I cannot force the iteration to happen in more than a single process, even though the original dataset has 256 shards. I end up with 100 samples / second, which would generate the dataset for me in 25,000 hours. :slight_smile:

So the question is: is there a way to download a subset of a dataset with acceptable speed? Thank you!

You could download only the first 100 files of the dataset ?
That’s approximately 4,9% of the 2048 files of the "de" subset

from datasets import load_dataset

data_files = "multilingual/c4-de.tfrecord-000*-of-*.json.gz"
de_train = load_dataset("allenai/c4", data_files=data_files, split='train', num_proc=50)

Alternatively you can pass a list of files to download if you want.