How can I download a sizable subset of a dataset

DavidNemeskey · March 21, 2024, 2:14pm

Hi,

I would like to download 5% of the German dataset from allenai/c4. It is a huge, 300+ billion-word dataset, so its 5% is also very big. My initial idea was to stream it, sample every 20th document, and then convert the iterable dataset back to a regular one, like this:

def gen_from_iterable_dataset(iterable_ds):
    yield from iterable_ds

def sample_fn(elem: dict|Any, rate: float):
    return random.random() < rate

de_train_it1 = datasets.load_dataset(
    "allenai/c4", "sr", split='train', streaming=True
)
de_train_it2 = de_train_it1.filter(partial(sample_fn, rate=0.05))

de_train = datasets.Dataset.from_generator(
    partial(gen_from_iterable_dataset, de_train_it),
    features=de_train_it1.features,
    num_proc=80
)

Now, there are multiple problems with this code:

Is there no way to convert an IterableDataset back to a Dataset without the clumsy gen_from_iterable_dataset approach (taken from Can I convert an `IterableDataset` to ` Dataset`? - Stack Overflow)?
The num_proc=80 part is disregarded (“Setting num_proc from 80 back to 1 for the train split to disable multiprocessing as it only contains one shard.”). Is there no way around it?
In either case, it seems as if I cannot force the iteration to happen in more than a single process, even though the original dataset has 256 shards. I end up with 100 samples / second, which would generate the dataset for me in 25,000 hours.

So the question is: is there a way to download a subset of a dataset with acceptable speed? Thank you!

lhoestq · April 3, 2024, 12:56pm

You could download only the first 100 files of the dataset ?
That’s approximately 4,9% of the 2048 files of the "de" subset

from datasets import load_dataset

data_files = "multilingual/c4-de.tfrecord-000*-of-*.json.gz"
de_train = load_dataset("allenai/c4", data_files=data_files, split='train', num_proc=50)

Alternatively you can pass a list of files to download if you want.

Topic		Replies	Views
Download only a subset of a split 🤗Datasets	10	16202	February 25, 2025
Loading just part of dataset 🤗Datasets	4	4527	February 25, 2025
Loading a fraction of data 🤗Datasets	5	5146	May 12, 2023
How to get the number of samples in a dataset without downloading the whole dataset? 🤗Datasets	3	1458	September 4, 2023
Num_worker with IterableDataset 🤗Datasets	4	2593	November 16, 2023

How can I download a sizable subset of a dataset

Related topics