How to get the number of samples in a dataset without downloading the whole dataset?

mehran · September 1, 2023, 4:53pm

As said in the title:

How to get the number of samples in a dataset without downloading the whole dataset? Preferably, the number of samples should be per split.

mariosasko · September 1, 2023, 6:00pm

You can fetch this info using the /split endpoint as explained here. In some cases, this info is not available (e.g., for private datasets, etc.), so the only option then is to use the streaming feature to iterate over a dataset’s samples but without downloading it:

from datasets import load_dataset
ds = load_dataset(ds_name, streaming=True)

split_num_examples = {}
for split, split_ds in ds.items():
    split_num_examples[split] = sum(1 for ex in split_ds)

mehran · September 1, 2023, 6:12pm

Thanks, @mariosasko . If this is the only, someone’s messed up

mariosasko · September 4, 2023, 6:46pm

datasets and datasets-server are open-source, so feel free to contribute to these projects to make them better .

Some datasets store num_examples in the README YAML or dataset_infos.json (for each config and split), so this is one more option.

Topic		Replies	Views
How to sample a dataset Beginners	0	287	November 28, 2022
Download only a subset of a split 🤗Datasets	10	16470	February 25, 2025
How can I download a sizable subset of a dataset 🤗Datasets	1	793	April 3, 2024
Loading just part of dataset 🤗Datasets	4	4682	February 25, 2025
How can I obtain a single or required number of image from the vision dataset for each class of images? 🤗Datasets	1	138	September 20, 2023

How to get the number of samples in a dataset without downloading the whole dataset?

Related topics