How to get the number of samples in a dataset without downloading the whole dataset?

As said in the title:

How to get the number of samples in a dataset without downloading the whole dataset? Preferably, the number of samples should be per split.

You can fetch this info using the /split endpoint as explained here. In some cases, this info is not available (e.g., for private datasets, etc.), so the only option then is to use the streaming feature to iterate over a dataset’s samples but without downloading it:

from datasets import load_dataset
ds = load_dataset(ds_name, streaming=True)

split_num_examples = {}
for split, split_ds in ds.items():
    split_num_examples[split] = sum(1 for ex in split_ds)

Thanks, @mariosasko . If this is the only, someone’s messed up :slight_smile:

datasets and datasets-server are open-source, so feel free to contribute to these projects to make them better :slightly_smiling_face:.

Some datasets store num_examples in the README YAML or dataset_infos.json (for each config and split), so this is one more option.

1 Like