As said in the title:
How to get the number of samples in a dataset without downloading the whole dataset? Preferably, the number of samples should be per split.
As said in the title:
How to get the number of samples in a dataset without downloading the whole dataset? Preferably, the number of samples should be per split.
You can fetch this info using the /split
endpoint as explained here. In some cases, this info is not available (e.g., for private datasets, etc.), so the only option then is to use the streaming feature to iterate over a dataset’s samples but without downloading it:
from datasets import load_dataset
ds = load_dataset(ds_name, streaming=True)
split_num_examples = {}
for split, split_ds in ds.items():
split_num_examples[split] = sum(1 for ex in split_ds)
Thanks, @mariosasko . If this is the only, someone’s messed up
datasets
and datasets-server
are open-source, so feel free to contribute to these projects to make them better .
Some datasets store num_examples
in the README YAML or dataset_infos.json
(for each config and split), so this is one more option.