Loading a fraction of data

Hi,
Is it possible to load a certain percent of any given dataset with the function load_dataset?

Hi ! You can load a subset from a dataset this way:

subset = load_dataset(..., split="train[:30%]")

Note that it still downloads and prepares the full dataset - but only the requested subset is returned.

Hi,
Thanks a lot for your answer !
Unfortunately, it does not solve my issue, I need a solution for only downloading a fraction of data (some datasets are huge, but I still might need a small fraction of it without downloading the whole dataset)

Will Hugging Face consider this issue?
Thanks again :pray:

Hi, if dataset size is an issue you can consider streaming the dataset instead. This way you don’t have to download it, but you can still use it.

Awesome, I hadn’t noticed this feature.
Although streaming the dataset sounds awesome, sometimes you just want to have your dataset locally. For this reason, I am just asking the following questions :

  • Am I able to load a small fraction of this dataset in streaming? for instance, running this command
dataset = load_dataset('oscar-corpus/OSCAR-2201', 'en', split='train[:1%]', streaming=True)

with the [:1%] in the train split and the streaming option True? (PS : I tried it with no success…)

  • If so, am I able to then download this fraction of data locally?

  • If not, can this issue be considered by Hugging Face?

Even though I believe that the streaming of datasets is a great tool, I just ask these questions for more specifications.
Thank you all :pray:

Loading a percentage in streaming mode is not implemented yet, because for some datasets we don’t know how many samples there are (e.g. for CSV files you need to download the full file to count the rows).

This can be supported for datasets with metadata about their length, or for supported formats like Parquet where the number of rows is available.