Download a fraction of data from HuggingFace Datasets

srinjoyMukherjee · November 20, 2024, 8:39am

hey all, I want to download about 15GBs of data from each subsection/language of the stack dataset. Is there any way to do this?

I can download the entire subsection using following code snippet, but i want a fraction of the data(decided by a percentage or number of files).

from  datasets  import  load_dataset
ds = load_dataset("bigcode/the-stack", data_dir="data/c", split="train")

Thanks in advance

John6666 · November 20, 2024, 9:01am

I think it should be possible if you specify the directory name in the allow_patterns filter for snapshot_download.

I’m not very familiar with it, but it seems that the datasets library itself also has functions to extract a part of a dataset.

srinjoyMukherjee · November 20, 2024, 9:32am

yeah, it has data_dir and data_files, but what I was looking for was be able to download 10 GBs or 10 files of data , what the above will do is download the complete sub directory.

John6666 · November 20, 2024, 9:49am

I think this is close, but technically speaking, it’s not the number of files, and you can’t specify the size.
You can use HfApi to get a list of the files in a repo and see their sizes, but that’s almost like doing it manually…

srinjoyMukherjee · November 20, 2024, 1:15pm

yes, thats what I did, thanks for the help btw.

Topic		Replies	Views
Download only a subset of a split 🤗Datasets	10	16713	February 25, 2025
Downloading a portion of parquet files 🤗Datasets	3	668	May 23, 2024
Loading a fraction of data 🤗Datasets	5	5283	May 12, 2023
Is there any ways to download only a subset of dataset using huggingface-cli? 🤗Hub	0	281	July 17, 2024
Downloading a subset of the Pile Beginners	1	732	August 23, 2024

Download a fraction of data from HuggingFace Datasets

Related topics