Download a fraction of data from HuggingFace Datasets

hey all, I want to download about 15GBs of data from each subsection/language of the stack dataset. Is there any way to do this?

I can download the entire subsection using following code snippet, but i want a fraction of the data(decided by a percentage or number of files).

from  datasets  import  load_dataset
ds = load_dataset("bigcode/the-stack", data_dir="data/c", split="train")

Thanks in advance

1 Like

I think it should be possible if you specify the directory name in the allow_patterns filter for snapshot_download.

I’m not very familiar with it, but it seems that the datasets library itself also has functions to extract a part of a dataset.

yeah, it has data_dir and data_files, but what I was looking for was be able to download 10 GBs or 10 files of data , what the above will do is download the complete sub directory.

1 Like

I think this is close, but technically speaking, it’s not the number of files, and you can’t specify the size.
You can use HfApi to get a list of the files in a repo and see their sizes, but that’s almost like doing it manually…

yes, thats what I did, thanks for the help btw.

1 Like