How do I download and load a dataset in batches without caching all of it?

qiragg · September 16, 2024, 8:54am

There are certain datasets that are too big to load onto either memory or disk.

For my data pipeline, I generally download datasets, load them, add some quick annotations (say length of string) and save them as parquets before uploading them to S3. This works fine using the following script:

from datasets import load_dataset_builder


builder = load_dataset_builder(ds_name, config, trust_remote_code= True)
builder.download_and_prepare(output_dir, file_format="parquet", num_proc = 8)

Unless the dataset or subset/config is too huge to be cached onto disk at once. Is there a way to load a dataset in batches that is also fast (unlike streaming)?

For example, certain splits (‘json’) of this dataset: bigcode/commitpack · Datasets at Hugging Face are too big to be loaded all at once on disk. I’d like to be able to load them in smaller segments - let’s say first 10M rows, next 10M and so on… so I don’t have to worry about disk being filled up?

John6666 · September 16, 2024, 9:24am

I’ve only done small dataset processing, but the only way I can think of for huge files is to use the accelerate library.
Or, I can split it into Shard files and batch process it…

Topic		Replies	Views
Extremely slow data loading of imagefolder 🤗Datasets	9	2430	January 4, 2024
How to disable caching in load_dataset()? 🤗Datasets	6	6167	July 10, 2024
Support of very large dataset? 🤗Datasets	12	10345	August 24, 2022
Loading a dataset doesn't actually memory map 🤗Datasets	1	935	September 4, 2023
Use load dataset to load a sample of the dataset 🤗Datasets	3	1264	May 24, 2021

How do I download and load a dataset in batches without caching all of it?

Related topics