There are certain datasets that are too big to load onto either memory or disk.
For my data pipeline, I generally download datasets, load them, add some quick annotations (say length of string) and save them as parquets before uploading them to S3. This works fine using the following script:
from datasets import load_dataset_builder
builder = load_dataset_builder(ds_name, config, trust_remote_code= True)
builder.download_and_prepare(output_dir, file_format="parquet", num_proc = 8)
Unless the dataset or subset/config is too huge to be cached onto disk at once. Is there a way to load a dataset in batches that is also fast (unlike streaming)?
For example, certain splits (‘json’) of this dataset: bigcode/commitpack · Datasets at Hugging Face are too big to be loaded all at once on disk. I’d like to be able to load them in smaller segments - let’s say first 10M rows, next 10M and so on… so I don’t have to worry about disk being filled up?