How do I download and load a dataset in batches without caching all of it?

There are certain datasets that are too big to load onto either memory or disk.

For my data pipeline, I generally download datasets, load them, add some quick annotations (say length of string) and save them as parquets before uploading them to S3. This works fine using the following script:

from datasets import load_dataset_builder


builder = load_dataset_builder(ds_name, config, trust_remote_code= True)
builder.download_and_prepare(output_dir, file_format="parquet", num_proc = 8)

Unless the dataset or subset/config is too huge to be cached onto disk at once. Is there a way to load a dataset in batches that is also fast (unlike streaming)?

For example, certain splits (‘json’) of this dataset: bigcode/commitpack · Datasets at Hugging Face are too big to be loaded all at once on disk. I’d like to be able to load them in smaller segments - let’s say first 10M rows, next 10M and so on… so I don’t have to worry about disk being filled up?

I’ve only done small dataset processing, but the only way I can think of for huge files is to use the accelerate library.
Or, I can split it into Shard files and batch process it…