batched I/O from disk when load_dataset API is used?

mmatak · January 26, 2025, 12:06pm

Hi,

I just downloaded a ~1TB HF dataset onto disk using git clone and am training a model using load_dataset API in streamed mode. The stream would read samples from the disk and populate a buffer of samples, which I’d use to construct a batch for the model.

During this training, the bottleneck is reading from the disk. I wonder if there is a batched way to load from the disk when stream=True option? This is different from yielding a batch while reading samples 1 by 1.

Thank you!

John6666 · January 26, 2025, 1:24pm

Is this it? It’s probably a little different…

mmatak · January 27, 2025, 6:38pm

That’s not quite it, but I guess I can look into source code to understand how things are loaded exactly. I think it’s fine to close the topic.

Topic		Replies	Views
Streaming dataset and cache 🤗Datasets	5	3554	August 4, 2023
Best practices for a large dataset 🤗Datasets	7	1347	May 6, 2025
Question about streaming 🤗Datasets	3	573	April 25, 2023
Roadmap/timeline for dataset streaming 🤗Datasets	9	2271	July 5, 2021
Streaming in dataset uploads 🤗Datasets	2	52	March 31, 2025

batched I/O from disk when load_dataset API is used?

Related topics