batched I/O from disk when load_dataset API is used?

Hi,

I just downloaded a ~1TB HF dataset onto disk using git clone and am training a model using load_dataset API in streamed mode. The stream would read samples from the disk and populate a buffer of samples, which I’d use to construct a batch for the model.

During this training, the bottleneck is reading from the disk. I wonder if there is a batched way to load from the disk when stream=True option? This is different from yielding a batch while reading samples 1 by 1.

Thank you!

1 Like

Is this it? It’s probably a little different…

That’s not quite it, but I guess I can look into source code to understand how things are loaded exactly. I think it’s fine to close the topic.

1 Like