I just downloaded a ~1TB HF dataset onto disk using git clone and am training a model using load_dataset API in streamed mode. The stream would read samples from the disk and populate a buffer of samples, which I’d use to construct a batch for the model.
During this training, the bottleneck is reading from the disk. I wonder if there is a batched way to load from the disk when stream=True option? This is different from yielding a batch while reading samples 1 by 1.