What are the most effective and reliable ways to load minibatches efficiently from HDD for deep learning training?

’m training a deep learning model using PyTorch on ~100k individual files (each 7–12 MB, .mat) stored on a standard HDD (I cannot have an SSD righit now so its not an option). Each file is one training example.

Current setup:

Batch size: 16

num_workers=16

prefetch_factor=2

pin_memory=True

persistent_workers=True

Dataset reads directly from disk, no decoding or preprocessing

Machine: 256 GB RAM, fast GPU (no GPU bottleneck)

Problem:

Despite applying all the standard PyTorch performance tricks, I/O remains the bottleneck. Each epoch takes over 1 hour, and GPU usage is poor due to slow data loading. Disabling shuffling helps slightly at first, but performance still degrades. Already tried — did NOT help:

pin_memory, prefetch_factor, persistent_workers, high num_workers

LMDB -> I generated 1 enourmous file (but was even slower in my case, despite claims i read that this would improve due to the supposedly-better file system of lmdb)

Tried removing shuffling (no perf gain whatsoever)

Question:

What are effective, practical solutions to speed up minibatch loading from HDD in this scenario?

I’m open to:

  • Packing files into a single large file (LMDB, HDF5, tar, memory-mapped arrays, etc.)
  • Caching strategies or chunk loading

Anything that actually solves the issue in practice

Please suggest solutions that you’ve personally used successfully with large datasets stored on HDD.

What do people do in practice ? I dont understand how one can do better with an HDD … but maybe it is possible ?

1 Like

Hmm… Would WebDataset be a good choice? The parquet format commonly used in Hugging Face is also an option…