’m training a deep learning model using PyTorch on ~100k individual files (each 7–12 MB, .mat) stored on a standard HDD (I cannot have an SSD righit now so its not an option). Each file is one training example.
Current setup:
Batch size: 16
num_workers=16
prefetch_factor=2
pin_memory=True
persistent_workers=True
Dataset reads directly from disk, no decoding or preprocessing
Machine: 256 GB RAM, fast GPU (no GPU bottleneck)
Problem:
Despite applying all the standard PyTorch performance tricks, I/O remains the bottleneck. Each epoch takes over 1 hour, and GPU usage is poor due to slow data loading. Disabling shuffling helps slightly at first, but performance still degrades. Already tried — did NOT help:
pin_memory, prefetch_factor, persistent_workers, high num_workers
LMDB -> I generated 1 enourmous file (but was even slower in my case, despite claims i read that this would improve due to the supposedly-better file system of lmdb)
Tried removing shuffling (no perf gain whatsoever)
Question:
What are effective, practical solutions to speed up minibatch loading from HDD in this scenario?
I’m open to:
- Packing files into a single large file (LMDB, HDF5, tar, memory-mapped arrays, etc.)
- Caching strategies or chunk loading
Anything that actually solves the issue in practice
Please suggest solutions that you’ve personally used successfully with large datasets stored on HDD.
What do people do in practice ? I dont understand how one can do better with an HDD … but maybe it is possible ?