What are the most effective and reliable ways to load minibatches efficiently from HDD for deep learning training?

AlanTuring123 · May 13, 2025, 3:33pm

’m training a deep learning model using PyTorch on ~100k individual files (each 7–12 MB, .mat) stored on a standard HDD (I cannot have an SSD righit now so its not an option). Each file is one training example.

Current setup:

Batch size: 16

num_workers=16

prefetch_factor=2

pin_memory=True

persistent_workers=True

Dataset reads directly from disk, no decoding or preprocessing

Machine: 256 GB RAM, fast GPU (no GPU bottleneck)

Problem:

Despite applying all the standard PyTorch performance tricks, I/O remains the bottleneck. Each epoch takes over 1 hour, and GPU usage is poor due to slow data loading. Disabling shuffling helps slightly at first, but performance still degrades. Already tried — did NOT help:

pin_memory, prefetch_factor, persistent_workers, high num_workers

LMDB -> I generated 1 enourmous file (but was even slower in my case, despite claims i read that this would improve due to the supposedly-better file system of lmdb)

Tried removing shuffling (no perf gain whatsoever)

Question:

What are effective, practical solutions to speed up minibatch loading from HDD in this scenario?

I’m open to:

Packing files into a single large file (LMDB, HDF5, tar, memory-mapped arrays, etc.)
Caching strategies or chunk loading

Anything that actually solves the issue in practice

Please suggest solutions that you’ve personally used successfully with large datasets stored on HDD.

What do people do in practice ? I dont understand how one can do better with an HDD … but maybe it is possible ?

John6666 · May 14, 2025, 7:50am

Hmm… Would WebDataset be a good choice? The parquet format commonly used in Hugging Face is also an option…

Topic		Replies	Views
Trainer + Datasets + Pytorch Dataloader Workers - how to manage memory usage? 🤗Transformers	1	36	April 29, 2025
Best practices for a large dataset 🤗Datasets	7	1318	May 6, 2025
How do I download and load a dataset in batches without caching all of it? 🤗Datasets	1	226	September 16, 2024
5M small images (~100Gb) 🤗Datasets	2	517	July 5, 2022
PyTorch Dataset/DataLoader classes 🤗Datasets	3	1144	November 25, 2021

What are the most effective and reliable ways to load minibatches efficiently from HDD for deep learning training?

Related topics