Best practices for a large dataset

mmatak · January 27, 2025, 11:10pm

Hi, I have a ~1TB large dataset stored at the HF hub. I can download this on my disk and read it successfully. Nevertheless, it’s large enough that I can’t fit it in my RAM.

What is the best practice for training a model on such a dataset?

I tried loading the dataset with load_dataset(..., streaming=True) and then having two buffers: one that is being loaded by training process onto GPU and one that, in a separate thread, streams from the dataset and fills up the other buffer. Then, when the initial training buffer starts running low, I swap out the buffers and refill the other buffer in a new thread. This wasn’t super successful since loading from the disk was still the bottleneck.

Thanks!

Alanturner2 · January 28, 2025, 12:02am

Hi there!

Dealing with a dataset that’s too big for your RAM can be tricky, but there are some great ways to make it work smoothly:

Stream Your Data:
Instead of loading the whole dataset into memory, you can stream it directly using load_dataset(..., streaming=True). This way, you only load what you need, when you need it, saving memory!
Shuffle Smartly:
While streaming, you can shuffle with a buffer to keep things random but still memory-efficient. A buffer of a few thousand examples usually works well.
Speed Up Loading:
If reading from disk is slowing you down, try breaking the dataset into smaller chunks beforehand. Smaller files load faster and keep things moving.
Prefetch Data:
You can load data in advance while your model is training to avoid waiting for the next batch. Frameworks like PyTorch or TensorFlow have prefetching options built in.
Use Cloud Storage:
If your disk is still slow, storing the dataset on a fast cloud service like AWS or GCP might help. You can stream the data directly from there with better speeds.
Memory Mapping:
If the dataset is too big for RAM but fits on your disk, you can load just the parts you need on the fly without keeping the whole thing in memory.
Split the Dataset:
You could split the dataset into smaller parts, then load one part at a time as needed. This keeps your memory usage low and training smooth.

Hope this helps!

mmatak · January 28, 2025, 5:22pm

I tried some of this, but with no success

streaming: I tried this, but it is just too slow. I tried even having separate threads here (as described in the post above) but was unsuccessful.
shuffle smartly: I didn’t find this to be a bottleneck
speed up loading: how? Ideally this would be multithreaded. My parquet files are about ~130MB size each. Each parquet file consists of multiple data samples.
prefetch data: I’ll try this
cloud storage: not an option really since everything is running at university’s HPC
memory mapping: Can you explain this further? Is this related to load_dataset()?
split the dataset: I think this is what I tried with the buffers, but hit the wall

mmatak · February 3, 2025, 6:43pm

So, how do I cache the dataset to maximize RAM usage, and use memory mapping for the leftover part of the dataset? I tried following this but it wouldn’t really load RAM unless I load the dataset with load_dataset(keep_in_memory=True) which would then overload memory

lhoestq · February 5, 2025, 11:20am

Have you tried using a DataLoader with num_workers and prefetching ? This usually helps !

from datasets import load_dataset
from torch.utils.data import DataLoader

ds = load_dataset(..., streaming=True)
dl = DataLoader(ds, num_workers=..., prefetch_factor=...)

for example in dl:
    ...

mmatak · February 5, 2025, 5:48pm

Yes, I am having num_workers=8 and default prefetch_factor which is 2. I tried setting it to 4 but haven’t seen much difference.

My dataset is an imbalanced binary label type of a dataset. The splits are pos (1% samples) and neg (99% samples). I have a custom IterableDataset which implements __iter__ that pulls from positive samples and negative samples (with streaming=True on both pos and neg dataset), and constructs an evenly balanced minibatch. Previously, I tried having Dataset, which implements __getitem__ instead of IterableDataset (with streaming=False for pos and neg), but it seems like that’s slightly slower due to indexing. I haven’t really found a way how to leverage more RAM to make everything go faster. Seems like the bottleneck is I/O from the disk (I am using VAST storage)

lhoestq · February 5, 2025, 6:03pm

I’m not familiar with VAST if it uses a mounted disk like NFS it can be much slower than a local disk indeed

nitishpandey04 · May 6, 2025, 2:21pm

Tell me, so you have implemented a pytorch IterableDataset that is calling 2 huggingface IterableDatasets for +ve and -ve class items right?

I think using individual iterable datasets for +ve and -ve classes can be a good idea but the wrapper you are using on top of those datasets doesn’t need to be iterable dataset itself. The parent can be a normal dataset. Have you tried this ?

Or maybe you can try to implement a single iterable dataset in pytorch which loads both +ve and -ve classes

I opened this issue - Big text dataset loading for training. Any insights you can share ?

Topic		Replies	Views
How do i load part of the data set Beginners	3	86	May 5, 2025
Big text dataset loading for training 🤗Datasets	2	95	May 7, 2025
Streaming in dataset uploads 🤗Datasets	2	52	March 31, 2025
Support of very large dataset? 🤗Datasets	12	10345	August 24, 2022
batched I/O from disk when load_dataset API is used? 🤗Datasets	2	27	January 27, 2025

Best practices for a large dataset

Related topics