Hi, I have a ~1TB large dataset stored at the HF hub. I can download this on my disk and read it successfully. Nevertheless, it’s large enough that I can’t fit it in my RAM.
What is the best practice for training a model on such a dataset?
I tried loading the dataset with load_dataset(..., streaming=True) and then having two buffers: one that is being loaded by training process onto GPU and one that, in a separate thread, streams from the dataset and fills up the other buffer. Then, when the initial training buffer starts running low, I swap out the buffers and refill the other buffer in a new thread. This wasn’t super successful since loading from the disk was still the bottleneck.
Dealing with a dataset that’s too big for your RAM can be tricky, but there are some great ways to make it work smoothly:
Stream Your Data:
Instead of loading the whole dataset into memory, you can stream it directly using load_dataset(..., streaming=True). This way, you only load what you need, when you need it, saving memory!
Shuffle Smartly:
While streaming, you can shuffle with a buffer to keep things random but still memory-efficient. A buffer of a few thousand examples usually works well.
Speed Up Loading:
If reading from disk is slowing you down, try breaking the dataset into smaller chunks beforehand. Smaller files load faster and keep things moving.
Prefetch Data:
You can load data in advance while your model is training to avoid waiting for the next batch. Frameworks like PyTorch or TensorFlow have prefetching options built in.
Use Cloud Storage:
If your disk is still slow, storing the dataset on a fast cloud service like AWS or GCP might help. You can stream the data directly from there with better speeds.
Memory Mapping:
If the dataset is too big for RAM but fits on your disk, you can load just the parts you need on the fly without keeping the whole thing in memory.
Split the Dataset:
You could split the dataset into smaller parts, then load one part at a time as needed. This keeps your memory usage low and training smooth.
streaming: I tried this, but it is just too slow. I tried even having separate threads here (as described in the post above) but was unsuccessful.
shuffle smartly: I didn’t find this to be a bottleneck
speed up loading: how? Ideally this would be multithreaded. My parquet files are about ~130MB size each. Each parquet file consists of multiple data samples.
prefetch data: I’ll try this
cloud storage: not an option really since everything is running at university’s HPC
memory mapping: Can you explain this further? Is this related to load_dataset()?
split the dataset: I think this is what I tried with the buffers, but hit the wall
So, how do I cache the dataset to maximize RAM usage, and use memory mapping for the leftover part of the dataset? I tried following this but it wouldn’t really load RAM unless I load the dataset with load_dataset(keep_in_memory=True) which would then overload memory
Have you tried using a DataLoader with num_workers and prefetching ? This usually helps !
from datasets import load_dataset
from torch.utils.data import DataLoader
ds = load_dataset(..., streaming=True)
dl = DataLoader(ds, num_workers=..., prefetch_factor=...)
for example in dl:
...
Yes, I am having num_workers=8 and default prefetch_factor which is 2. I tried setting it to 4 but haven’t seen much difference.
My dataset is an imbalanced binary label type of a dataset. The splits are pos (1% samples) and neg (99% samples). I have a custom IterableDataset which implements __iter__ that pulls from positive samples and negative samples (with streaming=True on both pos and neg dataset), and constructs an evenly balanced minibatch. Previously, I tried having Dataset, which implements __getitem__ instead of IterableDataset (with streaming=False for pos and neg), but it seems like that’s slightly slower due to indexing. I haven’t really found a way how to leverage more RAM to make everything go faster. Seems like the bottleneck is I/O from the disk (I am using VAST storage)