Hey Datasets community. I am having a minor issue with a streaming a dataset. I have a very large dataset, over 500GB, that I have sharded into 5 files. Naturally I choose to use a streaming dataset since my machine does not have enough RAM to hold the entire dataset. I create my HF dataset with the following line
I choose to have 4 workers. It works and is plenty fast, but during an epoch the memory footprint continually grows. When I look at htop in Linux, my training program spawned about 54 processes, all of which currently take up ~4.8GB of memory. Earlier today that same number for those processes was ~3GB. My conception of a streaming dataloader is that it should load in a batch and after a training step forfeit that memory so that it can load in the next batch. Therefore dataloading should always maintain the memory footprint of 1 batch*num_workers. Is there a memory leak here? I have attached a screenshot of htop to help you get an idea of what is going on.
After reading both posts, its not clear that there is a resolution to the reported behavior. I see that Quentin recommended dataset.with_format("torch"). I tried this, but it slows down the dataloading compared to the dataset with the “with_format” method, almost by a factor of 2.
I have spent some time observing the different behavior of the two methods. Without the “with_format” it loads data in very quickly, but gets hung-up at certain points, at which the dataloader is suspended few seconds, sometimes tens of seconds, while when using “with_format” it never gets hung-up, but as I previously said is only half as fast. Both methods take up considerable memory, although “with_format” a little less.
Furthermore whether using “with_format” or not, dataloading with streaming appears to need 5-6 GB per worker, and may be growing and shrinking unpredictably throughout training. Is there anyone that can give me a rationalization for this behavior during streaming? If it is streaming, why is the memory footprint still kinda large. A single batch has maybe 20,000 float32s, and thus should have a running memory footprint on the order of MB, not GB, right?
You’re using streaming=True with num_workers > 0.
This causes each worker to hold its own persistent iterator state, and Python doesn’t automatically garbage collect iterators across workers unless manually torn down.
Over time, especially with large DataLoader iterations, this creates persistent object growth in each process, even though you’re “streaming.”
Fix / Mitigation Options:
Force shared workers to reinitialize frequently.
Set:
persistent_workers=False
Or reinstantiate the DataLoader per epoch:
for epoch in range(num_epochs):
loader = build_dataloader(…)
Manually trigger GC cleanup inside collate_fn:
import gc
def collate_fn(batch):
gc.collect()
return batch
Use num_workers=0 for streaming datasets if memory is highly constrained you sacrifice speed but regain determinism.
Use HuggingFace’s iter(dataset) directly instead of wrapping it in DataLoader if you want tightest control.
Why it happens:
StreamingDataset + multiprocessing = replicated internal state across subprocesses.
Memory isn’t leaked it’s just retained longer than expected because Python workers don’t reset iterator scope automatically.
Fix provided by Triskel Data Deterministic AI.
Loop logic only works when memory is sealed.
Let me know if you’d like a memory-safe symbolic loader pattern.