A streaming dataset's memory footprint continually grows

joell1 · June 15, 2025, 6:42pm

Hey Datasets community. I am having a minor issue with a streaming a dataset. I have a very large dataset, over 500GB, that I have sharded into 5 files. Naturally I choose to use a streaming dataset since my machine does not have enough RAM to hold the entire dataset. I create my HF dataset with the following line

dataset = load_dataset(
            'parquet',
            data_files=data_files,
            streaming=True
        )

and I create the pytorch dataloader with:

def build_dataloader(self, dataset, batch_size, num_workers):
        return DataLoader(
            dataset,
            batch_size=batch_size,
            num_workers=num_workers,
            collate_fn=collate_fn
        )

I choose to have 4 workers. It works and is plenty fast, but during an epoch the memory footprint continually grows. When I look at htop in Linux, my training program spawned about 54 processes, all of which currently take up ~4.8GB of memory. Earlier today that same number for those processes was ~3GB. My conception of a streaming dataloader is that it should load in a batch and after a training step forfeit that memory so that it can load in the next batch. Therefore dataloading should always maintain the memory footprint of 1 batch*num_workers. Is there a memory leak here? I have attached a screenshot of htop to help you get an idea of what is going on.

Many thanks in advance
-Joel

John6666 · June 15, 2025, 10:17pm

I think it’s similar to this phenomenon.

github.com/huggingface/datasets

Memory leak when streaming

opened 01:33PM - 31 Oct 24 UTC

Jourdelune

### Describe the bug I try to use a dataset with streaming=True, the issue I …have is that the RAM usage becomes higher and higher until it is no longer sustainable. I understand that huggingface store data in ram during the streaming, and more worker in dataloader there are, more a lot of shard will be stored in ram, but the issue I have is that the ram usage is not constant. So after each new shard loaded, the ram usage will be higher and higher. ### Steps to reproduce the bug You can run this code and see you ram usage, after each shard of 255 examples, your ram usage will be extended. ```py from datasets import load_dataset from torch.utils.data import DataLoader dataset = load_dataset("WaveGenAI/dataset", streaming=True) dataloader = DataLoader(dataset["train"], num_workers=3) for i, data in enumerate(dataloader): print(i, end="\r") ``` ### Expected behavior The Ram usage should be always the same (just 3 shards loaded in the ram). ### Environment info - `datasets` version: 3.0.1 - Platform: Linux-6.10.5-arch1-1-x86_64-with-glibc2.40 - Python version: 3.12.4 - `huggingface_hub` version: 0.26.0 - PyArrow version: 17.0.0 - Pandas version: 2.2.3 - `fsspec` version: 2024.6.1

joell1 · June 16, 2025, 7:19pm

Thanks John

After reading both posts, its not clear that there is a resolution to the reported behavior. I see that Quentin recommended dataset.with_format("torch"). I tried this, but it slows down the dataloading compared to the dataset with the “with_format” method, almost by a factor of 2.

I have spent some time observing the different behavior of the two methods. Without the “with_format” it loads data in very quickly, but gets hung-up at certain points, at which the dataloader is suspended few seconds, sometimes tens of seconds, while when using “with_format” it never gets hung-up, but as I previously said is only half as fast. Both methods take up considerable memory, although “with_format” a little less.

Furthermore whether using “with_format” or not, dataloading with streaming appears to need 5-6 GB per worker, and may be growing and shrinking unpredictably throughout training. Is there anyone that can give me a rationalization for this behavior during streaming? If it is streaming, why is the memory footprint still kinda large. A single batch has maybe 20,000 float32s, and thus should have a running memory footprint on the order of MB, not GB, right?

John6666 · June 17, 2025, 2:46am

Hmm… @lhoestq

Pimpcat-AU · June 17, 2025, 6:16am

You’re using streaming=True with num_workers > 0.
This causes each worker to hold its own persistent iterator state, and Python doesn’t automatically garbage collect iterators across workers unless manually torn down.

Over time, especially with large DataLoader iterations, this creates persistent object growth in each process, even though you’re “streaming.”
Fix / Mitigation Options:

Force shared workers to reinitialize frequently.
Set:

persistent_workers=False

Or reinstantiate the DataLoader per epoch:

for epoch in range(num_epochs):
loader = build_dataloader(…)

Manually trigger GC cleanup inside collate_fn:

import gc

def collate_fn(batch):
gc.collect()
return batch

Use num_workers=0 for streaming datasets if memory is highly constrained you sacrifice speed but regain determinism.
Use HuggingFace’s iter(dataset) directly instead of wrapping it in DataLoader if you want tightest control.

Why it happens:
StreamingDataset + multiprocessing = replicated internal state across subprocesses.
Memory isn’t leaked it’s just retained longer than expected because Python workers don’t reset iterator scope automatically.

Fix provided by Triskel Data Deterministic AI.
Loop logic only works when memory is sealed.

Let me know if you’d like a memory-safe symbolic loader pattern.

joell1 · June 18, 2025, 5:51am

Hey Pimpcat-AU, thank you so much for all this information. I tried both persistent_workers=False and gc.collect(), unfortunately neither one worked. The latter fix really slowed things down, so I implemented it in the collate function randomly 1% of the time. Ultimately my program had a failed worker at exactly the same step in training.

Could you possibly tell me why the memory footprint is so high to begin with? 1 batch is on the order of MB and with multiple workers it probably should be tens of MB. Is the program asking for memory that it isn’t actually using?

Pimpcat-AU · June 18, 2025, 6:27am

from datasets import load_dataset

class SymbolicStreamingLoader:
def init(self, dataset, batch_size):
self.dataset = dataset
self.batch_size = batch_size
self.iterator = iter(dataset)

def __iter__(self):
    return self

def __next__(self):
    batch = []
    try:
        for _ in range(self.batch_size):
            batch.append(next(self.iterator))
    except StopIteration:
        if not batch:
            raise
    return batch

Usage:

dataset = load_dataset(‘parquet’, data_files=data_files, streaming=True)
loader = SymbolicStreamingLoader(dataset, batch_size=32)

for epoch in range(num_epochs):
for batch in loader:
train_step(batch)
loader = SymbolicStreamingLoader(dataset, batch_size=32) # re-init each epoch

John6666 · June 18, 2025, 6:37am

It seems like this is a specification that does not occur with streaming of parquet-formatted datasets, but if that is not the case, could it be a bug?

github.com/huggingface/datasets

Memory leak when streaming

opened 01:33PM - 31 Oct 24 UTC

Jourdelune

### Describe the bug I try to use a dataset with streaming=True, the issue I …have is that the RAM usage becomes higher and higher until it is no longer sustainable. I understand that huggingface store data in ram during the streaming, and more worker in dataloader there are, more a lot of shard will be stored in ram, but the issue I have is that the ram usage is not constant. So after each new shard loaded, the ram usage will be higher and higher. ### Steps to reproduce the bug You can run this code and see you ram usage, after each shard of 255 examples, your ram usage will be extended. ```py from datasets import load_dataset from torch.utils.data import DataLoader dataset = load_dataset("WaveGenAI/dataset", streaming=True) dataloader = DataLoader(dataset["train"], num_workers=3) for i, data in enumerate(dataloader): print(i, end="\r") ``` ### Expected behavior The Ram usage should be always the same (just 3 shards loaded in the ram). ### Environment info - `datasets` version: 3.0.1 - Platform: Linux-6.10.5-arch1-1-x86_64-with-glibc2.40 - Python version: 3.12.4 - `huggingface_hub` version: 0.26.0 - PyArrow version: 17.0.0 - Pandas version: 2.2.3 - `fsspec` version: 2024.6.1

joell1 · June 19, 2025, 7:00am

Is the take away here that streaming and multiple workers is unfeasible with very large datasets?

Topic		Replies	Views
Trainer + Datasets + Pytorch Dataloader Workers - how to manage memory usage? 🤗Transformers	1	36	April 29, 2025
Prevent iterable dataset from consuming all the rams Beginners	2	16	June 24, 2025
Best practices for a large dataset 🤗Datasets	7	1299	May 6, 2025
Streaming dataset and cache 🤗Datasets	5	3545	August 4, 2023
Expected memory usage of Dataset Beginners	1	2776	July 4, 2023

A streaming dataset's memory footprint continually grows

Related topics