A streaming dataset's memory footprint continually grows

John6666 · June 18, 2025, 6:37am

It seems like this is a specification that does not occur with streaming of parquet-formatted datasets, but if that is not the case, could it be a bug?

github.com/huggingface/datasets

Memory leak when streaming

opened 01:33PM - 31 Oct 24 UTC

Jourdelune

### Describe the bug I try to use a dataset with streaming=True, the issue I …have is that the RAM usage becomes higher and higher until it is no longer sustainable. I understand that huggingface store data in ram during the streaming, and more worker in dataloader there are, more a lot of shard will be stored in ram, but the issue I have is that the ram usage is not constant. So after each new shard loaded, the ram usage will be higher and higher. ### Steps to reproduce the bug You can run this code and see you ram usage, after each shard of 255 examples, your ram usage will be extended. ```py from datasets import load_dataset from torch.utils.data import DataLoader dataset = load_dataset("WaveGenAI/dataset", streaming=True) dataloader = DataLoader(dataset["train"], num_workers=3) for i, data in enumerate(dataloader): print(i, end="\r") ``` ### Expected behavior The Ram usage should be always the same (just 3 shards loaded in the ram). ### Environment info - `datasets` version: 3.0.1 - Platform: Linux-6.10.5-arch1-1-x86_64-with-glibc2.40 - Python version: 3.12.4 - `huggingface_hub` version: 0.26.0 - PyArrow version: 17.0.0 - Pandas version: 2.2.3 - `fsspec` version: 2024.6.1

Topic		Replies	Views
Trainer + Datasets + Pytorch Dataloader Workers - how to manage memory usage? 🤗Transformers	1	36	April 29, 2025
Prevent iterable dataset from consuming all the rams Beginners	2	16	June 24, 2025
Best practices for a large dataset 🤗Datasets	7	1320	May 6, 2025
Streaming dataset and cache 🤗Datasets	5	3552	August 4, 2023
Expected memory usage of Dataset Beginners	1	2780	July 4, 2023

A streaming dataset's memory footprint continually grows

Related topics