How does Dataset.from_generator store data bigger than RAM?

jzju · June 19, 2025, 11:39am

ids = Dataset.from_generator(gen)
ids.save_to_disk("ds")

I want to run code with results like the one above but save to disk when one shard is filled instead of keeping the whole generator of data in RAM. Is there a way to manually call flush shard every n iterations?

In the petabyte case the data won’t even fit on a disk and the generator and flush shoud read and write to a cloud bucket.

John6666 · June 19, 2025, 12:00pm

It seems like it could be done using the writer_batch_size parameter, but I’m not sure how to use it specifically…

github.com/huggingface/datasets

Dataset.from_generator consistently freezes at ~1000 rows

opened 04:06PM - 05 Jul 23 UTC

closed 01:46PM - 10 Jul 23 UTC

andreemic

### Describe the bug Whenever I try to create a dataset which contains images u…sing `Dataset.from_generator`, it freezes around 996 rows. I suppose it has something to do with memory consumption, but there's more memory available. I Somehow it worked a few times but mostly this makes the datasets library much more cumbersome to work with because generators are the easiest way to turn an existing dataset into a Hugging Face dataset. I've let it run in the frozen state for way longer than it can possibly take to load the actual dataset. Let me know if you have ideas how to resolve it! ### Steps to reproduce the bug ```python from datasets import Dataset import numpy as np def gen(): for row in range(10000): yield {"i": np.random.rand(512, 512, 3)} Dataset.from_generator(gen) # -> 90% of the time gets stuck around 1000 rows ``` ### Expected behavior Should continue and go through all the examples yielded by the generator, or at least throw an error or somehow communicate what's going on. ### Environment info - `datasets` version: 2.8.0 - Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.29 - Python version: 3.8.10 - PyArrow version: 12.0.1 - Pandas version: 1.5.1

By default, we write data to disk (so it can be memory-mapped) every 1000 rows/samples. You can control this with the writer_batch_size parameter.

github.com/huggingface/datasets

Allow to choose the `writer_batch_size` when using `save_to_disk`

opened 11:18AM - 15 Nov 23 UTC

NathanGodey

enhancement

### Feature request Add an argument in `save_to_disk` regarding batch size, whi…ch would be passed to `shard` and other methods. ### Motivation The `Dataset.save_to_disk` method currently calls `shard` without passing a `writer_batch_size` argument, thus implicitly using the default value (1000). This can result in RAM saturation when using a lot of processes on long text sequences or other modalities, or for specific IO configs. ### Your contribution I would be glad to submit a PR, as long as it does not imply extensive tests refactoring.

github.com/huggingface/datasets

Batched IterableDataset

opened 11:12AM - 05 Oct 23 UTC

lneukom

enhancement

### Feature request Hi, could you add an implementation of a batched `Itera…bleDataset`. It already support an option to do batch iteration via `.iter(batch_size=...)` but this cannot be used in combination with a torch `DataLoader` since it just returns an iterator. ### Motivation The current implementation loads each element of a batch individually which can be very slow in cases of a big batch_size. I did some experiments [here](https://discuss.huggingface.co/t/slow-dataloader-with-big-batch-size/57224) and using a batched iteration would speed up data loading significantly. ### Your contribution N/A

Topic		Replies	Views
How to serialise very large generator to disk 🤗Datasets	2	573	September 30, 2022
Multiprocessing and sharding when creating dataset from scratch using loading script 🤗Datasets	2	1621	November 4, 2022
Expected memory usage of Dataset Beginners	1	2780	July 4, 2023
Serially creating a very large dataset using from_generator(), slower each iteration (slows to >2-3s per example!) 🤗Datasets	1	764	May 18, 2023
Caching progress of Dataset.from_generator 🤗Datasets	2	61	October 23, 2024

How does Dataset.from_generator store data bigger than RAM?

Related topics