jzju
June 19, 2025, 11:39am
1
ids = Dataset.from_generator(gen)
ids.save_to_disk("ds")
I want to run code with results like the one above but save to disk when one shard is filled instead of keeping the whole generator of data in RAM. Is there a way to manually call flush shard every n iterations?
In the petabyte case the data won’t even fit on a disk and the generator and flush shoud read and write to a cloud bucket.
1 Like
It seems like it could be done using the writer_batch_size
parameter, but I’m not sure how to use it specifically…
opened 04:06PM - 05 Jul 23 UTC
closed 01:46PM - 10 Jul 23 UTC
### Describe the bug
Whenever I try to create a dataset which contains images u… sing `Dataset.from_generator`, it freezes around 996 rows. I suppose it has something to do with memory consumption, but there's more memory available. I
Somehow it worked a few times but mostly this makes the datasets library much more cumbersome to work with because generators are the easiest way to turn an existing dataset into a Hugging Face dataset.
I've let it run in the frozen state for way longer than it can possibly take to load the actual dataset.
Let me know if you have ideas how to resolve it!
### Steps to reproduce the bug
```python
from datasets import Dataset
import numpy as np
def gen():
for row in range(10000):
yield {"i": np.random.rand(512, 512, 3)}
Dataset.from_generator(gen)
# -> 90% of the time gets stuck around 1000 rows
```
### Expected behavior
Should continue and go through all the examples yielded by the generator, or at least throw an error or somehow communicate what's going on.
### Environment info
- `datasets` version: 2.8.0
- Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- PyArrow version: 12.0.1
- Pandas version: 1.5.1
By default, we write data to disk (so it can be memory-mapped) every 1000 rows/samples. You can control this with the writer_batch_size
parameter.
opened 11:18AM - 15 Nov 23 UTC
enhancement
### Feature request
Add an argument in `save_to_disk` regarding batch size, whi… ch would be passed to `shard` and other methods.
### Motivation
The `Dataset.save_to_disk` method currently calls `shard` without passing a `writer_batch_size` argument, thus implicitly using the default value (1000). This can result in RAM saturation when using a lot of processes on long text sequences or other modalities, or for specific IO configs.
### Your contribution
I would be glad to submit a PR, as long as it does not imply extensive tests refactoring.
opened 11:12AM - 05 Oct 23 UTC
enhancement
### Feature request
Hi,
could you add an implementation of a batched `Itera… bleDataset`. It already support an option to do batch iteration via `.iter(batch_size=...)` but this cannot be used in combination with a torch `DataLoader` since it just returns an iterator.
### Motivation
The current implementation loads each element of a batch individually which can be very slow in cases of a big batch_size. I did some experiments [here](https://discuss.huggingface.co/t/slow-dataloader-with-big-batch-size/57224) and using a batched iteration would speed up data loading significantly.
### Your contribution
N/A