It seems like this is a specification that does not occur with streaming of parquet-formatted datasets, but if that is not the case, could it be a bug?
opened 01:33PM - 31 Oct 24 UTC
### Describe the bug
I try to use a dataset with streaming=True, the issue I … have is that the RAM usage becomes higher and higher until it is no longer sustainable.
I understand that huggingface store data in ram during the streaming, and more worker in dataloader there are, more a lot of shard will be stored in ram, but the issue I have is that the ram usage is not constant. So after each new shard loaded, the ram usage will be higher and higher.
### Steps to reproduce the bug
You can run this code and see you ram usage, after each shard of 255 examples, your ram usage will be extended.
```py
from datasets import load_dataset
from torch.utils.data import DataLoader
dataset = load_dataset("WaveGenAI/dataset", streaming=True)
dataloader = DataLoader(dataset["train"], num_workers=3)
for i, data in enumerate(dataloader):
print(i, end="\r")
```
### Expected behavior
The Ram usage should be always the same (just 3 shards loaded in the ram).
### Environment info
- `datasets` version: 3.0.1
- Platform: Linux-6.10.5-arch1-1-x86_64-with-glibc2.40
- Python version: 3.12.4
- `huggingface_hub` version: 0.26.0
- PyArrow version: 17.0.0
- Pandas version: 2.2.3
- `fsspec` version: 2024.6.1
A dataset that comes from memory (e.g. using .from_dict()) doesn’t have a cache file yet, so if you want your map() to write on disk instead of filling up your memory you should pass a cache_file_name to map().
Note that at one point we might allocate a cache automatically to such datasets in memory to align with the general behavior.