Hi,
I’m new to HF dataset and I tried to create datasets based on data versioned in lakeFS (MinIO S3 bucket as storage backend)
Here I’m using ±30000 PIL image from MNIST data however it is taking around 12min to execute, which is a lot!
From what I understand, it is loading the images into cache then building the dataset.
– Please find bellow the execution screenshot –
Is there a way to optimize this or am I doing something wrong?
1 Like
Hmm… There is not much information available.
opened 09:04AM - 06 Dec 23 UTC
closed 07:13PM - 03 Jul 24 UTC
My dataset is stored on the company's lakefs server. How can I write code to loa… d the dataset? It would be great if I could provide code examples or provide some references
1 Like
@Adam-Ben-Khalifa you can try to load the data in streaming mode, also after you converted the data into the datasets library consider saving it locally or pushing it to the hub
2 Likes
I’m saving the dataset locally, the delay is only at the first time creating it.
Also I tried streaming and multiprocessing but I’m not seeing a difference, take a look
1 Like
imagefolder
is mainly for small image datasets, so I don’t think it’s very fast.
opened 12:04AM - 01 Dec 22 UTC
### Describe the bug
While testing image dataset creation, I'm seeing significa… nt performance bottlenecks with imagefolders when scanning a directory structure with large number of images.
## Setup
* Nested directories (5 levels deep)
* 3M+ images
* 1 `metadata.jsonl` file
## Performance Degradation Point 1
Degradation occurs because [`get_data_files_patterns`](https://github.com/huggingface/datasets/blob/main/src/datasets/data_files.py#L231-L243) runs the exact same scan for many different types of patterns, and there doesn't seem to be a way to easily limit this. It's controlled by the definition of [`ALL_DEFAULT_PATTERNS`](https://github.com/huggingface/datasets/blob/main/src/datasets/data_files.py#L82-L85).
One scan with 3M+ files takes about 10-15 minutes to complete on my setup, so having those extra scans really slows things down – from 10 minutes to 60+. Most of the scans return no matches, but they still take a significant amount of time to complete – hence the poor performance.
As a side effect, when this scan is run on 3M+ image files, Python also consumes up to 12 GB of RAM, which is not ideal.
## Performance Degradation Point 2
The second performance bottleneck is in [`PackagedDatasetModuleFactory.get_module`](https://github.com/huggingface/datasets/blob/d7dfbc83d68e87ba002c5eb2555f7a932e59038a/src/datasets/load.py#L707-L711), which calls `DataFilesDict.from_local_or_remote`.
It runs for a long time (60min+), consuming significant amounts of RAM – even more than the point 1 above. Based on `iostat -d 2`, it performs **zero** disk operations, which to me suggests that there is a code based bottleneck there that could be sorted out.
### Steps to reproduce the bug
```python
from datasets import load_dataset
import os
import huggingface_hub
dataset = load_dataset(
'imagefolder',
data_dir='/some/path',
# just to spell it out:
split=None,
drop_labels=True,
keep_in_memory=False
)
dataset.push_to_hub('account/dataset', private=True)
```
### Expected behavior
While it's certainly possible to write a custom loader to replace `ImageFolder` with, it'd be great if the off-the-shelf `ImageFolder` would by default have a setup that can scale to large datasets.
Or perhaps there could be a dedicated loader just for large datasets that trades off flexibility for performance? As in, maybe you have to define explicitly how you want it to work rather than it trying to guess your data structure like `_get_data_files_patterns()` does?
### Environment info
- `datasets` version: 2.7.1
- Platform: Linux-4.14.296-222.539.amzn2.x86_64-x86_64-with-glibc2.2.5
- Python version: 3.7.10
- PyArrow version: 10.0.1
- Pandas version: 1.3.5
Hi, I’m new to the Huggingface’s Datasets and I’m trying to train controlnet for stablediffusion on a custom dataset with approximately 300k images, the size of these images is (768, 768).
Now, I stuck in following lines of code:
dataset = load_dataset("imagefolder", data_dir="path/to/the/dataset")
print(dataset['train'][0])
Then, I have few questions.
Does imagefolder load images (load and decode) in memory at setup, if it is, can I disable it?
Are there any implicit process Datasets do wh…
I have a huge (100GB+) dataset of audio (.wav files) and its respective metadata I was able to easily load the dataset using load_dataset and uploaded it using push_to_hub which converts it to a parquet file what is the best way to upload such large dataset (particularly images and audio) I want to be able to use streaming with it And update metadata without having to reupload the entire dataset
2 Likes
This is helpful, I didn’t see these posts since I didn’t consider the data I’m testing with large (around 30k images ~ 9MB total)
I’ll check them and post an update
Thanks!
1 Like
> Update
The bottleneck, from what I understand, was making one network request per file
For 30k images, this meant 30k separate GET requests to the MinIO server through the S3 API, and that was killing the performance
Using webDataset to transform the large number of files to few .tar files and passing “webdataset” instead of “imagefolder” to the load_dataset function worked perfectly (took only ~11s)
1 Like
system
Closed
June 24, 2025, 12:37am
9
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.