Creating a HF Dataset from lakeFS with S3 storage takes too much time!

Adam-Ben-Khalifa · June 19, 2025, 11:58am

Hi,

I’m new to HF dataset and I tried to create datasets based on data versioned in lakeFS (MinIO S3 bucket as storage backend)
Here I’m using ±30000 PIL image from MNIST data however it is taking around 12min to execute, which is a lot!
From what I understand, it is loading the images into cache then building the dataset.
– Please find bellow the execution screenshot –

Is there a way to optimize this or am I doing something wrong?

John6666 · June 19, 2025, 12:45pm

Hmm… There is not much information available.

not-lain · June 19, 2025, 10:53pm

@Adam-Ben-Khalifa you can try to load the data in streaming mode, also after you converted the data into the datasets library consider saving it locally or pushing it to the hub

Adam-Ben-Khalifa · June 20, 2025, 11:04am

I’m saving the dataset locally, the delay is only at the first time creating it.
Also I tried streaming and multiprocessing but I’m not seeing a difference, take a look

John6666 · June 20, 2025, 11:14am

imagefolder is mainly for small image datasets, so I don’t think it’s very fast.

github.com/huggingface/datasets

`ImageFolder` performs poorly with large datasets

opened 12:04AM - 01 Dec 22 UTC

salieri

### Describe the bug While testing image dataset creation, I'm seeing significa…nt performance bottlenecks with imagefolders when scanning a directory structure with large number of images. ## Setup * Nested directories (5 levels deep) * 3M+ images * 1 `metadata.jsonl` file ## Performance Degradation Point 1 Degradation occurs because [`get_data_files_patterns`](https://github.com/huggingface/datasets/blob/main/src/datasets/data_files.py#L231-L243) runs the exact same scan for many different types of patterns, and there doesn't seem to be a way to easily limit this. It's controlled by the definition of [`ALL_DEFAULT_PATTERNS`](https://github.com/huggingface/datasets/blob/main/src/datasets/data_files.py#L82-L85). One scan with 3M+ files takes about 10-15 minutes to complete on my setup, so having those extra scans really slows things down – from 10 minutes to 60+. Most of the scans return no matches, but they still take a significant amount of time to complete – hence the poor performance. As a side effect, when this scan is run on 3M+ image files, Python also consumes up to 12 GB of RAM, which is not ideal. ## Performance Degradation Point 2 The second performance bottleneck is in [`PackagedDatasetModuleFactory.get_module`](https://github.com/huggingface/datasets/blob/d7dfbc83d68e87ba002c5eb2555f7a932e59038a/src/datasets/load.py#L707-L711), which calls `DataFilesDict.from_local_or_remote`. It runs for a long time (60min+), consuming significant amounts of RAM – even more than the point 1 above. Based on `iostat -d 2`, it performs **zero** disk operations, which to me suggests that there is a code based bottleneck there that could be sorted out. ### Steps to reproduce the bug ```python from datasets import load_dataset import os import huggingface_hub dataset = load_dataset( 'imagefolder', data_dir='/some/path', # just to spell it out: split=None, drop_labels=True, keep_in_memory=False ) dataset.push_to_hub('account/dataset', private=True) ``` ### Expected behavior While it's certainly possible to write a custom loader to replace `ImageFolder` with, it'd be great if the off-the-shelf `ImageFolder` would by default have a setup that can scale to large datasets. Or perhaps there could be a dedicated loader just for large datasets that trades off flexibility for performance? As in, maybe you have to define explicitly how you want it to work rather than it trying to guess your data structure like `_get_data_files_patterns()` does? ### Environment info - `datasets` version: 2.7.1 - Platform: Linux-4.14.296-222.539.amzn2.x86_64-x86_64-with-glibc2.2.5 - Python version: 3.7.10 - PyArrow version: 10.0.1 - Pandas version: 1.3.5

Adam-Ben-Khalifa · June 20, 2025, 11:47am

This is helpful, I didn’t see these posts since I didn’t consider the data I’m testing with large (around 30k images ~ 9MB total)
I’ll check them and post an update
Thanks!

Adam-Ben-Khalifa · June 23, 2025, 12:37pm

> Update

The bottleneck, from what I understand, was making one network request per file

For 30k images, this meant 30k separate GET requests to the MinIO server through the S3 API, and that was killing the performance

Using webDataset to transform the large number of files to few .tar files and passing “webdataset” instead of “imagefolder” to the load_dataset function worked perfectly (took only ~11s)

system · June 24, 2025, 12:37am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Load_datasets is extremely slow in loading HF datasets Beginners	1	2455	December 15, 2023
Extremely slow data loading of imagefolder 🤗Datasets	9	2430	January 4, 2024
Creating dataset slow 🤗Datasets	5	135	December 18, 2024
Loading Huge Image Dataset seems to take a lot of time 🤗Datasets	7	3739	May 16, 2022
Accessing dataset is very slow compared to torchvision 🤗Datasets	2	1311	May 24, 2022

Creating a HF Dataset from lakeFS with S3 storage takes too much time!

> Update

Related topics