Creating a HF Dataset from lakeFS with S3 storage takes too much time!

Adam-Ben-Khalifa · June 23, 2025, 12:37pm

> Update

The bottleneck, from what I understand, was making one network request per file

For 30k images, this meant 30k separate GET requests to the MinIO server through the S3 API, and that was killing the performance

Using webDataset to transform the large number of files to few .tar files and passing “webdataset” instead of “imagefolder” to the load_dataset function worked perfectly (took only ~11s)

Topic		Replies	Views
Load_datasets is extremely slow in loading HF datasets Beginners	1	2461	December 15, 2023
Extremely slow data loading of imagefolder 🤗Datasets	9	2430	January 4, 2024
Creating dataset slow 🤗Datasets	5	140	December 18, 2024
Loading Huge Image Dataset seems to take a lot of time 🤗Datasets	7	3739	May 16, 2022
Accessing dataset is very slow compared to torchvision 🤗Datasets	2	1311	May 24, 2022

Creating a HF Dataset from lakeFS with S3 storage takes too much time!

> Update

Related topics