Creating a HF Dataset from lakeFS with S3 storage takes too much time!

Hi,

I’m new to HF dataset and I tried to create datasets based on data versioned in lakeFS (MinIO S3 bucket as storage backend)
Here I’m using ±30000 PIL image from MNIST data however it is taking around 12min to execute, which is a lot!
From what I understand, it is loading the images into cache then building the dataset.
– Please find bellow the execution screenshot –

Is there a way to optimize this or am I doing something wrong?

1 Like

Hmm… There is not much information available.

1 Like

@Adam-Ben-Khalifa you can try to load the data in streaming mode, also after you converted the data into the datasets library consider saving it locally or pushing it to the hub

2 Likes

I’m saving the dataset locally, the delay is only at the first time creating it.
Also I tried streaming and multiprocessing but I’m not seeing a difference, take a look

1 Like

imagefolder is mainly for small image datasets, so I don’t think it’s very fast.

2 Likes

This is helpful, I didn’t see these posts since I didn’t consider the data I’m testing with large (around 30k images ~ 9MB total)
I’ll check them and post an update
Thanks!

1 Like

> Update

The bottleneck, from what I understand, was making one network request per file

For 30k images, this meant 30k separate GET requests to the MinIO server through the S3 API, and that was killing the performance

Using webDataset to transform the large number of files to few .tar files and passing “webdataset” instead of “imagefolder” to the load_dataset function worked perfectly (took only ~11s)

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.