Creating a HF Dataset from lakeFS with S3 storage takes too much time!

> Update

The bottleneck, from what I understand, was making one network request per file

For 30k images, this meant 30k separate GET requests to the MinIO server through the S3 API, and that was killing the performance

Using webDataset to transform the large number of files to few .tar files and passing “webdataset” instead of “imagefolder” to the load_dataset function worked perfectly (took only ~11s)

1 Like