> Update
The bottleneck, from what I understand, was making one network request per file
For 30k images, this meant 30k separate GET requests to the MinIO server through the S3 API, and that was killing the performance
Using webDataset to transform the large number of files to few .tar files and passing “webdataset” instead of “imagefolder” to the load_dataset function worked perfectly (took only ~11s)