Best Practices for Large-Scale Image Datasets? (between WebDataset and Parquet)

ryan-minato · February 6, 2025, 9:23am

I am importing an image dataset from an external source that is several terabytes in size.
Based on my research, WebDataset and Parquet seem to be the best choices. What are the advantages and disadvantages between WebDataset and Parquet ?

Currently, I am using the WebDataset format, but I have noticed several limitations.
The most critical issue is that file extensions must remain the same. It means I have to convert some PNG or other lossless formats with alpha channels into JPEG’s RGB format, which results in information loss.
Additionally, in my tests, WebDataset appears to be significantly slower than Parquet for both reading and writing.

I have not yet tried storing data in Parquet. Compared to WebDataset, does Parquet have any additional limitations?

Ryoo72 · February 7, 2025, 10:59am

I’m having the same concerns. Have you tried using the arrow format and saving it in binary PIL format? I think this might be the fastest method, but I’m not sure if it’s possible.

John6666 · February 8, 2025, 6:34am

I also think that PIL.Image.save() is a good way to bypass the problem.

I also dealt with a case where there was a need to add images from time to time, although it was a small image data set. If you upload many individual image files in a certain folder structure, Hugging Face’s server will organize them to a certain extent on its own, and it will probably be packed internally when loading, so that was the easiest thing to do. It’s not very smart from a Python perspective, and it’s a bit of a hassle to grasp the big picture…

But it’s definitely easy to add data, and you only need to upload the differences.

Ryoo72 · February 8, 2025, 6:07pm

I noticed that you mentioned using ImageFolder. If I may point out, according to `ImageFolder` performs poorly with large datasets · Issue #5317 · huggingface/datasets · GitHub, ImageFolder isn’t recommended for handling large image datasets.

Topic		Replies	Views
How to publish a text to-image dataset on huggingface 🤗Datasets	1	59	February 9, 2025
Image dataset best practices? 🤗Datasets	9	17224	January 15, 2023
Recommended tabular datasets format 🤗Datasets	1	470	February 27, 2024
Extremely slow data loading of imagefolder 🤗Datasets	9	2430	January 4, 2024
Handling Large-Scale Image Dataset 🤗Datasets	6	82	February 9, 2025

Best Practices for Large-Scale Image Datasets? (between WebDataset and Parquet)

Related topics