I am importing an image dataset from an external source that is several terabytes in size.
Based on my research, WebDataset and Parquet seem to be the best choices. What are the advantages and disadvantages between WebDataset and Parquet ?
Currently, I am using the WebDataset format, but I have noticed several limitations.
The most critical issue is that file extensions must remain the same. It means I have to convert some PNG or other lossless formats with alpha channels into JPEG’s RGB format, which results in information loss.
Additionally, in my tests, WebDataset appears to be significantly slower than Parquet for both reading and writing.
I have not yet tried storing data in Parquet. Compared to WebDataset, does Parquet have any additional limitations?
I’m having the same concerns. Have you tried using the arrow format and saving it in binary PIL format? I think this might be the fastest method, but I’m not sure if it’s possible.
I also think that PIL.Image.save() is a good way to bypass the problem.
I also dealt with a case where there was a need to add images from time to time, although it was a small image data set. If you upload many individual image files in a certain folder structure, Hugging Face’s server will organize them to a certain extent on its own, and it will probably be packed internally when loading, so that was the easiest thing to do. It’s not very smart from a Python perspective, and it’s a bit of a hassle to grasp the big picture…
But it’s definitely easy to add data, and you only need to upload the differences.