Handling Large-Scale Image Dataset

Hello. I want to train my VLM using a large-scale image dataset with the Huggingface trainer.

I initially planned to follow the method suggested by HuggingFace4M, which involves embedding PIL images within arrow files. However, I found this approach problematic when dealing with datasets like DocStruct4M. The challenges included handling large files such as infographics and processing millions of data points. Furthermore, uploading such large image datasets to the hub proved difficult.

Therefore, Iā€™m curious about what would be the best way to structure a large-scale image dataset."

1 Like

There seem to be several ways to create a dataset that does not involve uploading large files to the Hub. The method of writing a dataset loading script has been used for some time. A newer method is to use the Builder class, which I think is cleaner.

However, with both of these methods, it is difficult to create a detailed structure as when everything is managed locally and uploadedā€¦ thatā€™s just the way it is.

Create a dataset loading script

(in Japanese)

1 Like

It seems weā€™re facing similar challenges. First of all, I really appreciate all the help youā€™ve provided with my problem.

I was wondering if you could tell me whether a dataset created by subclassing GeneratorBasedBuilder uses a map-based or iterative approach? If possible, Iā€™d like to use the map-based approach for faster processing speed, shuffling capability, and to know the total number of iterations.

1 Like

Hmm, weā€™ll have to ask the author to find outā€¦ @lhoestq

1 Like

Oh, thatā€™s an actually great idea.

1 Like

hey mate, I think using arrow files to construct large image datasets isnā€™t a good idea. Errror when saving to disk a dataset of images Ā· Issue #5717 Ā· huggingface/datasets Ā· GitHub It seems arrow has limitations in handling large image sets. I think storing image paths as strings might be the best solution.

Have you tried using the Builder class? It seems to work in an iterative way, so Iā€™m worried that since you donā€™t know the step length, it would be difficult to automatically schedule the learning rate and shuffling might not work as freely as weā€™d like.

1 Like

I seeā€¦ In my case, the data set was small in terms of numbers.

In that case, for large data sets, rather than handling them on a file-by-file basis, it might be faster to use a method like grouping the data sets into a certain number of sets using WebDataset, or, although itā€™s a bit primitive, dividing the dataset repo into multiple volumes (Volume 1, Volume 2, etc.) and manually merging them when necessary when loading. Whether itā€™s smart or not, thatā€™s beside the pointā€¦

1 Like