Hello. I want to train my VLM using a large-scale image dataset with the Huggingface trainer.
I initially planned to follow the method suggested by HuggingFace4M, which involves embedding PIL images within arrow files. However, I found this approach problematic when dealing with datasets like DocStruct4M. The challenges included handling large files such as infographics and processing millions of data points. Furthermore, uploading such large image datasets to the hub proved difficult.
Therefore, Iām curious about what would be the best way to structure a large-scale image dataset."
There seem to be several ways to create a dataset that does not involve uploading large files to the Hub. The method of writing a dataset loading script has been used for some time. A newer method is to use the Builder class, which I think is cleaner.
However, with both of these methods, it is difficult to create a detailed structure as when everything is managed locally and uploadedā¦ thatās just the way it is.
It seems weāre facing similar challenges. First of all, I really appreciate all the help youāve provided with my problem.
I was wondering if you could tell me whether a dataset created by subclassing GeneratorBasedBuilder uses a map-based or iterative approach? If possible, Iād like to use the map-based approach for faster processing speed, shuffling capability, and to know the total number of iterations.
Have you tried using the Builder class? It seems to work in an iterative way, so Iām worried that since you donāt know the step length, it would be difficult to automatically schedule the learning rate and shuffling might not work as freely as weād like.
I seeā¦ In my case, the data set was small in terms of numbers.
In that case, for large data sets, rather than handling them on a file-by-file basis, it might be faster to use a method like grouping the data sets into a certain number of sets using WebDataset, or, although itās a bit primitive, dividing the dataset repo into multiple volumes (Volume 1, Volume 2, etc.) and manually merging them when necessary when loading. Whether itās smart or not, thatās beside the pointā¦