Handling Large-Scale Image Dataset

Ryoo72 · February 7, 2025, 11:33am

Hello. I want to train my VLM using a large-scale image dataset with the Huggingface trainer.

I initially planned to follow the method suggested by HuggingFace4M, which involves embedding PIL images within arrow files. However, I found this approach problematic when dealing with datasets like DocStruct4M. The challenges included handling large files such as infographics and processing millions of data points. Furthermore, uploading such large image datasets to the hub proved difficult.

Therefore, I’m curious about what would be the best way to structure a large-scale image dataset."

John6666 · February 8, 2025, 6:48am

There seem to be several ways to create a dataset that does not involve uploading large files to the Hub. The method of writing a dataset loading script has been used for some time. A newer method is to use the Builder class, which I think is cleaner.

However, with both of these methods, it is difficult to create a detailed structure as when everything is managed locally and uploaded… that’s just the way it is.

Create a dataset loading script

(in Japanese)

Ryoo72 · February 8, 2025, 6:17pm

It seems we’re facing similar challenges. First of all, I really appreciate all the help you’ve provided with my problem.

I was wondering if you could tell me whether a dataset created by subclassing GeneratorBasedBuilder uses a map-based or iterative approach? If possible, I’d like to use the map-based approach for faster processing speed, shuffling capability, and to know the total number of iterations.

John6666 · February 9, 2025, 3:43am

Hmm, we’ll have to ask the author to find out… @lhoestq

Ryoo72 · February 9, 2025, 4:39am

Oh, that’s an actually great idea.

Ryoo72 · February 9, 2025, 8:00am

hey mate, I think using arrow files to construct large image datasets isn’t a good idea. Errror when saving to disk a dataset of images · Issue #5717 · huggingface/datasets · GitHub It seems arrow has limitations in handling large image sets. I think storing image paths as strings might be the best solution.

Have you tried using the Builder class? It seems to work in an iterative way, so I’m worried that since you don’t know the step length, it would be difficult to automatically schedule the learning rate and shuffling might not work as freely as we’d like.

John6666 · February 9, 2025, 8:32am

I see… In my case, the data set was small in terms of numbers.

In that case, for large data sets, rather than handling them on a file-by-file basis, it might be faster to use a method like grouping the data sets into a certain number of sets using WebDataset, or, although it’s a bit primitive, dividing the dataset repo into multiple volumes (Volume 1, Volume 2, etc.) and manually merging them when necessary when loading. Whether it’s smart or not, that’s beside the point…

Topic		Replies	Views
Using External Datasets with HuggingFace Data Loader Beginners	9	4420	April 27, 2022
Huggingface Vision Dataset - the right way to use it? 🤗Datasets	5	1292	July 11, 2022
Extremely slow data loading of imagefolder 🤗Datasets	9	2510	January 4, 2024
Uploading image dataset to Huggingface Hub 🤗Datasets	2	2622	October 14, 2022
Loading Huge Image Dataset seems to take a lot of time 🤗Datasets	7	3779	May 16, 2022

Handling Large-Scale Image Dataset

Create a dataset loading script

(in Japanese)

Related topics