Hi, I’m new to the Huggingface’s Datasets and I’m trying to train controlnet for stablediffusion on a custom dataset with approximately 300k images, the size of these images is (768, 768).
Does imagefolder load images (load and decode) in memory at setup, if it is, can I disable it?
Are there any implicit process Datasets do when I first call load_dataset, so it takes that long time?
What’s the best practice to load a relatively large dataset? I see someone mention that saving dataset as Arrow and then load it, but I don’t know how to do it specifically. There is a urgent need for a detailed tutorial in official docs for this.
@panigrah thank you very much. Maybe you also know if it’s possible to download a dataset in multi-processed way? For some reason setting num_proc does not work at all… My dataset has 58 parquet files and i was hoping passing num_proc to load_dataset would spawn 58 Python processes each downloading its own parquet so I can load my dataset in 1 minutes instead of 50…
the FolderBasedBuilder doesn’t seem to be optimized for large datasets. It performs several costly operations parsing and validating the data.
a simple loading script sidesteps most of the costly operations, e.g.
import datasets
import glob
_IMAGES = glob.glob("my-image-folder/*.png")
class MyDataset(datasets.GeneratorBasedBuilder):
"""My image dataset"""
def _info(self):
return datasets.DatasetInfo(
features=datasets.Features(
{
"image": datasets.Image(),
# add labels or additional metadata as needed
}
),
)
def _split_generators(self, dl_manager):
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={
"images": _IMAGES,
},
),
]
def _generate_examples(self, images):
for file_path in images:
yield file_path, {
"image": file_path,
# parse labels or additional metadata as needed
}
In my case, I was trying to create a 15M image dataset: the “imagefolder” approach seemed to run out of memory after 8 hours or so; with the loading script I was able to create the dataset in minutes.
We just released datasets 2.16.1 which optimizes the data files resolutions and makes it possible to load datasets with millions of images. It also requires huggingface-hub >= 0.20.1
Older versions of datasets and huggingface-hub are slow to handle that many files