Loading images directly in data folder

Hi! I have an issue in relation with custom loading script for dataset loading. For a specific reason I have to keep images in a folder named “data” directly (without being compressed in a images.zip, for example).

The problem is that DownloadManager.download() (or download_and_extract) receives a file path. So I have to iterate over the repository files getting every image and them passing thorugh this function in a dict or list format. Doing this, the load_dataset function takes a lot of time to load the images (I think because the function do this one by one).

Is there any way to make this in a proper way? Please!!

Hi ! First, have you considered to not use a loading script, and simply have your dataset structured as an ImageFolder (+ metadata) ? See the documentation here: Image Dataset. It would also enable the Dataset Viewer on HF

Unfortunately the DownloadManager doesn’t implement anything to download a full directory or to glob files. You can either hardcode the list of files or use the HfFileSystem which does implement .glob()

Hi! Thanks for your answer.

Referring to have the dataset structured as an ImageFolder (+metadata). I have my repository (uploaded on Hugging Face Hub) structured with a folder called “data” in the root of the repo. Inside this folder I have the images in jpg format and a metadata.jsonl file containing metadata. How could I use load_dataset function to load a dataset with this structure?

I have seen differents configuration but they use “train” as the name of the folder which I call “data”. Is that the problem?

When I try to load_dataset with “datasets-examples/doc-image-5” (I use this because I think it has the same structure as my repo) I get no problems. When I try with my dataset I get “DataFilesNotFoundError: No (supported) data files found in <repo_id>” error. I try to rename my data folder to “train” but I get the same.

In my repo I have some folders (“rawdata”, “supportdata”, etc.) containing more data but I dont want this to be loaded, this data is not part of my dataset. I don’t know if this may trigger the errors.