Uploading image dataset to Huggingface Hub

Hi, I am trying to create an image dataset (training only) and upload it on HuggingFace Hub. The data has two columns: 1) the image, and 2) the description text, aka, label. Essentially I’m trying to upload something similar like this.

I am using the ImageFolder approach and have my data folder structured as such:


In the metadata.jsonl file I’ve added the labels for the images as mentioned here:

{β€œfile_name”: β€œimage_1.png”, β€œtext”: β€œsome description about image 1”}
{β€œfile_name”: β€œimage_2.png”, β€œtext”: β€œsome description about image 2”}
{β€œfile_name”: β€œimage_3.png”, β€œtext”: β€œsome description about image 3”}
{β€œfile_name”: β€œimage_4.png”, β€œtext”: β€œsome description about image 4”}

My script to upload is simple, and looks something like this:

from datasets import load_dataset
dataset = load_dataset(β€œimagefolder”, data_dir=β€œdata”, split=β€œtrain”)

When I run the script, strangely enough it seems to only push 1 image? I see a data/train-xxxx.parquet and a dataset_infos.json file generated on my repo, but clearly (given the size) it has not uploaded the full dataset in my local directory (1000+ images). Here is the result of the command

ejcho@ejs-macbook-pro undraw-raw % python3 push_images.py
Using custom data configuration default-e70837628f6a2c62
Found cached dataset imagefolder (/Users/ejcho/.cache/huggingface/datasets/imagefolder/default-e70837628f6a2c62/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f)
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 68.35ba/s]
Pushing dataset shards to the dataset hub: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:02<00:00, 2.27s/it]

Would anyone be able to help to see what’s going on?


It seems like there were some files names that the load_dataset didn’t like and would fail on. Once I removed those files (e.g. a file named β€œpersonal_training.png”) it worked.

Not sure if there is some issue in the parser and conflicts those text with certain keywords

Hi! Very strange indeed. Do you mind sharing the image folder with us (with dummy images if you want to keep them private), so we can get to the root of this problem?