Hi, I am trying to create an image dataset (training only) and upload it on HuggingFace Hub. The data has two columns: 1) the image, and 2) the description text, aka, label. Essentially I’m trying to upload something similar like this.
I am using the ImageFolder approach and have my data folder structured as such:
metadata.jsonl
data/train/image_1.png
data/train/image_2.png
data/train/image_3.png
data/train/image_4.png
...
In the metadata.jsonl
file I’ve added the labels for the images as mentioned here:
{“file_name”: “image_1.png”, “text”: “some description about image 1”}
{“file_name”: “image_2.png”, “text”: “some description about image 2”}
{“file_name”: “image_3.png”, “text”: “some description about image 3”}
{“file_name”: “image_4.png”, “text”: “some description about image 4”}
My script to upload is simple, and looks something like this:
from datasets import load_dataset
dataset = load_dataset(“imagefolder”, data_dir=“data”, split=“train”)
dataset.push_to_hub(“ejcho623/undraw-raw”)
When I run the script, strangely enough it seems to only push 1 image? I see a data/train-xxxx.parquet
and a dataset_infos.json
file generated on my repo, but clearly (given the size) it has not uploaded the full dataset in my local directory (1000+ images). Here is the result of the command
ejcho@ejs-macbook-pro undraw-raw % python3 push_images.py
Using custom data configuration default-e70837628f6a2c62
Found cached dataset imagefolder (/Users/ejcho/.cache/huggingface/datasets/imagefolder/default-e70837628f6a2c62/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 68.35ba/s]
Pushing dataset shards to the dataset hub: 100%|████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.27s/it]
Would anyone be able to help to see what’s going on?
Thanks