Hi, I am trying to create an image dataset (training only) and upload it on HuggingFace Hub. The data has two columns: 1) the image, and 2) the description text, aka, label. Essentially Iβm trying to upload something similar like this.
I am using the ImageFolder approach and have my data folder structured as such:
metadata.jsonl
data/train/image_1.png
data/train/image_2.png
data/train/image_3.png
data/train/image_4.png
...
In the metadata.jsonl
file Iβve added the labels for the images as mentioned here:
{βfile_nameβ: βimage_1.pngβ, βtextβ: βsome description about image 1β}
{βfile_nameβ: βimage_2.pngβ, βtextβ: βsome description about image 2β}
{βfile_nameβ: βimage_3.pngβ, βtextβ: βsome description about image 3β}
{βfile_nameβ: βimage_4.pngβ, βtextβ: βsome description about image 4β}
My script to upload is simple, and looks something like this:
from datasets import load_dataset
dataset = load_dataset(βimagefolderβ, data_dir=βdataβ, split=βtrainβ)
dataset.push_to_hub(βejcho623/undraw-rawβ)
When I run the script, strangely enough it seems to only push 1 image? I see a data/train-xxxx.parquet
and a dataset_infos.json
file generated on my repo, but clearly (given the size) it has not uploaded the full dataset in my local directory (1000+ images). Here is the result of the command
ejcho@ejs-macbook-pro undraw-raw % python3 push_images.py
Using custom data configuration default-e70837628f6a2c62
Found cached dataset imagefolder (/Users/ejcho/.cache/huggingface/datasets/imagefolder/default-e70837628f6a2c62/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f)
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 68.35ba/s]
Pushing dataset shards to the dataset hub: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:02<00:00, 2.27s/it]
Would anyone be able to help to see whatβs going on?
Thanks