Hey there,
Iβve built a image dataset of 100k images + text pair as described here Create an image dataset
Now Iβm trying to push it to the hub but Iβm running into issues. First, I tried doing it via git directly, I added all the files in git lfs and pushed but I got hit with an error saying huggingface only accept up to 10k files in a folder.
So Iβm now trying with the push_to_hub()
func as follow:
from datasets import load_dataset
import os
dataset = load_dataset("imagefolder", data_dir="./data", split="train")
dataset.push_to_hub("tzvc/organization-logos", token=os.environ.get('HF_TOKEN'))
But again, this produces an error:
Resolving data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 100212/100212 [00:00<00:00, 439108.61it/s]
Downloading and preparing dataset imagefolder/default to /home/contact_theochampion/.cache/huggingface/datasets/imagefolder/default-20567ffc703aa314/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f...
Downloading data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 100211/100211 [00:00<00:00, 149323.73it/s]
Downloading data files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 15947.92it/s]
Extracting data files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 2245.34it/s]
Dataset imagefolder downloaded and prepared to /home/contact_theochampion/.cache/huggingface/datasets/imagefolder/default-20567ffc703aa314/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f. Subsequent calls will reuse this data.
Resuming upload of the dataset shards.
Pushing dataset shards to the dataset hub: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 14/14 [00:31<00:00, 2.24s/it]
Downloading metadata: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 118/118 [00:00<00:00, 225kB/s]
Traceback (most recent call last):
File "/home/contact_theochampion/organization-logos/push_to_hub.py", line 5, in <module>
dataset.push_to_hub("tzvc/organization-logos", token=os.environ.get('HF_TOKEN'))
File "/home/contact_theochampion/.local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 5245, in push_to_hub
repo_info = dataset_infos[next(iter(dataset_infos))]
StopIteration
What am I missing here?
Cheers!