Cannot push dataset of 100k images

tzvc · March 26, 2023, 2:44pm

Hey there,

I’ve built a image dataset of 100k images + text pair as described here Create an image dataset

Now I’m trying to push it to the hub but I’m running into issues. First, I tried doing it via git directly, I added all the files in git lfs and pushed but I got hit with an error saying huggingface only accept up to 10k files in a folder.

So I’m now trying with the push_to_hub() func as follow:

from datasets import load_dataset
import os

dataset = load_dataset("imagefolder", data_dir="./data", split="train")
dataset.push_to_hub("tzvc/organization-logos", token=os.environ.get('HF_TOKEN'))

But again, this produces an error:

Resolving data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 100212/100212 [00:00<00:00, 439108.61it/s]
Downloading and preparing dataset imagefolder/default to /home/contact_theochampion/.cache/huggingface/datasets/imagefolder/default-20567ffc703aa314/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f...
Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 100211/100211 [00:00<00:00, 149323.73it/s]
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15947.92it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2245.34it/s]
Dataset imagefolder downloaded and prepared to /home/contact_theochampion/.cache/huggingface/datasets/imagefolder/default-20567ffc703aa314/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f. Subsequent calls will reuse this data.
Resuming upload of the dataset shards.                                                                                                                                        
Pushing dataset shards to the dataset hub: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:31<00:00,  2.24s/it]
Downloading metadata: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:00<00:00, 225kB/s]
Traceback (most recent call last):
  File "/home/contact_theochampion/organization-logos/push_to_hub.py", line 5, in <module>
    dataset.push_to_hub("tzvc/organization-logos", token=os.environ.get('HF_TOKEN'))
  File "/home/contact_theochampion/.local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 5245, in push_to_hub
    repo_info = dataset_infos[next(iter(dataset_infos))]
StopIteration

What am I missing here?

Cheers!

Topic		Replies	Views
Pushing dataset images to Hub 🤗Datasets	4	2661	October 25, 2022
Any workaround for push_to_hub() limits? 🤗Datasets	9	2167	May 2, 2024
Uploading image dataset to Huggingface Hub 🤗Datasets	2	2579	October 14, 2022
Problem "Bad request" when using datasets.Dataset.push_to_hub() 🤗Datasets	6	379	October 28, 2024
Datasetdict push_to_hub failing with payload to large 🤗Datasets	6	72	February 11, 2025

Cannot push dataset of 100k images

Related topics