Large image dataset, feedback and advice: data viewer, task template, and more

alkzar90 · November 4, 2022, 5:25pm

Hi fellows!

I noticed there aren’t many image datasets in the HuggingFace Hub, especially about medical images. So I contributed and added a benchmark dataset used for radiology images: the NIH Chest-X-ray14 dataset (Wang 2017).

This dataset has more than 100k images with 14 classes (~40GB approx). I learned a lot about structuring the repository and creating the custom loading script, but I still have some questions and problems that maybe someone from the community can help me with:

The data viewer has problems visualizing the train split. I still don’t know what happens because the test split is working. Any advice or a possible solution?
In the ._info() method, for the dataset generator class, I noticed there isn’t a task template for multi-label problems. Given that the label column was a Sequential type, it gave me an error when I used the ImageClassification task template as below. What is the purpose of the task_templates attributes? What are the advantages to have one task templates for a multi-label image problem?

            task_templates=[
                datasets.ImageClassification(image_column="image", label_column="label")
            ],

The dataset is quite large and takes time to download, so I was wondering if there is space to improve my loading script and increase efficiency. I don’t have the best internet connection in the world, but it takes approximately 1 hour to download and load the dataset (a google colab for loading and exploring the dataset).
I noticed some .zip files have a. “pickle” badge in the repository, but others do not. Why the difference, and what does it mean?
image1198×890 106 KB

Any feedback and comments would be appreciated.

best,
Cristóbal

julien-c · November 5, 2022, 6:50pm

cc’ing @severo just in case=)

severo · November 7, 2022, 2:02pm

Thanks for reporting. ~~It’s not normal. An error should have been reported. I opened an issue: Dataset Viewer issue for alkzar90/NIH-Chest-X-ray-dataset · Issue #630 · huggingface/dataset-viewer · GitHub~~

Edit: no way to report an error for this kind of problem. Thanks @lhoestq for solving it for this dataset.

lhoestq · November 7, 2022, 2:13pm

The issue comes from the dataset script. I opened a PR at alkzar90/NIH-Chest-X-ray-dataset · Pass a list instead of an iterator @alkzar90

alkzar90 · November 21, 2022, 8:54pm

Hey, I noticed that the train dataset viewer is still failing after the code modification. Is there any other bug possibility? Maybe it could be the size of the training files?

The other thing I noticed is that every zip file now has a pickle “badge.” When I published this post (above pic), only a few zip files had a pickle. Why is that?

Thanks

lhoestq · November 22, 2022, 10:06am

I opened another PR: alkzar90/NIH-Chest-X-ray-dataset · Fix the way it reads the file name

TLDR: use os.path.basename instead of path.split(‘/’)[-1] to get a filename

Topic		Replies	Views
Extremely slow data loading of imagefolder 🤗Datasets	9	2421	January 4, 2024
Hugging face datasets -- reading image shape takes very long time Beginners	1	281	April 4, 2023
Unable to load images 🤗Datasets	2	138	December 31, 2024
Handling Large-Scale Image Dataset 🤗Datasets	6	79	February 9, 2025
Huggingface Vision Dataset - the right way to use it? 🤗Datasets	5	1280	July 11, 2022

Large image dataset, feedback and advice: data viewer, task template, and more

Related topics