Large image dataset, feedback and advice: data viewer, task template, and more

Hi fellows!

I noticed there arenā€™t many image datasets in the HuggingFace Hub, especially about medical images. So I contributed and added a benchmark dataset used for radiology images: the NIH Chest-X-ray14 dataset (Wang 2017).

This dataset has more than 100k images with 14 classes (~40GB approx). I learned a lot about structuring the repository and creating the custom loading script, but I still have some questions and problems that maybe someone from the community can help me with:

  1. The data viewer has problems visualizing the train split. I still donā€™t know what happens because the test split is working. Any advice or a possible solution?

  2. In the ._info() method, for the dataset generator class, I noticed there isnā€™t a task template for multi-label problems. Given that the label column was a Sequential type, it gave me an error when I used the ImageClassification task template as below. What is the purpose of the task_templates attributes? What are the advantages to have one task templates for a multi-label image problem?

            task_templates=[
                datasets.ImageClassification(image_column="image", label_column="label")
            ],
  1. The dataset is quite large and takes time to download, so I was wondering if there is space to improve my loading script and increase efficiency. I donā€™t have the best internet connection in the world, but it takes approximately 1 hour to download and load the dataset (a google colab for loading and exploring the dataset).
  2. I noticed some .zip files have a. ā€œpickleā€ badge in the repository, but others do not. Why the difference, and what does it mean?

Any feedback and comments would be appreciated.

best,
CristĆ³bal

1 Like

ccā€™ing @severo just in case=)

2 Likes

Thanks for reporting. Itā€™s not normal. An error should have been reported. I opened an issue: Dataset Viewer issue for alkzar90/NIH-Chest-X-ray-dataset Ā· Issue #630 Ā· huggingface/datasets-server Ā· GitHub

Edit: no way to report an error for this kind of problem. Thanks @lhoestq for solving it for this dataset.

1 Like

The issue comes from the dataset script. I opened a PR at alkzar90/NIH-Chest-X-ray-dataset Ā· Pass a list instead of an iterator @alkzar90 :wink:

2 Likes

Hey, I noticed that the train dataset viewer is still failing after the code modification. Is there any other bug possibility? :man_detective: Maybe it could be the size of the training files?

The other thing I noticed is that every zip file now has a pickle ā€œbadge.ā€ When I published this post (above pic), only a few zip files had a pickle. Why is that?

Thanks :smile:

I opened another PR: alkzar90/NIH-Chest-X-ray-dataset Ā· Fix the way it reads the file name

TLDR: use os.path.basename instead of path.split(ā€˜/ā€™)[-1] to get a filename :wink:

1 Like