Large image dataset, feedback and advice: data viewer, task template, and more

Hi fellows!

I noticed there aren’t many image datasets in the HuggingFace Hub, especially about medical images. So I contributed and added a benchmark dataset used for radiology images: the NIH Chest-X-ray14 dataset (Wang 2017).

This dataset has more than 100k images with 14 classes (~40GB approx). I learned a lot about structuring the repository and creating the custom loading script, but I still have some questions and problems that maybe someone from the community can help me with:

  1. The data viewer has problems visualizing the train split. I still don’t know what happens because the test split is working. Any advice or a possible solution?

  2. In the ._info() method, for the dataset generator class, I noticed there isn’t a task template for multi-label problems. Given that the label column was a Sequential type, it gave me an error when I used the ImageClassification task template as below. What is the purpose of the task_templates attributes? What are the advantages to have one task templates for a multi-label image problem?

                datasets.ImageClassification(image_column="image", label_column="label")
  1. The dataset is quite large and takes time to download, so I was wondering if there is space to improve my loading script and increase efficiency. I don’t have the best internet connection in the world, but it takes approximately 1 hour to download and load the dataset (a google colab for loading and exploring the dataset).
  2. I noticed some .zip files have a. “pickle” badge in the repository, but others do not. Why the difference, and what does it mean?

Any feedback and comments would be appreciated.


1 Like

cc’ing @severo just in case=)


Thanks for reporting. It’s not normal. An error should have been reported. I opened an issue: Dataset Viewer issue for alkzar90/NIH-Chest-X-ray-dataset · Issue #630 · huggingface/datasets-server · GitHub

Edit: no way to report an error for this kind of problem. Thanks @lhoestq for solving it for this dataset.

1 Like

The issue comes from the dataset script. I opened a PR at alkzar90/NIH-Chest-X-ray-dataset · Pass a list instead of an iterator @alkzar90 :wink:


Hey, I noticed that the train dataset viewer is still failing after the code modification. Is there any other bug possibility? :man_detective: Maybe it could be the size of the training files?

The other thing I noticed is that every zip file now has a pickle “badge.” When I published this post (above pic), only a few zip files had a pickle. Why is that?

Thanks :smile:

I opened another PR: alkzar90/NIH-Chest-X-ray-dataset · Fix the way it reads the file name

TLDR: use os.path.basename instead of path.split(‘/’)[-1] to get a filename :wink:

1 Like