Large image dataset, feedback and advice: data viewer, task template, and more

Hi fellows!

I noticed there aren’t many image datasets in the HuggingFace Hub, especially about medical images. So I contributed and added a benchmark dataset used for radiology images: the NIH Chest-X-ray14 dataset (Wang 2017).

This dataset has more than 100k images with 14 classes (~40GB approx). I learned a lot about structuring the repository and creating the custom loading script, but I still have some questions and problems that maybe someone from the community can help me with:

  1. The data viewer has problems visualizing the train split. I still don’t know what happens because the test split is working. Any advice or a possible solution?

  2. In the ._info() method, for the dataset generator class, I noticed there isn’t a task template for multi-label problems. Given that the label column was a Sequential type, it gave me an error when I used the ImageClassification task template as below. What is the purpose of the task_templates attributes? What are the advantages to have one task templates for a multi-label image problem?

                datasets.ImageClassification(image_column="image", label_column="label")
  1. The dataset is quite large and takes time to download, so I was wondering if there is space to improve my loading script and increase efficiency. I don’t have the best internet connection in the world, but it takes approximately 1 hour to download and load the dataset (a google colab for loading and exploring the dataset).
  2. I noticed some .zip files have a. “pickle” badge in the repository, but others do not. Why the difference, and what does it mean?

Any feedback and comments would be appreciated.


cc’ing @severo just in case=)


Thanks for reporting. It’s not normal. An error should have been reported. I opened an issue: Dataset Viewer issue for alkzar90/NIH-Chest-X-ray-dataset · Issue #630 · huggingface/datasets-server · GitHub

Edit: no way to report an error for this kind of problem. Thanks @lhoestq for solving it for this dataset.

The issue comes from the dataset script. I opened a PR at alkzar90/NIH-Chest-X-ray-dataset · Pass a list instead of an iterator @alkzar90 :wink:


Hey, I noticed that the train dataset viewer is still failing after the code modification. Is there any other bug possibility? :man_detective: Maybe it could be the size of the training files?

The other thing I noticed is that every zip file now has a pickle “badge.” When I published this post (above pic), only a few zip files had a pickle. Why is that?

Thanks :smile:

I opened another PR: alkzar90/NIH-Chest-X-ray-dataset · Fix the way it reads the file name

TLDR: use os.path.basename instead of path.split(‘/’)[-1] to get a filename :wink:

