I noticed there aren’t many image datasets in the HuggingFace Hub, especially about medical images. So I contributed and added a benchmark dataset used for radiology images: the NIH Chest-X-ray14 dataset (Wang 2017).
This dataset has more than 100k images with 14 classes (~40GB approx). I learned a lot about structuring the repository and creating the custom loading script, but I still have some questions and problems that maybe someone from the community can help me with:
The data viewer has problems visualizing the train split. I still don’t know what happens because the test split is working. Any advice or a possible solution?
._info()method, for the dataset generator class, I noticed there isn’t a task template for multi-label problems. Given that the label column was a Sequential type, it gave me an error when I used the
ImageClassificationtask template as below. What is the purpose of the
task_templatesattributes? What are the advantages to have one task templates for a multi-label image problem?
task_templates=[ datasets.ImageClassification(image_column="image", label_column="label") ],
- The dataset is quite large and takes time to download, so I was wondering if there is space to improve my loading script and increase efficiency. I don’t have the best internet connection in the world, but it takes approximately 1 hour to download and load the dataset (a google colab for loading and exploring the dataset).
- I noticed some
.zipfiles have a. “pickle” badge in the repository, but others do not. Why the difference, and what does it mean?
Any feedback and comments would be appreciated.