Hi fellows!
I noticed there arenāt many image datasets in the HuggingFace Hub, especially about medical images. So I contributed and added a benchmark dataset used for radiology images: the NIH Chest-X-ray14 dataset (Wang 2017).
This dataset has more than 100k images with 14 classes (~40GB approx). I learned a lot about structuring the repository and creating the custom loading script, but I still have some questions and problems that maybe someone from the community can help me with:
-
The data viewer has problems visualizing the train split. I still donāt know what happens because the test split is working. Any advice or a possible solution?
-
In the
._info()
method, for the dataset generator class, I noticed there isnāt a task template for multi-label problems. Given that the label column was a Sequential type, it gave me an error when I used theImageClassification
task template as below. What is the purpose of thetask_templates
attributes? What are the advantages to have one task templates for a multi-label image problem?
task_templates=[
datasets.ImageClassification(image_column="image", label_column="label")
],
- The dataset is quite large and takes time to download, so I was wondering if there is space to improve my loading script and increase efficiency. I donāt have the best internet connection in the world, but it takes approximately 1 hour to download and load the dataset (a google colab for loading and exploring the dataset).
- I noticed some
.zip
files have a. āpickleā badge in the repository, but others do not. Why the difference, and what does it mean?
Any feedback and comments would be appreciated.
best,
Cristóbal