How to clean/audit your image data?

cmauck10 · April 19, 2023, 7:01pm

For the folks working on real-world CV projects or generative AI, I think you will find this open-sourced Python library useful that detects common issues (in famous datasets too) such as images which are: blurry, under/over-exposed, low information, oddly sized, or (near) duplicates of others.

CleanVision is a new package you can use to quickly audit any image dataset (including HuggingFace Datasets) for a broad range of common issues lurking in real-world data. Instead of relying on manual inspection, which can be time-consuming and lacks coverage, CleanVision provides an automated systematic approach for detecting data issues.

Here’s all the code you need to run CleanVision on any Hugging Face image dataset to gain a deeper understanding of the quality of the images.

from datasets import load_dataset, concatenate_datasets

# Download and concatenate different splits
dataset_dict = load_dataset("cifar10")
dataset = concatenate_datasets([d for d in dataset_dict.values()])

# Specify the key for Image feature in dataset.features in `image_key` argument
imagelab = Imagelab(hf_dataset=dataset, image_key="img")

imagelab.find_issues()

imagelab.report()

cmauck10 · April 21, 2023, 8:10pm

You can find an examples notebook here that outlines using CleanVision with Hugging Face datasets.

Here’s the blog for more details!

Topic		Replies	Views
How to clean 8217 pictures from the similar one Beginners	18	646	November 28, 2024
Unable to load images 🤗Datasets	2	144	December 31, 2024
Uploading dataset compressing images 🤗Datasets	3	257	January 29, 2024
Large image dataset, feedback and advice: data viewer, task template, and more 🤗Datasets	5	918	November 22, 2022
Dataset Viewer for labeled image dataset Beginners	0	26	July 22, 2024

How to clean/audit your image data?

Related topics