How to clean/audit your image data?

For the folks working on real-world CV projects or generative AI, I think you will find this open-sourced Python library useful that detects common issues (in famous datasets too) such as images which are: blurry, under/over-exposed, low information, oddly sized, or (near) duplicates of others.

CleanVision is a new package you can use to quickly audit any image dataset (including HuggingFace Datasets) for a broad range of common issues lurking in real-world data. Instead of relying on manual inspection, which can be time-consuming and lacks coverage, CleanVision provides an automated systematic approach for detecting data issues.

Here’s all the code you need to run CleanVision on any Hugging Face image dataset to gain a deeper understanding of the quality of the images.

from datasets import load_dataset, concatenate_datasets

# Download and concatenate different splits
dataset_dict = load_dataset("cifar10")
dataset = concatenate_datasets([d for d in dataset_dict.values()])

# Specify the key for Image feature in dataset.features in `image_key` argument
imagelab = Imagelab(hf_dataset=dataset, image_key="img")

imagelab.find_issues()

imagelab.report()
2 Likes

You can find an examples notebook here that outlines using CleanVision with Hugging Face datasets.

Here’s the blog for more details!