Handling decoding errors such as UnidentifiedImageError

finnAndTheSharks · February 5, 2025, 4:18pm

Yeah, it is an issue. Iterating with the IterableDataset oftentimes fails alltogether and if it doesn’t fail, it becomes extreeeemely slow.

Edit: Ok, I finally found a solution and will attach it below if anyone else needs it in the future:

import logging
import warnings
from datasets import load_dataset

def validate_download():
    dataset = load_dataset("ILSVRC/imagenet-1k")
    splits = ["train", "validation", "test"]

    logging.basicConfig(
        filename='/src/experiments/datasets/imagenet/validate_download.log', 
        level=logging.INFO,
        format='%(asctime)s %(levelname)s %(name)s %(message)s'
    )
    logger = logging.getLogger(__name__)

    # Treat all warnings as errors
    warnings.filterwarnings("error")

    for split in splits:
        logger.info(f"Validating all images of split '{split}'...")
        ds = dataset[split]

        for idx in range(len(ds)):
            try:
                ds[idx]['image'].load()
                ds[idx]['image'].close()
            except Exception as e:
                logger.error(f"{idx}: {e}")

    # No longer treat warnings as errors
    warnings.resetwarnings()


if __name__ == '__main__':
    validate_download()

Topic		Replies	Views
Handle errors when loading images (404, corrupted, etc) 🤗Datasets	4	825	August 17, 2023
PIL.UnidentifiedImageError: cannot identify image file 🤗Datasets	4	8402	March 3, 2023
Issues in loading image from dataset Beginners	3	1167	January 22, 2024
Handling non-existing url in image dataset while cast_column 🤗Datasets	2	420	January 16, 2024
Unable to load images 🤗Datasets	2	152	December 31, 2024

Handling decoding errors such as UnidentifiedImageError

Related topics