Handling decoding errors such as UnidentifiedImageError

Thank you @lhoestq!

I was trying to pinpoint which rows exactly fail, but am a bit lost in the process. When iterating over the Dataset, I can access each row just fine. But when iterating over the IterableDataset, a warning about the EXIF data is thrown.

/pfss/mlde/workspaces/mlde_wsp_KIServiceCenter/finngu/envs/dataset_download/lib/python3.10/site-packages/PIL/TiffImagePlugin.py:949: UserWarning: Corrupt EXIF data. Expecting to read 2 bytes but only got 0.

Do you have any idea, how I could find all rows with faulty EXIF data efficiently?

def validate_download():
    dataset = load_dataset("ILSVRC/imagenet-1k")
    splits = ["train", "validation", "test"]

    logging.basicConfig(
        filename='/pfss/mlde/workspaces/mlde_wsp_KIServiceCenter/finngu/LlavaGuard/src/experiments/datasets/imagenet/validate_download.log', 
        level=logging.INFO,
        format='%(asctime)s %(levelname)s %(name)s %(message)s'
    )
    logger = logging.getLogger(__name__)

    for split in splits:
        logger.info(f"Validating all images of split '{split}'...")

        iterable_ds = dataset[split].to_iterable_dataset()

        for idx, example in enumerate(iterable_ds):
            try:
                img = example['image']
            except Exception as e:
                logger.error(f"{idx}: {e}")
            
            if idx % 10000 == 0:
                logger.info(f"\t{idx} images done.")

        # ds = dataset[split]
        # ds_len = len(ds)

        # for idx in range(ds_len):
        #     try:
        #         _ = ds[idx]['image']
        #     except Exception as e:
        #         logger.error(f"{idx}: {e}")
            
        #     if idx % 10000 == 0:
        #         print(f"\t{idx} images done.")
1 Like