Handling decoding errors such as UnidentifiedImageError

I’m using the datasets library to load gigant/oldbookillustrations. It’s stored in parquet files, with a couple Image fields plus a handful of other short text fields with assorted metadata.

It loads okay, and many records seem to work just fine, but when I try to filter or map over the whole thing, it throws an UnidentifiedImageError from within the decoder:

traceback
File datasets/features/image.py:184, in Image.decode_example(self, value, token_per_repo_id)
    182             image = PIL.Image.open(bytes_)
    183 else:
--> 184     image = PIL.Image.open(BytesIO(bytes_))
    185 image.load()  # to avoid "Too many open files" errors
    186 return image

File PIL/Image.py:3280, in open(fp, mode, formats)
   3278     warnings.warn(message)
   3279 msg = "cannot identify image file %r" % (filename if filename else fp)
-> 3280 raise UnidentifiedImageError(msg)

cannot identify image file <_io.BytesIO object at 0x7fa50f2822a0>

How can I fix this or work around it?

Is there a way to install some kind of error handler on the decoder?

How can I identify which records are failing to decode, so that I can troubleshoot further or repair them?

Can I tell my operation to skip the failing record and move on?

env versions
  • datasets version: 2.15.0
  • Platform: Linux-5.15.0-89-generic-x86_64-with-glibc2.35
  • Python version: 3.11.5
  • huggingface_hub version: 0.19.4
  • PyArrow version: 13.0.0
  • Pandas version: 2.1.0
  • fsspec version: 2023.6.0

What version of pillow are you using ? Can you try to update pillow ?

Pillow version 10.1.0, the current version.

You can disable decoding and apply your own transform to decode the images:

def decode_images(batch):
    batch["rawscan"] = [decode_image(raw_data) for raw_data in batch["rawscan"]]
    return batch
ds = ds.cast_column("rawscan", Image(decode=False)
ds = ds.with_transform(decode_images)
ds[0]["rawscan"]  # transformed using decode_images

Then you can also iterate on your dataset to print the list of invalid images (e.g. using a try/except in decode_images)

1 Like

Thank you, re-casting the type with decode=False worked to get access to that data without having to re-specify all of that type’s internal fields. I managed to track down the problem from there. :+1:

(For a few records, the HTML of the source’s 404 page had been saved to that Image field.)

Which function decode_image() are you using in this example? I could not find a suitable import from the datasets library.

1 Like

Which function decode_image() are you using in this example? I could not find a suitable import from the datasets library.

this would be your own custom decode function

2 Likes

Thank you @lhoestq!

I was trying to pinpoint which rows exactly fail, but am a bit lost in the process. When iterating over the Dataset, I can access each row just fine. But when iterating over the IterableDataset, a warning about the EXIF data is thrown.

/pfss/mlde/workspaces/mlde_wsp_KIServiceCenter/finngu/envs/dataset_download/lib/python3.10/site-packages/PIL/TiffImagePlugin.py:949: UserWarning: Corrupt EXIF data. Expecting to read 2 bytes but only got 0.

Do you have any idea, how I could find all rows with faulty EXIF data efficiently?

def validate_download():
    dataset = load_dataset("ILSVRC/imagenet-1k")
    splits = ["train", "validation", "test"]

    logging.basicConfig(
        filename='/pfss/mlde/workspaces/mlde_wsp_KIServiceCenter/finngu/LlavaGuard/src/experiments/datasets/imagenet/validate_download.log', 
        level=logging.INFO,
        format='%(asctime)s %(levelname)s %(name)s %(message)s'
    )
    logger = logging.getLogger(__name__)

    for split in splits:
        logger.info(f"Validating all images of split '{split}'...")

        iterable_ds = dataset[split].to_iterable_dataset()

        for idx, example in enumerate(iterable_ds):
            try:
                img = example['image']
            except Exception as e:
                logger.error(f"{idx}: {e}")
            
            if idx % 10000 == 0:
                logger.info(f"\t{idx} images done.")

        # ds = dataset[split]
        # ds_len = len(ds)

        # for idx in range(ds_len):
        #     try:
        #         _ = ds[idx]['image']
        #     except Exception as e:
        #         logger.error(f"{idx}: {e}")
            
        #     if idx % 10000 == 0:
        #         print(f"\t{idx} images done.")
1 Like

Is the corrupted EXIF really an issue ? I think PIL is still able to load the image no ?

1 Like

I found a similar case.

Yeah, it is an issue. Iterating with the IterableDataset oftentimes fails alltogether and if it doesn’t fail, it becomes extreeeemely slow.

Edit: Ok, I finally found a solution and will attach it below if anyone else needs it in the future:

import logging
import warnings
from datasets import load_dataset

def validate_download():
    dataset = load_dataset("ILSVRC/imagenet-1k")
    splits = ["train", "validation", "test"]

    logging.basicConfig(
        filename='/src/experiments/datasets/imagenet/validate_download.log', 
        level=logging.INFO,
        format='%(asctime)s %(levelname)s %(name)s %(message)s'
    )
    logger = logging.getLogger(__name__)

    # Treat all warnings as errors
    warnings.filterwarnings("error")

    for split in splits:
        logger.info(f"Validating all images of split '{split}'...")
        ds = dataset[split]

        for idx in range(len(ds)):
            try:
                ds[idx]['image'].load()
                ds[idx]['image'].close()
            except Exception as e:
                logger.error(f"{idx}: {e}")

    # No longer treat warnings as errors
    warnings.resetwarnings()


if __name__ == '__main__':
    validate_download()
1 Like