Handling decoding errors such as UnidentifiedImageError

I’m using the datasets library to load gigant/oldbookillustrations. It’s stored in parquet files, with a couple Image fields plus a handful of other short text fields with assorted metadata.

It loads okay, and many records seem to work just fine, but when I try to filter or map over the whole thing, it throws an UnidentifiedImageError from within the decoder:

File datasets/features/image.py:184, in Image.decode_example(self, value, token_per_repo_id)
    182             image = PIL.Image.open(bytes_)
    183 else:
--> 184     image = PIL.Image.open(BytesIO(bytes_))
    185 image.load()  # to avoid "Too many open files" errors
    186 return image

File PIL/Image.py:3280, in open(fp, mode, formats)
   3278     warnings.warn(message)
   3279 msg = "cannot identify image file %r" % (filename if filename else fp)
-> 3280 raise UnidentifiedImageError(msg)

cannot identify image file <_io.BytesIO object at 0x7fa50f2822a0>

How can I fix this or work around it?

Is there a way to install some kind of error handler on the decoder?

How can I identify which records are failing to decode, so that I can troubleshoot further or repair them?

Can I tell my operation to skip the failing record and move on?

env versions
  • datasets version: 2.15.0
  • Platform: Linux-5.15.0-89-generic-x86_64-with-glibc2.35
  • Python version: 3.11.5
  • huggingface_hub version: 0.19.4
  • PyArrow version: 13.0.0
  • Pandas version: 2.1.0
  • fsspec version: 2023.6.0

What version of pillow are you using ? Can you try to update pillow ?

Pillow version 10.1.0, the current version.

You can disable decoding and apply your own transform to decode the images:

def decode_images(batch):
    batch["rawscan"] = [decode_image(raw_data) for raw_data in batch["rawscan"]]
    return batch
ds = ds.cast_column("rawscan", Image(decode=False)
ds = ds.with_transform(decode_images)
ds[0]["rawscan"]  # transformed using decode_images

Then you can also iterate on your dataset to print the list of invalid images (e.g. using a try/except in decode_images)

Thank you, re-casting the type with decode=False worked to get access to that data without having to re-specify all of that type’s internal fields. I managed to track down the problem from there. :+1:

(For a few records, the HTML of the source’s 404 page had been saved to that Image field.)