I’m using the datasets library to load gigant/oldbookillustrations
. It’s stored in parquet files, with a couple Image fields plus a handful of other short text fields with assorted metadata.
It loads okay, and many records seem to work just fine, but when I try to filter
or map
over the whole thing, it throws an UnidentifiedImageError
from within the decoder:
traceback
File datasets/features/image.py:184, in Image.decode_example(self, value, token_per_repo_id)
182 image = PIL.Image.open(bytes_)
183 else:
--> 184 image = PIL.Image.open(BytesIO(bytes_))
185 image.load() # to avoid "Too many open files" errors
186 return image
File PIL/Image.py:3280, in open(fp, mode, formats)
3278 warnings.warn(message)
3279 msg = "cannot identify image file %r" % (filename if filename else fp)
-> 3280 raise UnidentifiedImageError(msg)
cannot identify image file <_io.BytesIO object at 0x7fa50f2822a0>
How can I fix this or work around it?
Is there a way to install some kind of error handler on the decoder?
How can I identify which records are failing to decode, so that I can troubleshoot further or repair them?
Can I tell my operation to skip the failing record and move on?
env versions
datasets
version: 2.15.0- Platform: Linux-5.15.0-89-generic-x86_64-with-glibc2.35
- Python version: 3.11.5
huggingface_hub
version: 0.19.4- PyArrow version: 13.0.0
- Pandas version: 2.1.0
fsspec
version: 2023.6.0