I’m using the datasets library to load gigant/oldbookillustrations. It’s stored in parquet files, with a couple Image fields plus a handful of other short text fields with assorted metadata.
It loads okay, and many records seem to work just fine, but when I try to filter or map over the whole thing, it throws an UnidentifiedImageError from within the decoder:
traceback
File datasets/features/image.py:184, in Image.decode_example(self, value, token_per_repo_id)
182 image = PIL.Image.open(bytes_)
183 else:
--> 184 image = PIL.Image.open(BytesIO(bytes_))
185 image.load() # to avoid "Too many open files" errors
186 return image
File PIL/Image.py:3280, in open(fp, mode, formats)
3278 warnings.warn(message)
3279 msg = "cannot identify image file %r" % (filename if filename else fp)
-> 3280 raise UnidentifiedImageError(msg)
cannot identify image file <_io.BytesIO object at 0x7fa50f2822a0>
How can I fix this or work around it?
Is there a way to install some kind of error handler on the decoder?
How can I identify which records are failing to decode, so that I can troubleshoot further or repair them?
Can I tell my operation to skip the failing record and move on?
Thank you, re-casting the type with decode=False worked to get access to that data without having to re-specify all of that type’s internal fields. I managed to track down the problem from there.
(For a few records, the HTML of the source’s 404 page had been saved to that Image field.)
I was trying to pinpoint which rows exactly fail, but am a bit lost in the process. When iterating over the Dataset, I can access each row just fine. But when iterating over the IterableDataset, a warning about the EXIF data is thrown.
/pfss/mlde/workspaces/mlde_wsp_KIServiceCenter/finngu/envs/dataset_download/lib/python3.10/site-packages/PIL/TiffImagePlugin.py:949: UserWarning: Corrupt EXIF data. Expecting to read 2 bytes but only got 0.
Do you have any idea, how I could find all rows with faulty EXIF data efficiently?
def validate_download():
dataset = load_dataset("ILSVRC/imagenet-1k")
splits = ["train", "validation", "test"]
logging.basicConfig(
filename='/pfss/mlde/workspaces/mlde_wsp_KIServiceCenter/finngu/LlavaGuard/src/experiments/datasets/imagenet/validate_download.log',
level=logging.INFO,
format='%(asctime)s %(levelname)s %(name)s %(message)s'
)
logger = logging.getLogger(__name__)
for split in splits:
logger.info(f"Validating all images of split '{split}'...")
iterable_ds = dataset[split].to_iterable_dataset()
for idx, example in enumerate(iterable_ds):
try:
img = example['image']
except Exception as e:
logger.error(f"{idx}: {e}")
if idx % 10000 == 0:
logger.info(f"\t{idx} images done.")
# ds = dataset[split]
# ds_len = len(ds)
# for idx in range(ds_len):
# try:
# _ = ds[idx]['image']
# except Exception as e:
# logger.error(f"{idx}: {e}")
# if idx % 10000 == 0:
# print(f"\t{idx} images done.")