Handling decoding errors such as UnidentifiedImageError

You can disable decoding and apply your own transform to decode the images:

def decode_images(batch):
    batch["rawscan"] = [decode_image(raw_data) for raw_data in batch["rawscan"]]
    return batch
ds = ds.cast_column("rawscan", Image(decode=False)
ds = ds.with_transform(decode_images)
ds[0]["rawscan"]  # transformed using decode_images

Then you can also iterate on your dataset to print the list of invalid images (e.g. using a try/except in decode_images)

1 Like