Handle errors when loading images (404, corrupted, etc)

Hello, I am loading images into a Dataset, by casting their urls as datasets.Image objects.

def load_dataset(db_client: DBClient) -> Dataset:
    """Loads the dataset from the given bucket."""
    paths = db_client.missing_image_paths()
    paths = list(paths)

    def url_from_path(path: str) -> str:
        return f'gs://{BUCKET}/{FOLDER}{path}'
    return Dataset.from_dict({
        'image': [url_from_path(path) for path in paths],
        'filename': paths
    }).cast_column('image', Image())

Now, some of these images don’t exist anymore. So with print(dataset[0]), I get:
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1920x1280 at 0x132D99190>, 'filename': 'b57ed2793e6a8ae06382c78a87863b8d.jpg'} :white_check_mark:

But if I try to load more, at some point, I get a message similar to that: PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x13592fab0> :no_entry_sign:

Is there some way to specify that we want to ignore those issues and discard the images when that happens?

Hi! You can remove invalid image files with

dataset = dataset.cast_column("image", datasets.Image(decode=False))

def has_valid_image(ex):
    except Exception:
        return False
    return True

dataset = dataset.filter(has_valid_image)
dataset = dataset.cast_column("image", datasets.Image(decode=True))

Hi @mariosasko, and thanks. It would work, but require to load images twice. I wonder if there is any way to pass on the loading on the fly instead, to divide the amount of work by 2.

This is the lazy approach

dataset = dataset.cast_column("image", datasets.Image(decode=False))

def invalid_images_as_none(batch):
    images = []
    for image in batch["image"]:
           image = PIL.Image.open(image["path"])
       except Exception:
           image = None
    batch["image"] = image
    return batch

dataset = dataset.with_transform(invalid_images_as_none)
1 Like

Thank you! @mariosasko