Thank you @lhoestq!
I was trying to pinpoint which rows exactly fail, but am a bit lost in the process. When iterating over the Dataset
, I can access each row just fine. But when iterating over the IterableDataset
, a warning about the EXIF data is thrown.
/pfss/mlde/workspaces/mlde_wsp_KIServiceCenter/finngu/envs/dataset_download/lib/python3.10/site-packages/PIL/TiffImagePlugin.py:949: UserWarning: Corrupt EXIF data. Expecting to read 2 bytes but only got 0.
Do you have any idea, how I could find all rows with faulty EXIF data efficiently?
def validate_download():
dataset = load_dataset("ILSVRC/imagenet-1k")
splits = ["train", "validation", "test"]
logging.basicConfig(
filename='/pfss/mlde/workspaces/mlde_wsp_KIServiceCenter/finngu/LlavaGuard/src/experiments/datasets/imagenet/validate_download.log',
level=logging.INFO,
format='%(asctime)s %(levelname)s %(name)s %(message)s'
)
logger = logging.getLogger(__name__)
for split in splits:
logger.info(f"Validating all images of split '{split}'...")
iterable_ds = dataset[split].to_iterable_dataset()
for idx, example in enumerate(iterable_ds):
try:
img = example['image']
except Exception as e:
logger.error(f"{idx}: {e}")
if idx % 10000 == 0:
logger.info(f"\t{idx} images done.")
# ds = dataset[split]
# ds_len = len(ds)
# for idx in range(ds_len):
# try:
# _ = ds[idx]['image']
# except Exception as e:
# logger.error(f"{idx}: {e}")
# if idx % 10000 == 0:
# print(f"\t{idx} images done.")