Dataset Viewer issue: StreamingRowsError

The dataset viewer is not working.

The image I uploaded has no issues when tested locally, but it returns an error after being uploaded.

Is there any code or tool related to packaging images into Parquet files locally?

Error details:

Error code:   StreamingRowsError
Exception:    OSError
Message:      image file is truncated (20 bytes not processed)
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/src/worker/utils.py", line 99, in get_rows_or_raise
                  return get_rows(
                File "/src/libs/libcommon/src/libcommon/utils.py", line 197, in decorator
                  return func(*args, **kwargs)
                File "/src/services/worker/src/worker/utils.py", line 77, in get_rows
                  rows_plus_one = list(itertools.islice(ds, rows_max_number + 1))
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 2097, in __iter__
                  example = _apply_feature_types_on_example(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1635, in _apply_feature_types_on_example
                  decoded_example = features.decode_example(encoded_example, token_per_repo_id=token_per_repo_id)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/features/features.py", line 2044, in decode_example
                  return {
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/features/features.py", line 2045, in <dictcomp>
                  column_name: decode_nested_example(feature, value, token_per_repo_id=token_per_repo_id)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/features/features.py", line 1405, in decode_nested_example
                  return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/features/image.py", line 188, in decode_example
                  image.load()  # to avoid "Too many open files" errors
                File "/src/services/worker/.venv/lib/python3.9/site-packages/PIL/ImageFile.py", line 288, in load
                  raise OSError(msg)
              OSError: image file is truncated (20 bytes not processed)

cc @lhoestq .

1 Like

Maybe this.

Thank you for your answer.I hope to display the image on the Dataset Viewer, but it seems that I cannot modify the code of the Dataset Viewer.

1 Like

Dataset Viewer is probably the one built into HF’s GUI. It looks like the only thing to do is to raise an issue on github.

It seems that you can use the datasets library to convert to parquet on your own.

Is there any code or tool related to packaging images into Parquet files locally? Thank you.

1 Like

I found it, but it’s still very troublesome🤢, so let’s change our way of thinking.
https://arrow.apache.org/docs/python/parquet.html

If you open the image file in this way and save it again, I think it will probably be converted to a file format that HF can read automatically. However, be careful, as this will overwrite the file. Try backup whole data set first.

def fix_truncated_image(filename):
    from PIL import ImageFile, Image
    ImageFile.LOAD_TRUNCATED_IMAGES = True
    image = Image.open(filename)
    image.save(filename)

fix_truncated_image("data001.jpg")
fix_truncated_image("data002.jpg")

I think there are a lot of images, so it would be better to automate the process using Python’s glob or something similar.