Parquet image dataset

I uploaded a dataset with image and text in a parquet format. The image column is in the format of a dictionary with “bytes” and “path” as required. But the dataset preview cannot recognize the image and display the dictionary instead. How to solve the problem so that images are displayed in the preview section?

can you share the repository?

Thanks for replying me. The dataset is this: LouisChen15/ConstructionSite · Datasets at Hugging Face. I removed the original dataset and only left ten images as a demo. But they have the same issue as described.

cc @lhoestq? Images in a parquet file.

Hi ! Pandas sets the type of the “image” column to be a struct of bytes and path, since it doesn’t have an image type (yet ?)

Ideally it would be great to have a way to define types metadata, maybe we could have something like this in the future ?

df.image.attrs = {"dtype": "image"}
df.to_parquet("hf://datasets/LouisChen15/ConstructionSite/test_split_demo.parquet")

Anyway, right now if you want to set the type to image, you can define the types in the README.md in YAML:

dataset_info:
  features:
  - name: image  
    dtype: image
  - name: image_id
    dtype: string
  - name: image_caption
    dtype: string
  - name: illumination
    dtype: string
  - name: camera_distance
    dtype: string
  - name: view
    dtype: string
  - name: quality_of_info
    dtype: string
  - name: rule_1_violation
    struct:
    - name: bounding_box
      sequence:
        sequence: float64
    - name: reason
      dtype: string
  - name: rule_2_violation
    dtype: 'null'
  - name: rule_3_violation
    struct:
    - name: bounding_box
      sequence:
        sequence: float64
    - name: reason
      dtype: string
  - name: rule_4_violation
    dtype: 'null'
  - name: excavator
    sequence:
      sequence: float64
  - name: rebar
    sequence:
      sequence: float64
  - name: worker_with_white_hard_hat
    sequence: 'null'
1 Like

Thank you, it finally works, the metadata is exactly what I need. However, I did try using metadata before to solve the problem, but I only include the image part:

-name: image
 dtype: image

But it did not work. Does it mean I have to correctly specify the data type of all the “names” so that the system can process it?

Yea all the columns/types are required in the YAML