Parquet compression for image dataset

Hi, I’m new to hf and recently, while dealing with sidewalk-semantic dataset, I found out that its parquet version has incredible compression rate (raw images are ~2.4 Gb, while parquet is 324 Mb). Also I found their post on this forum, but it did’t help much. So I tried to compress images into parquet by myself, but got increased size, instead of reduced size.
There is my code:

from pathlib import Path
from datasets import Dataset
from PIL import Image


if __name__ == "__main__":
    workdir = Path("/path/to/data")
    images = workdir / "images"
    labels = workdir / "labels"
    out_name = "0000.parquet"
    out_path = workdir / "my_data.parquet"

    data = {
        "images": [
            Image.open(images / "000000.png").tobytes(),
            Image.open(images / "000001.png").tobytes(),
        ],
        "labels": [
            Image.open(labels / "000000.png").tobytes(),
            Image.open(labels / "000001.png").tobytes(),
        ],
    }
    dataset = Dataset.from_dict(data)
    dataset.to_parquet(out_path)

There I decompressed PNG images into raw bytes and packed into parquet, willing to_parquet() function would apply its own compression, but it turned out I’m doing something wrong (resulting parquet is bigger than source images). Guide me please to right direction.
My goal: compress PNG images on disk into parquet with reduced size (like in sidewalk-semantic) locally (without push_to_hub).

You can open a discussion in the repo to ask the authors this question.

How did you get the size of the “raw images”? Parquet usually does not compress raw bytes (images) efficiently, but we still use it for images to be consistent with the other data types. Instead, you should use a different image format (or the compress_level in PIL.Image.save) to reduce the images’ size.

Hi @mariosasko, thank you for feedback! About “raw images”: I took dataset field ‘pixel_values’ (bytes type) and decoded image from them via Image.open(). Resulting folder weights ~2.4 Gb (~2.4 Mb per image * 1000). I understand your suggestion about compress_level, but those one used in the dataset seems like lossless one, which is really interesting.

I followed your advice and started discussion on dataset page, but the reason I didn’t do that earlier is that from forum discussion linked above I concluded that compression happens inside push_to_hub() method of dataset library and not specific to sidewalk-semantic dataset.

310 MB is the size I get when I iterate over the dataset and save its images to an image folder: Google Colab.

Ok, I see that you extracted “pixel_values” as JPG and “label” as PNG, while I extracted both as PNG. So impressive compression rate they achieved is just plain jpeg :upside_down_face:. Thank you for help.