Parquet compression for image dataset

Hi, Iā€™m new to hf and recently, while dealing with sidewalk-semantic dataset, I found out that its parquet version has incredible compression rate (raw images are ~2.4 Gb, while parquet is 324 Mb). Also I found their post on this forum, but it didā€™t help much. So I tried to compress images into parquet by myself, but got increased size, instead of reduced size.
There is my code:

from pathlib import Path
from datasets import Dataset
from PIL import Image


if __name__ == "__main__":
    workdir = Path("/path/to/data")
    images = workdir / "images"
    labels = workdir / "labels"
    out_name = "0000.parquet"
    out_path = workdir / "my_data.parquet"

    data = {
        "images": [
            Image.open(images / "000000.png").tobytes(),
            Image.open(images / "000001.png").tobytes(),
        ],
        "labels": [
            Image.open(labels / "000000.png").tobytes(),
            Image.open(labels / "000001.png").tobytes(),
        ],
    }
    dataset = Dataset.from_dict(data)
    dataset.to_parquet(out_path)

There I decompressed PNG images into raw bytes and packed into parquet, willing to_parquet() function would apply its own compression, but it turned out Iā€™m doing something wrong (resulting parquet is bigger than source images). Guide me please to right direction.
My goal: compress PNG images on disk into parquet with reduced size (like in sidewalk-semantic) locally (without push_to_hub).

You can open a discussion in the repo to ask the authors this question.

How did you get the size of the ā€œraw imagesā€? Parquet usually does not compress raw bytes (images) efficiently, but we still use it for images to be consistent with the other data types. Instead, you should use a different image format (or the compress_level in PIL.Image.save) to reduce the imagesā€™ size.

Hi @mariosasko, thank you for feedback! About ā€œraw imagesā€: I took dataset field ā€˜pixel_valuesā€™ (bytes type) and decoded image from them via Image.open(). Resulting folder weights ~2.4 Gb (~2.4 Mb per image * 1000). I understand your suggestion about compress_level, but those one used in the dataset seems like lossless one, which is really interesting.

I followed your advice and started discussion on dataset page, but the reason I didnā€™t do that earlier is that from forum discussion linked above I concluded that compression happens inside push_to_hub() method of dataset library and not specific to sidewalk-semantic dataset.

310 MB is the size I get when I iterate over the dataset and save its images to an image folder: Google Colab.

Ok, I see that you extracted ā€œpixel_valuesā€ as JPG and ā€œlabelā€ as PNG, while I extracted both as PNG. So impressive compression rate they achieved is just plain jpeg :upside_down_face:. Thank you for help.