Parquet compression for image dataset

Divelix · December 4, 2023, 11:42am

Hi, I’m new to hf and recently, while dealing with sidewalk-semantic dataset, I found out that its parquet version has incredible compression rate (raw images are ~2.4 Gb, while parquet is 324 Mb). Also I found their post on this forum, but it did’t help much. So I tried to compress images into parquet by myself, but got increased size, instead of reduced size.
There is my code:

from pathlib import Path
from datasets import Dataset
from PIL import Image


if __name__ == "__main__":
    workdir = Path("/path/to/data")
    images = workdir / "images"
    labels = workdir / "labels"
    out_name = "0000.parquet"
    out_path = workdir / "my_data.parquet"

    data = {
        "images": [
            Image.open(images / "000000.png").tobytes(),
            Image.open(images / "000001.png").tobytes(),
        ],
        "labels": [
            Image.open(labels / "000000.png").tobytes(),
            Image.open(labels / "000001.png").tobytes(),
        ],
    }
    dataset = Dataset.from_dict(data)
    dataset.to_parquet(out_path)

There I decompressed PNG images into raw bytes and packed into parquet, willing to_parquet() function would apply its own compression, but it turned out I’m doing something wrong (resulting parquet is bigger than source images). Guide me please to right direction.
My goal: compress PNG images on disk into parquet with reduced size (like in sidewalk-semantic) locally (without push_to_hub).

mariosasko · December 4, 2023, 3:17pm

You can open a discussion in the repo to ask the authors this question.

How did you get the size of the “raw images”? Parquet usually does not compress raw bytes (images) efficiently, but we still use it for images to be consistent with the other data types. Instead, you should use a different image format (or the compress_level in PIL.Image.save) to reduce the images’ size.

Divelix · December 4, 2023, 4:08pm

Hi @mariosasko, thank you for feedback! About “raw images”: I took dataset field ‘pixel_values’ (bytes type) and decoded image from them via Image.open(). Resulting folder weights ~2.4 Gb (~2.4 Mb per image * 1000). I understand your suggestion about compress_level, but those one used in the dataset seems like lossless one, which is really interesting.

Divelix · December 4, 2023, 4:20pm

I followed your advice and started discussion on dataset page, but the reason I didn’t do that earlier is that from forum discussion linked above I concluded that compression happens inside push_to_hub() method of dataset library and not specific to sidewalk-semantic dataset.

mariosasko · December 5, 2023, 2:43pm

310 MB is the size I get when I iterate over the dataset and save its images to an image folder: Google Colab.

Divelix · December 7, 2023, 4:27pm

Ok, I see that you extracted “pixel_values” as JPG and “label” as PNG, while I extracted both as PNG. So impressive compression rate they achieved is just plain jpeg . Thank you for help.

Topic		Replies	Views
List of Images in a parquet dataset 🤗Datasets	1	79	March 25, 2025
Best Practices for Large-Scale Image Datasets? (between WebDataset and Parquet) 🤗Datasets	3	260	February 8, 2025
Uploading Dataset: GUI vs Python "Error" 🤗Datasets	4	444	February 15, 2023
Auto converted parquet is only a fraction in size 🤗Datasets	3	142	August 18, 2024
Load Dataset and Save as Parquet 🤗Datasets	3	4006	January 7, 2025

Parquet compression for image dataset

Related topics