Hi, Iām new to hf and recently, while dealing with sidewalk-semantic dataset, I found out that its parquet version has incredible compression rate (raw images are ~2.4 Gb, while parquet is 324 Mb). Also I found their post on this forum, but it didāt help much. So I tried to compress images into parquet by myself, but got increased size, instead of reduced size.
There is my code:
from pathlib import Path
from datasets import Dataset
from PIL import Image
if __name__ == "__main__":
workdir = Path("/path/to/data")
images = workdir / "images"
labels = workdir / "labels"
out_name = "0000.parquet"
out_path = workdir / "my_data.parquet"
data = {
"images": [
Image.open(images / "000000.png").tobytes(),
Image.open(images / "000001.png").tobytes(),
],
"labels": [
Image.open(labels / "000000.png").tobytes(),
Image.open(labels / "000001.png").tobytes(),
],
}
dataset = Dataset.from_dict(data)
dataset.to_parquet(out_path)
There I decompressed PNG images into raw bytes and packed into parquet, willing to_parquet()
function would apply its own compression, but it turned out Iām doing something wrong (resulting parquet is bigger than source images). Guide me please to right direction.
My goal: compress PNG images on disk into parquet with reduced size (like in sidewalk-semantic) locally (without push_to_hub
).