Auto converted parquet is only a fraction in size

ZhiyuanChen · August 14, 2024, 8:43am

I have a dataset that is already in parquet format, and costs about 13GiB spaces.

But the auto converted version only uses ~2GiB spaces.

How can I achieve similar compression ratio as the parquet-bot?

I am using df.to_parquet as of now.

severo · August 17, 2024, 2:13pm

Oh. no, the reason is that we only keep the first 5GB of the Parquet file (see Dataset viewer).

As you might wonder why we convert to parquet while it was already in parquet, see https://huggingface.co/datasets/multimolecule/rnacentral/discussions/1:

When the dataset is already in Parquet format, the data are not converted and the files in refs/convert/parquet are links to the original files. This rule has an exception to ensure the dataset viewer API to stay fast: if the row group size of the original Parquet files is too big, new Parquet files are generated.

In your case, we had to convert to a parquet file with smaller row groups, and thus, the 5GB limit was applied: the conversion is only partial. You can see this in the viewer:

ZhiyuanChen · August 18, 2024, 1:04pm

Hey, thank you for your explanation.

lol, I spent days trying to get a similar compression ratio and failed, I even thought huggingface have developed a new compression algorithm.

But I’ve also learned many things in the exploration.

For example, Brotli with compression level 4 have good compression ratio while maintaining a decent speed.
And I’m able to reduce the file size by over 50%~

ZhiyuanChen · August 18, 2024, 2:24pm

Can I confirm what does first 5GiB means, it seems to be first 5GiB of raw data (instead of compressed data)

Topic		Replies	Views
Parquet-bot converted a parquet file into a bigger parquet chunk 🤗Datasets	2	153	June 14, 2024
Parquet compression for image dataset 🤗Datasets	5	3052	December 7, 2023
Datasets viewer preview only 🤗Datasets	3	52	April 24, 2025
Load Dataset and Save as Parquet 🤗Datasets	3	3976	January 7, 2025
Generated data from a Parquet file (90MB) results in really large cache files (20GB) 🤗Datasets	2	215	April 18, 2024

Auto converted parquet is only a fraction in size

Related topics