Auto converted parquet is only a fraction in size

I have a dataset that is already in parquet format, and costs about 13GiB spaces.

But the auto converted version only uses ~2GiB spaces.

How can I achieve similar compression ratio as the parquet-bot?

I am using df.to_parquet as of now.

Oh. no, the reason is that we only keep the first 5GB of the Parquet file (see Dataset viewer).

As you might wonder why we convert to parquet while it was already in parquet, see https://huggingface.co/datasets/multimolecule/rnacentral/discussions/1:

When the dataset is already in Parquet format, the data are not converted and the files in refs/convert/parquet are links to the original files. This rule has an exception to ensure the dataset viewer API to stay fast: if the row group size of the original Parquet files is too big, new Parquet files are generated.

In your case, we had to convert to a parquet file with smaller row groups, and thus, the 5GB limit was applied: the conversion is only partial. You can see this in the viewer:

Hey, thank you for your explanation.

lol, I spent days trying to get a similar compression ratio and failed, I even thought huggingface have developed a new compression algorithm.

But Iā€™ve also learned many things in the exploration.

For example, Brotli with compression level 4 have good compression ratio while maintaining a decent speed.
And Iā€™m able to reduce the file size by over 50%~

1 Like

Can I confirm what does first 5GiB means, it seems to be first 5GiB of raw data (instead of compressed data)

1 Like