Generated data from a Parquet file (90MB) results in really large cache files (20GB)

arxyzan · April 18, 2024, 7:09am

Hi, I have a private dataset repo with hundreds of configs. All datasets in each config are stored in the Parquet format and the feature set is as below:

features = Features({
    "url": Value(dtype="string", id=None),
    "category": {
        "original": Value(dtype="string", id=None),
        "textType": Value(dtype="string", id=None),
        "major": Value(dtype="string", id=None),
        "minor": Value(dtype="string", id=None),
    },
    "date": Value(dtype="string", id=None),
    "title": Value(dtype="string", id=None),
    "content": [  # A list of dictionaries of texts and their HTML tag types
        {
            "type": Value(dtype="string", id=None),
            "text": Value(dtype="string", id=None),
        }
    ],
})

I encountered a weird issue that for some of the configs that have like a single 100MB Parquet file, the generated dataset results in 20GB of storage usage on the cache!
Unfortunately I cannot share any reproducable code since the dataset is gated. I just want to know that why would such thing happen and is there any solution to optimize such behavior?

severo · April 18, 2024, 8:29am

It’s possible because Parquet is a format that takes advantage of the column distributions to reduce the size: it encodes the data smartly, and also compresses them. It generally leads to a tremendous ratio compared to the raw data size.

If 20GB of raw data is too much, you might want to process your data using streaming (Stream), which would not download the whole data locally.

system · April 18, 2024, 8:29pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to load parquet to datasets without caching? 🤗Datasets	1	3365	June 24, 2022
Working with large datasets - cache issues 🤗Datasets	1	1025	June 1, 2022
How to disable caching in load_dataset()? 🤗Datasets	6	6115	July 10, 2024
Recommended file format for uploading a dataset 🤗Datasets	2	393	July 12, 2023
Uploading Dataset: GUI vs Python "Error" 🤗Datasets	4	444	February 15, 2023

Generated data from a Parquet file (90MB) results in really large cache files (20GB)

Related topics