Generated data from a Parquet file (90MB) results in really large cache files (20GB)

Hi, I have a private dataset repo with hundreds of configs. All datasets in each config are stored in the Parquet format and the feature set is as below:

features = Features({
    "url": Value(dtype="string", id=None),
    "category": {
        "original": Value(dtype="string", id=None),
        "textType": Value(dtype="string", id=None),
        "major": Value(dtype="string", id=None),
        "minor": Value(dtype="string", id=None),
    },
    "date": Value(dtype="string", id=None),
    "title": Value(dtype="string", id=None),
    "content": [  # A list of dictionaries of texts and their HTML tag types
        {
            "type": Value(dtype="string", id=None),
            "text": Value(dtype="string", id=None),
        }
    ],
})

I encountered a weird issue that for some of the configs that have like a single 100MB Parquet file, the generated dataset results in 20GB of storage usage on the cache!
Unfortunately I cannot share any reproducable code since the dataset is gated. I just want to know that why would such thing happen and is there any solution to optimize such behavior?

It’s possible because Parquet is a format that takes advantage of the column distributions to reduce the size: it encodes the data smartly, and also compresses them. It generally leads to a tremendous ratio compared to the raw data size.

If 20GB of raw data is too much, you might want to process your data using streaming (Stream), which would not download the whole data locally.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.