Hi, I have a private dataset repo with hundreds of configs. All datasets in each config are stored in the Parquet format and the feature set is as below:
features = Features({
"url": Value(dtype="string", id=None),
"category": {
"original": Value(dtype="string", id=None),
"textType": Value(dtype="string", id=None),
"major": Value(dtype="string", id=None),
"minor": Value(dtype="string", id=None),
},
"date": Value(dtype="string", id=None),
"title": Value(dtype="string", id=None),
"content": [ # A list of dictionaries of texts and their HTML tag types
{
"type": Value(dtype="string", id=None),
"text": Value(dtype="string", id=None),
}
],
})
I encountered a weird issue that for some of the configs that have like a single 100MB Parquet file, the generated dataset results in 20GB of storage usage on the cache!
Unfortunately I cannot share any reproducable code since the dataset is gated. I just want to know that why would such thing happen and is there any solution to optimize such behavior?