Hi,
I want to upload a tabular dataset to HugggingFace.
- Could provide a general guideline on which format would be the best (CSV, JSON, JSONL, or maybe converted to e.g. Parquet) for a tabular dataset. I think the most important metric would be the speed of use once it’s uploaded and used.
The factors we could consider are:
a) the size of the dataset e.g. in terms of the number of rows or alternatively size,
b) whether there are any features e.g. string (of which size might vary), that would discard certain formats.
Any special consideration, like dividing the files manually at a certain size point? - What would be the difference between manually deciding on these settings vs using
push_to_hub
?
I also noticed a quite recent change from load scripts to Parquet files directly (e.g. CIFAR10 is now stored directly in Parquet), so I’m wondering what is the general direction.