Recommended tabular datasets format

adamnarozniak · February 26, 2024, 11:28am

Hi,
I want to upload a tabular dataset to HugggingFace.

Could provide a general guideline on which format would be the best (CSV, JSON, JSONL, or maybe converted to e.g. Parquet) for a tabular dataset. I think the most important metric would be the speed of use once it’s uploaded and used.
The factors we could consider are:
a) the size of the dataset e.g. in terms of the number of rows or alternatively size,
b) whether there are any features e.g. string (of which size might vary), that would discard certain formats.
Any special consideration, like dividing the files manually at a certain size point?
What would be the difference between manually deciding on these settings vs using push_to_hub?

I also noticed a quite recent change from load scripts to Parquet files directly (e.g. CIFAR10 is now stored directly in Parquet), so I’m wondering what is the general direction.

severo · February 27, 2024, 8:55am

Hi.

We support any of the following formats: Uploading datasets

You can use whatever works better with you, and I recommend choosing the format that will help you update/maintain your dataset in the future.

Then, we will automatically convert it to Parquet, to power the dataset viewer, and to let users access it as Parquet: see Dataset viewer and Overview.

If your data is already in Parquet format, we just copy (symlink) to the original files (providing they respect some details, as the length of row groups, see the docs).

Topic		Replies	Views
Recommended file format for uploading a dataset 🤗Datasets	2	393	July 12, 2023
Best Practices for Large-Scale Image Datasets? (between WebDataset and Parquet) 🤗Datasets	3	253	February 8, 2025
How to publish a text to-image dataset on huggingface 🤗Datasets	1	58	February 9, 2025
Image dataset best practices? 🤗Datasets	9	17215	January 15, 2023
Save `DatasetDict` to HuggingFace Hub 🤗Datasets	12	7422	October 20, 2023

Recommended tabular datasets format

Related topics