Recommended tabular datasets format

Hi,
I want to upload a tabular dataset to HugggingFace.

  1. Could provide a general guideline on which format would be the best (CSV, JSON, JSONL, or maybe converted to e.g. Parquet) for a tabular dataset. I think the most important metric would be the speed of use once it’s uploaded and used.
    The factors we could consider are:
    a) the size of the dataset e.g. in terms of the number of rows or alternatively size,
    b) whether there are any features e.g. string (of which size might vary), that would discard certain formats.
    Any special consideration, like dividing the files manually at a certain size point?
  2. What would be the difference between manually deciding on these settings vs using push_to_hub?

I also noticed a quite recent change from load scripts to Parquet files directly (e.g. CIFAR10 is now stored directly in Parquet), so I’m wondering what is the general direction.

Hi.

We support any of the following formats: Uploading datasets

You can use whatever works better with you, and I recommend choosing the format that will help you update/maintain your dataset in the future.

Then, we will automatically convert it to Parquet, to power the dataset viewer, and to let users access it as Parquet: see Dataset viewer and Overview.

If your data is already in Parquet format, we just copy (symlink) to the original files (providing they respect some details, as the length of row groups, see the docs).