Understanding data of dataset_infos.json

dk-crazydiv · June 28, 2021, 5:52pm

Hi everyone,

I was exploring dataset_infos.json , and I couldn’t figure out what some of the keys represent in the file. Could someone please point me to a reference, which I could use as column descriptions.

eg of some confusing columns: “download_size”, “dataset_size”, “size_in_bytes”, “post_processing_size” and “num_bytes”(splits).

Another set of keys I couldn’t understand/interpret what they represent, were “post_processed” and “supervised_keys”.

Is the structure documentation available, or is diving into the code from the command dataset-cli test would be the correct approach to figure this out?

Example from Cifar-10 (canonical):

lewtun · June 28, 2021, 6:34pm

hey @dk-crazydiv you can find a description of all the DatasetInfo fields in the docs: Main classes — datasets 1.8.0 documentation

if something is unclear / could be improved, feel free to open a pr!

dk-crazydiv · June 29, 2021, 4:08am

Thank you. This explains it.

Topic		Replies	Views
How to set datasets.features Beginners	0	122	April 12, 2024
Chapter 5 questions Course	105	8437	July 7, 2025
Dataset loses format (/n) Beginners	0	113	April 27, 2024
Intention of the `length` field in class datasets.Sequence? 🤗Datasets	1	289	March 23, 2023
DatasetInfo seems to be missing when I pull my dataset from HFHub 🤗Datasets	0	29	July 17, 2024

Understanding data of dataset_infos.json

Related topics