How to configure dataset.description for a dataset without a loading script?

As the documentation states, if the data is in some known format, no loading script is necessary and that works just fine.

However when loading the datasat the “description” feature is empty and I cannot figure out how to configure this (the metadata editor on the dataset card page does not contain a description field).

What is the recommended way to set this and possibly other features of the dataset?

You can add this info to the dataset card (as a markdown). The datasets project started as a fork of Tensorflow Datasets when the Hub (for datasets) did not exist. Hence, most DatasetInfo attributes come from the fork (e.g., description, homepage, etc.) and are not integrated well with the Hub (or datasets), so we plan to deprecate this class eventually.

Thanks for this info, but I am still confused: there seem to be 3 different ways to provide information about a dataset: the features/attributes that can be set in the loader for the Dataset class, the metainformation which are in the YAML-part of the Readme file and the textual information in the markdown part.

The problem it is not clear what is supported in the YAML-part of the Readme, which of those make it into the attributes and thus are available programmatically.

So, in order to make the description available in the program from the dataset representation, can I do this without implementing a loader class? And if that functionality gets deprecated, obviously it would not be wise to implement it now, but how will then metainformation get available in the code?

In other words: what is the best and future-proof way to specify all metainformation in a way that makes it show up on hub AND available within python via the API, including the description?

You can put this info in the Dataset Description part (after importing the template), then use huggingface_hub’s RepoCard API to download and parse the card.

1 Like