Contructing a dataset with categorical labels

All of the dataset examples appear to hard-code the list of labels, i.e. ag_news

    def _info(self):
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=datasets.Features(
                {
                    "text": datasets.Value("string"),
                    "label": datasets.features.ClassLabel(names=["World", "Sports", "Business", "Sci/Tech"]),
                }
            ),
            homepage="http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html",
            citation=_CITATION,
            task_templates=[TextClassification(text_column="text", label_column="label")],
        )

I’d like to load my labels from a file (i.e. either use the names_file argument of ClassLabel or load directly read a json file and construct the names argument.

The issue I’m having is the _info(self) method doesn’t give me access to a download_manager so I cannot get the path to names_file in the cache.

I don’t want to hard code my labels, I have different variants with different labels and I want to include a metadata file that includes the labels per variant along with additional identifiers.

Note I’ve also posted to SO python - Creating a Huggingface Dataset with categorical class labels from a file - Stack Overflow

I think the simplest solution is to define a (Python) module that contains the lists of labels and then import this module and compute the labels.

names_file is a concept that comes from Tensorflow Datasets (this project started as a fork of TFDS)

Yeah I’ve been stepping through the load_dataset source and confirmed self._info() is called before the cache dir is setup so looks like this might be the best option. :frowning: