How do I set feature type when loading dataset(ClassLabel etc)?

MaximusDecimusMeridi · January 18, 2022, 9:40pm

I am loading my dataset from a local file, and I’m getting error “TypeError: new(): invalid data type ‘numpy.str_’” which I believe is due to the features not being defined

It’s mentioned here and a solution is to pass a features dictionary when loading. But I am having trouble with the format.

I’ve tried things like :

emotions = load_dataset("csv", data_files="train.txt", sep=";",
                              names=["text", "label"],features = {'text': datasets.Value(dtype='int32', id=None),
 'label':datasets.ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None)})

and

load_dataset("csv", data_files="train.txt", sep=";",
                              names=["text", "label"],features = {'text': 'str',
 'label':['not_equivalent', 'equivalent']})

Without success.

I’m trying to follow the documentation here but can’t seem to figure it out…is there an example of how to do this somewhere? Thanks!

https://huggingface.co/docs/datasets/_modules/datasets/features/features.html#Features

mariosasko · January 19, 2022, 12:40pm

Hi! This should work:

import datasets
features = datasets.Features({"text": datasets.Value("string"), "label": datasets.ClassLabel(names=['not_equivalent', 'equivalent'])})
dset = datasets.load_dataset("csv", data_files="train.txt", sep=";", names=["text", "label"], features=features)

Also note that the label column in your csv file has to contain numbers as labels (0 and 1) and not strings (not_equivalent and equivalent), otherwise you’ll get an error.

MaximusDecimusMeridi · January 19, 2022, 2:50pm

Thank you!

Topic		Replies	Views
Passing schema features to a load_dataset function 🤗Datasets	4	1427	August 26, 2021
ValueError: Field 'ner_tags' from the JSON data of type list<item: string> is not compatible with ClassLabel. Compatible types are int64 and string 🤗Datasets	7	860	March 25, 2022
Setting dataset feature value as numpy array 🤗Datasets	7	7879	November 14, 2023
Problems with Dataset.from_dict() and Feature types 🤗Datasets	1	2223	September 6, 2021
Correct way to create a Dataset from a csv file Beginners	13	14041	March 25, 2022

How do I set feature type when loading dataset(ClassLabel etc)?

Related topics