How do I set feature type when loading dataset(ClassLabel etc)?

I am loading my dataset from a local file, and I’m getting error “TypeError: new(): invalid data type ‘numpy.str_’” which I believe is due to the features not being defined

It’s mentioned here and a solution is to pass a features dictionary when loading. But I am having trouble with the format.

I’ve tried things like :

emotions = load_dataset("csv", data_files="train.txt", sep=";",
                              names=["text", "label"],features = {'text': datasets.Value(dtype='int32', id=None),
 'label':datasets.ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None)})

and

load_dataset("csv", data_files="train.txt", sep=";",
                              names=["text", "label"],features = {'text': 'str',
 'label':['not_equivalent', 'equivalent']})

Without success.

I’m trying to follow the documentation here but can’t seem to figure it out…is there an example of how to do this somewhere? Thanks!

https://huggingface.co/docs/datasets/_modules/datasets/features/features.html#Features

Hi! This should work:

import datasets
features = datasets.Features({"text": datasets.Value("string"), "label": datasets.ClassLabel(names=['not_equivalent', 'equivalent'])})
dset = datasets.load_dataset("csv", data_files="train.txt", sep=";", names=["text", "label"], features=features)

Also note that the label column in your csv file has to contain numbers as labels (0 and 1) and not strings (not_equivalent and equivalent), otherwise you’ll get an error.

1 Like

Thank you!